Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement storage- and relationship-aware cleanup #973

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added services/cleanup/__init__.py
Empty file.
32 changes: 32 additions & 0 deletions services/cleanup/cleanup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from django.db.models.query import QuerySet

Check warning on line 1 in services/cleanup/cleanup.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/cleanup.py#L1

Added line #L1 was not covered by tests

from services.cleanup.models import MANUAL_CLEANUP
from services.cleanup.relations import build_relation_graph

Check warning on line 4 in services/cleanup/cleanup.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/cleanup.py#L3-L4

Added lines #L3 - L4 were not covered by tests


def run_cleanup(query: QuerySet) -> tuple[int, int]:

Check warning on line 7 in services/cleanup/cleanup.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/cleanup.py#L7

Added line #L7 was not covered by tests
"""
Cleans up all the models and storage files reachable from the given `QuerySet`.

This deletes all database models in topological sort order, and also removes
all the files in storage for any of the models in the relationship graph.

Returns the number of models and files being cleaned up.
"""
models_to_cleanup = build_relation_graph(query)

Check warning on line 16 in services/cleanup/cleanup.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/cleanup.py#L16

Added line #L16 was not covered by tests

cleaned_models = 0
cleaned_files = 0

Check warning on line 19 in services/cleanup/cleanup.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/cleanup.py#L18-L19

Added lines #L18 - L19 were not covered by tests

for model, query in models_to_cleanup:
manual_cleanup = MANUAL_CLEANUP.get(model)
if manual_cleanup is not None:
res = manual_cleanup(query)
cleaned_models += res[0]
cleaned_files += res[1]

Check warning on line 26 in services/cleanup/cleanup.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/cleanup.py#L21-L26

Added lines #L21 - L26 were not covered by tests

else:
deleted, _ = query.delete()
cleaned_models += deleted

Check warning on line 30 in services/cleanup/cleanup.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/cleanup.py#L29-L30

Added lines #L29 - L30 were not covered by tests

return (cleaned_models, cleaned_files)

Check warning on line 32 in services/cleanup/cleanup.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/cleanup.py#L32

Added line #L32 was not covered by tests
123 changes: 123 additions & 0 deletions services/cleanup/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
import dataclasses
import itertools
from collections.abc import Callable
from functools import partial

Check warning on line 4 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L1-L4

Added lines #L1 - L4 were not covered by tests

from django.db.models import Model
from django.db.models.query import Q, QuerySet
from shared.api_archive.storage import StorageService
from shared.config import get_config
from shared.django_apps.core.models import Commit, Pull
from shared.django_apps.reports.models import CommitReport, ReportDetails, ReportSession

Check warning on line 11 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L6-L11

Added lines #L6 - L11 were not covered by tests

from services.archive import ArchiveService, MinioEndpoints

Check warning on line 13 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L13

Added line #L13 was not covered by tests

DELETE_CHUNKS = 25

Check warning on line 15 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L15

Added line #L15 was not covered by tests


# This has all the `Repository` fields needed by `get_archive_hash`
@dataclasses.dataclass
class FakeRepository:
repoid: int
service: str
service_id: str

Check warning on line 23 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L19-L23

Added lines #L19 - L23 were not covered by tests


def cleanup_archivefield(field_name: str, query: QuerySet) -> tuple[int, int]:
model_field_name = f"_{field_name}_storage_path"

Check warning on line 27 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L26-L27

Added lines #L26 - L27 were not covered by tests
# query for a non-`None` `field_name`
storage_query = query.filter(**{f"{model_field_name}__isnull": False}).values_list(

Check warning on line 29 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L29

Added line #L29 was not covered by tests
model_field_name, flat=True
)

# and then delete all those files from storage
storage = StorageService()
bucket = get_config("services", "minio", "bucket", default="archive")
cleaned_files = 0

Check warning on line 36 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L34-L36

Added lines #L34 - L36 were not covered by tests
# TODO: possibly fan out the batches to a thread pool, as the storage requests are IO-bound
# TODO: do a limit / range query to avoid loading *all* the paths into memory at once
for batched_paths in itertools.batched(storage_query, DELETE_CHUNKS):
storage.delete_files(bucket, batched_paths)
cleaned_files += len(batched_paths)

Check warning on line 41 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L39-L41

Added lines #L39 - L41 were not covered by tests

cleaned_models, _ = query.delete()

Check warning on line 43 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L43

Added line #L43 was not covered by tests

return (cleaned_models, cleaned_files)

Check warning on line 45 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L45

Added line #L45 was not covered by tests


def cleanup_commitreport(query: QuerySet) -> tuple[int, int]:
coverage_reports = query.filter(

Check warning on line 49 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L48-L49

Added lines #L48 - L49 were not covered by tests
Q(report_type=None) | Q(report_type="coverage")
).values_list(
"code",
"commit__commitid",
"repository__repoid",
"repository__owner__service",
"repository__service_id",
)

storage = StorageService()
bucket = get_config("services", "minio", "bucket", default="archive")
repo_hashes: dict[int, str] = {}
cleaned_files = 0

Check warning on line 62 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L59-L62

Added lines #L59 - L62 were not covered by tests
# TODO: figure out a way to run the deletes in batches
# TODO: possibly fan out the batches to a thread pool, as the storage requests are IO-bound
# TODO: do a limit / range query to avoid loading *all* the paths into memory at once
for (

Check warning on line 66 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L66

Added line #L66 was not covered by tests
report_code,
commit_sha,
repoid,
repo_service,
repo_service_id,
) in coverage_reports:
if repoid not in repo_hashes:
fake_repo = FakeRepository(

Check warning on line 74 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L73-L74

Added lines #L73 - L74 were not covered by tests
repoid=repoid, service=repo_service, service_id=repo_service_id
)
repo_hashes[repoid] = ArchiveService.get_archive_hash(fake_repo)
repo_hash = repo_hashes[repoid]

Check warning on line 78 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L77-L78

Added lines #L77 - L78 were not covered by tests

chunks_file_name = report_code if report_code is not None else "chunks"
path = MinioEndpoints.chunks.get_path(

Check warning on line 81 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L80-L81

Added lines #L80 - L81 were not covered by tests
version="v4",
repo_hash=repo_hash,
commitid=commit_sha,
chunks_file_name=chunks_file_name,
)
storage.delete_file(bucket, path)
cleaned_files += 1

Check warning on line 88 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L87-L88

Added lines #L87 - L88 were not covered by tests

cleaned_models, _ = query.delete()

Check warning on line 90 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L90

Added line #L90 was not covered by tests

return (cleaned_models, cleaned_files)

Check warning on line 92 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L92

Added line #L92 was not covered by tests


def cleanup_upload(query: QuerySet) -> tuple[int, int]:
storage_query = query.values_list("storage_path", flat=True)

Check warning on line 96 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L95-L96

Added lines #L95 - L96 were not covered by tests

storage = StorageService()
bucket = get_config("services", "minio", "bucket", default="archive")
cleaned_files = 0

Check warning on line 100 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L98-L100

Added lines #L98 - L100 were not covered by tests
# TODO: possibly fan out the batches to a thread pool, as the storage requests are IO-bound
# TODO: do a limit / range query to avoid loading *all* the paths into memory at once
for batched_paths in itertools.batched(storage_query, DELETE_CHUNKS):
storage.delete_files(bucket, batched_paths)
cleaned_files += len(batched_paths)

Check warning on line 105 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L103-L105

Added lines #L103 - L105 were not covered by tests

cleaned_models, _ = query.delete()

Check warning on line 107 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L107

Added line #L107 was not covered by tests

return (cleaned_models, cleaned_files)

Check warning on line 109 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L109

Added line #L109 was not covered by tests


# All the models that need custom python code for deletions so a bulk `DELETE` query does not work.
MANUAL_CLEANUP: dict[type[Model], Callable[[QuerySet], tuple[int, int]]] = {

Check warning on line 113 in services/cleanup/models.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/models.py#L113

Added line #L113 was not covered by tests
Commit: partial(cleanup_archivefield, "report"),
Pull: partial(cleanup_archivefield, "flare"),
ReportDetails: partial(cleanup_archivefield, "files_array"),
CommitReport: cleanup_commitreport,
ReportSession: cleanup_upload,
# TODO: figure out any other models which have files in storage that are not `ArchiveField`
# TODO: TA is also storing files in GCS
# TODO: BA is also storing files in GCS
# TODO: There is also `CompareCommit.report_storage_path`, but that does not seem to be implemented as Django model?
}
77 changes: 77 additions & 0 deletions services/cleanup/relations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# from pprint import pprint

from graphlib import TopologicalSorter

from django.db.models import Model
from django.db.models.query import QuerySet


def build_relation_graph(query: QuerySet) -> list[tuple[type[Model], QuerySet]]:
"""
This takes as input a django `QuerySet`, like `Repository.objects.filter(repoid=123)`.

It then walks the django relation graph, resolving all the models that have a relationship **to** the input model,
returning those models along with a `QuerySet` that allows either querying or deleting those models.

The returned list is in topological sorting order, so related models are always sorted before models they depend on.
"""
graph: TopologicalSorter[type[Model]] = TopologicalSorter()
querysets: dict[type[Model], QuerySet] = {}

def process_model(model: type[Model], query: QuerySet):
if model in querysets:
return
querysets[model] = query

if not (meta := model._meta):
return

Check warning on line 27 in services/cleanup/relations.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/relations.py#L27

Added line #L27 was not covered by tests

for field in meta.get_fields(include_hidden=True):
if not field.is_relation:
continue

if field.one_to_many or field.one_to_one:
# Most likely the reverse of a `ForeignKey`
# <https://docs.djangoproject.com/en/5.1/ref/models/fields/#django.db.models.Field.one_to_many>

if not hasattr(field, "field"):
# I believe this is the actual *forward* definition of a `OneToOne`
continue

# this should be the actual `ForeignKey` definition:
actual_field = field.field
if actual_field.model == model:
# this field goes from *this* model to another, but we are interested in the reverse actually
continue

related_model = actual_field.model
related_model_field = actual_field.name
related_query = related_model.objects.filter(
**{f"{related_model_field}__in": query}
)
graph.add(model, related_model)
process_model(related_model, related_query)

elif field.many_to_many:
if not hasattr(field, "through"):
# we want to delete all related records on the join table
continue

related_model = field.through
join_meta = related_model._meta
for field in join_meta.get_fields(include_hidden=True):
if not field.is_relation or field.model != model:
continue

related_model_field = actual_field.name
related_query = related_model.objects.filter(

Check warning on line 67 in services/cleanup/relations.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/relations.py#L66-L67

Added lines #L66 - L67 were not covered by tests
**{f"{related_model_field}__in": query}
)
graph.add(model, related_model)
process_model(related_model, related_query)

Check warning on line 71 in services/cleanup/relations.py

View check run for this annotation

Codecov Notifications / codecov/patch

services/cleanup/relations.py#L70-L71

Added lines #L70 - L71 were not covered by tests

# pprint(vars(field.through._meta))

process_model(query.model, query)

return [(model, querysets[model]) for model in graph.static_order()]
Empty file.
10 changes: 10 additions & 0 deletions services/cleanup/tests/test_relations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from shared.django_apps.core.models import Repository

from services.cleanup.relations import build_relation_graph


def test_builds_relation_graph(db):
print()
relations = build_relation_graph(Repository.objects.filter(repoid=123))
for model, query in relations:
print(model, str(query.query))
Loading