-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create partitions with more similar size #619
Comments
I think it will be easier to implement and more consistent with other tables if we partition according to root tree_entry, not blob hash.
Can't we just partition them as any other table? |
@smola I changed files table to be partitioned by the root tree entry on the main issue description. How we can make references works like any other table? they are not into the packfile, so they don't appear on the index, so we cannot apply the following partition to them:
Because the references are just that, references to an object into the packfile, we can just show on each partition the references that are pointing to objects inside that partition. |
I guess they would need a different partitioning scheme. For example: The problem I see with partitioning them based on what they point to, is that they can reference tag objects or commit objects, so its logic is not that trivial? |
@smola If they are objects, they will be in the index, doesn't matter if it is a tag object or a commit object. Maybe I'm not understanding completely what you want to mean. |
@smola Actually I was thinking about ways to determine partitions upfront. But I see you're proposing to check repositories to find the partitions:
So, yeah, your approach of "references by offset of the object they point to" should work. |
Static partition sizes have some benefits. For instance you don't have calculate mapping where is what. On the other hand it still won't be super balanced. We may try to distribute objects across multiple partitions and track their sizes (kind of round-robin), but it will require an extra mapping where the object is or what object is there. |
Having this issue in mind #686 this proposal makes less sense. We should think in other ways to parallelize queries. Instead of just parallelize per repository, we can parallelize parts of the tree after reading the data from tables. For the other hand, if we still want to create more than one partition per repository, we should think about how to split the DAG into several parallel parts, as another idea. |
We based our parallelization logic on some ideas of this paper:
There you can see other ways to parallelize data processing. |
Do we still want to do this @ajnavarro? |
Will solve a lot of the problems that we actually have (unknown query processing time left, repositories taking a lot of time to process and so on.). I thought about partitioning at repository level + split the DAG into more or less similar parts, but I didn't have time for this. |
Partitions with similar size
Description
Right now, our partition implementation is a partition per repository, that means, partitions can have really different sizes, causing problems like really long times processing 5% of the total amount repositories (only one thread parsing a huge repo).
With this proposal, I want to normalize partition size changing the way we split the data into several threads.
Proposal
Change actual partition ID:
To:
Values description
[REPOSITORY_ID]
: gitbase ID identifying the repository. Examplesgitbase
,kubernetes
,spark
.[PACKFILE_HASH]
: packfile hash. Example:ccbe18d4ba50a14a28855a7b53880d93d87dc498
[OFFSET_FROM]
: packfile offset where the first object for that partition is. Example23144
[OFFSET_TO]
: packfile offset where the last object for that partition is. Example7009211
How to do repartition
We need to decide the number of objects that will be on each partition to generate the offsets. That object count will be a magic number and won't be changed. If we change it over configuration, indexes and other internal modules will not work correctly.
Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.
Taking into account repositories updates, this approach is totally valid. Partitions are related to the packfile hash, so the content cannot change, only be deleted or create new packfiles.
How to handle this partitions on each main tables
Object tables (Commits, Blobs, Trees, Tree entries)
Each partition will only iterate the objects on their ranges, so no objects will be duplicated at the output.
References
Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.
Files
Files will be appearing only on the partition where the root tree entry is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.
Caveats
Problems
Benefits
The text was updated successfully, but these errors were encountered: