Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry: SNB Basic + SNB Composite Merge Foreign? #394

Open
aMahanna opened this issue Jun 10, 2022 · 24 comments
Open

Inquiry: SNB Basic + SNB Composite Merge Foreign? #394

aMahanna opened this issue Jun 10, 2022 · 24 comments

Comments

@aMahanna
Copy link

aMahanna commented Jun 10, 2022

Hi again 😄

In an experiment to support our database's multi-model functionality, we are trying to include the edges generated from the SNB Basic dataset, with the files generated from the SNB CompositeMergeForeign dataset.

We are getting inconsistent results, and wondered if there is any consideration of supporting this with the datagen, or if by any chance this is already possible?

For example, we want a data model where both the post_hasCreator_person relationship and creator attribute in the Post document exist.

Happy to move this conversation to the datagen repo if that makes more sense

@szarnyasg
Copy link
Member

Hi @aMahanna,

Two things:

  1. Can you please clarify how the results are inconsistent?

The datagen is deterministic, so the graphs (including the IDs) should be the same between different generators. Therefore, combining files from data sets should be possible without getting inconsistency.

  1. For pre-processing data sets, I have two approaches:

i. Using the usual UNIX tools like grep, cat, cut, etc. These work well for splitting files.

ii. Using DuckDB. This approach also allows joins, aggregation (string_agg) or unwinding (unnest). I have a couple of example scripts for SNB BI: https://github.com/ldbc/ldbc_snb_example_data/tree/main/export

Gabor

@aMahanna
Copy link
Author

aMahanna commented Jun 15, 2022

Can you please clarify how the results are inconsistent?

Apologies for the delay and for the confusion, after further investigation we discovered a formatting mistake on our part when combining the files.

I will follow up shortly regarding your second point, but for now I just want to say thank you for all the help so far

  • Anthony

@aMahanna
Copy link
Author

aMahanna commented Jun 21, 2022

Hi Gabor,

We've been evaluating the various SNB datasets available in attempt to support our database's multi-model functionality.

We found that using a combination of the Basic & MergeForeign datasets substantially increases our query performance and better suits our data model. Our request would be to have the datagen natively support the data model outlined below, or suggest a way to do so if it already exists. As it stands now, modelling the data in this way requires a lot of pre/post processing (as suggested above), which we believe will count against us if we were to have the benchmark audited.

In particular, we have situations where a query benefits from the Basic dataset (IC8), a query that benefits from the MergeForeign dataset (IC3 Sub-Query A), and another query that benefits from a combination of both (IC3 Sub-Query B).

IC8

Understanding that you may not be familiar with AQL (Arango Query Language), this query relies on the edge relationships only available in the Basic dataset (e.g post_hasCreator_person, comment_hasCreator_person, etc.).

FOR commentReply IN 2..2 INBOUND @personId post_hasCreator_person, comment_hasCreator_person, comment_replyOf_post, comment_replyOf_comment
    SORT commentReply.creationDate DESC, commentReply._id
    LIMIT 20
    FOR creator IN 1..1 OUTBOUND commentReply comment_hasCreator_person
        RETURN {
            id: creator._id,
            firstName: creator.firstName,
            lastName: creator.lastName,
            commentId: commentReply._id,
            commentCreationDate: commentReply.creationDate,
            commentContent: commentReply.content
        }

The alternative approach is to solely rely on the MergeForeign attributes (i.e creator, replyOfPost, replyOfComment). Seeing that none of the edge relationships mentioned above are included in MergeForeign, switching to these attributes would result in a query performance that is 6x slower than the current implementation. On the other hand, sticking to a Basic-only data model poses its own challenges, as seen below.

IC3

We've noticed peak performance in IC3 when a combination of Basic SNB edge relationships & MergeForeign SNB attributes are used within the same query.

IC3 Sub-Query A

A portion of IC3 relies on the person.place MergeForeign attribute for efficient query performance.

FOR friend IN 1..2 ANY @personId person_knows_person OPTIONS {bfs: true, uniqueVertices:"global"}
    FILTER friend.place NOT IN [countryXKey, countryYKey]
    RETURN {id: friend.id, place: friend.place}

Attempting to do this using the Basic SNB person_isLocatedIn_place edge relationship results in a query performance that is 70x slower.

IC3 Sub-Query B

Another portion of IC3 relies on the post.place and the comment.place MergeForeign attributes, while also benefitting from the post_hasCreator_person and comment_hasCreator_person relationships (found only in the Basic SNB dataset).

FOR message IN 1..1 INBOUND friend post_hasCreator_person,comment_hasCreator_person
    FILTER message.place IN [countryXKey, countryYKey]
    RETURN message

Attempting to do this using the Basic SNB post_isLocatedIn_place & comment_isLocatedIn_place edge relationships results in a query performance that is 30x slower.

Conclusion

As far as we can tell, the current datagen utility doesn't support this, and so we feel that this leaves out the multi-model graph capabilities offered by our database. We are not looking to manipulate the data in a way that specifically favours us, but instead looking for the LDBC datagen to better support the functionality of multi-model graph databases.

Would it be possible to have the datagen support this data model out of the box (assuming it doesn't already)?

@szarnyasg szarnyasg transferred this issue from ldbc/data-sets-surf-repository Jun 21, 2022
@szarnyasg
Copy link
Member

szarnyasg commented Jun 21, 2022

@aMahanna I transferred the issue to the (new, Spark-based) Datagen's repository.
I skimmed your suggestion and it seems doable in Datagen albeit it will not have a high priority in our development plans.

This week I'm travelling/have other duties -- I will take a look next week.

@szarnyasg szarnyasg added this to the Milestone 5 milestone Jun 25, 2022
@szarnyasg szarnyasg removed this from the Milestone 5 milestone Oct 10, 2022
@szarnyasg
Copy link
Member

Hello again,

The bad news: this functionality is unlikely to be supported in the Datagen.

The good news: I have generated the data sets and uploaded them to Cloudflare R2 (an egress-free object storage):

Composite Merged FK

Composite Projected FK

Gabor

@cw00dw0rd
Copy link

Hi @szarnyasg

Sorry to hear that this functionality won't be supported in the utility, as it fits multi-model graph databases quite well. Was there some issue with implementing it or would you still be open to having it added if we were able to?

Apologies if I am missing something but the datasets you just provided seem to have the same schema as before, was that the intention? Just trying to determine if there is a difference between these and the Surf datasets?

Thank you again for all the help so far!

Chris

@szarnyasg
Copy link
Member

szarnyasg commented Oct 18, 2022 via email

@szarnyasg
Copy link
Member

By the way, maybe an important piece of information that's missing from the discussion above: systems can pre-process the data set before loading. So you can take e.g. the composite merge foreign CSV files, run them through a script (which can use anything cut, Perl scripts, DuckDB SQL script, etc.) and create a new set of CSV files, then load those into the system-under-test. We try to avoid this in the reference implementations but it is definitely a possibility.

@cw00dw0rd
Copy link

Hi Gabor @szarnyasg

Sorry to keep this thread going so long but I downloaded and attempted to decompress the files above and the SF1 worked fine but SF1000 reports the following error:

/*stdin*\ : Read error (39) : premature end
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

The command I ran was the following:
tar --use-compress-program=unzstd -xvf bi-sf1000-composite-projected-fk.tar.zst.000

I attempted this with both the merge and projected files and receive the same error for the SF1000 files. Do you have any suggestions?

@szarnyasg
Copy link
Member

szarnyasg commented Nov 15, 2022 via email

@cw00dw0rd
Copy link

Hi Gabor,

I am unable to access that link, it shows 404.

Chris

@szarnyasg
Copy link
Member

szarnyasg commented Nov 16, 2022 via email

@cw00dw0rd
Copy link

Thank you, that did the trick!

@cw00dw0rd
Copy link

Hi Gabor,

Do you, by chance, have the IC substitution parameters used for the datasets you shared above?
I found this: https://github.com/ldbc/ldbc_snb_bi/blob/main/snb-bi-pre-generated-data-sets.md#parameters
but that only has the bi parameters.

To confirm, while this is tagged with bi I assumed the initial_snapshot would work for the IC queries as well, is this true?

@szarnyasg
Copy link
Member

szarnyasg commented Nov 23, 2022 via email

@cw00dw0rd
Copy link

I attempted to run the paramgen but I must be missing something.
I copied the factors folders into a factors folder I made within the paramgen folder so the resulting structure looks like the following:

ls /data/ldbc_snb_interactive_driver/paramgen/factors/parquet/raw/composite-merged-fk/

cityNumPersons/                          countryPairsNumFriends/                  languageNumPosts/                        personDays/                              personLikesNumMessages/                  personNumFriendTags/                     sameUniversityConnected/
cityPairsNumFriends/                     creationDayAndLengthCategoryNumMessages/ lengthNumMessages/                       personDisjointEmployerPairs/             personNumFriendComments/                 personNumFriends/                        tagClassNumMessages/
companyNumEmployees/                     creationDayAndTagClassNumMessages/       messageIds/                              personFirstNames/                        personNumFriendOfFriendCompanies/        personNumFriendsOfFriendsOfFriends/      tagClassNumTags/
countryNumMessages/                      creationDayAndTagNumMessages/            people2Hops/                             personKnowsPersonConnected/              personNumFriendOfFriendForums/           personStudyAtUniversityDays/             tagNumMessages/
countryNumPersons/                       creationDayNumMessages/                  people4Hops/                             personKnowsPersonDays/                   personNumFriendOfFriendPosts/            personWorkAtCompanyDays/                 tagNumPersons/

After that, I export the variable for LDBC_SNB_DATA_ROOT_DIRECTORY to the data directory

export LDBC_SNB_DATA_ROOT_DIRECTORY=/data/110822/merged/bi-sf1000-composite-merged-fk

And then I attempt to run the script while in the ldbc_snb_interactive_driver/paramgen directory

./scripts/paramgen.sh

Traceback (most recent call last):
  File "paramgen.py", line 273, in <module>
    PG.run()
  File "paramgen.py", line 110, in run
    path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir)
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths
    list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days)
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run
    self.create_views()
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 52, in create_views
    self.cursor.execute(
duckdb.IOException: IO Error: No files found that match the pattern "/data/110822/merged/bi-sf1000-composite-merged-fk/graphs/parquet/raw/composite-merged-fk/dynamic/Person/*.parquet"

Do you have any suggestions for how I can resolve this?

@szarnyasg
Copy link
Member

szarnyasg commented Nov 28, 2022 via email

@cw00dw0rd
Copy link

Thank you! To confirm, these generated parameters will be compatible with the cloudflare datasets you linked above?

@szarnyasg
Copy link
Member

szarnyasg commented Nov 28, 2022 via email

@cw00dw0rd
Copy link

After downloading and unpacking I now receive the following:

root@dataloader-0:/data/ldbc_snb_interactive_driver/paramgen# scripts/paramgen.sh
Traceback (most recent call last):
  File "paramgen.py", line 273, in <module>
    PG.run()
  File "paramgen.py", line 110, in run
    path_curation.get_people_4_hops_paths(self.start_date, self.end_date, 1, parquet_output_dir)
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 356, in get_people_4_hops_paths
    list_of_paths = self.run(start_date, end_date, time_bucket_size_in_days)
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 286, in run
    self.create_views()
  File "/data/ldbc_snb_interactive_driver/paramgen/path_selection.py", line 66, in create_views
    self.cursor.execute(
duckdb.IOException: IO Error: No files found that match the pattern "/data/ldbc_snb_interactive_driver/paramgen/scratch/factors/people4Hops/*.parquet"

The folders in dynamic are the following:

Comment/                   Forum/                     Forum_hasTag_Tag/          Person_hasInterest_Tag/    Person_likes_Comment/      Person_studyAt_University/ Post/                      _SUCCESS
Comment_hasTag_Tag/        Forum_hasMember_Person/    Person/                    Person_knows_Person/       Person_likes_Post/         Person_workAt_Company/     Post_hasTag_Tag/

@szarnyasg
Copy link
Member

szarnyasg commented Nov 28, 2022 via email

@cw00dw0rd
Copy link

Thank you, the directory I was supplying was too far down.
It seems to have run for quite a while but then I am given the following error

Traceback (most recent call last):
  File "paramgen.py", line 273, in <module>
    PG.run()
  File "paramgen.py", line 122, in run
    self.generate_parameter_for_query_type(self.start_date, self.start_date, "13b")
  File "paramgen.py", line 200, in generate_parameter_for_query_type
    self.cursor.execute(f"INSERT INTO 'Q_{query_variant}' SELECT * FROM ({parameter_query});")
duckdb.BinderException: Binder Error: Referenced column "useFrom" not found in FROM clause!
Candidate bindings: "personIds.diff"

@szarnyasg szarnyasg reopened this Dec 8, 2022
@szarnyasg
Copy link
Member

@cw00dw0rd I added a sample script to the driver's CI that shows how to use the paramgen:

https://github.com/ldbc/ldbc_snb_interactive_driver/blob/bb80725214ada3639869cc5aa1b546298b90e6a9/.circleci/config.yml#L71-L93

Let me know if this fails for any of the larger data sets -- if so, there is a problem with the data sets.

(Note that the ${LDBC_SNB_DATA_ROOT_DIRECTORY} env var is currently used inconsistently for the conversion and the paramgen scripts. We'll fix this eventually - ldbc/ldbc_snb_interactive_v1_driver#219 - in the meantime, it's easy to work around.)

@cw00dw0rd
Copy link

@szarnyasg thank you, I will give this a try today and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants