-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3a
URLs don't work as in documentation
#556
Comments
Thanks for finding this issue, and creating a great ticket! I appreciate it. Looks like this was something I missed regression testing when I worked with @helgeho on pulling in Sparkling here](c8fa256#diff-7e582908ce37e25cdae381cebc539965c62f5a241cf1ea38fcafe9683b6ce44cR96). It'll be awhile before I can pivot to any of my research time work on this since my funding on this project ended back in July 2023. But, I'll definitely try and set aside some time in the future to get this working again. Also, I see your example uses wat file from Common Crawl and @helgeho also flagged the WAT file usage. Documentation flagged for future work: |
EDIT: this helped, the doc may need to be updated:
Describe the bug
According to the docs,
aut
should be able to read data froms3a
URLs, but every way I've tried it, I get the same result (wrong FS
...)This specific run is built from aut-docker @ b64c02a343ad02ac36e84a2393ed52d86f0fb4ee), but a standalone
Sparkling
build does the same thing. I would file the ticket against Sparkling, but your docs actually exist, and no good deed goes unpunished ;)I've verified that the credentials I'm providing can read this file using
aws s3 cp
etc.To Reproduce
Steps to reproduce the behavior (e.g.):
git clone [email protected]:archivesunleashed/docker-aut.git && cd docker-aut
docker build .
docker run -it <hash of above>
RecordLoader.loadArchives("s3a://commoncrawl/crawl-data/CC-MAIN-2023-50/segments/1700679103558.93/wat/CC-MAIN-20231211045204-20231211075204-00000.warc.wat.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().take(10)
Expected behavior
A DataFrame is returned by RecordLoader ;)
Screenshots
If applicable, add screenshots to help explain your problem.
Environment information
docker-aut
, currently at b64c02a343ad02ac36e84a2393ed52d86f0fb4eedocker run -it sha256:f6a21678154c9603e5e4b3f453fa043083812bc40456469e3059edb7c5a3b36d
The text was updated successfully, but these errors were encountered: