-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Groups formatting changed, unit test issues #94
Comments
Updating from offline discussions: As part of Project OCEAN's Open Source Data Ecosystem, @nyghtowl (Xoogler) and members of the 20% Dive Crew scoped, designed, and built a data pipeline to aggregate mailing lists from multiple communities, including: Python, Angular, and Go. This dataset was used in multiple research projects with our academic partners, including an accepted dataset track submission at MSR 2022. As outlined by @glasnt, there are several updates that need to be made in the open source project and the GCP project to maintain this dataset. Polling our research stakeholders, this dataset is not currently being used for any ongoing research project. Any changes currently made would most likely need to be maintained with future open source dependency version changes, GCP product updates, and Google Groups API/RSS supported features. Rather than update a project no one is using, we are going to put it all on the shelf with proper documentation for future explorers and experimentation. |
Closing, see #97 |
TL;DR: No Google Groups ingestion currently because of changes to Google Groups, causing scraping code to fail.
Discovered while trying to update dependencies.
Zero topics
Monthly pipeline processing was showing 0 topics returned:
Checking the go code for how topic counts are captured, the regex doesn't match current Google Groups UI (there may have been some MaterialUI changes since this code was written).
E.g. https://groups.google.com/g/golang-checkins shows 1–30 of 81553 (specifically
–
is\u2013 EN DASH
). The regex ingetTotalTopics
specifies-
(\u002D HYPHEN-MINUS
).So because the topic counts are 0, it's effecting loops later on (in my estimation)
Nest unit tests
Additionally, trying to run unit tests, it appears running just
mailinglists/
doesn't run the nested mailing lists, so the unit tests forgooglegroups
weren't being run (and are currently breaking)Failing topic unit tests
Now running the unit tests:
Infinite redirects
This URL is no longer a valid URL format, as trying to curl it gets stuck in an infinite 301 redirect loop:
Summary
This is going to take some re-engineering to work out what's changed in the Google Groups format to bring this code back to working.
The text was updated successfully, but these errors were encountered: