Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Groups formatting changed, unit test issues #94

Closed
glasnt opened this issue Nov 15, 2022 · 2 comments
Closed

Google Groups formatting changed, unit test issues #94

glasnt opened this issue Nov 15, 2022 · 2 comments
Labels
bug Something isn't working core Core code functionality enhancement New feature or request

Comments

@glasnt
Copy link
Member

glasnt commented Nov 15, 2022

TL;DR: No Google Groups ingestion currently because of changes to Google Groups, causing scraping code to fail.

Discovered while trying to update dependencies.

Zero topics

Monthly pipeline processing was showing 0 topics returned:

2022/11/01 08:01:32 GOOGLEGROUPS loading golang-checkins:
2022/11/01 08:01:32 All topics captured: total topics captured are 0.

Checking the go code for how topic counts are captured, the regex doesn't match current Google Groups UI (there may have been some MaterialUI changes since this code was written).

E.g. https://groups.google.com/g/golang-checkins shows 1–30 of 81553 (specifically is \u2013 EN DASH). The regex in getTotalTopics specifies - (\u002D HYPHEN-MINUS).

So because the topic counts are 0, it's effecting loops later on (in my estimation)

Nest unit tests

Additionally, trying to run unit tests, it appears running just mailinglists/ doesn't run the nested mailing lists, so the unit tests for googlegroups weren't being run (and are currently breaking)

Failing topic unit tests

Now running the unit tests:

=== RUN   TestTopicIDToRawMsgUrlMap/Pull_topic_ids_for_date
2022/11/15 22:40:43 No message ID found in topicId: 8sv65_WCOS4.
    googlegroups_data_test.go:300: Result response does not match.
         got: map[2018-09.txt:[]]
        want: map[2018-09.txt:[https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ]]

Infinite redirects

This URL is no longer a valid URL format, as trying to curl it gets stuck in an infinite 301 redirect loop:

$ curl https://groups.google.com/forum/message/raw\?msg\=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ
<HTML>
<HEAD>
<TITLE>Moved Permanently</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Permanently</H1>
The document has moved <A HREF="https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ">here</A>.
</BODY>
</HTML>

Summary

This is going to take some re-engineering to work out what's changed in the Google Groups format to bring this code back to working.

@glasnt glasnt added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 15, 2022
@glasnt glasnt mentioned this issue Nov 15, 2022
2 tasks
@amcasari amcasari assigned amcasari and unassigned amcasari Nov 18, 2022
@amcasari amcasari pinned this issue Nov 18, 2022
@amcasari
Copy link
Collaborator

amcasari commented Dec 8, 2022

Updating from offline discussions:

As part of Project OCEAN's Open Source Data Ecosystem, @nyghtowl (Xoogler) and members of the 20% Dive Crew scoped, designed, and built a data pipeline to aggregate mailing lists from multiple communities, including: Python, Angular, and Go.

This dataset was used in multiple research projects with our academic partners, including an accepted dataset track submission at MSR 2022.

As outlined by @glasnt, there are several updates that need to be made in the open source project and the GCP project to maintain this dataset. Polling our research stakeholders, this dataset is not currently being used for any ongoing research project.

Any changes currently made would most likely need to be maintained with future open source dependency version changes, GCP product updates, and Google Groups API/RSS supported features.

Rather than update a project no one is using, we are going to put it all on the shelf with proper documentation for future explorers and experimentation.

@amcasari amcasari added bug Something isn't working core Core code functionality and removed documentation Improvements or additions to documentation labels Dec 8, 2022
@glasnt
Copy link
Member Author

glasnt commented Dec 12, 2022

Closing, see #97

@glasnt glasnt closed this as completed Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core code functionality enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants