Draft: Update data on the CI #26

jbruechert · 2021-03-29T12:49:08Z

No description provided.

ulfgebhardt · 2021-03-29T17:15:12Z

The update & push to github must become its own workflow e.g .github/workflows/crawl.yml so we can separate stuff and schedule the job. Ofc what you do is perfectly fine for testing.

name: "Close stale issues"
on:
  schedule:
  - cron: "0 0 * * *"

jobs:
  stale:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/[email protected]
      with:
        repo-token: ${{ secrets.GITHUB_TOKEN }}
        stale-issue-message: 'Message to comment on stale issues. If none provided, will not mark issues stale'
        stale-pr-message: 'Message to comment on stale PRs. If none provided, will not mark PRs stale'

src

If you need secrets in the repo to push to the correct git repo, tell me and I make things happen
Is it smart to keep data like banz.json in this repo?
....
Profit! Good work, very awesome to see this year old problem to get solved <3

jbruechert · 2021-03-29T17:53:33Z

Is it smart to keep data like banz.json in this repo?

So far we are treating this data as input data for our other tools. If someone has a use for it outside of this we can move it to a new repo I guess.

ulfgebhardt · 2021-03-29T17:56:37Z

Oh I did not know that this tool processes the crawled data further. Its cool, I just thought for separation of code & data it might be wise to do so. If its not thats perfectly fine.

jbruechert · 2021-03-29T17:59:18Z

Before we can merge this we need to fill in the years between 2013 and 2021 with a manual run locally btw. The script currently only starts at the year of the latest commit in the gesetze repo, which is from this year, so it would get confused by that.

ulfgebhardt · 2021-03-29T18:05:07Z

I guess this is to increase efficiency? How long is the time effort for a whole scrape? My experience with bundestag.de is that they change old stuff regularly and their unstable web interfaces cause faulty data to show up every now and then - this can be detected with regular crawls. So what speaks against crawling everything once a day?

jbruechert · 2021-03-29T18:15:09Z

The Bundesanzeiger scraper is really slow. As far as I understand the code, the actual laws are updated completely, but the index of changed laws is only extended, not refreshed. Even crawling just 2021 in the Bundesanzeiger takes about 5 Minutes iirc, and everything would take multiple hours.

Also run test workflow for pull requests, but don't scrape data in them.

Croydon · 2021-09-14T17:49:03Z

What is the status of this? Any particular reason why the work in this PR did not get continued?

jbruechert · 2021-09-14T19:29:37Z

The data from 2013 and 2021 needs to be added (for example by editing updatelawsgit.py locally). Afterwards I think this could be merged. It would be nice if someone else could take over that work, I'm not that active here right now.

darkdragon-001

Thanks for your work. I would vote in favor of implementing #35 and #36 first, but would be fine otherwise since this is better than nothing.

darkdragon-001 · 2021-09-15T07:23:14Z

updatelawsgit.py

+def clone_lawsgit() -> None:
+    print("Updating gesetze.git…")
+
+    if not os.path.exists(LAWS_PATH):
+        run_command(["git", "clone", "--depth=1", LAWS_REPOSITORY, LAWS_PATH])
+    else:
+        run_command(["git", "-C", LAWS_PATH, "pull"])


Is this really needed since you are running uses: actions/checkout@v2?

darkdragon-001 · 2021-09-15T07:23:53Z

updatelawsgit.py

+def run_command(command: List[str]) -> None:
+    if subprocess.check_call(command) != 0:
+        print("Error while executing", command)
+        exit(1)


Should we really use subprocesses or better just import the python files directly?

darkdragon-001 · 2021-09-15T07:25:46Z

updatelawsgit.py

+def get_latest_year() -> int:
+    repo = Repo(LAWS_PATH)
+    timestamp = repo.head.commit.committed_date
+    date = datetime.fromtimestamp(timestamp)
+    return date.year


How can we ensure that this correlation always is valid? Maybe we should check the content of the downloaded json files instead.

darkdragon-001 · 2021-09-15T07:26:34Z

updatelawsgit.py

+    run_command(["./banz_scraper.py", BANZ_FILE, str(minyear), str(maxyear)])
+    run_command(["./bgbl_scraper.py", BGBL_FILE, str(minyear), str(maxyear)])
+
+    # TODO add the other indexes here, once they are working


Are they working now?

darkdragon-001 · 2021-09-15T07:27:52Z

updatelawsgit.py

+def fetch_raw_xml() -> None:
+    print("Downloading new xml from gesetze-im-internet.de…")
+
+    run_command(["./lawde.py", "loadall", f"--path={RAW_XML_PATH}"])


This is a lot of data. Do we really need to download everything daily again?

ulfgebhardt assigned jbruechert Mar 29, 2021

jbruechert force-pushed the ci-update branch 2 times, most recently from 3372da0 to e5ad111 Compare March 29, 2021 17:52

jbruechert force-pushed the ci-update branch 2 times, most recently from a2c23e3 to 3d7570f Compare March 29, 2021 20:06

jbruechert added 8 commits March 29, 2021 23:33

ci: Add job for updating the gesetze.git repository

721b0b7

Shallow-clone gesetze.git

66f0fcc

Add test stderr print

1cf7529

Use bot account for commit

fdf6489

Push changes in the data directory

d0e5542

Move extracting to new workflow

7489da5

Clean up workflows

be20e9c

Also run test workflow for pull requests, but don't scrape data in them.

Updatelawsgit: Also run bgbl_scraper

32f17fc

jbruechert force-pushed the ci-update branch from ac9839f to 32f17fc Compare March 29, 2021 21:33

Update data

88c4530

darkdragon-001 reviewed Sep 15, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Update data on the CI #26

Draft: Update data on the CI #26

jbruechert commented Mar 29, 2021

ulfgebhardt commented Mar 29, 2021 •

edited

Loading

jbruechert commented Mar 29, 2021

ulfgebhardt commented Mar 29, 2021

jbruechert commented Mar 29, 2021

ulfgebhardt commented Mar 29, 2021 •

edited

Loading

jbruechert commented Mar 29, 2021 •

edited

Loading

Croydon commented Sep 14, 2021

jbruechert commented Sep 14, 2021

darkdragon-001 left a comment

darkdragon-001 Sep 15, 2021

darkdragon-001 Sep 15, 2021

darkdragon-001 Sep 15, 2021

darkdragon-001 Sep 15, 2021

darkdragon-001 Sep 15, 2021

Draft: Update data on the CI #26

Are you sure you want to change the base?

Draft: Update data on the CI #26

Conversation

jbruechert commented Mar 29, 2021

ulfgebhardt commented Mar 29, 2021 • edited Loading

jbruechert commented Mar 29, 2021

ulfgebhardt commented Mar 29, 2021

jbruechert commented Mar 29, 2021

ulfgebhardt commented Mar 29, 2021 • edited Loading

jbruechert commented Mar 29, 2021 • edited Loading

Croydon commented Sep 14, 2021

jbruechert commented Sep 14, 2021

darkdragon-001 left a comment

Choose a reason for hiding this comment

darkdragon-001 Sep 15, 2021

Choose a reason for hiding this comment

darkdragon-001 Sep 15, 2021

Choose a reason for hiding this comment

darkdragon-001 Sep 15, 2021

Choose a reason for hiding this comment

darkdragon-001 Sep 15, 2021

Choose a reason for hiding this comment

darkdragon-001 Sep 15, 2021

Choose a reason for hiding this comment

ulfgebhardt commented Mar 29, 2021 •

edited

Loading

ulfgebhardt commented Mar 29, 2021 •

edited

Loading

jbruechert commented Mar 29, 2021 •

edited

Loading