-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Update data on the CI #26
base: master
Are you sure you want to change the base?
Conversation
|
3372da0
to
e5ad111
Compare
So far we are treating this data as input data for our other tools. If someone has a use for it outside of this we can move it to a new repo I guess. |
Oh I did not know that this tool processes the crawled data further. Its cool, I just thought for separation of code & data it might be wise to do so. If its not thats perfectly fine. |
Before we can merge this we need to fill in the years between 2013 and 2021 with a manual run locally btw. The script currently only starts at the year of the latest commit in the gesetze repo, which is from this year, so it would get confused by that. |
I guess this is to increase efficiency? How long is the time effort for a whole scrape? My experience with bundestag.de is that they change old stuff regularly and their unstable web interfaces cause faulty data to show up every now and then - this can be detected with regular crawls. So what speaks against crawling everything once a day? |
The Bundesanzeiger scraper is really slow. As far as I understand the code, the actual laws are updated completely, but the index of changed laws is only extended, not refreshed. Even crawling just 2021 in the Bundesanzeiger takes about 5 Minutes iirc, and everything would take multiple hours. |
a2c23e3
to
3d7570f
Compare
Also run test workflow for pull requests, but don't scrape data in them.
What is the status of this? Any particular reason why the work in this PR did not get continued? |
The data from 2013 and 2021 needs to be added (for example by editing updatelawsgit.py locally). Afterwards I think this could be merged. It would be nice if someone else could take over that work, I'm not that active here right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def clone_lawsgit() -> None: | ||
print("Updating gesetze.git…") | ||
|
||
if not os.path.exists(LAWS_PATH): | ||
run_command(["git", "clone", "--depth=1", LAWS_REPOSITORY, LAWS_PATH]) | ||
else: | ||
run_command(["git", "-C", LAWS_PATH, "pull"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed since you are running uses: actions/checkout@v2
?
def run_command(command: List[str]) -> None: | ||
if subprocess.check_call(command) != 0: | ||
print("Error while executing", command) | ||
exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really use subprocesses or better just import the python files directly?
def get_latest_year() -> int: | ||
repo = Repo(LAWS_PATH) | ||
timestamp = repo.head.commit.committed_date | ||
date = datetime.fromtimestamp(timestamp) | ||
return date.year |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we ensure that this correlation always is valid? Maybe we should check the content of the downloaded json files instead.
run_command(["./banz_scraper.py", BANZ_FILE, str(minyear), str(maxyear)]) | ||
run_command(["./bgbl_scraper.py", BGBL_FILE, str(minyear), str(maxyear)]) | ||
|
||
# TODO add the other indexes here, once they are working |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they working now?
def fetch_raw_xml() -> None: | ||
print("Downloading new xml from gesetze-im-internet.de…") | ||
|
||
run_command(["./lawde.py", "loadall", f"--path={RAW_XML_PATH}"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of data. Do we really need to download everything daily again?
No description provided.