Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding China National People's Congress #861

Merged
merged 12 commits into from
May 29, 2024
Merged

Conversation

bgmello
Copy link
Contributor

@bgmello bgmello commented May 16, 2024

Fixes opensanctions/crawler-planning#170

It's currently throwing an assertion error because it's not creating an entity for each line, but I couldn't find why.

@bgmello bgmello marked this pull request as draft May 16, 2024 14:43
Copy link
Contributor

@jbothma jbothma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this something like cn_wikidata_npc so that if we get an official source one day, we can use this name or similar for it.

datasets/cn/national_congress/cn_national_congress.yaml Outdated Show resolved Hide resolved
datasets/cn/national_congress/cn_national_congress.yaml Outdated Show resolved Hide resolved
datasets/cn/national_congress/cn_national_congress.yaml Outdated Show resolved Hide resolved
datasets/cn/national_congress/cn_national_congress.yaml Outdated Show resolved Hide resolved
datasets/cn/national_congress/crawler.py Outdated Show resolved Hide resolved
datasets/cn/national_congress/crawler.py Outdated Show resolved Hide resolved
datasets/cn/national_congress/crawler.py Outdated Show resolved Hide resolved
@bgmello
Copy link
Contributor Author

bgmello commented May 21, 2024

I'm actually using the data from NPC observer, because it was already in a google sheets . Do you want me to extract from the wikipedia or is this ok?

bgmello and others added 2 commits May 21, 2024 14:46
@jbothma
Copy link
Contributor

jbothma commented May 22, 2024

Oh right - I pasted one table into my google sheet from wikipedia and it seems to work fine https://docs.google.com/spreadsheets/d/1WfvKmydpsyIQe-5V28meRx2miVNCxMAtbtM-taJDFNw/edit#gid=0

I think it's worth using the wikipedia data since it contains dates of birth which help a lot to disambiguate when two people have the same name.

@bgmello
Copy link
Contributor Author

bgmello commented May 23, 2024

I added the data from wikipedia to the google sheets. I used the following code to extract the data:

import requests
import pandas as pd

r = requests.get('https://zh.wikipedia.org/wiki/第十四届全国人民代表大会代表名单')
dfs = pd.read_html(r.text)
final_df = []
columns = ['姓名', '政党', '性别', '民族', '出生日期', '职务', '备注']
for i, df in enumerate(dfs[1:36]):
    df.columns = df.columns.get_level_values(0)
    df = df.iloc[:, :7]
    df.columns = columns
    final_df.append(df)
    
final_df = pd.concat(final_df)
final_df.to_excel("china_national_people_congress.xlsx", index=False)

@bgmello
Copy link
Contributor Author

bgmello commented May 24, 2024

I'm having trouble with the assertion because there are some cases we don't have enough information to create a unique ID. For example, there are two deputies with the name Zhang Qiang, that are male and ethnicity Han, since we don't have any other relevant information about them like birth date, we create the same ID. What we do in this case?

@jbothma
Copy link
Contributor

jbothma commented May 28, 2024

well spotted!

It looks like one represents Jiangsu Province and one represents Jiangxi Province.

Let's include the province they represent in the ID

@jbothma
Copy link
Contributor

jbothma commented May 28, 2024

Could you include your script for scraping the content in the crawler directory? Then it's easy to update.

@jbothma jbothma merged commit e0604fd into opensanctions:main May 29, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

China National People's Congress
2 participants