We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
新闻通讯pro.major_news中,部分src重复新闻在一个很短时间段超过200篇,很容易挤占其他src的新闻提取,因为接口有每次提取的限额,这样会导致有些新闻提取不到,以下面时间段为例,5分钟内有超过200篇,实际上去重后仅有16篇。期待官方去重。
pro.major_news
src
df = pro.major_news(src='', start_date='2020-06-03 08:30:00', end_date='2020-06-03 08:35:00', fields='pub_time,title,content,src') print(len(df)) print(len(df.drop_duplicates(subset=['title', 'content']))) df
下面列出了部分有大量重复的时间段,因为数据没拉完,所以应该还不是全部,看了下,其中src='凤凰财经'出现概率较高(22年2月以后,这个现象少很多):
2020-05-18 08:30:00 2020-05-18 09:00:00 2020-06-03 08:30:00 2020-06-03 09:00:00 2020-06-05 08:30:00 2020-06-05 09:00:00 2020-06-06 08:30:00 2020-06-06 09:00:00 2020-06-06 15:00:00 2020-06-06 15:30:00 2020-06-09 09:00:00 2020-06-09 09:30:00 2020-06-11 08:30:00 2020-06-11 09:00:00 2020-06-11 15:30:00 2020-06-11 16:00:00 2020-06-12 08:30:00 2020-06-12 09:00:00 2020-06-12 20:30:00 2020-06-12 23:59:59 2020-06-14 16:00:00 2020-06-15 16:59:59 2020-12-21 09:30:00 2020-12-21 10:00:00 2021-06-21 05:00:00 2021-06-21 08:00:00 2021-06-23 00:00:00 2021-06-23 01:00:00 2021-06-25 01:00:00 2021-06-25 07:00:00 2021-06-29 03:00:00 2021-06-29 04:00:00 2021-07-01 01:00:00 2021-07-01 04:00:00 2021-07-07 01:00:00 2021-07-07 02:00:00 2021-07-11 22:00:00 2021-07-12 08:00:00 2021-07-12 17:30:00 2021-07-12 23:00:00 2021-07-14 01:00:00 2021-07-14 02:00:00 2021-07-21 20:30:00 2021-07-21 22:00:00 2021-07-22 18:30:00 2021-07-27 18:00:00 2021-07-29 13:00:00 2021-07-30 15:00:00 2021-07-31 14:00:00 2021-07-31 15:00:00 2021-08-03 18:00:00 2021-08-03 19:00:00 2021-08-04 13:30:00 2021-08-04 15:00:00 2021-08-05 21:00:00 2021-08-06 14:00:00 2021-08-08 07:00:00 2021-08-08 13:00:00 2021-08-09 18:00:00 2021-08-09 19:00:00 2021-08-10 18:00:00 2021-08-10 19:00:00 2021-08-13 19:00:00 2021-08-14 09:00:00 2021-08-19 06:00:00 2021-08-19 07:00:00 2021-08-28 21:30:00 2021-08-28 08:00:00 2021-09-03 07:00:00 2021-09-03 09:00:00
2021-9-3之后的数据在下面这个csv文件中: long-news-import-error-timespan.csv
tushare id:382058
382058
The text was updated successfully, but these errors were encountered:
No branches or pull requests
新闻通讯
pro.major_news
中,部分src
重复新闻在一个很短时间段超过200篇,很容易挤占其他src
的新闻提取,因为接口有每次提取的限额,这样会导致有些新闻提取不到,以下面时间段为例,5分钟内有超过200篇,实际上去重后仅有16篇。期待官方去重。下面列出了部分有大量重复的时间段,因为数据没拉完,所以应该还不是全部,看了下,其中src='凤凰财经'出现概率较高(22年2月以后,这个现象少很多):
2021-9-3之后的数据在下面这个csv文件中:
long-news-import-error-timespan.csv
tushare id:
382058
The text was updated successfully, but these errors were encountered: