Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

爬取的数据会漏,应该怎么调整?爬取的速度可以慢一些,但希望能够爬全。 #319

Open
blacksh1982 opened this issue Mar 7, 2024 · 3 comments

Comments

@blacksh1982
Copy link

blacksh1982 commented Mar 7, 2024

首先,感谢作者提供这么棒的工具。前几天还追加了阅读数统计字段,非常方便。
目前,使用过程中,在相同的条件下,多爬几次的结果都不一样,总是相差10条左右。(我是按月爬取的,每个月的微博数量都在300条上下,不会超过350条)只有一次爬全了。初步分析了数据,看起来也没什么规律,每次漏的数据项都不太相同。请问会不会是网络问题或者是翻页的时候加载慢一点就没爬到啊?我是不是应该往大了调整下图的切分时间段?

image

@blacksh1982
Copy link
Author

我尝试了修改setting.py文件中DOWNLOAD_DELAY = 20。爬的速度确实很慢了,但依然有漏的。

image

@chengcheng0509
Copy link

也遇到了同样的问题!请问你解决了吗?求教!

@blacksh1982
Copy link
Author

也遇到了同样的问题!请问你解决了吗?求教!

我没有解决。我现在只能把爬来的数据和原始数据对比,相差的部分手动补充呢。最近爬的过程中还发现,会有重复爬到数据的情况呢。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants