中文 | English Translated by ChatGPT

A web crawler for v2ex.com

A small crawler written to learn scrapy.

The data is stored in an SQLite database for easy sharing, with a total database size of 3.7GB.

I have released the complete SQLite database file on GitHub.

Database Update: 2023-07-22-full

It contains all post data, including hot topics. Both post and comment content are now scraped in their original HTML format. Additionally, the "topic" has a new field "reply_count," and "comment" has a new field "no."

It is not recommended to run the crawler as the data is already available

The crawling process took several dozen hours because rapid crawling can result in IP banning, and I didn't use a proxy pool. Setting the concurrency to 3 should allow for continuous crawling.

Download the database

Explanation of Crawled Data

The crawler starts crawling from topic_id = 1, and the path is https://www.v2ex.com/t/{topic_id}. The server might return 404/403/302/200 status codes. A 404 indicates that the post has been deleted, 403 indicates that the crawler has been restricted, 302 is usually a redirection to the login page or homepage, and 200 indicates a normal page.

The crawler fetches post content, comments, and user information during the crawling process.

Database table structure: Table structure source code

Running

Ensure Python version is >=3.10

Install Dependencies

pip install -r requirements.txt

Configuration

The default concurrency is set to 1. To change it, modify CONCURRENT_REQUESTS.

Cookie

Some posts and post information require login for crawling. You can set a Cookie to log in. Modify the COOKIES value in v2ex_scrapy/settings.py:

COOKIES = """
a=b;c=d;e=f
"""

Proxy

Change the value of PROXIES in v2ex_scrapy/settings.py, for example:

[
     "http://127.0.0.1:7890"
]

Requests will randomly choose one of the proxies. If you need a more advanced proxy method, you can use a third-party library or implement Middleware yourself.

LOG

The writing of Log files is disabled by default. To enable it, uncomment this line in v2ex_scrapy\settings.py:

LOG_FILE = "v2ex_scrapy.log"

Run the Crawler

Crawl all posts, user information, and comments on the entire site:

scrapy crawl v2ex

Crawl posts, user information, and comments for a specific node. If node-name is empty, it crawls "flamewar":

scrapy crawl v2ex-node node=${node-name}

Crawl user information, starting from uid=1 and crawling up to uid=635000:

scrapy crawl v2ex-member start_id=${start_id} end_id=${end_id}

If you see scrapy: command not found, it means the Python package installation path has not been added to the environment variable.

Resuming the Crawl

Simply run the crawl command again, and it will automatically continue crawling, skipping the posts that have already been crawled:

scrapy crawl v2ex

Notes

If you encounter a 403 error during the crawling process, it is likely due to IP restrictions. Wait for a while before trying again.

Statistical Analysis

The SQL queries used for statistics can be found in the query.sql file, and the source code for the charts is in the analysis subproject. It includes a Python script for exporting data to JSON for analysis and a frontend display project.

The first analysis can be found at https://www.v2ex.com/t/954480

For more detailed data, I suggest downloading the database.

Statistics of Posts, Comments, and Users

Total posts: 801,038 (800,000) Total comments: 10,899,382 (10 million) Total users: 194,534 (200,000) - See Note 2 in the Explanation of Crawled Data

Top Comments by Gratitude

Gratitude count is too large to display the full content. You can click on the links or download the database to query using SQL. The SQL queries are also included in the open-source files.

Comment Link	Gratitude Count
https://www.v2ex.com/t/820687#r_11150263	316
https://www.v2ex.com/t/437760#r_5432223	297
https://www.v2ex.com/t/915584#r_12684442	248
https://www.v2ex.com/t/917858#r_12720322	246
https://www.v2ex.com/t/949195#r_13227124	246
https://www.v2ex.com/t/881410#r_12126164	245
https://www.v2ex.com/t/884719#r_12178891	240
https://www.v2ex.com/t/901263#r_12442916	240
https://www.v2ex.com/t/749163#r_10129442	217
https://www.v2ex.com/t/877829#r_12070911	216

Most Upvoted Posts

Post Link	Title	Votes
https://www.v2ex.com/t/110327	UP n DOWN vote in V2EX	321
https://www.v2ex.com/t/295433	Snipaste - 开发了三年的截图工具，但不只是截图	274
https://www.v2ex.com/t/462641	在 D 版发过了，不过因为不少朋友看不到 D 版，我就放在这里吧，说说我最近做的这个 Project	200
https://www.v2ex.com/t/658387	剽窃别人成果的人一直有，不过今天遇到了格外厉害的	179
https://www.v2ex.com/t/745030	QQ 正在尝试读取你的浏览记录	177
https://www.v2ex.com/t/689296	早上还在睡觉，自如管家进了我卧室...	145
https://www.v2ex.com/t/814025	分享一张我精心修改调整的 M42 猎户座大星云(Orion Nebula)壁纸。用了非常多年，首次分享出来，能和 MBP 2021 新屏幕和谐相处。	136
https://www.v2ex.com/t/511827	23 岁，得了癌症，人生无望	129
https://www.v2ex.com/t/427796	隔壁组的小兵集体情愿要炒了 team leader	123
https://www.v2ex.com/t/534800	使用 Github 账号登录黑客派之后， Github 自动 follow	112

Most Viewed Posts

Post Link	Title	Views
https://www.v2ex.com/t/510849	chrome 签到插件 [魂签] 更新啦	39,452,510
https://www.v2ex.com/t/706595	迫于搬家 ··· 继续出 700 本书~ 四折非技术书还剩 270 多本·	2,406,584
https://www.v2ex.com/t/718092	使用 GitHub 的流量数据为仓库创建访问数和克隆数的徽章	1,928,267
https://www.v2ex.com/t/861832	帮朋友推销下福建古田水蜜桃，欢迎各位购买啊	635,832
https://www.v2ex.com/t/176916	王垠这是在想不开吗	329,617
https://www.v2ex.com/t/303889	关于 V2EX 提供的 Android Captive Portal Server 地址的更新	295,681
https://www.v2ex.com/t/206766	如何找到一些有趣的 telegram 群组？	294,553
https://www.v2ex.com/t/265474	ngrok 客户端和服务端如何不验证证书	271,244
https://www.v2ex.com/t/308080	Element UI——一套基于 Vue 2.0 的桌面端组件库	221,099
https://www.v2ex.com/t/295433	Snipaste - 开发了三年的截图工具，但不只是截图	210,675

Users with Most Comments

User	Comment Count
Livid	19559
loading	19190
murmur	17189
msg7086	16768
Tink	15919
imn1	11468
20015jjw	10293
x86	9704
opengps	9694
est	9532

Users with Most Posts

User	Topic Count
Livid	6974
icedx	722
ccming	646
2232588429	614
razios	611
coolair	604
Kai	599
est	571
Newyorkcity	553
WildCat	544

Curves

For detailed data, I recommend downloading the database.

Line Chart for New Users per Month

Line Chart for New Posts per Month

Line Chart for Comment Count

Most Frequently Used Nodes

Node	Count
qna	188011
all4all	103254
programmer	51706
jobs	49959
share	35942
apple	20713
macos	19040
create	18685
python	14124
career	13170

Most Frequently Used Tags (auto-generated by V2EX)

Tag	Count
开发	16414
App	13240
Python	13016
Mac	12931
Java	10984
Pro	9375
iOS	9216
微信	8922
V2EX	8426
域名	8424

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-en.md

README-en.md

A web crawler for v2ex.com

It is not recommended to run the crawler as the data is already available

Explanation of Crawled Data

Running

Install Dependencies

Configuration

Cookie

Proxy

LOG

Run the Crawler

Resuming the Crawl

Notes

Statistical Analysis

Statistics of Posts, Comments, and Users

Top Comments by Gratitude

Most Upvoted Posts

Most Viewed Posts

Users with Most Comments

Users with Most Posts

Curves

Line Chart for New Users per Month

Line Chart for New Posts per Month

Line Chart for Comment Count

Most Frequently Used Nodes

Most Frequently Used Tags (auto-generated by V2EX)

Files

README-en.md

Latest commit

History

README-en.md

File metadata and controls

A web crawler for v2ex.com

It is not recommended to run the crawler as the data is already available

Explanation of Crawled Data

Running

Install Dependencies

Configuration

Cookie

Proxy

LOG

Run the Crawler

Resuming the Crawl

Notes

Statistical Analysis

Statistics of Posts, Comments, and Users

Top Comments by Gratitude

Most Upvoted Posts

Most Viewed Posts

Users with Most Comments

Users with Most Posts

Curves

Line Chart for New Users per Month

Line Chart for New Posts per Month

Line Chart for Comment Count

Most Frequently Used Nodes

Most Frequently Used Tags (auto-generated by V2EX)