Skip to content

This project focuses on analyzing the questions on askubuntu.com to find the most common topics asked about in order to better understand what areas of Ubuntu may need more attention for bug fixing and also what features might be good to add in future releases of Ubuntu. To do this, I analyzed public data from askubuntu.com using Azure HDInsight…

License

Notifications You must be signed in to change notification settings

appaulo14/spark_analysis_of_public_data_from_askubuntu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Finding the Most Common Topics on askubuntu.com

Abstract

This article focuses on analyzing the questions on askubuntu.com to find the most common topics asked about in order to better understand what areas of Ubuntu may need more attention for bug fixing and also what features might be good to add in future releases of Ubuntu. To do this, I analyzed public data from askubuntu.com using Azure HDInsights with Spark. Tags were the most useful. Word counting the titles and body text was less useful. Future research might try using a natural language parsing libraries such as NLTK to better identify topics asked about and also better identify what type of questions are asked for each topic.

Introduction

Big Data consists of largs amounts of unstructured or semi-structed data that can be analyzed to derrive new insights that can not easily be found by manually searching the data. For example, one could parse gigabytes of server log files to find common causes of errors or slowdowns on a cluster of servers. Another example would be analyzing tweets from Twitter to determine public sentiment about a product. A third example would be analyzing customer buying behavior to better deliver targeted advertising.

This article focuses on analyzing the questions on askubuntu.com to find the most common topics asked about in order to better understand what areas of Ubuntu may need more attention for bug fixing and also what features might be good to add in future releases of Ubuntu.

I'm in no way affiliated with Ubuntu itself. This analysis is for demonstration purposes only.

Methods

The data was obtained from the Stack Exchange Data Dump on archive.org, after which it was extracted out of its 7z achive and the XML files inside were uploaded to HDFS storage on Microsoft Azure.

After the files were uploaded to Azure, two Spark 2.0 scripts were written and executed in Python (find_top_tags_for_askubuntu.com.py and find_top_words_for_askubunut.com.py) in an HD Insights cluster following this Azure HD Insights/Spark guide.

These scripts can be run from a Spark 2.0 cluster using the following commands:
spark-submit find_top_tags_for_askubuntu.com.py
spark-submit find_top_words_for_askubuntu.com.py

The scripts are saved in the scripts section of this repository.

The results from these scripts are saved in the results section of this repository. For find_top_words_for_askubuntu.com.py, only the top 1,000 results were saved due to size limitations.

Results

Table 1: Top 25 Tags

RankTagcount
114.0421148
212.0417412
3boot13098
4command-line12294
5networking12101
616.0411278
7dual-boot10458
8drivers9723
9unity9122
10wireless9018
11server8852
12apt8589
13grub27755
14partitioning7474
15installation7221
16nvidia6498
17gnome5818
18system-installation5651
19upgrade5507
20bash5470
21usb5404
22package-management5356
2311.105125
24software-installation5054
25sound4961

Table 2: Top 25 Words in Title of Questions*

*Filtered out generic words such as prepositions and conjunctions

Rank Word Count
1 ubuntu 71478
2 install 19539
3 14.04 12902
4 windows 12491
5 boot 11096
6 error 9763
7 16.04 9562
8 file 9426
9 cant 9348
10 12.04 9318
11 installing 8392
12 - 8332
13 working 8046
14 using 7860
15 screen 7003
16 server 6999
17 files 6851
18 usb 6647
19 work 5557
20 installation 5438
21 system 5340
22 command 5055
23 update 5006
24 drive 4921
25 upgrade 4898

Table 3: Top 25 Words in Body of Question*

*Filtered out generic words such as prepositions and conjunctions

Rankwordcount
1ubuntu385199
2install264532
3file210150
4using176274
5all151640
60143552
7-141588
8windows141341
9installed140034
10apt-get131660
11like130961
12sudo127988
13boot120906
14system117750
15its114235
16some113739
17need111594
18run111025
19up110354
20one108843
21command105521
22error102161
231101823
24only100010
25files97827

Discussion/Conclusion

Tags were the most useful. The most common questions seemed to be about Ubuntu LTS releases (12.04, 14.04, 16.04), with all three recent LTS releases being in the top 6 tags. This may be due to LTS releases being used the most. A lot of questions are related to booting (boot: 3rd place, dual-boot: 7th place, grub2: 13th place). This might be due to the wide variety of hardware Ubuntu runs on but I cannot say for sure. Networking-related questions were common (networking: 5th place, wireless: 10th place). Additionally, many people seem to be interested in running Ubuntu as a server, judging by the server tag coming in at 11th place. Other notably high tags were related to drivers , graphics, installation, patitioning, and sound, again possibly due to the wide variety of hardware on which Ubuntu can run (drivers: 8th place, nvdia: 16th place, installation: 15th place, parititioning: 14th place, sound: 25th place).

Word counting did not provide much useful information compared to tag counting. A lot of the words were pronouns, prepositions, conjunctions, or other words that do not provide any meaningful information. I tried to filter such words out but it was difficult due to the large number of such words.

The information collected here may be useful for what common problems Ubuntu users face and also what features they are most interested in. However, more investigation is needed before it can be turned into actionable insights.

Future research might try using a natural language parsing library such as NLTK to better identify topics asked about and also better identify what type of questions are asked for each topics.

About

This project focuses on analyzing the questions on askubuntu.com to find the most common topics asked about in order to better understand what areas of Ubuntu may need more attention for bug fixing and also what features might be good to add in future releases of Ubuntu. To do this, I analyzed public data from askubuntu.com using Azure HDInsight…

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages