This project is based on displaying the word count in a file using PySpark and Databricks in cloud.
- Resilient Distributed Dataset (RDD) - A distributed collection of Objects
- DataFrame -An immutable distributed collection of data.
- SparkContext(sc) - entry point to any spark functionality.
- Collect()- It is an action operation that is used to retrieve all the elements of the dataset (from all nodes) to the driver node.
- The input can be taken as any URL which is of the text format.For this project, I have taken data as The Project Gutenberg EBook of My Life and Work, by Henry Ford
- Initially, import all the required libraries and start fetching the data from the URL
import urllib.request
stringInURL = "https://www.gutenberg.org/cache/epub/7213/pg7213.txt"
urllib.request.urlretrieve(stringInURL,"/tmp/lifeandwork.txt")
- Next, move the file from temp folder to databricks storage folder of dbfs
dbutils.fs.mv("file:/tmp/lifeandwork.txt","dbfs:/data/lifeandwork.txt")
- Then, transfer the data file into Spark using sparkContext
lifeandworkRawRDD= sc.textFile("dbfs:/data/lifeandwork.txt")
- Command to separate the words from each line using flatmap function and to change all the words to lower case and then removing the spaces between them
lifeandworkMessyTokensRDD = lifeandworkRawRDD.flatMap(lambda eachLine: eachLine.lower().strip().split(" "))
- After changing the words to lowercase, we need to remove punctuations using regular expression by importing "Regex" library
import re
wordsAfterCleanedTokensRDD = lifeandworkMessyTokensRDD.map(lambda letter: re.sub(r'[^A-Za-z]', '', letter))
- Then, we need to filter the data by removing all the stop words from the data and create a new RDD with the obtained result
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover()
stopwords = remover.getStopWords()
lifeAndWorkWordsRDD = wordsAfterCleanedTokensRDD.filter(lambda word: word not in stopwords)
# removing all the empty spaces from the data
lifeAndWorkRemoveSpaceRDD = lifeAndWorkWordsRDD.filter(lambda x: x != "")
- In this process, we will pair up each word of the file and count it to 1 as an intermediate Key-value pairs and then we need to transform the words using reduceByKey() to get the total count of all the distinct words. To get back to python, we use collect() and then print the obtained results.
lifeAndWorkPairsRDD = lifeAndWorkRemoveSpaceRDD.map(lambda eachWord: (eachWord,1))
# transforming the words using reduceByKey() to get (word,count) results
lifeAndWorkWordCountRDD = lifeAndWorkPairsRDD.reduceByKey(lambda acc, value: acc + value)
#collect() action to get back to python
results = lifeAndWorkWordCountRDD.collect()
print(results)
- Sorting the words based on highest count of the word and displaying them in descending order
output = sorted(results, key=lambda t: t[1], reverse=True)[:15]
print(output)
- To display the data by ploting the obtained output, here we need to import the required libraries and then label the axis as per our requirement.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# preparing chart information
source = 'The Project Gutenberg EBook of My Life and Work, by Henry Ford'
title = 'Top Words in ' + source
xlabel = 'Words'
ylabel = 'Count'
df = pd.DataFrame.from_records(output, columns =[xlabel, ylabel])
plt.figure(figsize=(20,4))
sns.barplot(xlabel, ylabel, data=df, palette="cubehelix").set_title(title)
- We need to import "Natural Language tool kit" and "word Cloud" libraries to show the highest word count for the given input file.Then, we need to define few functions to process the data and the result will be shown in figure.
import matplotlib.pyplot as plt
import nltk
import wordcloud
from nltk.corpus import stopwords # to remove the stopwords from the data
from nltk.tokenize import word_tokenize # breaking down the text into smaller units called tokens
from wordcloud import WordCloud # creating an image using words in the data
# defining the functions to process the data
class WordCloudGeneration:
def preprocessing(self, data):
# convert all words to lowercase
data = [item.lower() for item in data]
# load the stop_words of english
stop_words = set(stopwords.words('english'))
# concatenate all the data with spaces.
paragraph = ' '.join(data)
# tokenize the paragraph using the inbuilt tokenizer
word_tokens = word_tokenize(paragraph)
# filter words present in stopwords list
preprocessed_data = ' '.join([word for word in word_tokens if not word in stop_words])
print("\n Preprocessed Data: " ,preprocessed_data)
return preprocessed_data
def create_word_cloud(self, final_data):
# initiate WordCloud object with parameters width, height, maximum font size and background color
# call the generate method of WordCloud class to generate an image
wordcloud = WordCloud(width=1600, height=800, max_font_size=200, background_color="white").generate(final_data)
# plt the image generated by WordCloud class
plt.figure(figsize=(12,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
wordcloud_generator = WordCloudGeneration()
# you may uncomment the following line to use custom input
# input_text = input("Enter the text here: ")
import urllib.request
url = "https://www.gutenberg.org/cache/epub/7213/pg7213.txt"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
input_text = response.read().decode('utf-8')
input_text1 = input_text.split('.')
clean_data = wordcloud_generator.preprocessing(input_text1)
wordcloud_generator.create_word_cloud(clean_data)
Based on the obtained results, the top 15 words of 'The Project Gutenberg EBook of My Life and Work, by Henry Ford' are 'one','business','work', 'man', 'men','money', 'time','made','make','may','car','every','production','get' and 'much'.We can say that the author is mainly focussing on individuals who spend their most of the time in business or at work to make more productive by making money.