-
Notifications
You must be signed in to change notification settings - Fork 0
/
06-Machine-learning-for-spam-detection.Rmd
270 lines (184 loc) · 15.1 KB
/
06-Machine-learning-for-spam-detection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# Machine Learning Fundamentals: Spam Detection Using a Naive Bayes Classifier
In the previous sessions we learnt how to read data (from files or from public websites) into R, how to process it, explore it and perform some basic analysis for one, two or multiple variables. Now that you know how to do all of this, you are prepared to look at the final steps of data analysis: using the data to find new knowledge and apply it to relevant objectives such as finding patterns or predicting outcomes. This falls under the scope of machine learning, a discipline which combines statistics, mathematics and computer science to make sense of data. The two most tasks in types of machine learning are:
A) Clustering: Finding if any patterns or subgroups exist in our data. We learnt a bit of this in our previous session on heatmaps and dendrograms.
B) Classification: Clasifying an individual observation into one of the possible groups of classes. There are multiple applications of this in the real world. For example finding if a bank transaction is normal or fraud, determining if a message is normal or spam, finding whether a star contains exoplanets or not, classifying cancer patients into different types of cancer, etc...
In this session we will use a specific type of machine learning called "Naive Bayes" to classify SMS messages into normal or spam.
Before starting this session we need to install several R libraries. Some of these libraries are designed to work with text data (remember messages are by definition in a text format) and others contain machine learning functions. To install these libraries, simply run the following lines of code.
```{r install packages, warning=FALSE, message=FALSE, eval=FALSE}
install.packages("tm")
install.packages("e1071")
install.packages("gmodels")
install.packages("rlang")
```
Let's now load the libraries.
```{r libraries, warning=FALSE, message=FALSE}
library(tm)
library(wordcloud)
library(rlang)
library(e1071)
library(gmodels)
library(rafalib)
```
You can find the data for this session in the "Data" directory, along with the other materials for this course. The file is called "sms_spam.csv". Let's read this into R. Note that I am asking R not to transform text into factors, because messages are mostly text.
```{r load csv, message=FALSE, warning=FALSE}
sms_raw <- read.csv("./Data/sms_spam.csv", stringsAsFactors = F)
```
We use the function str() to find the structure of this data
```{r, message=FALSE, warning=FALSE}
str(sms_raw)
```
R tells us that we have two variables:
1) A series of 5559 text (SMS) messages
2) A list which tells us which of these messges are spam and which are normal ("ham"). This has been manually determined by the people who collected the data.
Let's transform the spam/ham labels into a factor, since these really are two categories.
```{r type to factor, message=FALSE, warning=FALSE}
sms_raw$type <- factor(sms_raw$type)
```
Now let's count how many spam and how many "ham" messges there are:
```{r number spam, message=FALSE, warning=FALSE}
table(sms_raw$type)
```
The majority of them (~ 86%) are normal messages, and only 14% are spam.
## Visualising data
Let's now visualise the data. Because this data is in a text format, we cannot create histograms or scatter plots for it. Instead, let's use a word cloud to find which words are used most frequently. Because there are almost 6 thousand messages and thosands of words, we ask R to only show us the 50 most common words.
```{r wordcloud sms_raw, message=FALSE, warning=FALSE}
wordcloud(sms_raw$text, max.words = 50, random.order = F)
```
Now, let's use the function subset() to separate the data in two: a list of normal messages and a list of spam messages.
```{r spam and ham, message=FALSE, warning=FALSE}
spam <- subset(sms_raw, type=="spam")
ham <- subset(sms_raw, type=="ham")
```
Let's now repeat the word cloud separately for each group. In this way, we can compare the word composition of spam and non-spam messages.
```{r wordclouds spam and ham, message=FALSE, warning=FALSE}
mypar(1,2)
wordcloud(spam$text, max.words = 50, scale=c(3,0.5), random.order = F)
wordcloud(ham$text, max.words = 50, scale=c(2,0.5), random.order = F)
```
We can already see some major differences between both types of message. Spam messages use words as "call", "free", "reply" and even "£200" (beause what they want is for you to call them back or to make you think you've won something). Normal messges between humans talk about much more emotional matters. For example, in normal messages we see words like "can", "know", "sorry", "good" or even "home".
If these messages are so different, can we use their word composition to separate them or create a spam filter? We will try to apply machine learning to achieve this.
## Pre-processing data
The first thing we need to do is putting the data in the right format. Machine learning uses statistics, and statistics can only deal with numbers or proportions. Thus, we will have to convert these messages into a quantitative format.
### Converting data to corpus
We start by converting our 6000 messages into a "corpus". A corpus is a list, where each element of the list is a series of words. We use the function Corpus().
```{r sms_corpus, message=FALSE, warning=FALSE}
sms_corpus <- Corpus(VectorSource(sms_raw$text))
```
We can use the function "inspect()" to look at the first three messages in the corpus:
```{r, message=FALSE, warning=FALSE}
inspect(sms_corpus[1:3])
```
### Cleaning data
Now we need to clean the data and remove all the noise and unnecessary words. There can be a lot of unnecessary words and signs in 6 thousand messages! The first problem with words is that R is case sensitive, so words starting with upper case or lower case will be seen by R as different things, even if they are the exact same word. Thus, let's transf all the words to lower case using the tm_map() function and specifying we want lower case. We'll store our clean body of words in the variable "corpus_clean".
```{r tolower, message=FALSE, warning=FALSE}
corpus_clean <- tm_map(sms_corpus, tolower)
```
Next, let's remove all the numbers and keep only words.
```{r removeNumbers, message=FALSE, warning=FALSE}
corpus_clean <- tm_map(corpus_clean, removeNumbers)
```
Some words are absolutely useless for classifying messages. For example, stop words such as "to","and","but","or", etc... give us very little information and usually appear a lot of times, hence taking up too much memory. Let's use tm_map() to remove all stop words.
```{r removeStopWords, message=FALSE, warning=FALSE}
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
```
Now let's also remove all the punctuation marks (dots, commas, quotes, etc...).
```{r removePunctuation, message=FALSE, warning=FALSE}
corpus_clean <- tm_map(corpus_clean, removePunctuation)
```
Finally, let's remove the extra white space, because often people make typos when composing messages and hit the space key multiple times.
```{r stripWhiteSpace, message=FALSE, warning=FALSE}
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
```
Let's have a look at our final, clean set of words.
```{r, message=FALSE, warning=FALSE}
inspect(corpus_clean[1:5])
```
Note how a lot of words dissappeared and now all the letters are lower case.
### Visualising clean data
Let's create a new word cloud, but now using the clean data.
```{r wordcloud corpus_clean, message=FALSE, warning=FALSE}
wordcloud(corpus_clean, min.freq = 40, scale=c(3,0.5), random.order = F)
```
### Tokenising
Now comes the most important part: transforming our text data to something quantitative (ie. numbers). The way we do this is by counting how many times each word appears in each message. This will result in a "word count matrix". In other words, we will construct a matrix in which every row is a message and every column is a word. Each element of the matrix is the number of times that word appeared in this message.
These type of matrix is sometimes called "document term matrix". Let's use R's function DocumentTermMatrix() to convert our clean data into numbers. This processed is often referred to as "tokenising".
```{r sms_dtm, message=FALSE, warning=FALSE}
sms_dtm <- DocumentTermMatrix(corpus_clean)
```
### Dividing data into training and testing set
Now we are ready to start classifying the messages. As you may have learnt in the theoretical session, the way machine learning work is the following:
1. The algorithm is "trained" on a very big data set. Using this data, it will find the patterns that characterise spam and normal messages
2. The algorithm is applied to new (test) data and classifies each new message as normal or spam.
3. We can use these results to analyse how the algorithm performed.
Let's first divide the messages in two groups: training and test. We do this on the raw data, the corpus and the word count matrix.
A) Training data set: This set will contain 75% of the messages (messages 1 to 4169):
```{r training data, message=FALSE, warning=FALSE}
sms_raw_train <- sms_raw[1:4169,]
sms_corpus_train <- corpus_clean[1:4169]
sms_dtm_train <- sms_dtm[1:4169,]
```
B) Test data set: This set will contain 25% of the messages (messages 4170 to 5559):
```{r, message=FALSE, warning=FALSE}
sms_raw_test <- sms_raw[4170:5559,]
sms_corpus_test <- corpus_clean[4170:5559]
sms_dtm_test <- sms_dtm[4170:5559,]
```
### Correcting data sparsity
There is one last problem we need to solve before building our machine learning model. Because so many words exist in the English languge (just look at the size of any dictionary!), chances are there will be a lot of words that only appear in one, two or three of the messages. Theses words are not useful: they will not help us differentiate spam from normal. In statistics this is called "sparsity": when a very big part of a matrix is full of zeros (in this case, words that appear zero times in most messages). We should remove this sparsity. To do that, let's restrict our data to frequent words only. All words that appear in less than 5 messages will be discarded. Frequent words can be found using the function findFreqTerms().
```{r findFreqTerms, message=FALSE, warning=FALSE}
sms_dict <- findFreqTerms(sms_dtm_train, 5)
```
Let's only keep these frequent words
```{r keeping frequent words, message=FALSE, warning=FALSE}
sms_train <- DocumentTermMatrix(sms_corpus_train, list(dictionary=sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_test, list(dictionary=sms_dict))
```
### Converting data to word appearance table
At this point, we have create a matrix with words as columns and messages as rows. However, to make things simpler we will restrict our matrix to only two values: if a word appears inside a message the entry is set to "Yes" (1). If it doesn't, to "No" (0).
This function will convert the word counts to YES and NOs.
```{r convert_counts, message=FALSE, warning=FALSE}
convert_counts <- function(x){
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels=c(0,1), labels=c("No","Yes"))
return(x)
}
```
Let's apply it to both our training and our test data.
```{r apply convert_counts, message=FALSE, warning=FALSE}
sms_train <- apply(sms_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_test, MARGIN = 2, convert_counts)
```
Finally, let's have a look at the data.
```{r, message=FALSE, warning=FALSE}
sms_train[1:10,1:10]
```
Now we have a matrix of Y and N which tells us which message contains which word. There are 1218 words in this matrix, so you can think of it as a space with about 1200 dimensions (each word is a dimension)! Now we are ready to do machine learning!
## Predicting spam and non-spam
### Training classifier
We use the YES/NO data to "train" a classifier. The type of classifier we will work with in this session is called "Naive Bayes". The way this algorithm works involves advanced statistics. Some of this will be explained during the theoretical session. However, in general terms we find out which words are contianed in a message and which words are not, based on this evidence we calculate the probability that this message is spam or normal. If this probability is high, we say that the message is spam. If it is low, we say that the message is normal. We repeat this for every messag in the data set.
Let's use the training data to "train" our naive bayes algorithm. All you need to do is use the naiveBayes function as follows:
```{r training classifier, message=FALSE, warning=FALSE}
sms_classifier <- naiveBayes(sms_train, sms_raw_train$type)
```
Now the algorithm has been trained. In other words, it has looked at the data and found which words appear in spam and in normal messages, then it has calculated probabilities for each of them.
### Testing classifier
Now we can apply the algorithm we just trained on the remaining 25% of the data (the test set). Is our clasifier able to predict which messages are spam and which are not? To make this prediction we simply use the function "predict" followed by the variable in which our trained model is stored. This step can take a few seconds to run. After all, R is using thousands of words to predict if each message is or isn't spam!
```{r testing classifier, message=FALSE, warning=FALSE}
sms_predicted <- predict(sms_classifier, sms_test)
```
### Evaluating performance
Let's find out how good our predictions were. We use the function CrossTable to compare the actual class of each message (spam or normal, as determined manually by reserachers) with the class we predicted them to have.
```{r evaluate performance, message=FALSE, warning=FALSE}
CrossTable(sms_predicted, sms_raw_test$type,
prop.chisq = FALSE, prop.t =FALSE,
dnn = c("predicted","actual"))
```
You can see that 99.7% of normal messages were detected by our fileter as normal. Only 4 out of more than one thousand normal messags were misclassified as spam. This is a really good performance. On the other hand, about 82.5% of spam messages were correctly labeled as spam. Approximately 32 out of 180 spam messages were misclassified as normal. This performance is good, but could definitely be improved!
Now you know a bit more about how machine learning is used with different comercial and scientific purposes. In this case, to build a spam filter. There are many more machine learning techniques, some of them incredibly accurate and complex. They can be (and in fact) are applied all the time to solve all sorts of problems. Examples of this are recommendations displayed to you on Google, Facebook or Amazon; studying which regions of the brain are active when people perform different tasks; classifying cancers into groups; or performing language processing (for example in devices capable of translating from one language to another in real time).
## Discussion
Can you think of any other applications of machine learning? In which ways do you think machine learning is changing our daily lives? Are these changes positive or negative? What are their implications?
## References
This turotial was adapted from:
1. Lantz B. (2013). Chapter 4. Probabilistic learning and classification using Naive Bayes. In Machine Learning with R. Birmingham: Packt Publishing.
SMS data comes from the study:
2. Gomez-Hidalgo JM, Almeida TA and Yamakami A. (2012). On the validity of a new SMS spam collection. Proceedings of the 11th international conference on machine learning and applications.