-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-Introduction-to-programming-in-R.Rmd
770 lines (539 loc) · 27.6 KB
/
01-Introduction-to-programming-in-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
# Introduction to Programming in R
R is a programming language designed to read, write and manipulate data. It is especially suitable for performing statistical tests and modelling data. This session illustrates some of the most important tasks which can be performed in R.
## Calculations in R
The easiest way to use R is as a simple calculator. For example, you can calculate the sum, difference or product of two or more numbers in R as you would in any other calculator:
```{r}
2+2
```
```{r}
3-1
```
```{r}
3*5
```
```{r}
6/3
```
However, R is also able to perform more complex computations, such as those in a scientific calculator. For instance, one can use R to calculate roots, logarithms or results from exponential operations. This is done via R "functions". Functions have a name (called the function), followed by two parentheses. Inside these parentheses lie the arguments of the function.
For example, the command "sqrt(9)" applies the sqrt function (which calculates the squared root) to its argument (number 9):
```{r}
sqrt(9)
```
Some functions accept more than one argument. For example, the function "log" (which calculates logarithms) accepts two arguments: the number we are aplying logarithm to, and the base of the logarithm: log(100, base=10) calculates the log10 of 100.
```{r}
log(x=100, base=10)
```
Arguments can be specified in two ways:
A) positional: R knows which argument is which because they appear in a certain order
B) explicit: We specify the name of the argument follwed by an equality sign (=) and the desired value.
For example, the following command has the same result as before: calculating the log10 of 100. However, we do not need to specify which is x and which is base, since R will immediately infer this from their position (x always goes first and base always goes second).
```{r}
log(100,10)
```
If you do not know what a function does or how to use it, you can access the "help" information using a question mark:
```{r}
?log
```
This helps exists for any R function.
## Variable assignation in R
Most of the times, we will need to use R as something more than just a scientific computer: we might want to keep track of the things we have done and store our results in the computer memory (RAM). R allows us to do that by assigning values to variables.
In R variables are always object. We analyse see what this means below, but for now we should remember that a variable has three characteristics:
1) A name: We can call our variables with whichever name we want, with only three exceptions:
1.1 Variable names cannot start with a number, they need to start with a letter
1.2 Variable names cannot start with a dot
1.3 You should not name the variables with any "protected name". A protected name is a word which by default already means something in R. The most common example are the letters F and T, which in R mean TRUE and FALSE, two logical values
1.4 Variable names are case sensitive: ST, St, sT and st will be four different variables
2) A type: variables in R can be text, numbers or qualitative variables. We will discuss this in more detail below.
3) A value: this is the actual piece of information we are "setting" that variable to be.
To assign a value to a variable we use the following operator: <-
These two symbols (<-) represent an arrow pointing from right to left and in English mean "Let's assign the value in the right to the word in the left".
For example:
```{r}
A <- log(356,2)
```
Means "Let's set A to be the log2 of 356" (log2 of 356 is simply a number, close to 8.5).
Now this number has been stored in RAM under the name "A". We can access it by simply tying A in the console:
```{r}
A
```
Because it is so common in other languages, now R also accepts the following syntax, which means the exact same:
```{r}
A = log(356, 2)
A
```
## Data types in R
As mentioned above, in R information is stored in the form of objects (variables) which can belong to several data types. Let's take a look at the different data types available in R
### Numeric variables
The simplest data types are those which consist of a single element. This element can, for example, be a number. Numbers in R are called "numeric variables".
There are two types of numbers: integers (non-fractional) and fractional numbers. In computers, fractional numbers are stored as "Floating point numbers", which is a type of scientific notation.
Our previous variable "A"" is a number (specifically, a floating point number). To verify this, we can use the function class(), which tells us the data type of a variable.
```{r}
class(A)
```
If we want to transform this number into an integer, we can use the function as.integer() as follows:
```{r}
as.integer(A)
```
Note that now we don't get 8.4757... but only 8. To transform the fractional number into a integer, R has chopped all the numbers after the decimal point.
Let's look at the class of this new object:
```{r}
class(as.integer(A))
```
We see that it is no longer a "numeric" variable, but an "integer".
### Character variables
Sometimes we will need to work with text. For example, we might want to store a series of words. Texts in R are "character" variables. To tell R that something should be a character variable we put it in quotes:
```{r}
B <- "word"
```
Let's look at the value and clas of B:
```{r}
class(B)
B
```
B is a character variable with value "word".
Be careful of always having the correct data type! Often errors in R come from thinking we have a number when we actually have a text, or vice versa. For examle, the following variable contains number 2, but in a text format:
```{r}
A <- "2"
```
```{r}
class(A)
A
```
If we try to add it two number 2 we will get an error, since we are trying to add a character variable to a numeric variable
```{r, eval=FALSE}
A + 2
```
We can convert characters to numbers using the function as.numeric() or viceversa using the function as.character(). For instance, let's convert A to a numeric format:
```{r}
A <- as.numeric(A)
```
If we try to add 2 to this number, we will see that the error has disappeared:
```{r}
A + 2
```
Inside the computer memory, R character variables are stored as ASCII symbols.
### Logical variables
Often we will need to know if something is true or false. For example, is our variable "A" bigger than 3?
```{r}
A > 3
```
Or, is our variable a number?
```{r}
is.numeric(A)
```
The answer to all these questions comes in a binary format: YES or NO, TRUE or FALSE. These binary (boolean) values are stored in R as "Logical" variables. The importance of logical variables will become apparent as we learn more about R.
### Categorical variables (factors)
Sometimes our data will not consists of numbers of words, but rather of categories or groups. For example, we might have data on female and male individuals: which of them is which? Or we might have data on people from Mexico, the UK and France: which of them is which? This type of information is stored in R as "factors".
To create a factor, we use the function factor(). For example, the following command creates a list of factors for individuals from three nationalities: two are Mexican, 3 British and 2 French.
Don't pay much attention to the c() operator yet, this will be explained below.
```{r}
A <- factor(c("MEX", "UK", "UK", "FR", "MEX", "FR", "UK"))
```
```{r}
A
```
Notice how the factor has two components: the levels and the values. This is because factors are actually stored as numbers (integers), not as words! For example, in this case R created the following equivalence:
Word Number
FR 1
MEX 2
UK 3
So that each Frenck individual is given a number 1, each Mexican a number 2 and each British a number 3. Numbers are easier to store than words! R ordered our labels simply alphabetically. We can verify this with the mode function:
```{r}
mode(A)
```
Indeed, our data is made of numbers! In fact, we can obtain this numbers using as.numeric():
```{r}
as.numeric(A)
```
On the other hand, we can easily transform this to words using as.character():
```{r}
as.character(A)
```
More generally, the levels of a factor can be accessed using the "levels()" function:
```{r}
levels(A)
```
Factors are extremely important in statistics, because they allow us to split our data in groups and look for differencs between them.
## Working with multiple numbers in R
Single numbers or words are not very interesting, and they will hardly ever be the centre of an R analysis. What we really need is a way to store tens, hundreds or thousands of numbers and words. In R, this can actually be done very easily using objects called "vectors".
### Vectors
A series of values can be stored in a single variable in the form of a vector. The easiest way to define a vector in R is using the "c()" operator ("c" stands for create or catenate) as follows:
```{r}
vec <- c(1,2,3,4,5)
```
Since all of the members of our vector are numbers, the class of this vector is numeric:
```{r}
vec
class(vec)
```
Vectors also have an atribute called "length" (number of elements in the vector) which can be accessed with the length() operator:
```{r}
length(vec)
```
To retrieve specific elements from a vector, we can use the [] operator. Inside the brackets we should include the index (position) of the element we want to access. R always indexes its objects with base 1 (as opposed to base 0), which means that the first element with have index 1, the second index 2, and so forth.
To access the second element of our vector, we do the following:
```{r}
vec[2]
```
If we want to create a vector of consecutive numbers (a sequence) without typing each of them, we can use the colon ":" operator, as follows:
```{r}
1:5
```
The same result can be achieved using the seq() function. When using seq() we need to specify the starting and ending point of the sequence:
```{r}
seq(1,10)
```
It is possible to add a third argument to the seq() function, which specifies the interval/increment size of the sequence. For example, the following function generates a sequence of numbers from 1 to 10 in increments of 2:
```{r}
seq(1,10,2)
```
Vectors can also store character variables. For example, we can define a vector which contains animal names:
```{r}
vec2 <- c(A="dog",B="cat",C="rabbit")
vec2
```
The class of this vector will be "character":
```{r}
class(vec2)
```
Note that when I defined the vector I assigned a name to each of its elements (in this case A, B, and C). The name of a vector's elements can be accessed using the names() function:
```{r}
names(vec2)
```
A named element in a vector can be accessed directly by its name:
```{r}
vec2["C"]
```
Names can also be assigned after the vector has been created. For example, let's have a look at our first vector. This vector does not contain any names:
```{r}
vec
names(vec)
```
We can assign names to it using the names() function and the '<-' operator as follows:
```{r}
names(vec) <- c("one","two","three","four","five")
```
Now each element of vec is indexed by a name:
```{r}
vec
vec["two"]
```
Sometimes we will need to create vectors in which a number (or a group of numbers) is repeated over and over again. This can be easily done using the rep() function ("rep" standing for replicate). For example, the following function will create a vector that contains the values 1 and 2, repeated 5 times:
```{r}
rep(c(1,2),5)
```
rep() can also be used to replicate the content of a vector. For example, the following line creates a vector containing three copies o the content of vec2:
```{r}
rep(vec2,3)
```
### Lists
Often we will need to store elements of different classes into a single variable. For instance, imagine you want to store numbers and text in the same variable. You can try to assign them to a vector as follows:
```{r}
myvector <- c(1,3,6:3,"cat","dog")
```
However, the result from this operation is not what we expected:
```{r}
myvector
```
The quotes indicate us that all the variables are text! In fact, R has converted all the numbers to characters:
```{r}
class(myvector)
```
This is because a vector can only store values which all belong to the same class.
If we want to store elements of multiple data types in a single variable, we use a "list". Lists are defined using the list() function:
```{r}
mylist <- list(A=1,B=3,C=6:3,D="cat",E="dog")
```
Note how each element is stored independently and can be of a different class and have different dimentions. In this case, the element "C" is itself a vector of length 4.
```{r}
mylist
```
If the list contains names, then each element can be accessed by its name using the $ sign:
```{r}
mylist$D
```
This is the same as accessing the element using its index:
```{r}
mylist[[4]]
```
Note that for lists you have to use the double bracket operator [[]] instead of the single bracket used for vectors.
Each of the objects in a list can have different classes and modes:
```{r}
class(mylist$A)
class(mylist$C)
class(mylist$D)
```
### Matrices
Sometimes even vectors and lists will not be enough for storing data. This is the case when one deals with tables of data or matrices. For example, we might have information on "cancer samples" and "healthy samples" from a hospital, and some measurements on each of them. We will want to store at least two variables:
A) The cancer and healthy labels (this can be, for instance, a factor)
B) The measurements
To manipulate this kind of data structures we can use matrices and data frames.
A matrix object consists of "n" rows and "m" columns. To create a matrix in R simply use the "matrix()" function and specify the data you want to fill the matrix with. For example, this line creates a sequence of numbers from 1 to 10 and stores them in a matrix format:
```{r}
matrix(1:10)
```
Note that by default matrix() assumes that you only want one column. If you need the data to be stored in multiple columns, simply specify the number of rows and columns you need:
```{r}
matrix(1:10,nrow=5,ncol=2)
```
Importantly, R always assumes you want to arrange the data in a matrix "by column". This means, starting from the top left and going down the first column, then the second one, and so forth. If you need the data to be allocated row by row, simply set the byrow argument to TRUE.
```{r}
matrix(1:100,nrow=10,ncol=10,byrow=TRUE)
```
Just as vectors have names, the columns and rows of a matrix can also be labelled. These labels can be specified using the "dimnames" argument. dimnames has to be a list with two elements, the name of rows and the name of columns.
Let's craete a 100 x 5 matrix containing numbers from 1 to 500. We name the rows of the matrix with numbers and the columns with letters.
```{r}
mat <- matrix(1:500,nrow=100,ncol=5,
dimnames=list(1:100,c("A","B","C","D","E")))
```
The funciton "head()" displays the top elements of the matrix
```{r}
head(mat)
```
The function "tail()" displays the bottom elements
```{r}
tail(mat)
```
You can use "colnames()" and "rownames()" to access the row and column names of a matrix:
```{r}
colnames(mat)
rownames(mat)
```
Finally, it is possible to access specific elements in a matrix using either their names or their indexes. Note that both indexes (the column and row number) need to be specified.
```{r}
mat[1,4]
mat[1,"D"]
```
If the column number is left empty, then R retrieves all the elements in that row (and vice versa).
```{r}
mat[1,]
```
For instances, the following line retrieves all elements in the column labelled "D" and calculates the average (mean):
```{r}
mean(mat[,"D"])
```
Matrices are very useful, since they allow the user to perform operations to each of its rows or columns.
Another way of building matrices is by concatenating multiple vectors. The vectors in question have to be of the same length. Let's create two different vectors with length 10 each:
```{r}
A <- seq(1:10)
B <- seq(11:20)
```
Now let's combine them. We can do this using the "cbind()" function (cbind standing for "column binding").
```{r}
cbind(A,B)
```
One can verify that this new object is a matrix
```{r}
class(cbind(A,B))
```
The columns of this matrix are named after the names of each individual vector
```{r}
colnames(cbind(A,B))
```
Vectors can also be used as two rows of a matrix, insted of two columns. To do this, we use rbind(), which stands for "row binding".
```{r}
rbind(A,B)
```
### Data frames
As with vectors, often we will need to store elements of different classes into a single matrix. For instance, imagine you want to store a list of names along with an identifier for each name. Let's try to store this data as a matrix:
First we generate a vector of IDs, then a vector of names:
```{r}
ID <- 1:6
Name <- c("Jimmy","Amanda","Glenn","Toby","Ren","Amanda")
```
Finally, we bind both columns:
```{r}
names <- cbind(ID,Name)
```
Note that everything has been transformed to a character form (hence the quotes):
```{r}
names
```
This is because, just as a vector, a matrix can only store values which all belong to the same class.
To store objets of several different classes in a single object, we need to build a data frame. To create a data frame in R simply use the data.frame() function as follows:
```{r}
names <- data.frame(ID,Name)
names
```
Note how now each column belongs to a different class:
```{r}
class(names$ID)
class(names$Name)
```
However, the names have been stored as factors instead of characters. This is because R always assumes that the text in a data.frame represents factors. To keep them as characters instead, simply set the stringsAsFactors paramteres to FALSE:
```{r}
names <- data.frame(ID,Name, stringsAsFactors=FALSE)
```
Now we have a numeric and a character column:
```{r}
class(names$ID)
class(names$Name)
```
Data frames are useful because it is easy to perform operations on each row or column individually. For example, we can now use "table()" to tabulate the names and find out if any of them is repeated more than once.
```{r}
table(names$Name)
```
## Manipulating data in R
As a programming language, R has been designed to optimally work with vectors, lists, matrices and data frames, as will be illustrated in the following examples.
Let's create a data frame containing the name of 6 individuals as well as their height, body mass index (BMI), age and hours of sleep. Since we do not have access to such dataset, we "simulate it" using a "random" number generator. We will not discuss random number generation in R just now, but we'll return to it later.
Let's first define a vector of names:
```{r}
names <- c("Jimmy","Amanda","Glenn","Toby","Ren","Amanda")
```
Next, we generate 6 random numbers that represent height in centimetres (a height for each individual). To do this we use rnorm(), which is a function that generates random numbers with mean (average) of 165 cm and standard deviation (a measure of spread which we will learn about later) of 10cm.
```{r}
heights <- rnorm(6,mean=165,sd=10)
```
We repeat this process for BMI and sleep hours (each of them will have its own mean and standard deviaiton).
```{r}
BMIs <- rnorm(6,mean=27,sd=2)
sleep <- rnorm(6,mean=7,sd=2)
```
Now, we create a vector of ages. In this case, we simply invent the age of our fictional characters:
```{r}
age <- c(21,23,23,40,19,35)
```
Finally, we combine everything into a single data.frame object:
```{r}
dat <- data.frame(names,heights,BMIs,age,sleep)
```
Let's have a look at it:
```{r}
dat
```
Using the function "dim()" we verify the dimensions of this data set
```{r}
dim(dat)
```
Let's now look closer into a family of functions called "apply". These functions are designed to repeat an operation across all elements of a data frame, list or vector.
In order to use apply(), we specify the following arguments:
A) X = matrix or data frame where our data is stored
B) MARGIN = whether we want the operation to be repeated per row (1) or per column (2)
C) FUN = operation we want to perform
In this case, we want to find out the class of each of the columns in the data frame "dat":
```{r}
apply(X=dat, MARGIN=2, FUN=class)
```
Now we combine apply() with function "Summary", which takes a group of numbers and summarises them in six parameters: minimum, mean, median, maximum and interquartile ranges.
```{r}
apply(X=dat[,2:5], MARGIN=2, FUN=summary)
```
### Working with lists of data frames
Data frames can themselves be grouped into lists or higher-order structures. To illustrate this, let's create a new fictional dataset of heights, BMIs, ages and sleep hours. Now let's also include the variable sex. We will simulate randm data for 10,000 individuals (Yes, then thousand!) and will name each individual with a numeric ID instead of a name (coming up with 10000 fictional names would be too time consuming and require too much imagination!):
```{r}
IDs <- seq(1,10000)
heights <- rnorm(10000,mean=165,sd=10)
BMIs <- rnorm(10000,mean=27,sd=2)
ages <- round(rnorm(10000,mean=25,sd=10))
sex <- sample(x=c(0,1),size=10000,replace=TRUE)
sleep <- rnorm(10000,mean=7,sd=2)
dat2 <- data.frame(IDs,heights,BMIs,ages,sex,sleep)
```
Let's compare our first, smaller data set with the new data.
```{r}
dat
```
```{r}
head(dat2)
```
Under certain circumstances we will want to store both datasets together. For instance, maybe they are data from two different populations which we later want to compare. We can do this by creating a list of data frames:
```{r}
database <- list(A=dat,B=dat2)
```
Here, the element A of our list is the first data set, and the element B the second:
```{r}
head(database$A)
head(database$B)
```
Now we can use another function of the apply family called "lapply" to repeat an operation in all the elements of a list (in this case, the list of data frames). We can combine lapply() with apply() to calculate the summary numbers of each column o both datasets. Let's store the results in a new variable:
```{r}
results <- lapply(database, function(l){
apply(X=l[,2:dim(l)[2]], MARGIN=2, FUN=summary)
})
```
Note that the results variable is itself a list (since lapply generates a list output):
```{r}
results
class(results)
```
## Extending R's funcitonality with libraries
Sometimes the basic functions in R are not enough for the type of analysis we are interested in. In these cases, we can expand R's functionality by installing additional groups of functions called "libraries". A large proportion of R libraries are stored in the "Comprehensive R Archive Network" (CRAN) and can be installed using the "install.packages()" function followed by the name of the library.
The following line installs the libraries "rafalib" and "reshape2" from CRAN:
```{r, eval=FALSE}
install.packages("rafalib")
install.packages("reshape2")
```
rafalib contains a group of functions which facilitate data exploration and visualisation, while reshape2 contains functions to change the structure of data frames (reshape them).
Let's load the libraries using library().
```{r}
library(rafalib)
library(reshape2)
```
Now we can use the function mypar() from "rafalib" to tell R to create a grid with 6 spaces. Then, we use apply() to plot histograms of each of the columns in our data frame data2 (element B in the "database" list"). We will discuss extensively what a histogram is in our next session, but for now let's just think about it as a way to "see" our data.
```{r}
mypar(3,2)
apply(X=database$B[,2:dim(database$B)[2]], MARGIN=2, FUN=hist, main="",xlab="")
```
The function melt() from reshape2 combines all the different columns of a data frame into a single columns and adds an extra column of labels. This way of restructuring the data is useful for easier visulisation or manipulation.
```{r}
dat
```
```{r}
melt(dat)
```
## Writing data from R
Once we've finished manipulating data, we might want to permanently store it in our computer (hard drive) and not just in RAM. R allows us to easily store data as files. There are two ways of writing data from R:
### Writing human readable files
We might want to write the data in a format that is human readale. This means, a format that any text editor can open and which contains ASCII characters. This can be done with the "write.table()" function. For example, we can store the data frame dat2 into a text file as follows:
```{r, eval=FALSE}
write.table(dat2, file="~/Desktop/Big_data_summer_course_2018/Data/example_data.txt")
```
However, write.table() will automatically add quotes to every value in the text file. You can check this yourself opening the output file. To avoid this, simply set the argument quote to FALSE:
```{r, eval=FALSE}
write.table(dat2, file="~/Desktop/Big_data_summer_course_2018/Data/example_data_noQuotes.txt", quote=FALSE)
```
By default, write.table() separates the values with spaces. Sometimes we might want to separate them using tab ("\t"). For example, genomic distances and coordinates are commonly saved as tabulated files. To do this, specify the separator character by setting sep to "\t".
```{r, eval=FALSE}
write.table(dat2, file="~/Desktop/Big_data_summer_course_2018/Data/example_data.tab", quote=FALSE, sep="\t")
```
Finally, one of the most common data file formats is the comma separated value (csv), which can be read by any spreadsheet software (eg. Excel). To create this type of files, we can set sep="," or simply use the function "write.csv()".
```{r, eval=FALSE}
write.csv(dat2, file="~/Desktop/Big_data_summer_course_2018/Data/example_data.csv", quote=FALSE)
```
### Writing binary files
Often we will be interested in saving objects which were difficult to generate but are too complicated to be stored in a human readable format. For instance, if we wanted to save our object "database", which is a list containing two data frames, we could not use write.table. In these cases, an alternative is to save the data as a binary file (R data file). This is easily done with the function saveRDS(). It is recommended (though not absolutely necessary) to save the files with the suffix rds.
```{r, eval=FALSE}
saveRDS(database, file="./Data/example_database.rds")
```
saveRDS() is standardised so that R will always save data objects in the same binary format regardless of computer architecture. This means that an RDS object generated in one computer can be taken to any other computer and read into R.
## Reading data into R
All the data types in the previous section can be read into R. This is done with the following function:
To read from a csv, we use read.csv()
```{r}
mydata <- read.csv("./Data/example_data.csv")
head(mydata)
```
Note that read.csv() automatically recognises the header. If this is not what you want, you can set the header arugment to FALSE. Also, read.csv() assumes there are no row names. If we want the first column (or any other column) to be used as row names, we just need to set row.names to be the column number. In this case, row.names = 1 will use the IDs as row names:
```{r}
mydata <- read.csv("./Data/example_data.csv", row.names = 1)
head(mydata)
```
To read from a space or tab separated files, we use read.table() as follows:
```{r}
mydata <- read.table("./Data/example_data.tab", row.names = 1)
head(mydata)
```
Finally, we can read a binary (R data) file using readRDS(). Note that the list has exactly the same structure as it had before, and that you can easily access both of the data frames stored in it.
```{r}
mydatabase <- readRDS("./Data/example_database.rds")
lapply(mydatabase,head)
```
## Project work
Now let's try to apply what you just learn to your project data! We will begin by trying to solve the following tasks:
1) Open a new R session and create a new R file. You can use this file as a place for experimentation
2) Try reading your data into R. Most likely this data will be in the form of a CSV file or a table (text file)
3) How many variables does your data contain?
4) Which is the class of each of theses variables?
5) Can you tabulate these variables?
6) Can you identify which of them are quantitative and which qualitative?
7) Try taking a sample of 10 elements from each of these variables