title | author |
---|---|
R refresher |
Meeta Mistry, Radhika Khetani, Mary Piper, Jihe Liu |
Approximate time: 45 minutes
- Describe the various data types and data structures (including tibbles) used by R
- Use functions in R and describe how to get help with arguments
- Describe how to install and use packages in R
- Use the pipe (%>%) from the dplyr package
- Describe the syntax used by ggplot2 for making plots
-
Let’s create a new project directory for this review:
- Create a new project called
R_refresher
- Create a new R script called
reviewing_R.R
- Create the following folders in the project directory -
data
,figures
- Download a counts file to the
data
folder by right-clicking here
- Create a new project called
-
Now that we have our directory structure setup, let's load our libraries and read in our data:
- Load the
tidyverse
library - Use
read.csv()
to read in the downloaded file and save it in the object/variablecounts
- What is the syntax for a function?
- How do we get help for using a function?
- What is the data structure of
counts
?- What main data structures are available in R?
- What are the data types of the columns?
- What data types are available in R?
- Load the
-
We are performing RNA-Seq on cancer samples with genotypes of p53 wildtype (WT) and knock-down (KO). You have 8 samples total, with 4 replicates per genotype. Write the R code you would use to construct your metadata table as described below.
- Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the
rep()
function). - Put them together into a dataframe called
meta
. - Use the
rownames()
function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring thepaste0()
function).
Your finished metadata table should have information for the variables
sex
,stage
,genotype
, andmyc
levels:sex stage genotype myc KO1 M 1 KO 23 KO2 F 2 KO 4 KO3 M 2 KO 45 KO4 F 1 KO 90 WT1 M 2 WT 34 WT2 F 1 WT 35 WT3 M 1 WT 9 WT4 F 2 WT 10 - Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the
Now that we have created our metadata data frame, it's often a good idea to get some descriptive statistics about the data before performing any analyses.
- Summarize the contents of the
meta
object, how many data types are represented? - Check that the row names in the
meta
data frame are identical to the column names incounts
(content and order). - Convert the existing
stage
column into a factor data type
-
Using the
meta
data frame created in the previous question, perform the following exercises (questions DO NOT build upon each other):- return only the
genotype
andsex
columns using[]
: - return the
genotype
values for samples 1, 7, and 8 using[]
: - use
filter()
to return all data for those samples with genotypeWT
: - use
filter()
/select()
to return only thestage
andgenotype
columns for those samples withmyc
> 50: - add a column called
pre_treatment
to the beginning of the dataframe with the values T, F, T, F, T, F, T, F- why might this design be problematic?
- Using
%>%
create a tibble of themeta
object and call itmeta_tb
(make sure you don't lose the rownames!) - change the names of the columns to: "A", "B", "C", "D", "E":
- return only the
-
Often it is easier to see the patterns or nature of our data when we explore it visually with a variety of graphics. Let's use ggplot2 to explore differences in the expression of the Myc gene based on genotype.
- Plot a boxplot of the expression of Myc for the KO and WT samples using
theme_minimal()
and give the plot new axes names and a centered title.
- Plot a boxplot of the expression of Myc for the KO and WT samples using
-
Many different statistical tools or analytical packages expect all data needed as input to be in the structure of a list. Let's create a list of our count and metadata in preparation for a downstream analysis.
- Create a list called
project1
with themeta
andcounts
objects, as well as a new vector with all the sample names extracted from one of the 2 data frames.
- Create a list called