diff --git a/Basic_R_Course.zip b/Basic_R_Course.zip
index 84af6c4..aa8da65 100644
Binary files a/Basic_R_Course.zip and b/Basic_R_Course.zip differ
diff --git a/Introduction to Solving Biological Problems Using R - Day 1.pdf b/Introduction to Solving Biological Problems Using R - Day 1.pdf
deleted file mode 100644
index d565573..0000000
Binary files a/Introduction to Solving Biological Problems Using R - Day 1.pdf and /dev/null differ
diff --git a/Introduction to Solving Biological Problems Using R - Day 2.pdf b/Introduction to Solving Biological Problems Using R - Day 2.pdf
deleted file mode 100644
index 5f7fc82..0000000
Binary files a/Introduction to Solving Biological Problems Using R - Day 2.pdf and /dev/null differ
diff --git a/Session1.1-intro.Rmd b/Session1.1-intro.Rmd
new file mode 100644
index 0000000..81b4482
--- /dev/null
+++ b/Session1.1-intro.Rmd
@@ -0,0 +1,592 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 1"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+ html_notebook:
+ toc: yes
+ toc_float: yes
+css: mystyle.css
+---
+```{r include = FALSE}
+library(knitr)
+opts_chunk$set(comment = NA,eval=FALSE) # eliminates hashtag from R outputs
+```
+
+
+
+# Course Aims
+
+- To introduce you to the basics of R
+ + Reading data
+ + Perform simple analyses
+ + Producing graphs
+ + ***How to get help!***
+- Give you all the background you need to ***practice*** by yourselves
+- Introduce tools that will help you to work in a ***reproducible*** manner
+
+# Day 1 Schedule
+
+1. Introduction to R and its environment
+2. Data Structures
+3. Data Analysis Walkthrough
+4. Plotting in R
+
+#1. Introduction to R and its environment
+
+##What's R?
+
+* A statistical programming environment
+ + based on 'S'
+ + suited to high-level data analysis
+* But offers much more than just statistics
+* Open source and cross platform
+* Extensive graphics capabilities
+* Diverse range of add-on packages
+* Active community of developers
+* Thorough documentation
+
+
+
+http://www.r-project.org/
+
+
+
+
+
+
+##R plotting capabilities
+
+https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
+
+
+##Who uses R? Not just academics!
+
+http://www.revolutionanalytics.com/companies-using-r
+
+- Facebook
+ + http://blog.revolutionanalytics.com/2010/12/analysis-of-facebook-status-updates.html
+- Google
+ + http://blog.revolutionanalytics.com/2009/05/google-using-r-to-analyze-effectiveness-of-tv-ads.html
+- Microsoft
+ + http://blog.revolutionanalytics.com/2014/05/microsoft-uses-r-for-xbox-matchmaking.html
+- New York Times
+ + http://blog.revolutionanalytics.com/2011/03/how-the-new-york-times-uses-r-for-data-visualization.html
+- Buzzfeed
+ + http://blog.revolutionanalytics.com/2015/12/buzzfeed-uses-r-for-data-journalism.html
+- New Zealand Tourist Board
+ + https://mbienz.shinyapps.io/tourism_dashboard_prod/
+
+
+## R can facilitate Reproducible Research
+
+
+
+
+
+
+- Statisticians at MD Anderson tried to reproduce results from a Duke paper and unintentionally unravelled a web of incompetence and skullduggery
+ + as reported in the ***New York Times***
+
+
+
+
+
+- Very entertaining talk from Keith Baggerly in Cambridge, December 2010
+
+
+
+According to recent editorials, the reproducibility crisis is still on-going
+
+
+
+
+[Reality check on reproducibility](http://www.nature.com/news/reality-check-on-reproducibility-1.19961)
+
+[1,500 scientists lift the lid on reproducibility](http://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970)
+
+
+##Getting started
+- Latest release 3.3.2 (October 2016)
+ + Base package and Contributed packages (general purpose extras)
+ + `r length(XML:::readHTMLTable("http://cran.r-project.org/web/packages/available_packages_by_date.html")[[1]][[2]])` available packages as of `r date()`
+- Download from http://mirrors.ebi.ac.uk/CRAN/
+- Windows, Mac and Linux versions available
+- Executed using command line, or a graphical user interface (GUI)
+- On this course, we use the RStudio GUI (www.rstudio.com)
+
+
+
+
+To launch RStudio, find the RStudio icon and click
+
+
+
+
+
+
+- The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left)
+- However, for this course we will use a relatively new feature called *R-notebooks*.
+- An R-notebook mixes plain text with R code
+ + The R code can be run from inside the document and the results are displayed directly underneath
+- Each *chunk* of R code looks something like this.
+- Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
+- Or you can press the green triangle on the right-hand side to run everything in the chunk
+- Try this now!
+
+```{r}
+print("Hello World")
+
+```
+
+- The R notebook can be rendered into a format such as PDF or HTML so they can be shared with your collaborators
+- On the course website you will see compiled versions of each session
+
+##Basic concepts in R - simple arithmetic
+
+- The command line can be used as a calculator and understands the usual arithmetic operators +, -, *, /
+- Try adding a few more calculations here
+
+```{r}
+2 + 2
+2 - 2
+4 * 3
+10 / 2
+
+
+```
+
+Note: The number in the square brackets is an indicator of the
+position in the output. In this case the output is a 'vector' of length 1
+(i.e. a single number). More on vectors coming up...
+
+
+In the case of expressions involving multiple operations, R respects the [BODMAS](https://en.wikipedia.org/wiki/Order_of_operations#Mnemonics) system to decide the order in which operations should be performed.
+
+```{r}
+2 + 2 *3
+2 + (2 * 3)
+(2 + 2) * 3
+```
+
+R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.
+
+
+```{r}
+pi
+sin (pi/2)
+cos(pi)
+tan(2)
+log(1)
+
+
+```
+
+We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of *variables*.
+
+
+##Basic concepts in R - variables
+
+- A variable is a letter or word which takes (or contains) a value. We use the **assignment operator: `<-`**
+```{r}
+x <- 10
+x
+myNumber <- 25
+myNumber
+```
+
+- We can perform arithmetic on variables:
+```{r}
+sqrt(myNumber)
+```
+
+
+- We can add variables together:
+```{r}
+x + myNumber
+```
+
+- We can change the value of an existing variable:
+
+```{r}
+x <- 21
+x
+```
+
+
+- We can set one variable to equal the value of another variable:
+```{r}
+x <- myNumber
+x
+```
+
+- We can modify the contents of a variable:
+
+```{r}
+myNumber <- myNumber + sqrt(16)
+myNumber
+```
+
+When we are feeling lazy we might give our variables short names (`x`, `y`, `i`...etc), but a better practice would be to give them meaningful names. There are some restrictions on creating variable names. They cannot start with a number or contain characters such as `.`, `_`, '-'. Naming variables the same as in-built functions in R, such as `c`, `T`, `mean` should also be avoided.
+
+Naming variables is a matter of taste. Some [conventions](http://adv-r.had.co.nz/Style.html) exist such as a separating words with `-` or using *C*amel*C*aps. Whatever convention you decided, stick with it!
+
+##Basic concepts in R - functions
+
+- **Functions** in R perform operations on **arguments** (the inputs(s) to the function). We have already used:
+```{r}
+sin(x)
+```
+
+- This returns the sine of x
+ + In this case the function has one argument: **x**.
+ + Arguments are always contained in parentheses -- curved brackets, **()** -- separated by commas.
+
+
+Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order). The names of the arguments are determined by the author of the function and can be found in the help page for the function. When testing code, it is easier and safer to name the arguments.
+
+`seq` is a function for generating a numeric sequence *from* and *to* particular numbers.
+
+- Type `?seq` to get the help page for this function.
+- When testing code, it is easier and safer to name the arguments
+
+```{r}
+seq(from = 2, to = 20, by = 4)
+seq(2, 20, 4)
+```
+
+Arguments can have *default* values, meaning we do not need to specify values for these in order to run the function.
+
+`rnorm` is a function that will generate a series of values from a *normal distribution*. In order to use the function, we need to tell R how many values we want
+
+```{r}
+rnorm(n=10)
+```
+
+The normal distribution is defined by a *mean* (average) and *standard deviation* (spread). However, in the above example we didn't tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page
+
+(*N.B sometimes help pages can describe more than one function*)
+
+```{r}
+?rnorm
+```
+
+In this case, we see that the defaults for mean and standard deviation are 0 and 1. We can change the function to generate values from a distribution with a different mean and standard deviation using the `mean` and `sd` *arguments*. It is important that we get the spelling of these arguments exactly right, otherwise R will an error message, or (worse?) do something unexpected.
+
+```{r}
+rnorm(n=10, mean=2,sd=3)
+rnorm(10, 2, 3)
+```
+
+In the examples above, `seq` and `rnorm` were both outputting a series of numbers, which is called a *vector* in R and is the most-fundamental data-type.
+
+
+
+##Basic concepts in R - vectors
+
+- The basic data structure in R is a **vector** -- an ordered collection of values.
+- R treats even single values as 1-element vectors.
+- The function **`c`** *combines* its arguments into a vector:
+
+```{r}
+x <- c(3,4,5,6)
+x
+```
+- The square brackets `[]` indicate the position within the vector (the ***index***).
+- We can extract individual elements by using the `[]` notation:
+
+```{r}
+x[1]
+x[4]
+
+```
+
+- We can even put a vector inside the square brackets (*vector indexing*):
+- **Before executing this line of code, what do you think it will produce?**
+
+```{r}
+y <- c(2,3)
+x[y]
+```
+
+- There are a number of shortcuts to create a vector.
+- Instead of:
+
+```{r}
+x <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
+x
+```
+- we can write:
+
+```{r}
+x <- 3:12
+x
+```
+
+- or we can use the **`seq()`** function, which returns a vector:
+
+```{r}
+x <- seq(2, 20, 4)
+x
+```
+
+```{r}
+x <- seq(2, 20, length.out=5)
+x
+```
+
+- or we can use the **`rep()`** function:
+
+
+```{r}
+y <- rep(3, 5)
+y
+```
+
+```{r}
+y <- rep(1:3, 5)
+y
+```
+
+
+- We have seen some ways of extracting elements of a vector. We can use these shortcuts to make things easier (or more complex!)
+
+```{r}
+x <- 3:12
+# Extract elements from x:
+
+x[3:7]
+x[seq(2, 6, 2)]
+x[rep(3, 2)]
+```
+
+
+- We can add an element to a vector:
+
+```{r}
+y <- c(x, 1)
+y
+```
+
+- We can glue vectors together:
+
+```{r}
+z <- c(x, y)
+z
+```
+
+- We can "remove" element(s) from a vector:
+ + NOTE: the vector x doesn't get modified
+ + we're just displaying what the vector looks like without particular elements
+
+```{r}
+x <- 3:12
+
+x[-3]
+x[-(5:7)]
+x[-seq(2, 6, 2)]
+x
+```
+
+- Finally, we can modify the contents of a vector:
+
+```{r}
+x[6] <- 4
+x
+
+x[3:5] <- 1
+x
+```
+
+**Remember!**
+
+ - **Square** brackets [ ] for ***indexing***
+ - **Parentheses** () for function ***arguments***
+
+##Basic concepts in R - vector arithmetic
+
+- When applying all standard arithmetic operations to vectors,
+application is element-wise
+
+```{r}
+x <- 1:10
+y <- x*2
+```
+
+```{r}
+y
+```
+
+```{r}
+z <- x^2
+```
+
+```{r}
+z
+```
+
+- Adding two vectors:
+
+```{r}
+y + z
+```
+
+- If vectors are not the same length, the shorter one will be recycled:
+
+```{r}
+x + 1:2
+```
+
+- But be careful if the vector lengths aren't factors of each other:
+
+```{r}
+x + 1:3
+```
+
+- Sometimes R will give a *warning* message. It has performed the calculation you asked it to, but the results may be unexpected. You need to check the output carefully to make sure it is what you really wanted.
+
+##Basic concepts in R - Character vectors and naming
+
+- All the vectors we have seen so far have contained numbers, but we can also store text (/"strings") in vector
+ + this is called a **character** vector.
+
+```{r}
+gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
+gene.names
+```
+
+- We can name elements of vectors using the `names()` function, which can be useful to keep track of the meaning of our data:
+
+```{r}
+gene.expression <- c(0, 3.2, 1.2, -2)
+names(gene.expression) <- gene.names
+gene.expression
+
+```
+
+- We can also use the `names()` function to get a vector of the names of an object:
+```{r}
+names(gene.expression)
+```
+
+
+##Exercise: Body-Mass Index
+- Let's try some vector arithmetic. Here are the weights and heights of five individuals
+
+|Person | Weight (kg) | Height (cm)|
+|-------|------------------:|-------------------:|
+|*Jo* | 65.8 | 192 |
+|*Sam* | 67.9 | 179 |
+|*Charlie*| 75.3 | 169 |
+|*Frankie*| 61.9 | 175 |
+|*Alex* | 92.4 | 171 |
+
+
+- Create *weight* and *height* vectors to hold the data in each column using the `c` function. Create a *person* vector and use this vector to name the values in the other two vectors.
+
+1. The body-mass index is given by the formula:- $BMI = (Weight)/(Height^2)$; where Height is given in ***metres***
+ + Create a new vector to record this, called `bmi`.
+2. Create a new vector `bmi.sorted` where the bmi values are put in increasing numeric order (HINT: look up the help on the `sort` function)
+3. The interquartile range (IQR) of a vector is defined as the 75% percentile of the data minus the 25% percentile. Calculate the IQR for our bmi values
+ + check your answer using the `IQR` function
+
+
+```{r}
+### YOUR ANSWER HERE (please) ###
+
+```
+
+
+
+
+##Getting help
+
+- **This is possibly the most important slide in the whole course!?!**
+- To get help on any R function, type **`?`** followed by the function name. For example:
+```{r}
+?seq
+```
+- This retrieves the syntax and arguments for the function. The help page shows the default order of arguments. It also tells you which *package* it belongs to.
+- There is typically a usage example, which you can test using the
+`example` function:
+
+```{r}
+example(seq)
+```
+
+- If you can't remember the exact name, type **`??`** followed by your guess.
+R will return a list of possibilities:
+
+```{r}
+??mean
+```
+
+- The **Packages** tab in the lower-right panel of RStudio will help you locate the help pages for a particular package and its functions
+ + Often there will be a user-guide or '*vignette*' too
+
+## R packages
+
+- R comes ready loaded with various libraries of functions called
+**packages**. For example: the function **`sum()`** is in the **base** package and
+**`sd()`**, which calculates the standard deviation of a vector, is in the
+**`stats`** package
+- There are 1000s of additional packages provided by third parties,
+and the packages can be found in numerous server locations on the
+web called **repositories**
+- The two repositories you will come across the most are:
+ + **The Comprehensive R Archive Network (CRAN)**
+ + Use metacran search to find functionality you need: http://www.r-pkg.org/
+ + Or look for packages by theme: http://cran.r-project.org/web/views/
+ + **Bioconductor** specialised in genomics: http://www.bioconductor.org/packages/release/bioc/
+ + **https//github.com** can also host R packages, and hosts the development version of many packages
+- Bottomline: ***always*** first look if there is already an R package that does what you want before trying to implement it yourself
+
+## Installing packages
+
+- CRAN packages can be installed using **`install.packages()`**
+
+ + or clicking on the *Packages* tab in RStudio
+
+```{r eval=FALSE}
+install.packages(name.of.my.package)
+```
+
+
+- Set the *Bioconductor* package download tool by typing:
+```{r eval=FALSE}
+source("http://bioconductor.org/biocLite.R")
+```
+
+- *Bioconductor* packages are then installed with the `biocLite()` function:
+```{r eval=FALSE}
+biocLite("PackageName")
+```
+
+- ggplot2 is a commonly used graphics package:
+ + in RStudio, go to **Tools** → **Install Packages**... and type the package name
+ + or use `install.packages()` function to install it:
+
+```{r eval=FALSE}
+install.packages("ggplot2")
+```
+
+- `DESeq2` is a Bioconductor package (http://www.bioconductor.org) for the analysis of RNA-seq data:
+
+```{r eval=FALSE}
+source("http://www.bioconductor.org/biocLite.R")
+biocLite("DESeq2")
+```
+
+##Example: Load packages ggplot2 and DESeq2
+
+- R needs to be told to use the new functions from the installed packages. Use **`library(...)`** function to load the newly installed features:
+
+```{r eval=FALSE}
+
+library(ggplot2) # loads ggplot functions
+library(DESeq2) # loads DESeq functions
+library() # Lists all the packages
+ # you've got installed
+```
+
diff --git a/Session1.1-intro.nb.html b/Session1.1-intro.nb.html
new file mode 100644
index 0000000..ec1340f
--- /dev/null
+++ b/Session1.1-intro.nb.html
@@ -0,0 +1,1140 @@
+
+
+
+
+
Executed using command line, or a graphical user interface (GUI)
+
On this course, we use the RStudio GUI (www.rstudio.com)
+
+
+
+
rstudio
+
+
To launch RStudio, find the RStudio icon and click
+
+
+
RStudio screenshot
+
+
+
The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left)
+
However, for this course we will use a relatively new feature called R-notebooks.
+
An R-notebook mixes plain text with R code
+
+
The R code can be run from inside the document and the results are displayed directly underneath
+
+
Each chunk of R code looks something like this.
+
Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
+
Or you can press the green triangle on the right-hand side to run everything in the chunk
+
Try this now!
+
+
+
+
+
print("Hello World")
+
+
+
+
+
The R notebook can be rendered into a format such as PDF or HTML so they can be shared with your collaborators
+
On the course website you will see compiled versions of each session
+
+
+
+
Basic concepts in R - simple arithmetic
+
+
The command line can be used as a calculator and understands the usual arithmetic operators +, -, *, /
+
Try adding a few more calculations here
+
+
+
+
+
2 + 2
+2 - 2
+4 * 3
+10 / 2
+
+
+
+
Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…
+
In the case of expressions involving multiple operations, R respects the BODMAS system to decide the order in which operations should be performed.
+
+
+
+
2 + 2 *3
+2 + (2 * 3)
+(2 + 2) * 3
+
+
+
+
R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.
+
+
+
+
pi
+sin (pi/2)
+cos(pi)
+tan(2)
+log(1)
+
+
+
+
We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of variables.
+
+
+
Basic concepts in R - variables
+
+
A variable is a letter or word which takes (or contains) a value. We use the assignment operator: <-
+
+
+
+
+
x <- 10
+x
+myNumber <- 25
+myNumber
+
+
+
+
+
We can perform arithmetic on variables:
+
+
+
+
+
sqrt(myNumber)
+
+
+
+
+
We can add variables together:
+
+
+
+
+
x + myNumber
+
+
+
+
+
We can change the value of an existing variable:
+
+
+
+
+
x <- 21
+x
+
+
+
+
+
We can set one variable to equal the value of another variable:
+
+
+
+
+
x <- myNumber
+x
+
+
+
+
+
We can modify the contents of a variable:
+
+
+
+
+
myNumber <- myNumber + sqrt(16)
+myNumber
+
+
+
+
When we are feeling lazy we might give our variables short names (x, y, i…etc), but a better practice would be to give them meaningful names. There are some restrictions on creating variable names. They cannot start with a number or contain characters such as ., _, ‘-’. Naming variables the same as in-built functions in R, such as c, T, mean should also be avoided.
+
Naming variables is a matter of taste. Some conventions exist such as a separating words with - or using CamelCaps. Whatever convention you decided, stick with it!
+
+
+
Basic concepts in R - functions
+
+
Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:
+
+
+
+
+
sin(x)
+
+
+
+
+
This returns the sine of x
+
+
In this case the function has one argument: x.
+
Arguments are always contained in parentheses – curved brackets, () – separated by commas.
+
+
+
Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order). The names of the arguments are determined by the author of the function and can be found in the help page for the function. When testing code, it is easier and safer to name the arguments.
+
seq is a function for generating a numeric sequence from and to particular numbers.
+
+
Type ?seq to get the help page for this function.
+
When testing code, it is easier and safer to name the arguments
+
+
+
+
+
seq(from = 2, to = 20, by = 4)
+seq(2, 20, 4)
+
+
+
+
Arguments can have default values, meaning we do not need to specify values for these in order to run the function.
+
rnorm is a function that will generate a series of values from a normal distribution. In order to use the function, we need to tell R how many values we want
+
+
+
+
rnorm(n=10)
+
+
+
+
The normal distribution is defined by a mean (average) and standard deviation (spread). However, in the above example we didn’t tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page
+
(N.B sometimes help pages can describe more than one function)
+
+
+
+
?rnorm
+
+
+
+
In this case, we see that the defaults for mean and standard deviation are 0 and 1. We can change the function to generate values from a distribution with a different mean and standard deviation using the mean and sdarguments. It is important that we get the spelling of these arguments exactly right, otherwise R will an error message, or (worse?) do something unexpected.
+
+
+
+
rnorm(n=10, mean=2,sd=3)
+rnorm(10, 2, 3)
+
+
+
+
In the examples above, seq and rnorm were both outputting a series of numbers, which is called a vector in R and is the most-fundamental data-type.
+
+
+
Basic concepts in R - vectors
+
+
The basic data structure in R is a vector – an ordered collection of values.
+
R treats even single values as 1-element vectors.
+
The function ccombines its arguments into a vector:
+
+
+
+
+
x <- c(3,4,5,6)
+x
+
+
+
+
+
The square brackets [] indicate the position within the vector (the index).
+
We can extract individual elements by using the [] notation:
+
+
+
+
+
x[1]
+x[4]
+
+
+
+
+
We can even put a vector inside the square brackets (vector indexing):
+
Before executing this line of code, what do you think it will produce?
+
+
+
+
+
y <- c(2,3)
+x[y]
+
+
+
+
+
There are a number of shortcuts to create a vector.
+
Instead of:
+
+
+
+
+
x <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
+x
+
+
+
+
+
we can write:
+
+
+
+
+
x <- 3:12
+x
+
+
+
+
+
or we can use the seq() function, which returns a vector:
+
+
+
+
+
x <- seq(2, 20, 4)
+x
+
+
+
+
+
+
+
x <- seq(2, 20, length.out=5)
+x
+
+
+
+
+
or we can use the rep() function:
+
+
+
+
+
y <- rep(3, 5)
+y
+
+
+
+
+
+
+
y <- rep(1:3, 5)
+y
+
+
+
+
+
We have seen some ways of extracting elements of a vector. We can use these shortcuts to make things easier (or more complex!)
+
+
+
+
+
x <- 3:12
+# Extract elements from x:
+
+x[3:7]
+x[seq(2, 6, 2)]
+x[rep(3, 2)]
+
+
+
+
+
We can add an element to a vector:
+
+
+
+
+
y <- c(x, 1)
+y
+
+
+
+
+
We can glue vectors together:
+
+
+
+
+
z <- c(x, y)
+z
+
+
+
+
+
We can “remove” element(s) from a vector:
+
+
NOTE: the vector x doesn’t get modified
+
we’re just displaying what the vector looks like without particular elements
+
+
+
+
+
+
x <- 3:12
+
+x[-3]
+x[-(5:7)]
+x[-seq(2, 6, 2)]
+x
+
+
+
+
+
Finally, we can modify the contents of a vector:
+
+
+
+
+
x[6] <- 4
+x
+
+x[3:5] <- 1
+x
+
+
+
+
Remember!
+
+
Square brackets [ ] for indexing
+
Parentheses () for function arguments
+
+
+
+
Basic concepts in R - vector arithmetic
+
+
When applying all standard arithmetic operations to vectors, application is element-wise
+
+
+
+
+
x <- 1:10
+y <- x*2
+
+
+
+
+
+
+
y
+
+
+
+
+
+
+
z <- x^2
+
+
+
+
+
+
+
z
+
+
+
+
+
Adding two vectors:
+
+
+
+
+
y + z
+
+
+
+
+
If vectors are not the same length, the shorter one will be recycled:
+
+
+
+
+
x + 1:2
+
+
+
+
+
But be careful if the vector lengths aren’t factors of each other:
+
+
+
+
+
x + 1:3
+
+
+
+
+
Sometimes R will give a warning message. It has performed the calculation you asked it to, but the results may be unexpected. You need to check the output carefully to make sure it is what you really wanted.
+
+
+
+
Basic concepts in R - Character vectors and naming
+
+
All the vectors we have seen so far have contained numbers, but we can also store text (/“strings”) in vector
+
We can also use the names() function to get a vector of the names of an object:
+
+
+
+
+
names(gene.expression)
+
+
+
+
+
+
Exercise: Body-Mass Index
+
+
Let’s try some vector arithmetic. Here are the weights and heights of five individuals
+
+
+
+
+
Person
+
Weight (kg)
+
Height (cm)
+
+
+
+
+
Jo
+
65.8
+
192
+
+
+
Sam
+
67.9
+
179
+
+
+
Charlie
+
75.3
+
169
+
+
+
Frankie
+
61.9
+
175
+
+
+
Alex
+
92.4
+
171
+
+
+
+
+
Create weight and height vectors to hold the data in each column using the c function. Create a person vector and use this vector to name the values in the other two vectors.
+
+
+
The body-mass index is given by the formula:- \(BMI = (Weight)/(Height^2)\); where Height is given in metres
+
+
Create a new vector to record this, called bmi.
+
+
Create a new vector bmi.sorted where the bmi values are put in increasing numeric order (HINT: look up the help on the sort function)
+
The interquartile range (IQR) of a vector is defined as the 75% percentile of the data minus the 25% percentile. Calculate the IQR for our bmi values
+
+
check your answer using the IQR function
+
+
+
+
+
+
### YOUR ANSWER HERE (please) ###
+
+
+
+
+
+
Getting help
+
+
This is possibly the most important slide in the whole course!?!
+
To get help on any R function, type ? followed by the function name. For example:
+
+
+
+
+
?seq
+
+
+
+
+
This retrieves the syntax and arguments for the function. The help page shows the default order of arguments. It also tells you which package it belongs to.
+
There is typically a usage example, which you can test using the example function:
+
+
+
+
+
example(seq)
+
+
+
+
+
If you can’t remember the exact name, type ?? followed by your guess. R will return a list of possibilities:
+
+
+
+
+
??mean
+
+
+
+
+
The Packages tab in the lower-right panel of RStudio will help you locate the help pages for a particular package and its functions
+
+
Often there will be a user-guide or ‘vignette’ too
+
+
+
+
+
R packages
+
+
R comes ready loaded with various libraries of functions called packages. For example: the function sum() is in the base package and sd(), which calculates the standard deviation of a vector, is in the stats package
+
There are 1000s of additional packages provided by third parties, and the packages can be found in numerous server locations on the web called repositories
+
The two repositories you will come across the most are:
+
+
+
+
+
+
+
+
+
diff --git a/Session1.2-data-structures.Rmd b/Session1.2-data-structures.Rmd
new file mode 100644
index 0000000..5215284
--- /dev/null
+++ b/Session1.2-data-structures.Rmd
@@ -0,0 +1,463 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 1"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+ html_notebook:
+ toc: yes
+ toc_float: yes
+---
+
+# 2. Data structures
+
+## R is designed to handle experimental data
+
+- Although the basic unit of R is a vector, we usually handle data in **data frames**.
+- A data frame is a set of observations of a set of variables -- in other words, the outcome of an experiment.
+- For example, we might want to analyse information about a set of patients.
+- To start with, let's say we have ten patients and for each one we know their name, sex, age, weight and whether they give consent for their data to be made public.
+- We are going to create a data frame called 'patients', which will have ten rows (observations) and seven columns (variables). The columns must all be equal lengths.
+- We will explore how to construct these data from scratch.
+ + (in practice, we would usually import such data from a file)
+
+| |First_Name|Second_Name|Full_Name|Sex |Age|Weight |Consent|
+|--|-------|-------|--------------|:----:|--:|------:|:-----:|
+|1 |Adam |Jones |Adam Jones | Male|50 | 70.8 | TRUE|
+|2 |Eve |Parker |Eve Parker |Female|21 | 67.9 | TRUE|
+|3 |John |Evans |John Evans | Male|35 | 75.3 | FALSE|
+|4 |Mary |Davis |Mary Davis |Female|45 | 61.9 | TRUE|
+|5 |Peter |Baker |Peter Baker | Male|28 | 72.4 | FALSE|
+|6 |Paul |Daniels|Paul Daniels | Male|31 | 69.9 | FALSE|
+|7 |Joanna |Edwards|Joanna Edwards|Female|42 | 63.5 | FALSE|
+|8 |Matthew|Smith |Matthew Smith | Male|33 | 71.5 | TRUE|
+|9 |David |Roberts|David Roberts | Male|57 | 73.2 | FALSE|
+|10|Sally |Wilson |Sally Wilson |Female|62 | 64.8 | TRUE|
+
+## Character, numeric and logical data types
+
+- Each column is a vector, like previous vectors we have seen, for
+example:
+
+```{r}
+age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
+weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9,
+ 63.5, 71.5, 73.2, 64.8)
+
+```
+
+- We can define the names using character vectors:
+
+```{r}
+firstName <- c("Adam", "Eve", "John", "Mary",
+ "Peter", "Paul", "Joanna", "Matthew",
+ "David", "Sally")
+secondName <- c("Jones", "Parker", "Evans", "Davis",
+ "Baker","Daniels", "Edwards", "Smith",
+ "Roberts", "Wilson")
+```
+
+Notice how a particular line of R code can be typed over multiple lines. R won't execute the code until it sees the closing bracket `)` that matches the initial bracket `(`)
+- We often use this trick to make our code easier to read
+
+- We also have a new type of vector, the ***logical*** vector, which only
+contains the values `TRUE` and `FALSE`:
+
+```{r}
+consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE,
+ FALSE, FALSE, TRUE, FALSE, TRUE)
+```
+
+
+- Vectors can only contain one type of data; we cannot mix numbers, characters and logical values in the same vector.
+ + If we try this, R will convert everything to characters:
+
+```{r}
+c(20, "a string", TRUE)
+
+```
+
+- We can see the type of a particular vector using the **`class()`** function:
+
+```{r}
+ class(firstName)
+ class(age)
+ class(weight)
+ class(consent)
+```
+
+##Factors
+
+- Character vectors are fine for some variables, like names. But sometimes we have categorical data and we want R to
+recognize this
+- A factor is R's data structure for categorical data:
+
+```{r}
+sex <- c("Male", "Female", "Male", "Female", "Male",
+ "Male", "Female", "Male", "Male", "Female")
+sex
+```
+
+
+
+```{r}
+factor(sex)
+```
+
+- R has converted the strings of the sex character vector into two **levels**, which are the categories in the data
+- Note the values of this factor are not character strings, but levels
+- We can use this factor later-on to compare data for males and females
+
+## Creating a data frame (first attempt)
+
+- We can construct a data frame from other objects (N.B. The **`paste()`** function joins character vectors together)
+
+```{r}
+patients <- data.frame(firstName, secondName,
+ paste(firstName, secondName),
+ sex, age, weight, consent)
+```
+
+
+```{r}
+patients
+
+```
+
+
+
+##Naming data frame variables
+
+- We can access particular variables using the **'`$`'** *operator*:
+- TIP: you can use TAB-complete to select the variable you want
+
+```{r}
+patients$age
+
+
+```
+
+- R has inferred the names of our data frame variables from the names of the vectors or the commands (e.g. the `paste()` command)
+- We can name the variables after we have created a data frame using the **`names()`** function, and we can use the same function to see the names:
+
+```{r}
+names(patients) <- c("First_Name", "Second_Name",
+ "Full_Name", "Sex", "Age",
+ "Weight", "Consent")
+```
+
+```{r}
+names(patients)
+```
+
+
+- Or we can name the variables when we define the data frame
+
+```{r}
+patients <- data.frame(First_Name = firstName,
+ Second_Name = secondName,
+ Full_Name = paste(firstName,
+ secondName),
+ Sex = sex,
+ Age = age,
+ Weight = weight,
+ Consent = consent)
+
+```
+
+```{r}
+names(patients)
+```
+
+##Factors in data frames
+
+- When creating a data frame, R assumes all character vectors should be categorical variables and converts them to factors. This is not
+always what we want:
+ + e.g. we are unlikely to be interested in the hypothesis that people called Adam are taller, so it seems a bit silly to represent this as a factor
+
+```{r}
+patients$First_Name
+```
+
+
+- We can avoid this by asking R not to treat strings as factors, and
+then explicitly stating when we want a factor by using **`factor()`**:
+
+```{r}
+patients <- data.frame(First_Name = firstName,
+ Second_Name = secondName,
+ Full_Name = paste(firstName,
+ secondName),
+ Sex = factor(sex),
+ Age = age,
+ Weight = weight,
+ Consent = consent,
+ stringsAsFactors = FALSE)
+patients
+```
+
+```{r}
+patients$Sex
+patients$First_Name
+```
+
+## Removing variables
+
+Now that we are happy with our data frame, we no longer have any use for the vectors that were used to create it
+
+- R has a function called `rm` that will allow us to remove variables
+
+
+```{r eval=FALSE}
+rm(age)
+```
+
+Once something has been removed, we can no longer use it
+
+```{r eval=FALSE}
+age
+```
+
+Multiple objects can be removed at the same time
+
+```{r}
+rm(list = c("age","firstName","secondName","sex","weight","consent"))
+
+```
+
+## Adding additional columns
+
+Recall that we can create a new variable using an assignment operator and specifying a name that R isn't currently using as a variable name
+
+```{r}
+myNewVariable <- 42
+myNewVariable
+```
+
+We use a similar trick to define new columns in the data frame
+- The value you assign must be the same length as the number of rows in the data frame.
+
+```{r}
+patients$ID
+patients$ID <- paste("Patient", 1:10)
+patients
+```
+
+
+##Indexing data frames and matrices
+
+- You can index multidimensional data structures like matrices and data
+frames using commas:
+- **`object[rows, colums]`**
+- Try and predict what each of the following commands will do:-
+
+```{r}
+patients[2,1]
+```
+
+
+```{r}
+patients[1,2]
+
+```
+
+```{r}
+patients[1,1:3]
+
+```
+
+- If you don't provide an index for either rows or columns, all of the rows or columns will be returned.
+
+```{r}
+patients[1,]
+
+```
+
+- Rows or columns can be omitted by putting a `-` in front of the index
+
+```{r}
+patients[,-1]
+patients[-c(5,7),]
+```
+
+##Advanced indexing
+
+- Indices are actually vectors, and can be ***numeric*** or ***logical***:
+- We won't always know in advance which indices we want to return
+ + we might want all values that exceed a particular value or satisfy some other criteria
+- In this example, `letters` is a vector containing all letters in the English alphabet
+
+```{r}
+letters
+s <- letters[1:5]
+s
+```
+
+So far we have seen how to extract the first and third values in the vector
+
+```{r}
+s[c(1,3)]
+```
+
+R can perform the same operation using a vector of logical values. Only indices with a `TRUE` value will get returned
+
+```{r}
+s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
+```
+
+
+- We can do the logical test and indexing in the same line of R code
+ + R will do the test first, and then use the vector of `TRUE` and `FALSE` values to subset the vector
+```{r}
+a <- 1:5
+
+a < 3
+s[a < 3]
+```
+
+
+## Logical Operators
+
+- Operators allow us to combine multiple logical tests
+- comparison operators
+**`<, >, <=, >=, ==, !=`**
+- logical operators
+**`!, &, |, xor`**
+ + The operators for 'comparison' and 'logical' always return logical values! i.e. (`TRUE`, `FALSE`)
+
+
+```{r}
+
+s[a > 1 & a <3]
+s[a == 2]
+```
+
+The vector that you use to perform the logical test could be extracted from a data frame
+
+- which could then be used to subset the data frame
+
+```{r}
+patients$First_Name == "Peter"
+patients[patients$First_Name == "Peter",]
+```
+
+
+
+##Exercise: Exercise 2
+
+- Write R code to print the following subsets of the patients data frame
+- The first and second rows, and the first and second colums
+
+| |First_Name|Second_Name
+|--|-------|-------|
+|1 |Adam |Jones
+|2 |Eve |Parker
+
+- Only even-numbered rows
+
+HINT: you can use the `seq` function that we saw earlier to define a vector of even numbers
+
+| |First_Name|Second_Name|Full_Name|Sex |Age|Weight |Consent|
+|--|-------|-------|--------------|:----:|--:|------:|:-----:|
+|2 |Eve |Parker |Eve Parker |Female|21 | 67.9 | TRUE|
+|4 |Mary |Davis |Mary Davis |Female|45 | 61.9 | TRUE|
+|6 |Paul |Daniels|Paul Daniels | Male|31 | 69.9 | FALSE|
+|8 |Matthew|Smith |Matthew Smith | Male|33 | 71.5 | TRUE|
+|10|Sally |Wilson |Sally Wilson |Female|62 | 64.8 | TRUE|
+
+- All rows except the last one, all columns
+
+HINT: the `nrow` function will give the number of rows in the data frame
+
+| |First_Name|Second_Name|Full_Name|Sex |Age|Weight |Consent|
+|--|-------|-------|--------------|:----:|--:|------:|:-----:|
+|1 |Adam |Jones |Adam Jones | Male|50 | 70.8 | TRUE|
+|2 |Eve |Parker |Eve Parker |Female|21 | 67.9 | TRUE|
+|3 |John |Evans |John Evans | Male|35 | 75.3 | FALSE|
+|4 |Mary |Davis |Mary Davis |Female|45 | 61.9 | TRUE|
+|5 |Peter |Baker |Peter Baker | Male|28 | 72.4 | FALSE|
+|6 |Paul |Daniels|Paul Daniels | Male|31 | 69.9 | FALSE|
+|7 |Joanna |Edwards|Joanna Edwards|Female|42 | 63.5 | FALSE|
+|8 |Matthew|Smith |Matthew Smith | Male|33 | 71.5 | TRUE|
+|9 |David |Roberts|David Roberts | Male|57 | 73.2 | FALSE|
+
+- Use logical indexing to select the following patients from the data frame:
+ 1. Patients under 40
+ 2. Patients who give consent to share their data
+ 3. Men who weigh as much or more than the average European male (70.8 kg)
+
+
+```{r}
+age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
+weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9,
+ 63.5, 71.5, 73.2, 64.8)
+
+firstName <- c("Adam", "Eve", "John", "Mary",
+ "Peter", "Paul", "Joanna", "Matthew",
+ "David", "Sally")
+secondName <- c("Jones", "Parker", "Evans", "Davis",
+ "Baker","Daniels", "Edwards", "Smith",
+ "Roberts", "Wilson")
+
+consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE,
+ FALSE, FALSE, TRUE, FALSE, TRUE)
+
+sex <- c("Male", "Female", "Male", "Female", "Male",
+ "Male", "Female", "Male", "Male", "Female")
+patients <- data.frame(First_Name = firstName,
+ Second_Name = secondName,
+ Full_Name = paste(firstName,
+ secondName),
+ Sex = factor(sex),
+ Age = age,
+ Weight = weight,
+ Consent = consent,
+ stringsAsFactors = FALSE)
+rm(list = c("firstName","secondName","sex","weight","consent"))
+patients
+
+### Your Answer ###
+
+
+```
+
+
+
+## (Supplementary) Matrices
+
+- Data frames are R's speciality, but R also handles matrices:
+ + All columns are assumed to contain the same data type, e.g. numerical
+ + Matrices can be manipulated in the same fashion as data frame
+ + We can easily convert between the two object types
+
+```{r}
+e <- matrix(1:10, nrow=5, ncol=2)
+e
+```
+
+- Some calculations are more efficient to do on matrices, e.g.:
+
+```{r}
+rowMeans(e)
+```
+
+Matrices (and indeed data frames) can be joined together using the functions `cbind` and `rbind`
+
+Let's first create some example data
+```{r}
+
+mat1 <- matrix(11:20, nrow=5,ncol=2)
+mat1
+mat2 <- matrix(21:30, nrow=5, ncol=2)
+mat2
+mat3 <- matrix(31:40,nrow=5,ncol=2)
+mat3
+
+```
+
+and now try out these functions:-
+
+```{r}
+cbind(mat1,mat2)
+rbind(mat1,mat3)
+```
diff --git a/Session1.2-data-structures.nb.html b/Session1.2-data-structures.nb.html
new file mode 100644
index 0000000..168d184
--- /dev/null
+++ b/Session1.2-data-structures.nb.html
@@ -0,0 +1,1389 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to Solving Biological Problems Using R - Day 1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Introduction to Solving Biological Problems Using R - Day 1
+
Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+
Last modified: 16 Jun 2017
+
+
+
+
+
+
+
2. Data structures
+
+
R is designed to handle experimental data
+
+
Although the basic unit of R is a vector, we usually handle data in data frames.
+
A data frame is a set of observations of a set of variables – in other words, the outcome of an experiment.
+
For example, we might want to analyse information about a set of patients.
+
To start with, let’s say we have ten patients and for each one we know their name, sex, age, weight and whether they give consent for their data to be made public.
+
We are going to create a data frame called ‘patients’, which will have ten rows (observations) and seven columns (variables). The columns must all be equal lengths.
+
We will explore how to construct these data from scratch.
+
+
(in practice, we would usually import such data from a file)
+
+
+
+
+
+
+
First_Name
+
Second_Name
+
Full_Name
+
Sex
+
Age
+
Weight
+
Consent
+
+
+
+
+
1
+
Adam
+
Jones
+
Adam Jones
+
Male
+
50
+
70.8
+
TRUE
+
+
+
2
+
Eve
+
Parker
+
Eve Parker
+
Female
+
21
+
67.9
+
TRUE
+
+
+
3
+
John
+
Evans
+
John Evans
+
Male
+
35
+
75.3
+
FALSE
+
+
+
4
+
Mary
+
Davis
+
Mary Davis
+
Female
+
45
+
61.9
+
TRUE
+
+
+
5
+
Peter
+
Baker
+
Peter Baker
+
Male
+
28
+
72.4
+
FALSE
+
+
+
6
+
Paul
+
Daniels
+
Paul Daniels
+
Male
+
31
+
69.9
+
FALSE
+
+
+
7
+
Joanna
+
Edwards
+
Joanna Edwards
+
Female
+
42
+
63.5
+
FALSE
+
+
+
8
+
Matthew
+
Smith
+
Matthew Smith
+
Male
+
33
+
71.5
+
TRUE
+
+
+
9
+
David
+
Roberts
+
David Roberts
+
Male
+
57
+
73.2
+
FALSE
+
+
+
10
+
Sally
+
Wilson
+
Sally Wilson
+
Female
+
62
+
64.8
+
TRUE
+
+
+
+
+
+
Character, numeric and logical data types
+
+
Each column is a vector, like previous vectors we have seen, for example:
Notice how a particular line of R code can be typed over multiple lines. R won’t execute the code until it sees the closing bracket ) that matches the initial bracket () - We often use this trick to make our code easier to read
+
+
We also have a new type of vector, the logical vector, which only contains the values TRUE and FALSE:
When creating a data frame, R assumes all character vectors should be categorical variables and converts them to factors. This is not always what we want:
+
+
e.g. we are unlikely to be interested in the hypothesis that people called Adam are taller, so it seems a bit silly to represent this as a factor
+
+
+
+
+
+
patients$First_Name
+
+
+
[1] Adam Eve John Mary Peter Paul Joanna Matthew David Sally
+Levels: Adam David Eve Joanna John Mary Matthew Paul Peter Sally
+
+
+
+
+
We can avoid this by asking R not to treat strings as factors, and then explicitly stating when we want a factor by using factor():
+
+
+
+
+
+
+
+
diff --git a/Session1.3-walkthrough.Rmd b/Session1.3-walkthrough.Rmd
new file mode 100644
index 0000000..9d1b52a
--- /dev/null
+++ b/Session1.3-walkthrough.Rmd
@@ -0,0 +1,293 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 1"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+ html_notebook:
+ toc: yes
+ toc_float: yes
+---
+
+# 3. R for data analysis
+
+##3 steps to Basic Data Analysis
+
+- In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline:
+
+1. Reading in data
+ + `read.table()`
+ + `read.csv(), read.delim()`
+2. Analysis
+ + Manipulating & reshaping the data
+ + perhaps dealing with "missing data"
+ + Any maths you like
+ + Diagnostic Plots
+3. Writing out results
+ + `write.table()`
+ + `write.csv()`
+
+## A simple walkthrough
+
+- We have data from 100 patients that given consent for their data to use in future studies
+- A researcher wants to undertake a study involving people that are overweight
+- We will walkthrough how to filter the data and write a new file with the candidates for the study
+
+##The Working Directory (wd)
+
+
+- Like many programs R has a concept of a working directory
+- It is the place where R will look for files to execute and where it will
+save files, by default
+- For this course we need to set the working directory to the location
+of the course scripts
+- In RStudio use the mouse and browse to the directory where you saved the Course Materials
+
+- ***Session → Set Working Directory → Choose Directory...***
+
+## 0. Locate the data
+
+Before we even start the analysis, we need to be sure of where the data are located on our hard drive
+
+- Functions that import data need a file location as a character vector
+- The default location is the ***working directory***
+```{r}
+getwd()
+```
+
+- If the file you want to read is in your working directory, you can just use the file name
+
+```{r eval=FALSE}
+list.files()
+```
+
+- The `file.exists` function does exactly what it says on the tin!
+ + a good sanity check for your code
+
+```{r}
+file.exists("patient-info.txt")
+```
+
+- Otherwise you need the *path* to the file
+ + you can get this using **`file.choose()`**
+
+- If you unsure about specifying a file path at the command line, this [online tutorial](http://rik.smith-unna.com/command_line_bootcamp/?id=vczhybjhtyt) will give you hands-on practice
+
+##1. Read in the data
+
+- The data are a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text
+- We need to read in the results and assign it to an object (`patients`)
+
+```{r}
+patients <- read.delim("patient-info.txt")
+
+```
+
+In the latest RStudio, there is the option to import data directly from the File menu. ***File*** -> ***Import Dataset*** -> ***From Csv***
+
+- If the data are comma-separated, then use either the argument `sep=","` or the function `read.csv()`:
+- You need to make sure you use the correct function
+ + can you explain the output of the following lines of code?
+
+```{r }
+tmp <- read.csv("patient-info.txt")
+head(tmp)
+```
+- For full list of arguments:
+```{r}
+?read.table
+```
+
+##1b. Check the data
+- *Always* check the object to make sure the contents and dimensions are as you expect
+- R will sometimes create the object without error, but the contents may be un-usable for analysis
+ + If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column
+
+```{r}
+# View the first 10 rows to ensure import is OK
+patients[1:10,]
+```
+
+
+- or use the `View()` function to get a display of the data in RStudio:
+```{r}
+View(patients)
+```
+
+##1c. Understanding the object
+
+- Once we have read the data successfully, we can start to interact with it
+- The object we have created is a *data frame*:
+```{r}
+class(patients)
+```
+
+- We can query the dimensions:
+
+```{r}
+ncol(patients)
+nrow(patients)
+dim(patients)
+```
+
+
+- The names of the columns are automatically assigned:
+
+```{r}
+colnames(patients)
+```
+
+- We can use any of these names to access a particular column:
+ + and create a vector
+ + TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
+```{r}
+patients$ID
+
+```
+
+## Word of warning
+
+
+
+
+
+
+
+
+> Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)
+
+You will make your life a lot easier if you keep your data **tidy** and ***organised***. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them
+
+- http://www.datacarpentry.org/spreadsheet-ecology-lesson/
+- http://kbroman.org/dataorg/
+
+##Handling missing values
+
+- The data frame contains some **`NA`** values, which means the values are missing – a common occurrence in real data collection
+- `NA` is a special value that can be present in objects of any type (logical, character, numeric etc)
+- `NA` is not the same as `NULL`:
+ - `NULL` is an empty R object.
+ - `NA` is one missing value within an R object (like a data frame or a vector)
+- Often R functions will handle `NA`s gracefully:
+
+```{r}
+length(patients$Height)
+mean(patients$Height)
+```
+
+- However, sometimes we have to tell the functions what to do with them.
+- R has some built-in functions for dealing with `NA`s, and functions often have their own arguments (like `na.rm`) for handling them:
+ + annoyingly, different functions have different argument names to change their behaviour with regards to `NA` values. *Always check the documentation*
+
+```{r}
+mean(patients$Height, na.rm = TRUE)
+
+mean(na.omit(patients$Height))
+```
+
+##2. Analysis (reshaping data and maths)
+
+- Our analysis involves identifying patients with extreme BMI
+ + we will define this as being two standard deviations from the mean
+
+```{r}
+# Create an index of results:
+BMI <- (patients$Weight)/((patients$Height/100)^2)
+upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
+upper.limit
+```
+
+
+- We can plot a simple chart of the BMI values
+ + add a vertical line to indicate the cut-off
+ + plotting will be covered in detail shortly..
+
+```{r}
+plot(BMI)
+# Add a horizonal line:
+abline(h=upper.limit)
+```
+
+- It is also useful to save the variable we have computed as a new column in the data frame
+
+```{r}
+round(BMI,1)
+patients$BMI <- round(BMI,1)
+head(patients)
+```
+
+- To actually select the candidates we can use a logical expression to test the values of the BMI vector being greater than the upper limit
+ + if the second line looks a bit weird, remember that `<-` is doing an assignment. Thevalue we are assigning to our new variable is the logical (`TRUE` or `FALSE`) vector given by testing each item in `BMI` against the `upper.limit`
+
+```{r}
+BMI > upper.limit
+candidates <- BMI > upper.limit
+```
+
+We have seen that a logical vector can be used to subset a data frame
+
+- However, in our case the result looks a bit funny
+- Can you think why this might be?
+
+```{r}
+patients[candidates,]
+```
+
+The `which` function will take a logical vector and return the indices of the `TRUE` values
+
+- This can then be used to subset the data frame
+
+```{r}
+which(BMI > upper.limit)
+candidates <- which(BMI > upper.limit)
+```
+
+
+## 3. Outputting the results
+
+- We write out a data frame of candidates (patients with BMI more than standard deviations from the mean) as a 'comma separated values' text file (CSV):
+
+```{r}
+write.csv(patients[candidates,], file="selectedSamples.csv")
+```
+
+- The output file is directly-readable by Excel
+- It's often helpful to double check where the data has been saved. Use the *get working directory* function:
+
+```{r eval=FALSE}
+getwd() # print working directory
+list.files() # list files in working directory
+
+```
+
+
+To recap, the set of R commands we have used is:-
+
+```{r}
+patients <- read.delim("patient-info.txt")
+BMI <- (patients$Weight)/((patients$Height/100)^2)
+upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
+plot(BMI)
+# Add a horizonal line:
+abline(h=upper.limit)
+patients$BMI <- round(BMI,1)
+candidates <- which(BMI > upper.limit)
+write.csv(patients[candidates,], file="selectedSamples.csv")
+
+```
+
+##Exercise: Exercise 3
+
+- A separate study is looking for patients that are underweight and also smoke;
+ + Modify the condition in our previous code to find these patients
+ + e.g. having BMI that is 2 standard deviations *less* than the mean BMI
+ + Write out a results file of the samples that match these criteria, and open it in a spreadsheet program
+
+
+```{r}
+### Your Answer Here ###
+
+
+
+```
+
diff --git a/Session1.3-walkthrough.nb.html b/Session1.3-walkthrough.nb.html
new file mode 100644
index 0000000..a506a26
--- /dev/null
+++ b/Session1.3-walkthrough.nb.html
@@ -0,0 +1,844 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to Solving Biological Problems Using R - Day 1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)
+
+
You will make your life a lot easier if you keep your data tidy and organised. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them
The data frame contains some NA values, which means the values are missing – a common occurrence in real data collection
+
NA is a special value that can be present in objects of any type (logical, character, numeric etc)
+
NA is not the same as NULL:
+
+
NULL is an empty R object.
+
NA is one missing value within an R object (like a data frame or a vector)
+
+
Often R functions will handle NAs gracefully:
+
+
+
+
+
length(patients$Height)
+
+
+
[1] 100
+
+
+
mean(patients$Height)
+
+
+
[1] NA
+
+
+
+
+
However, sometimes we have to tell the functions what to do with them.
+
R has some built-in functions for dealing with NAs, and functions often have their own arguments (like na.rm) for handling them:
+
+
annoyingly, different functions have different argument names to change their behaviour with regards to NA values. Always check the documentation
+
+
+
+
+
+
mean(patients$Height, na.rm = TRUE)
+
+
+
[1] 167.4969
+
+
+
mean(na.omit(patients$Height))
+
+
+
[1] 167.4969
+
+
+
+
+
+
2. Analysis (reshaping data and maths)
+
+
Our analysis involves identifying patients with extreme BMI
+
+
we will define this as being two standard deviations from the mean
+
+
+
+
+
+
# Create an index of results:
+BMI <- (patients$Weight)/((patients$Height/100)^2)
+upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
+upper.limit
+
+
+
[1] 30.9533
+
+
+
+
+
We can plot a simple chart of the BMI values
+
+
add a vertical line to indicate the cut-off
+
plotting will be covered in detail shortly..
+
+
+
+
+
+
plot(BMI)
+# Add a horizonal line:
+abline(h=upper.limit)
+
+
+
+
+
+
+
+
It is also useful to save the variable we have computed as a new column in the data frame
+
+
+
+
+
round(BMI,1)
+
+
+
[1] 22.9 25.1 26.4 30.6 26.5 27.9 26.3 25.6 23.4 28.2 28.2 NA 30.0 27.9 24.5 22.0 25.6 31.5 23.8 NA 23.5 26.7 31.4 NA
+ [25] 24.6 NA 24.8 29.2 NA 24.1 25.1 28.0 29.4 28.2 23.6 26.4 NA 25.0 27.7 27.0 25.6 26.7 24.5 26.1 23.1 28.2 26.9 NA
+ [49] 25.4 25.9 NA 24.8 28.2 NA 30.4 26.8 26.0 25.2 26.9 31.7 25.6 NA 26.7 27.8 28.4 NA 31.5 27.0 30.0 26.5 25.2 NA
+ [73] 26.7 25.8 NA 27.6 29.1 26.6 26.6 26.9 27.6 26.4 27.8 NA 27.8 25.8 27.7 28.7 24.2 24.6 28.3 24.8 27.8 21.4 28.0 26.0
+ [97] 26.2 26.4 27.7 NA
+
+
+
patients$BMI <- round(BMI,1)
+head(patients)
+
+
+
+
+
+
+
+
+
+
To actually select the candidates we can use a logical expression to test the values of the BMI vector being greater than the upper limit
+
+
if the second line looks a bit weird, remember that <- is doing an assignment. Thevalue we are assigning to our new variable is the logical (TRUE or FALSE) vector given by testing each item in BMI against the upper.limit
+
+
+
+
+
+
BMI > upper.limit
+
+
+
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE TRUE FALSE NA
+ [21] FALSE FALSE TRUE NA FALSE NA FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE
+ [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE NA FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE TRUE
+ [61] FALSE NA FALSE FALSE FALSE NA TRUE FALSE FALSE FALSE FALSE NA FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE
+ [81] FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE NA
+
+
+
candidates <- BMI > upper.limit
+
+
+
+
We have seen that a logical vector can be used to subset a data frame
+
+
However, in our case the result looks a bit funny
+
Can you think why this might be?
+
+
+
+
+
patients[candidates,]
+
+
+
+
+
+
+
+
+
The which function will take a logical vector and return the indices of the TRUE values
+
+
This can then be used to subset the data frame
+
+
+
+
+
which(BMI > upper.limit)
+
+
+
[1] 18 23 60 67
+
+
+
candidates <- which(BMI > upper.limit)
+
+
+
+
+
+
3. Outputting the results
+
+
We write out a data frame of candidates (patients with BMI more than standard deviations from the mean) as a ‘comma separated values’ text file (CSV):
+
+
+
+
+
+
+
+
diff --git a/Session1.4-plotting.Rmd b/Session1.4-plotting.Rmd
new file mode 100644
index 0000000..3f86a19
--- /dev/null
+++ b/Session1.4-plotting.Rmd
@@ -0,0 +1,340 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 1"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+ html_notebook:
+ toc: yes
+ toc_float: yes
+---
+
+
+
+# 4. Plotting in R
+
+##Plot basics
+
+- As we have heard, R has extensive graphical capabilities
+- ...but we need to start simple
+- We will describe *base* graphics in R: the plots available with any standard R installation
+ + other more advanced alternatives are, e.g., `lattice`, `ggplot2`
+ + See our [intermediate R course](http://bioinformatics-core-shared-training.github.io/r-intermediate/) for fancy graphics
+- Plotting in R is a *vast* topic:
+ + We cannot cover everything
+ + You can tinker with plots to your hearts content
+ + Best to learn from examples; e.g. [The R Graph Gallery](http://www.r-graph-gallery.com/)
+- ***You need to think about how best to visualise your data***
+ + http://www.bioinformatics.babraham.ac.uk/training.html#figuredesign
+- R cannot prevent you from creating a plotting disaster:
+ + http://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6?op=1&IR=T
+
+##Making a Scatter Plot
+
+- If given a single vector as an argument, the function **`plot()`** will make a scatter plot with the *values* of the vector on the *y* axis, and *indices* in the *x* axis
+ + e.g. it puts a point at:
+ + x = 1, y = 70.8
+ + x = 2, y = 67.9 etc...
+- We are going to be using the patients data frame, read using the following command
+
+```{r}
+patients <- read.delim("patient-info.txt")
+```
+
+Remember that `$` can be used to access a particular column. The result is a vector, which is the most-basic type of data used in plotting
+
+```{r}
+patients$Weight
+```
+
+
+- R tries to guess the most appropriate way to visualise the data, according to the type and dimensions of the object(s) provided
+
+
+```{r}
+plot(patients$Weight)
+```
+
+- Axis limits, labels, titles are inferred from the data
+ + We can modify these as we wish, by specifying ***arguments***
+
+- We can give two arguments to `plot()`:
+ + In order to visualise the relationship between two variables
+ + It will put the values from the *first* argument in the *x* axis, and values from the *second* argument on the *y* axis
+
+```{r}
+patients$Age
+plot(patients$Age, patients$Weight)
+```
+
+##Making a barplot
+
+- Other types of visualisation are available:
+ + These are often just special cases of using the `plot()` function
+ + One such function is `barplot()`
+
+
+```{r}
+barplot(patients$Age)
+```
+
+
+- It is more usual to display count data in a barplot
+ + e.g. the counts of a particular ***categorical*** variable
+
+```{r}
+barplot(summary(patients$Sex))
+```
+
+##Plotting a distribution: Histogram
+
+- A histogram is a popular way of visualising a distribution of ***continuous*** data:
+ + You can change the width of bins
+ + The y-axis can be either frequency of density
+
+```{r}
+hist(patients$Weight)
+```
+
+
+
+##Plotting a distribution: Boxplot
+
+- The boxplot is commonly used in statistics to visualise a distribution:
+```{r}
+boxplot(patients$Weight)
+```
+
+- The black solid line is the ***median***
+- The top and bottom of the box are the 75th and 25th percentiles
+ + Hence, the distance between these is a reflection of the *spread* of the data; the Inter-Quartile Range (***IQR***)
+- Whiskers are drawn at 1.5 x IQR and -1.5 x IQR
+
+
+- Sometimes we want to compare distributions between different categories in our data
+- For this we need to use the '*formula*' syntax
+ + For now, `y ~ x` means put continuous variable `y` on the *y* axis and categorical `x` on the x axis
+```{r}
+boxplot(patients$Weight ~ patients$Sex)
+
+```
+
+- Other alternatives to consider:
+ - `example(dotchart)`
+ - `example(stripchart)`
+ - `example(vioplot) # From vioplot library`
+ - `example(beeswarm) # From beeswarm library`
+
+## Exercise: Exercise 4a
+
+- In the course folder you will find the file `ozone.csv`:
+ + Data describing weather conditions in New York City in 1973, obtained from the [supplementary data](http://faculty.washington.edu/heagerty/Books/Biostatistics/index-chapter.html) to *Biostatistics: A Methodology for the Health Sciences*
+ + Full description here: http://faculty.washington.edu/heagerty/Books/Biostatistics/DATA/ozonedoc.txt
+1. Read these data into R using `read.csv` or `read.delim` as described in the previous section
+ + you will need to choose which is appropriate for the file type
+2. What data types are present? Try to think of ways to create the following plots from the data
+ + Scatter plot two variables. e.g. Solar Radiation against Ozone
+ + A histogram. e.g. Wind Speed
+ + Boxplot of a continuous variable against a categorical variable. e.g. Ozone level per month
+
+
+
+
+
+```{r}
+### Your Answer Here ###
+
+
+
+```
+
+
+## Simple customisations
+
+- `plot()` comes with a large collection of arguments that can be set when we call the function:
+ + See `?plot` and `?par`
+- Recall that, unless specified, arguments have a default value
+- We can choose to draw lines on the plot rather than points
+ + The rest of the plot remains the same
+
+```{r}
+plot(patients$Weight, type = "l")
+```
+
+
+- We can also have both lines and points:
+
+```{r}
+plot(patients$Weight, type = "b")
+```
+
+
+- Add an informative title to the plot using the `main` argument:
+
+```{r}
+plot(patients$Age, patients$Weight,
+ main = "Relationship between Weight and Age")
+```
+
+
+
+- Adding the x-axis label:
+```{r}
+plot(patients$Age, patients$Weight, xlab = "Age")
+```
+
+- Adding the y-axis label:
+
+```{r}
+plot(patients$Age, patients$Weight, ylab = "Weight")
+```
+
+- We can specifiy multiple arguments at once:
+ + here `ylim` and `xlim` are used to specify axis limits
+```{r}
+plot(patients$Age,patients$Weight,
+ ylab="Weight",
+ xlab="Age",
+ main="Relationship between Weight and Age",
+ xlim=c(10,70),
+ ylim=c(60,80))
+
+```
+
+##Defining a colour
+
+- R can recognise various strings, such as `"red"`, `"orange"`,`"green"`,`"blue"`,`"yellow"`...
+- Or more exotic ones like ``r sample(colours(),8)``...
+ + See `colours()`
+- See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
+- Can also use **R**ed **G**reen **B**lue and hexadecimal values:
+
+ + `rgb(0.7, 0.7, 0.7)` → A light grey in RGB format`
+ + `"#B3B3B3"` → The same light grey in hexadecimal
+ + `"#0000FF88"`→ A semi-transparent blue, in hexadecimal
+ + The hexadecimal system is the native colour system for screen visualisation (e.g. webs). It indicates the intensity of Red, Green and Blue by using two digits for each colour, in a scale from 0-9 and A-F (0 meaning no intensity and F meaning most intense)
+
+
+Changing the `col` argument to `plot()` changes the colour that the points are plotted in:
+
+```{r}
+plot(patients$Age, patients$Weight, col = "red")
+```
+
+
+##Plotting characters
+
+- R can use a variety of **p**lotting **ch**aracters
+- Each of which has a numeric *code*
+
+
+
+```{r}
+plot(patients$Age, patients$Weight, pch = 16)
+```
+
+
+
+- Or you can specify a character:
+
+```{r}
+plot(patients$Age, patients$Weight, pch = "X")
+```
+
+
+##Size of points
+
+**C**haracter **ex**pansion: Make the size of points 3 times larger than the default
+
+```{r}
+plot(patients$Age, patients$Weight, cex = 3)
+```
+
+or 20% of the original size
+
+```{r}
+plot(patients$Age, patients$Weight, cex = 0.2)
+```
+
+##Colours and characters as vectors
+
+- Previously we have used a *vector* of length 1 as our value of colour and character
+- We can use a vector of any length:
+ + the values will get *recycled* (re-used) so that each point gets assigned a value
+- We can use a pre-defined ***colour palette*** (see later)
+
+```{r}
+plot(patients$Age, patients$Weight,
+ col = c("red","blue"))
+```
+
+##Other plots use the same arguments
+
+- Other plotting functions use the same arguments as `plot()`
+ + technical explanation: the arguments are *'inherited'*
+
+```{r}
+boxplot(patients$Weight~patients$Sex,
+ xlab = "Sex",
+ ylab = "Weight",
+ main = "Relationship between Weight and Gender",
+ col = c("blue","yellow"))
+```
+
+
+##Exercise: exercise4b
+
+- Can you re-create the following plots? Hint:
+ + See the `breaks` and `freq` arguments to hist (`?hist`) to create 20 bins and display density rather than frequency
+ + For third plot, see the rainbow function (`?rainbow`)
+ + Don't worry too much about getting the colours exactly correct
+ + The `las` argument changes the label orientation. See `?par`.
+ + look at the arguments to `boxplot` to see how to change the names printed under each box
+
+
+
+```{r}
+### Your Answer Here ###
+
+
+```
+
+## More on colours
+
+- The **`rainbow()`** function is used to create a vector of colours for the boxplot; in other words a ***palette***:
+ + Red, Orange, Yellow, Green, Blue, Indigo, Violet, etc.
+ + Other palette functions available: `heat.colors(), terrain.colors(), topo.colors(), cm.colors()`
+ + Red, Orange, Yellow, Green, Blue, Indigo, Violet....etc
+
+
+
+
+- More aesthetically-pleasing palettes are provided by the **`RColorBrewer`** package:
+ + can also check for palettes that are accepted for those with colour-blindness
+- You may need to *install* `RColorBrewer` with the following line of code
+ + remember, you only need to do this once
+
+```{r eval=FALSE}
+install.packages("RColorBrewer")
+```
+
+
+```{r}
+library(RColorBrewer)
+display.brewer.all()
+display.brewer.all(colorblindFriendly = TRUE)
+```
+
+```{r}
+weather <- read.csv("ozone.csv")
+boxplot(weather$Temp ~ weather$Month,col=brewer.pal(5,"Set1"))
+```
+
+
+#End of Day 1
+
+## To come tomorrow...
+- More customisation of plots
+- Statistics
+- Further manipulation of data
+- Report writing
diff --git a/Session1.4-plotting.nb.html b/Session1.4-plotting.nb.html
new file mode 100644
index 0000000..7e7ee2b
--- /dev/null
+++ b/Session1.4-plotting.nb.html
@@ -0,0 +1,803 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to Solving Biological Problems Using R - Day 1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
If given a single vector as an argument, the function plot() will make a scatter plot with the values of the vector on the y axis, and indices in the x axis
+
+
e.g. it puts a point at:
+
+
x = 1, y = 70.8
+
x = 2, y = 67.9 etc…
+
+
+
We are going to be using the patients data frame, read using the following command
+
+
+
+
+
patients <- read.delim("patient-info.txt")
+
+
+
+
Remember that $ can be used to access a particular column. The result is a vector, which is the most-basic type of data used in plotting
+
+
+
+
patients$Weight
+
+
+
+
+
R tries to guess the most appropriate way to visualise the data, according to the type and dimensions of the object(s) provided
+
+
+
+
+
plot(patients$Weight)
+
+
+
+
+
Axis limits, labels, titles are inferred from the data
+
+
We can modify these as we wish, by specifying arguments
+
+
We can give two arguments to plot():
+
+
In order to visualise the relationship between two variables
+
It will put the values from the first argument in the x axis, and values from the second argument on the y axis
+
+
+
+
+
+
patients$Age
+plot(patients$Age, patients$Weight)
+
+
+
+
+
+
Making a barplot
+
+
Other types of visualisation are available:
+
+
These are often just special cases of using the plot() function
+
One such function is barplot()
+
+
+
+
+
+
barplot(patients$Age)
+
+
+
+
+
It is more usual to display count data in a barplot
+
+
e.g. the counts of a particular categorical variable
+
+
+
+
+
+
barplot(summary(patients$Sex))
+
+
+
+
+
+
Plotting a distribution: Histogram
+
+
A histogram is a popular way of visualising a distribution of continuous data:
+
+
You can change the width of bins
+
The y-axis can be either frequency of density
+
+
+
+
+
+
hist(patients$Weight)
+
+
+
+
+
+
Plotting a distribution: Boxplot
+
+
The boxplot is commonly used in statistics to visualise a distribution:
+
+
+
+
+
boxplot(patients$Weight)
+
+
+
+
+
The black solid line is the median
+
The top and bottom of the box are the 75th and 25th percentiles
+
+
Hence, the distance between these is a reflection of the spread of the data; the Inter-Quartile Range (IQR)
+
+
Whiskers are drawn at 1.5 x IQR and -1.5 x IQR
+
Sometimes we want to compare distributions between different categories in our data
+
For this we need to use the ‘formula’ syntax
+
+
For now, y ~ x means put continuous variable y on the y axis and categorical x on the x axis
+
+
+
+
+
+
boxplot(patients$Weight ~ patients$Sex)
+
+
+
+
+
Other alternatives to consider:
+
+
example(dotchart)
+
example(stripchart)
+
example(vioplot) # From vioplot library
+
example(beeswarm) # From beeswarm library
+
+
+
+
+
Exercise: Exercise 4a
+
+
In the course folder you will find the file ozone.csv:
+
+
Data describing weather conditions in New York City in 1973, obtained from the supplementary data to Biostatistics: A Methodology for the Health Sciences
Can also use Red Green Blue and hexadecimal values:
+
+
rgb(0.7, 0.7, 0.7) → A light grey in RGB format`
+
"#B3B3B3" → The same light grey in hexadecimal
+
"#0000FF88"→ A semi-transparent blue, in hexadecimal
+
+
The hexadecimal system is the native colour system for screen visualisation (e.g. webs). It indicates the intensity of Red, Green and Blue by using two digits for each colour, in a scale from 0-9 and A-F (0 meaning no intensity and F meaning most intense)
+
+
+
+
Changing the col argument to plot() changes the colour that the points are plotted in:
+
+
+
+
plot(patients$Age, patients$Weight, col = "red")
+
+
+
+
+
+
Plotting characters
+
+
R can use a variety of plotting characters
+
Each of which has a numeric code
+
+
+
+
+
+
+
+
+
plot(patients$Age, patients$Weight, pch = 16)
+
+
+
+
+
Or you can specify a character:
+
+
+
+
+
plot(patients$Age, patients$Weight, pch = "X")
+
+
+
+
+
+
Size of points
+
Character expansion: Make the size of points 3 times larger than the default
+
+
+
+
plot(patients$Age, patients$Weight, cex = 3)
+
+
+
+
or 20% of the original size
+
+
+
+
plot(patients$Age, patients$Weight, cex = 0.2)
+
+
+
+
+
+
Colours and characters as vectors
+
+
Previously we have used a vector of length 1 as our value of colour and character
+
We can use a vector of any length:
+
+
the values will get recycled (re-used) so that each point gets assigned a value
+
+
We can use a pre-defined colour palette (see later)
+
+
+
+
+
plot(patients$Age, patients$Weight,
+ col = c("red","blue"))
+
+
+
+
+
+
Other plots use the same arguments
+
+
Other plotting functions use the same arguments as plot()
+
+
technical explanation: the arguments are ‘inherited’
+
+
+
+
+
+
boxplot(patients$Weight~patients$Sex,
+ xlab = "Sex",
+ ylab = "Weight",
+ main = "Relationship between Weight and Gender",
+ col = c("blue","yellow"))
+
+
+
+
+
+
Exercise: exercise4b
+
+
Can you re-create the following plots? Hint:
+
+
See the breaks and freq arguments to hist (?hist) to create 20 bins and display density rather than frequency
+
For third plot, see the rainbow function (?rainbow)
+
Don’t worry too much about getting the colours exactly correct
+
The las argument changes the label orientation. See ?par.
+
look at the arguments to boxplot to see how to change the names printed under each box
+
+
+
+
+
+
+
+
+
+
### Your Answer Here ###
+
+
+
+
+
+
More on colours
+
+
The rainbow() function is used to create a vector of colours for the boxplot; in other words a palette:
+
+
Red, Orange, Yellow, Green, Blue, Indigo, Violet, etc.
+
Other palette functions available: heat.colors(), terrain.colors(), topo.colors(), cm.colors()
+
+
+
+
+
+
+
+
diff --git a/Session2.1-plotting2.Rmd b/Session2.1-plotting2.Rmd
new file mode 100644
index 0000000..e8a8d43
--- /dev/null
+++ b/Session2.1-plotting2.Rmd
@@ -0,0 +1,421 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 2"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+ html_notebook:
+ toc: yes
+ toc_float: yes
+---
+
+# Day 2 Schedule
+
+1. Further customisation of plots
+2. Statistics
+3. Data Manipulation Techniques
+4. Programming in R
+5. Further report writing
+
+#1. Further customisation of plots
+
+## Recap
+
+- We have seen how to use `plot()`, `boxplot()` , `hist()` etc to make simple plots
+- These come with arguments that can be used to change the appearance of the plot
+ + `col`, `pch`
+ + `main`, `xlab`, `ylab`
+ + etc....
+- We will now look at ways to modify the plot appearance after it has been created
+- Also, how to export the graphs
+
+
+
+## The painter's model
+
+- R employs a painter's model to construct it's plots
+- Elements of the graph are added to the canvas one layer at a time, and the picture built up in levels.
+- Lower levels are obscured by higher levels,
+ + allowing for blending, masking and overlaying of objects.
+- You can't undo the changes you make to the plot
+ + you have to re-run the code from scratch
+
+
+
+[http://www.inquisitr.com/309687/jesus-painting-restoration-goes-wrong-well-intentioned-old-lady-destroys-100-year-old-fresco/](http://www.inquisitr.com/309687/jesus-painting-restoration-goes-wrong-well-intentioned-old-lady-destroys-100-year-old-fresco/)
+
+
+- We will re-use the patients data from yesterday:
+
+```{r}
+patients <- read.delim("patient-info.txt")
+```
+
+- Recall our patients dataset from yesterday
+ + we might want to display other characteristics on the plot, e.g. gender of individual:
+
+```{r}
+plot(patients$Height, patients$Weight, pch=16)
+```
+
+##The points function
+
+- `points()` can be used to set of points to an *existing* plot
+- It requires a vector of x and y coordinates
+ + These do not have to be the same length as the number of points in the initial plot:
+ + Hence we can use `points()` to highlight observations
+ + ...or add a set of new observations
+```{r }
+plot(patients$Height, patients$Weight, pch=16)
+points(160,90, pch="X")
+```
+
+- Note that axis limits of the existing plot are not altered
+- Often it is useful to create a blank 'canvas' with the correct labels and limits
+
+```{r}
+plot(patients$Height, patients$Weight, type="n")
+```
+
+## Adding points to differentiate gender
+
+- Selecting males using the **`==`** comparison we saw yesterday
+ + Gives a `TRUE` or `FALSE` value
+ + Can be used to index the data frame
+ + Which means we can get the relevant Age and Weight values
+```{r}
+males <- patients$Sex == "Male"
+males
+```
+
+```{r, eval=FALSE}
+males
+patients[males,]
+patients$Age[males]
+patients$Weight[males]
+```
+
+
+
+- The points we add have to be within the `x` and `y` limits of the original plot axes, otherwise they won't be displayed
+ + R won't give a warning or error if you attempt to plot points outside the axis limits
+
+```{r}
+plot(patients$Height, patients$Weight, type="n")
+points(patients$Height[males], patients$Weight[males],
+ pch=16, col="steelblue")
+
+```
+
+
+We can do the same for Females
+
+```{r}
+females <- patients$Sex == "Female"
+females
+patients[females,]
+```
+
+- Again, we have to be careful that all the points are within the `x` and `y` limits
+ + this is why creating the blank plot containing the limits of the data is useful
+
+```{r}
+plot(patients$Height, patients$Weight, type="n")
+points(patients$Height[males], patients$Weight[males],
+ pch=16, col="steelblue")
+points(patients$Height[females], patients$Weight[females],
+ pch=16, col="orangered1")
+
+```
+
+- Each set of points can have a different colour and shape
+- Axis labels and title and limits are defined by the plot
+- Once you've added points to a plot, they cannot be removed
+
+
+```{r }
+plot(patients$Height, patients$Weight, type="n")
+points(patients$Height[males], patients$Weight[males],
+ pch=16, col="steelblue")
+points(patients$Height[females], patients$Weight[females],
+ pch=17, col="orangered1",
+ xlab="Age of Patient",
+ ylab="Weight",
+ main="Relationship between Age and Weights")
+
+## The arguments xlab, ylab, main in the points functions are not used
+## Need to specify these labels when you create the plot initially
+
+```
+
+
+
+
+## Adding a legend
+
+- Should also add a legend to help interpret the plot
+ + use the `legend` function
+ + can give x and y coordinates where legend will appear
+ + also recognises shortcuts such as ***topleft*** and ***bottomright***...
+
+```{r}
+plot(patients$Height, patients$Weight, type="n")
+points(patients$Height[males], patients$Weight[males],
+ pch=16, col="steelblue")
+points(patients$Height[females], patients$Weight[females],
+ pch=17, col="orangered1")
+legend("topleft", legend=c("M","F"),
+ col=c("steelblue","orangered1"), pch=c(16,17))
+```
+
+##Adding text
+
+- Text can also be added to a plot in a similar manner
+ + the `labels` argument specifies the text we want to add
+
+```{r}
+plot(patients$Height, patients$Weight, pch=16)
+ text(patients$Height, patients$Weight, labels=patients$Race)
+```
+
+- In the above we used to same co-ordinates from the original plot to place the text
+- We can use arguments `pos` or `adj` to offset the positions of the labels
+ + `pos` can be 1, 2, 3, 4 for below, left, above, right
+ + `adj` is an adjustment in the range 0 to 1
+
+```{r}
+plot(patients$Height, patients$Weight, pch=16)
+ text(patients$Height, patients$Weight, labels=patients$Race,pos = 3)
+```
+
+
+```{r}
+plot(patients$Height, patients$Weight, pch=16)
+ text(patients$Height, patients$Weight, labels=patients$Race,adj =0.1)
+```
+
+
+- To aid our interpretation, it is often helpful to add guidelines
+ + `grid()` is one easy way of doing this:
+
+```{r}
+plot(patients$Height, patients$Weight, pch=16)
+grid(col="steelblue")
+```
+
+
+- Can also add lines that intersect the axes:
+ + `v =` for vertical lines
+ + `h =` for horizontal
+ + can specify multiple lines in a vector
+
+```{r}
+plot(patients$Height, patients$Weight, pch=16)
+abline(v=160, col="red")
+abline(h=c(65,70,75), col="blue")
+```
+
+
+## Plot layouts
+
+- The `par` function can be used specify the appearance of a plot
+- The settings persist until the plot is closed with **`dev.off()`**
+- `?par` and scroll to ***graphical parameters***
+- One example is `mfrow`:
+ + "multiple figures per row"
+ + needs to be a vector of rows and columns:
+ + e.g. a plot with one row and two columns `par(mfrow=c(1,2))`
+ + don't need the same kind of plot in each cell
+
+```{r}
+par(mfrow=c(1,2))
+plot(patients$Height, patients$Weight, pch=16)
+boxplot(patients$Weight ~ patients$Sex)
+
+```
+
+- See also `mar` for setting the margins:
+ + `par(mar=c(...))`
+
+## Exporting graphs from RStudio
+
+- You can use the `pdf()` function to export one or more plots to a pdf file:
+ + You will see that the plot does not appear in RStudio
+- You need to use the `dev.off()` to stop printing graphs to the pdf and 'close' the file
+ + It allows you to create a pdf document with multiple pages
+
+```{r}
+pdf("ExampleGraph.pdf")
+boxplot(patients$Weight ~ patients$Sex)
+dev.off()
+```
+
+- pdf is a good choice for publication as they can be imported into Photoshop, Inkscape, etc.
+ - Sometimes it is easier to edit in these tools than R!
+ - If it is taking too long to customise a plot in R, consider if you should be using one of these tools instead
+
+
+- To save any graph you have created to a pdf, repeat the code you used to create the plot with `pdf()` before and `dev.off()` afterwards
+ + you can have as many lines of code in-between as you like
+ + each new plot will be written to a different page
+
+```{r}
+pdf("mygraph.pdf")
+plot(patients$Height, patients$Weight, pch=16)
+abline(v=40, col="red")
+abline(h=c(65,70,75), col="blue")
+boxplot(patients$Weight ~ patients$Sex)
+x <- 1:10
+y <- x^2 + 4*x
+plot(x,y)
+dev.off()
+```
+
+- If no plots are appearing in RStudio, it could be you are still writing to a pdf file
+ + run `dev.off()` multiple times until you see a message `cannot shut down device (the null device)`
+
+
+- We can specify the dimensions of the plot, and other properties of the file (`?pdf`)
+
+```{r}
+pdf("ExampleGraph.pdf", width=10, height=5)
+boxplot(patients$Weight ~ patients$Sex)
+dev.off()
+```
+
+- Other formats can be created:
+ + e.g. ***png***, or others `?jpeg`
+ + more appropriate for email, presentations, web page
+
+```{r}
+png("ExampleGraph.png")
+boxplot(patients$Weight ~ patients$Sex)
+dev.off()
+```
+
+##Exercise: Exercise 5a
+- Return to the weather data from yesterday:
+
+```{r}
+weather <- read.csv("ozone.csv")
+```
+
+- Using the `par` function, create a layout with three columns
+- Plot Ozone versus Solar Radiation, Wind Speed and Temperature on separate graphs
+ + use different colours and plotting characters on each plot
+- Save the plot to a pdf
+- HINT: Create the graph first in RStudio. When you're happy with it, re-run the code preceeded by the `pdf` function to save to a file
+ + don't forget to use `dev.off()` to close the file
+
+
+
+```{r}
+### Your Answer Here ###
+
+
+```
+
+
+##Exercise: Exercise 5b
+- Temperature and Ozone level seem to be correlated
+- However, there are some observations that do not seem to fit the trend
+ + those with Ozone level > 100
+- Modify the plot so that these outlier observations are in a different colour
+- Add a legend to help interpret the plot
+
+
+
+
+
+HINT: You can break down the problem into the following steps
+
+- Create a blank plot
+- Identify observations with ozone > 100
+ + plot the corresponding Temperature and Ozone values for these in red
+- Identify observations with ozone < 100
+ + plot the corresponding Temperature and Ozone values for these in orange
+- Can you modify your code so that the cut-off is not hard-coded (fixed) to 100
+ + e.g. Ozone > 80, Ozone, 90 etc...
+ + can you re-generate the plots the minimal changes to the code
+
+```{r}
+
+### Your Answer Here ###
+
+
+```
+
+An alternative, and equally-valid, solution involves creating a vector of colours which will either be `red` or `orange` depending whether on the particular value of ozone is an outlier
+
+- we can use the `rep` function to create a vector of the required length for the entire dataset
+ + one vector for plotting colours, and another vector for plotting characters
+
+```{r}
+weather <- read.csv("ozone.csv")
+mycol <-rep("orange",nrow(weather))
+mypch <- rep(17, nrow(weather))
+
+```
+
+- now use a logical expression to identify the high ozone level, and replace the corresponding entries in the colour and plotting character vectors
+
+```{r}
+highO <- which(weather$Ozone > 100)
+mycol[highO] <- "red"
+mypch[highO] <- 18
+```
+
+- creating the plot can now be done with a single `plot` command
+
+```{r}
+plot(weather$Temp,weather$Ozone,
+ col=mycol, pch=mypch,ylab="Ozone level",
+ xlab="Temperature")
+abline(h=100,lty=2)
+legend("topleft", legend = c("Ozone > 100","Normal Ozone"),col=c("red","orange"),pch=17)
+```
+
+## Other plotting features
+
+We can choose not to show the x- and y-axis in the initial plot
+
+```{r}
+plot(patients$Age, patients$Weight, pch=16,axes=FALSE)
+```
+
+They can be added afterwards with the `axis` function
+ - first argument is whether the axis appears on the bottom (1), left (2), top (3) or right (4)
+ - can also define where the tick marks are located, and the labels to display
+ - the `box` function can be used to enclose the plot in a box
+```{r}
+plot(patients$Age, patients$Weight, pch=16,axes=FALSE)
+axis(1)
+axis(4, at = c(65,75,85,95),labels = c("65kg","75kg","85kg","95kg"))
+box()
+
+```
+
+
+
+## Ordering the boxes in a set of boxplots
+
+N.B. The order in which the boxes are displayed on a boxplot isn't something that can be controlled by graphical paramaters
+
+- the order is derived from the order of the categories
+ + in this case `Female` comes first alphabetically
+
+```{r}
+boxplot(patients$Weight~patients$Sex)
+```
+
+- a different order can be achieved by defining a new factor where the `levels` are in the order you want
+
+```{r}
+patients$Sex <- factor(patients$Sex,levels = c("Male","Female"))
+boxplot(patients$Weight~patients$Sex)
+```
+
+
+## Need inspiration?
+
+The [R Graph Gallery](http://www.r-graph-gallery.com/) has lots of examples (and code) showing what can be achieved with R graphics
\ No newline at end of file
diff --git a/Session2.1-plotting2.nb.html b/Session2.1-plotting2.nb.html
new file mode 100644
index 0000000..2831c91
--- /dev/null
+++ b/Session2.1-plotting2.nb.html
@@ -0,0 +1,1012 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to Solving Biological Problems Using R - Day 2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
points(patients$Height[females], patients$Weight[females],
+ pch=17, col="orangered1",
+ xlab="Age of Patient",
+ ylab="Weight",
+ main="Relationship between Age and Weights")
+
+
+
+
+
+
## The arguments xlab, ylab, main in the points functions are not used
+## Need to specify these labels when you create the plot initially
+
+
+
+
+
+
Adding a legend
+
+
Should also add a legend to help interpret the plot
+
+
use the legend function
+
can give x and y coordinates where legend will appear
+
also recognises shortcuts such as topleft and bottomright…
Using the par function, create a layout with three columns
+
Plot Ozone versus Solar Radiation, Wind Speed and Temperature on separate graphs
+
+
use different colours and plotting characters on each plot
+
+
Save the plot to a pdf
+
HINT: Create the graph first in RStudio. When you’re happy with it, re-run the code preceeded by the pdf function to save to a file
+
+
don’t forget to use dev.off() to close the file
+
+
+
+
+
+
+
+
+
+
### Your Answer Here ###
+
+
+
+
+
+
Exercise: Exercise 5b
+
+
Temperature and Ozone level seem to be correlated
+
However, there are some observations that do not seem to fit the trend
+
+
those with Ozone level > 100
+
+
Modify the plot so that these outlier observations are in a different colour
+
Add a legend to help interpret the plot
+
+
+
+
+
+
HINT: You can break down the problem into the following steps
+
+
Create a blank plot
+
Identify observations with ozone > 100
+
+
plot the corresponding Temperature and Ozone values for these in red
+
+
Identify observations with ozone < 100
+
+
plot the corresponding Temperature and Ozone values for these in orange
+
+
Can you modify your code so that the cut-off is not hard-coded (fixed) to 100
+
+
e.g. Ozone > 80, Ozone, 90 etc…
+
can you re-generate the plots the minimal changes to the code
+
+
+
+
+
+
### Your Answer Here ###
+
+
+
+
An alternative, and equally-valid, solution involves creating a vector of colours which will either be red or orange depending whether on the particular value of ozone is an outlier
+
+
we can use the rep function to create a vector of the required length for the entire dataset
+
+
one vector for plotting colours, and another vector for plotting characters
They can be added afterwards with the axis function - first argument is whether the axis appears on the bottom (1), left (2), top (3) or right (4) - can also define where the tick marks are located, and the labels to display - the box function can be used to enclose the plot in a box
+
+
+
+
+
+
+
+
diff --git a/Session2.2-stats.Rmd b/Session2.2-stats.Rmd
new file mode 100644
index 0000000..d513a38
--- /dev/null
+++ b/Session2.2-stats.Rmd
@@ -0,0 +1,332 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 2"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+ html_notebook:
+ toc: yes
+ toc_float: yes
+---
+
+
+# 2. Statistics
+##Built-in support for statistics
+- R is a statistical programming language:
+ + Classical statistical tests are built-in
+ + Statistical modeling functions are built-in
+ + Regression analysis is fully supported
+ + Additional mathematical packages are available (`MASS`, Waves, sparse matrices, etc)
+
+##Distribution functions
+- Most commonly used distributions are built-in, functions have stereotypical names, e.g. for normal distribution:
+ + **`pnorm`** - cumulative distribution for x
+ + **`qnorm`** - inverse of pnorm (from probability gives x)
+ + **`dnorm`** - distribution density
+ + **`rnorm`** - random number from normal distribution
+
+
+
+- Available for variety of distributions: `punif` (uniform), `pbinom` (binomial), `pnbinom` (negative binomial), `ppois` (poisson), `pgeom` (geometric), `phyper` (hyper-geometric), `pt` (T distribution), pf (F distribution)
+
+
+- 10 random values from the Normal distribution with mean 10 and standard deviation 5:
+```{r}
+rnorm(10, mean=10, sd=5)
+```
+- The probability of drawing exactly 10 from this distribution:
+```{r}
+dnorm(10, mean=10, sd=5)
+```
+
+```{r}
+dnorm(100, mean=10, sd=5)
+```
+
+
+- The probability of drawing a value smaller than 10:
+```{r}
+pnorm(10, mean=10, sd=5)
+```
+- The inverse of `pnorm()`:
+```{r}
+qnorm(0.5, mean=10, sd=5)
+```
+- How many standard deviations for statistical significance?
+```{r}
+qnorm(0.95, mean=0, sd=1)
+```
+
+## Example
+
+Recall our histogram of Wind Speed from yesterday:
+
+- The data look to be roughly normally-distributed
+- An assumption we rely on for various statistical tests
+
+```{r}
+weather <- read.csv("ozone.csv")
+hist(weather$Wind, col="steelblue", xlab="Wind Speed",
+ main="Distribution of Wind Speed",
+ breaks = 20, freq=FALSE)
+```
+
+## Create a normal distribution curve
+
+- If our data are normally-distributed, we can calculate the probability of drawing particular values.
+ + e.g. a Wind Speed of 10
+
+```{r}
+windMean <- mean(weather$Wind)
+windSD <- sd(weather$Wind)
+dnorm(10, mean=windMean, sd=windSD)
+```
+
+- We can overlay this on the histogram using `points` as we just saw:
+```{r}
+hist(weather$Wind, col="steelblue", xlab="Wind Speed",
+ main="Distribution of Wind Speed",
+ breaks = 20, freq=FALSE)
+points(10, dnorm(10, mean=windMean, sd=windSD),
+ col="red", pch=16)
+```
+
+- We can repeat the calculation for a vector of values
+ + remember that functions in R are often ***vectorized***
+ + use `lines` in this case rather than `points`
+
+```{r}
+xs <- c(0,5,10,15,20)
+ys <- dnorm(xs, mean=windMean, sd=windSD)
+hist(weather$Wind, col="steelblue", xlab="Wind Speed",
+ main="Distribution of Wind Speed",
+ breaks = 20, freq=FALSE)
+lines(xs, ys, col="red")
+```
+
+
+
+
+- For a smoother curve, use a longer vector:
+ + we can generate x values using the `seq()` function
+- Inspecting the data in this manner is usually acceptable to assess normality
+ + the fit doesn't have to be exact
+ + the shapiro test is also available `?shapiro.test` (but not really recommended by statisticians)
+
+
+```{r}
+hist(weather$Wind,col="steelblue",xlab="Wind Speed",
+ main="Distribution of Wind Speed",breaks = 20,freq=FALSE)
+xs <- seq(0,20, length.out = 10000)
+ys <- dnorm(xs, mean=windMean,sd=windSD)
+lines(xs,ys,col="red")
+```
+
+## Simple testing
+
+- If we want to compute the probability of observing a particular Wind Speed, from the same distribution, we can use the standard formula to calculate a t statistic:
+
+$$t = \frac{\bar{x} -\mu_0}{s / \sqrt(n)}$$
+
+- Say a Wind Speed of 2; which from the histogram seems to be unlikely
+
+```{r}
+t <- (windMean - 2) / (windSD/sqrt(length(weather$Wind)))
+t
+```
+
+- ...or use the **`t.test()`** function to compute the statistic and corresponding p-value
+
+```{r}
+t.test(weather$Wind, mu=2)
+```
+
+- A variety of tests are supported the R authors have tried to make them as consistent as possible
+
+```{r}
+?var.test
+?t.test
+?wilcox.test
+?prop.test
+?cor.test
+?chisq.test
+?fisher.test
+```
+
+
+- Bottom-line: Pretty much any statistical test you care to name will probably be in R
+ + This is not supposed to be a statistics course (sorry!)
+ + None of them are particular harder than others to use
+ + The difficulty is deciding which test to use:
+ + whether the assumptions of the test are met, etc.
+ + Consult your local statistician if not sure
+ + An upcoming course that will help
+ + [Introduction to Statistical Analysis](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/)
+ + Some good references:
+ + [Statistical Analysis Using R (Course from the Babaraham Bioinformatics Core)](http://training.csx.cam.ac.uk/bioinformatics/event/1827771)
+ + [Quick-R guide to stats](http://www.statmethods.net/stats/index.html)
+ + [Simple R eBook](https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf)
+ + [R wiki](https://en.wikibooks.org/wiki/R_Programming/Descriptive_Statistics)
+ + [R tutor](http://www.r-tutor.com/elementary-statistics)
+ + [Statistical Cheatsheet](https://rawgit.com/bioinformatics-core-shared-training/intermediate-stats/master/cheatsheet.pdf)
+ - R will do whatever you ask it, it will never check the assumptions or help you to interpret the result
+
+
+
+## Example analysis
+
+- We have already seen that men in our `patients` dataset tend to be heavier than women
+- We can **test this formally** in R
+ + N.B. the weight of people in a population tends to be normally distributed, so we can probably be safe to use a parametric test
+
+
+
+
+
+
+We need to run this if we don't have the patients data in our R environment
+```{r}
+patients <- read.delim("patient-info.txt")
+```
+
+We can test if the variance of the two groups is the same
+
++ this will influence what flavour of t-test we use
+
+```{r}
+var.test(patients$Weight~patients$Sex)
+```
+
+
+```{r}
+t.test(patients$Weight~patients$Sex, var.equal=FALSE)
+```
+
+If were unwilling to make a assumption of normality, a non-parametric test could be used.
+
+```{r}
+wilcox.test(patients$Weight~patients$Sex)
+```
+
+## Linear Regression
+
+- Linear modeling is supported by the function `lm()`:
+ + `example(lm)`
+- The output assumes you know a fair bit about the subject
+- `lm` is really useful for plotting lines of best fit to XY data, in order to determine intercept, gradient and Pearson's correlation coefficient
+ + This is very easy in R
+- Three steps to plotting with a best fit line:
+
+ + Plot XY scatter-plot data
+ + Fit a linear model
+ + Add bestfit line data to plot with `abline()` function
+- Let's see a toy example:-
+
+```{r}
+x <- c(1, 2.3, 3.1, 4.8, 5.6, 6.3)
+y <- c(2.6, 2.8, 3.1, 4.7, 5.1, 5.3)
+plot(x,y, xlim=c(0,10), ylim=c(0,10))
+```
+
+- The `~` is used to define a formula; i.e. "y is given by x"
+ + Take care about the order of `x` and `y` in the `plot` and `lm` expressions
+
+```{r}
+plot(x,y, xlim=c(0,10), ylim=c(0,10))
+myModel <- lm(y~x)
+abline(myModel, col="red")
+```
+
+The generic `summary` function give an overview of the model
+
+- particular components are accessible if we know their names
+
+```{r}
+summary(myModel)
+names(myModel) # Names of the objects within myModel
+```
+
+- alternatively, various helper functions are provided.
+
+```{r}
+coef(myModel) # Coefficients
+resid(myModel) # Residuals
+fitted(myModel) # Fitted values
+residuals(myModel) + fitted(myModel) # what values does this give?
+```
+
+You can also get some diagnostic information on the model.
+
+- Some explanation can be found [here](http://data.library.virginia.edu/diagnostic-plots/) and [here](http://strata.uga.edu/6370/rtips/regressionPlots.html)
+
+```{r}
+par(mfrow=c(2,2))
+plot(myModel)
+```
+
+- R has a very powerful formula syntax for describing statistical models
+- Suppose we had two explanatory variables `x` and `z`, and one response variable `y`
+- We can describe a relationship between, say, `y` and `x` using a tilde `~`, placing the response variable on the left of the tilde and the explanatory variables on the right:
+ + y~x
+- It is very easy to extend this syntax to do multiple regressions, ANOVAs, to include interactions, and to do many other common modelling tasks. For example
+
+```{r}
+y~x #If x is continuous, this is linear regression
+y~x #If x is categorical, ANOVA
+y~x+z #If x and z are continuous, multiple regression
+y~x+z #If x and z are categorical, two-way ANOVA
+y~x+z+x:z # : is the symbol for the interaction term
+y~x*z # * is a shorthand for x+z+x:z
+```
+
+## Exercise: Exercise 6
+
+- There are suggestions that Ozone level could be influenced by Temperature:
+```{r}
+plot(weather$Temp, weather$Ozone,xlab="Temperature",ylab="Ozone level",pch=16)
+```
+
+- Perform a linear regression analysis to assess this:
+ + Fit the linear model and print a summary of the output
+ + Plot the two variables and overlay a best-fit line
+ + What is the equation of the best-fit line in the form
+ $$ y = ax + b$$
+- Calculate the correlation between the two variables using the function cor (?cor)
+ + can you add text to the plot to show the correlation coefficient?
+ + HINT: The `paste` can be used to join strings of text together, or variables
+
+```{r}
+paste("Hello","World")
+age <- 35
+paste("My age is", age)
+```
+
+
+```{r}
+
+## Your Answer Here ##
+
+```
+
+
+
+
+
+## More words of warning
+
+***Correlation != Causation***
+
+
+
+http://tylervigen.com/spurious-correlations
+
+
+
+
+[So if I want to win a nobel prize, I should eat even more chocolate?!?!?](http://www.businessinsider.com/chocolate-consumption-vs-nobel-prizes-2014-4?IR=T)
+
+But no-one would ever take such trends seriously....would they?
+
+
+
+
diff --git a/Session2.2-stats.nb.html b/Session2.2-stats.nb.html
new file mode 100644
index 0000000..ef7e2b5
--- /dev/null
+++ b/Session2.2-stats.nb.html
@@ -0,0 +1,969 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to Solving Biological Problems Using R - Day 2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
If we want to compute the probability of observing a particular Wind Speed, from the same distribution, we can use the standard formula to calculate a t statistic:
+
+
\[t = \frac{\bar{x} -\mu_0}{s / \sqrt(n)}\]
+
+
Say a Wind Speed of 2; which from the histogram seems to be unlikely
+
+
+
+
+
t <- (windMean - 2) / (windSD/sqrt(length(weather$Wind)))
+t
+
+
+
[1] 27.93897
+
+
+
+
+
…or use the t.test() function to compute the statistic and corresponding p-value
+
+
+
+
+
t.test(weather$Wind, mu=2)
+
+
+
+ One Sample t-test
+
+data: weather$Wind
+t = 27.939, df = 152, p-value < 2.2e-16
+alternative hypothesis: true mean is not equal to 2
+95 percent confidence interval:
+ 9.394804 10.520229
+sample estimates:
+mean of x
+ 9.957516
+
+
+
+
+
A variety of tests are supported the R authors have tried to make them as consistent as possible
R will do whatever you ask it, it will never check the assumptions or help you to interpret the result
+
+
+
+
+
+
+
+
Example analysis
+
+
We have already seen that men in our patients dataset tend to be heavier than women
+
We can test this formally in R
+
+
N.B. the weight of people in a population tends to be normally distributed, so we can probably be safe to use a parametric test
+
+
+
+
+
+
+
We need to run this if we don’t have the patients data in our R environment
+
+
+
+
patients <- read.delim("patient-info.txt")
+
+
+
+
We can test if the variance of the two groups is the same
+
+
this will influence what flavour of t-test we use
+
+
+
+
+
var.test(patients$Weight~patients$Sex)
+
+
+
+ F test to compare two variances
+
+data: patients$Weight by patients$Sex
+F = 0.14216, num df = 49, denom df = 44, p-value = 3.59e-10
+alternative hypothesis: true ratio of variances is not equal to 1
+95 percent confidence interval:
+ 0.07900337 0.25344664
+sample estimates:
+ratio of variances
+ 0.1421572
+ Welch Two Sample t-test
+
+data: patients$Weight by patients$Sex
+t = -11.204, df = 55.168, p-value = 7.786e-16
+alternative hypothesis: true difference in means is not equal to 0
+95 percent confidence interval:
+ -15.62079 -10.88094
+sample estimates:
+mean in group Female mean in group Male
+ 68.95980 82.21067
+
+
+
+
If were unwilling to make a assumption of normality, a non-parametric test could be used.
+
+
+
+
wilcox.test(patients$Weight~patients$Sex)
+
+
+
+ Wilcoxon rank sum test with continuity correction
+
+data: patients$Weight by patients$Sex
+W = 59, p-value = 1.993e-15
+alternative hypothesis: true location shift is not equal to 0
+
+
+
+
+
+
Linear Regression
+
+
Linear modeling is supported by the function lm():
+
+
example(lm)
+
+
The output assumes you know a fair bit about the subject
+
lm is really useful for plotting lines of best fit to XY data, in order to determine intercept, gradient and Pearson’s correlation coefficient
+
+
This is very easy in R
+
+
Three steps to plotting with a best fit line:
+
+
Plot XY scatter-plot data
+
Fit a linear model
+
Add bestfit line data to plot with abline() function
R has a very powerful formula syntax for describing statistical models
+
Suppose we had two explanatory variables x and z, and one response variable y
+
We can describe a relationship between, say, y and x using a tilde ~, placing the response variable on the left of the tilde and the explanatory variables on the right:
+
+
y~x
+
+
It is very easy to extend this syntax to do multiple regressions, ANOVAs, to include interactions, and to do many other common modelling tasks. For example
+
+
+
+
+
y~x #If x is continuous, this is linear regression
+
+
+
y ~ x
+
+
+
y~x #If x is categorical, ANOVA
+
+
+
y ~ x
+
+
+
y~x+z #If x and z are continuous, multiple regression
+
+
+
y ~ x + z
+
+
+
y~x+z #If x and z are categorical, two-way ANOVA
+
+
+
y ~ x + z
+
+
+
y~x+z+x:z # : is the symbol for the interaction term
+
+
+
y ~ x + z + x:z
+
+
+
y~x*z # * is a shorthand for x+z+x:z
+
+
+
y ~ x * z
+
+
+
+
+
+
Exercise: Exercise 6
+
+
There are suggestions that Ozone level could be influenced by Temperature:
+
+
+
+
+
+
+
+
diff --git a/Session2.3-data-manipulation.Rmd b/Session2.3-data-manipulation.Rmd
new file mode 100644
index 0000000..4ae2c2f
--- /dev/null
+++ b/Session2.3-data-manipulation.Rmd
@@ -0,0 +1,400 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 2"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+ html_notebook:
+ toc: yes
+ toc_float: yes
+---
+
+
+# 3. Data Manipulation Techniques
+
+## Motivation
+
+- So far we have been lucky that all our data have been in the same file:
+ + This is not usually the case
+ + Dataset may be spread over several files
+ + This takes longer, and is harder, than many people realise
+ + We need to combine before doing an analysis
+
+
+
+## Combining data from multiple sources: Gene Clustering Example
+
+- R has powerful functions to combine heterogeneous data sources into a single data set
+- Gene clustering example data:
+ + Gene expression values in ***gene.expression.txt***
+ + Gene information in ***gene.description.txt***
+ + Patient information in ***cancer.patients.txt***
+- A breast cancer dataset with numerous patient characteristics:
+ + We will concentrate on ***ER status*** (positive / negative)
+ + What genes show a statistically-significant different change between ER groups?
+
+## Analysis goals
+
+- We will show how to lookup a particular gene in the dataset
+- Also, how to look-up genes in a given genomic region
+- Perform a "sanity-check" to see if a previously-known gene exhibits a difference in our dataset
+- How many genes on chromosome 8 are differentially-expressed?
+- Create a heatmap to cluster the samples and reveal any subgroups in the data
+ + do the subgroups agree with our prior knowledge about the samples
+
+## Peek at the data
+
+- `gene.expression.txt` is a tab-delimited file, so we can use `read.delim` to import it
+- here the `head` function is used as a convenient way to see the first six rows of the resulting data frame
+
+```{r}
+normalizedValues <- read.delim("gene.expression.txt")
+head(normalizedValues)
+```
+
+
+- `r nrow(normalizedValues)` rows and `r ncol(normalizedValues)` columns
++ One row for each gene:
+ + Rows are named according to particular technology used to make measurement
+ + The names of each row can be returned by `rownames(normalizedValues)`; giving a vector
++ One column for each patient:
+ + The names of each column can be returned by `colnames(normalizedValues)`; giving a vector
+
+
+```{r}
+geneAnnotation <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
+head(geneAnnotation)
+```
+
+
+- `r nrow(geneAnnotation)` rows and `r ncol(geneAnnotation)` columns
+- One for each gene
+- Includes mapping between manufacturer ID and Gene name
+
+
+```{r}
+patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
+head(patientMetadata)
+```
+
+- One for each patient in the study
+- Each column is a different characteristic of that patient
+ + e.g. whether a patient is ER positive (value of 1) or negative (value of 0)
+
+```{r}
+table(patientMetadata$er)
+```
+
+
+
+## Ordering and sorting
+
+To get a feel for these data, we will look at how we can subset and order
+
+- R allows us to do the kinds of filtering, sorting and ordering operations you might be familiar with in Excel
+- For example, if we want to get information about patients that are ER negative
+ + these are indicated by an entry of ***0*** in the `er` column
+
+```{r eval=FALSE}
+patientMetadata$er == 0
+```
+
+We can do the comparison within the square brackets
+
+- Remembering to include a `,` to index the columns as well
+- Best practice to create a new variable and leave the original data frame untouched
+
+```{r}
+erNegPatients <- patientMetadata[patientMetadata$er == 0,]
+head(erNegPatients)
+```
+
+or
+
+```{r}
+View(erNegPatients)
+```
+
+Sorting is supported by the **`sort()`** function
+
+- Given a vector, it will return a sorted version of the same length
+
+```{r}
+sort(erNegPatients$grade)
+```
+
+- But this is not useful in all cases
+ + We have lost the extra information that we have about the patients
+
+- Instead, we can use **`order()`**
+- Given a vector, `order()` will give a set of numeric values which will give an ordered version of the vector
+ + default is smallest --> largest
+
+```{r}
+myvec <- c(90,100,40,30,80,50,60,20,10,70)
+myvec
+order(myvec)
+```
+
+- i.e. number in position 9 is the smallest, number in position 8 is the second smallest:
+
+```{r}
+myvec[9]
+myvec[8]
+```
+
+N.B. `order` will also work on character vectors
+
+```{r}
+firstName <- c("Adam", "Eve", "John", "Mary", "Peter", "Paul", "Joanna", "Matthew", "David", "Sally")
+order(firstName)
+```
+
+- We can use the result of `order()` to perform a subset of our original vector
+- The result is an ordered vector
+```{r}
+myvec.ord <- myvec[order(myvec)]
+myvec.ord
+```
+
+- Implication: We can use `order` on a particular column of a data frame, and use the result to sort all the rows
+
+- We might want to select the youngest ER negative patients for a follow-up study
+- Here we order the `age` column and use the result to re-order the rows in the data frame
+
+```{r}
+erNegPatientsByAge <- erNegPatients[order(erNegPatients$age),]
+head(erNegPatientsByAge)
+```
+
+
+- can change the behaviour of `order` to be Largest --> Smallest
+```{r}
+erNegPatientsByAge <- erNegPatients[order(erNegPatients$age,decreasing = TRUE),]
+head(erNegPatientsByAge)
+```
+
+- we can write the result to a file if we wish
+
+```{r eval=FALSE}
+write.table(erNegPatientsByAge, file="erNegativeSubjectsByAge.txt", sep="\t")
+```
+
+
+## Exercise: Exercise7
+
+- Imagine we want to know information about chromosome 8 genes that have been measured.
+1. Create a new data frame containing information on genes on Chromosome 8
+2. Order the rows in this data frame according to start position, and write the results to a file
+
+```{r}
+
+## Your Answer Here ###
+
+```
+
+
+### Alternative:
+
+- you might find the function `subset` a bit easier to use
+ + no messing around with square brackets
+ + no need to remember row and column indices
+ + no need for `$` operator to access columns
+- more advanced packages like dplyr use a similar approach
+ + you'll find out about this on our intermediate course
+
+```{r}
+chr8Genes <- subset(geneAnnotation, Chromosome=="chr8")
+head(chr8Genes)
+```
+
+
+## Retrieving data for a particular gene
+
+ - Gene `ESR1` is known to be hugely-different between ER positive and negative patient
+ + let's check that this is evident in our dataset
+ + if not, something has gone wrong!
+- First step is to locate this gene in our dataset
+- We can use `==` to do this, but there are some alternatives that are worth knowing about
+
+## Character matching in R
+
+- `match()` and `grep()` are often used to find particular matches
+ + CAUTION: by default, match will only return the ***first*** match!
+
+```{r}
+match("D", LETTERS)
+grep("F", rep(LETTERS,2))
+match("F", rep(LETTERS,2))
+```
+
+- `grep` can also do partial matching
+ + can also do complex matching using "regular expressions"
+
+```{r}
+month.name
+grep("ary",month.name)
+grep("ber",month.name)
+```
+
+- `%in%` will return a logical if each element is contained in a shortened list
+
+```{r}
+month.name %in% c("May", "June")
+```
+
+
+## Retrieving data for a particular gene
+
+- Find the name of the ID that corresponds to gene ***ESR1*** using `match`
+ + mapping between IDs and genes is in the ***genes*** data frame
+ + ID in first column, gene name in the second
+- Save this ID as a variable
+
+```{r}
+rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
+geneAnnotation[rowInd,]
+myProbe <- geneAnnotation$probe[rowInd]
+myProbe
+```
+
+Now, find which row in our expression matrix is indexed by this ID
+
+- recall that the rownames of the expression matrix are the probe IDs
+- save the expression values as a variable
+
+```{r}
+match(myProbe, rownames(normalizedValues))
+normalizedValues[match(myProbe, rownames(normalizedValues)), 1:10]
+myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
+class(myGeneExpression)
+```
+
+
+
+## Relating to patient characteristics
+
+We have expression values and want to visualise them against our categorical data
+
+- use a boxplot, for example
+- however, we have to first make sure our values are treat as numeric data
+- as we created the subset of a data frame, the result was also a data frame
+ + use `as.numeric` to create a vector that we can plot
+ + various `as.` functions exist to convert between various data types
+
+
+```{r}
+boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)
+```
+
+
+- In this case there is a clear difference, so we probably don't even need a p-value to convince ourselves of the difference
+ + in real-life, we would probably test lots of genes and implement some kind of multiple-testing
+ + e.g. `p.adjust` (`?p.adjust`)
+
+```{r}
+t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)
+
+```
+
+
+
+## Complete script
+
+```{r}
+geneAnnotation <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
+patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
+normalizedValues <- read.delim("gene.expression.txt")
+
+rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
+myProbe <- geneAnnotation$probe[rowInd]
+myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
+
+boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)
+t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)
+```
+
+## Exercise: Exercise 8
+
+Repeat the same steps we performed for the gene ESR1, but for GATA3:
+
+- Try and make as few changes as possible from the ESR1 script
+- Can you see why making a markdown document is useful for analysis?
+
+
+```{r}
+
+### Your Answer Here ###
+
+```
+
+## Extra Discussion
+
+This example has been simplified by the fact that the columns in the expression matrix are in the same order as the patient metadata. This would normally be the case for data obtained in a public repository such as Gene Expression Omnibus
+
+```{r}
+colnames(normalizedValues)
+patientMetadata$samplename
+
+```
+
+There is a quick shortcut to check that these names are the same using the `all` function
+
+```{r}
+colnames(normalizedValues) == patientMetadata$samplename
+all(colnames(normalizedValues) == patientMetadata$samplename)
+```
+
+Let's say that our metadata have been re-ordered by ER status and age, and not by patient ID
+
+
+```{r}
+patientMetadata <- patientMetadata[order(patientMetadata$er,patientMetadata$age),]
+patientMetadata
+```
+
+- If we run the same code as before to produce the boxplot and perform the t-test we would get a very different result.
+- This should make use immediately suspicious, as the ESR1 gene is known to be highly differentially-expressed in the contrast we are making
+- Such sanity checks are important to check to your code
+
+```{r}
+rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
+myProbe <- geneAnnotation$probe[rowInd]
+myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
+
+boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)
+t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)
+```
+
+If we run the same check as before on the column names and patient IDs, we see that it fails:-
+
+```{r}
+all(colnames(normalizedValues) == patientMetadata$samplename)
+```
+
+A solution is to use `match` again. Specifically, we want to know where each column in the expression matrix can be found in the patient metadata. The result is a vector, each item of which is an index for a particular row in the patient metadata
+
+```{r}
+match(colnames(normalizedValues),patientMetadata$samplename)
+
+```
+
+The vector we have just generated can then by used to re-order the rows in the patient metadata
+
+```{r}
+patientMetadata <- patientMetadata[match(colnames(normalizedValues),patientMetadata$samplename),]
+patientMetadata
+all(colnames(normalizedValues) == patientMetadata$samplename)
+```
+
+And we can now proceed to perform the analysis and can the result we expect
+
+```{r}
+rowInd <- match("ESR1", geneAnnotation$HUGO.gene.symbol)
+myProbe <- geneAnnotation$probe[rowInd]
+myGeneExpression <- normalizedValues[match(myProbe,rownames(normalizedValues)),]
+
+boxplot(as.numeric(myGeneExpression) ~ patientMetadata$er)
+t.test(as.numeric(myGeneExpression) ~ patientMetadata$er)
+```
+
diff --git a/Session2.3-data-manipulation.nb.html b/Session2.3-data-manipulation.nb.html
new file mode 100644
index 0000000..d7486a9
--- /dev/null
+++ b/Session2.3-data-manipulation.nb.html
@@ -0,0 +1,1193 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to Solving Biological Problems Using R - Day 2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Welch Two Sample t-test
+
+data: as.numeric(myGeneExpression) by patientMetadata$er
+t = -38.746, df = 205.88, p-value < 2.2e-16
+alternative hypothesis: true difference in means is not equal to 0
+95 percent confidence interval:
+ -1.246953 -1.126198
+sample estimates:
+mean in group 0 mean in group 1
+ -1.17388506 0.01269076
+ Welch Two Sample t-test
+
+data: as.numeric(myGeneExpression) by patientMetadata$er
+t = -38.746, df = 205.88, p-value < 2.2e-16
+alternative hypothesis: true difference in means is not equal to 0
+95 percent confidence interval:
+ -1.246953 -1.126198
+sample estimates:
+mean in group 0 mean in group 1
+ -1.17388506 0.01269076
+
+
+
+
+
+
Exercise: Exercise 8
+
Repeat the same steps we performed for the gene ESR1, but for GATA3:
+
+
Try and make as few changes as possible from the ESR1 script
+
Can you see why making a markdown document is useful for analysis?
+
+
+
+
+
### Your Answer Here ###
+
+
+
+
+
+
Extra Discussion
+
This example has been simplified by the fact that the columns in the expression matrix are in the same order as the patient metadata. This would normally be the case for data obtained in a public repository such as Gene Expression Omnibus
+ Welch Two Sample t-test
+
+data: as.numeric(myGeneExpression) by patientMetadata$er
+t = -1.7848, df = 133.53, p-value = 0.07656
+alternative hypothesis: true difference in means is not equal to 0
+95 percent confidence interval:
+ -0.29727383 0.01525417
+sample estimates:
+mean in group 0 mean in group 1
+ -0.3990460 -0.2580361
+
+
+
+
If we run the same check as before on the column names and patient IDs, we see that it fails:-
A solution is to use match again. Specifically, we want to know where each column in the expression matrix can be found in the patient metadata. The result is a vector, each item of which is an index for a particular row in the patient metadata
+ Welch Two Sample t-test
+
+data: as.numeric(myGeneExpression) by patientMetadata$er
+t = -38.746, df = 205.88, p-value < 2.2e-16
+alternative hypothesis: true difference in means is not equal to 0
+95 percent confidence interval:
+ -1.246953 -1.126198
+sample estimates:
+mean in group 0 mean in group 1
+ -1.17388506 0.01269076
+
+
+
+
+
+
+
+
diff --git a/Session2.4-programming.Rmd b/Session2.4-programming.Rmd
new file mode 100644
index 0000000..5b61732
--- /dev/null
+++ b/Session2.4-programming.Rmd
@@ -0,0 +1,341 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 2"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+ html_notebook:
+ toc: yes
+ toc_float: yes
+---
+
+#4. Programming in R
+
+## Motivation
+
+From the previous exercise, you should see how we can easily adapt our markdown scripts:
+
+- e.g. ESR1 versus GATA3
+- But what if we want to analyse many genes?
+- It would be tedious to create a new markdown document for every gene
+- ...and prone to error too
+
+##Introducing loops
+
+- Many programming languages have ways of doing the same thing many times, perhaps changing some variable each time. This is called **looping**
+- Loops are not used in R so often, because we can usually achieve the same thing using vector calculations
+- For example, to add two vectors together, we do not need to add each pair of elements one by one, we can just add the vectors
+
+```{r}
+x <- 1:10
+y <- 11:20
+x+y
+```
+
+- But there are some situations where R functions can not take vectors as input. For example, `t.test()` will only test one gene at a time
+- What if we wanted to test multiple genes?
+
+For completeness, we can re-run the R code to import the data
+```{r}
+geneAnnotation <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
+patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
+normalizedValues <- read.delim("gene.expression.txt")
+```
+
+
+- We could run the following code to perform t-tests on the first two genes
+
+```{r eval=FALSE}
+t.test(as.numeric(normalizedValues[1,]) ~ factor(patientMetadata$er))
+t.test(as.numeric(normalizedValues[2,]) ~ factor(patientMetadata$er))
+
+```
+
+- But for many genes this will be boring to type, difficult to change, and prone to error
+- As we are doing the same thing multiple times, but with a different index each time, we can use a **loop** instead
+
+##Loops: Commands and flow control
+- R has two basic types of loop
+ + a **`for`** loop: run some code on every value in a vector
+ + a **`while`** loop: run some code while some condition is true (*hardly ever used!*)
+
+`for`
+```{r eval=FALSE}
+for(i in 1:10) {
+ print(i)
+ }
+
+```
+
+`while`
+
+```{r eval=FALSE}
+i <- 1
+while(i <= 10 ) {
+ print(i)
+ i <- i + 1
+ }
+```
+
+
+
+- Here's how we might use a `for` loop to test the first 10 genes
+
+
+```{r}
+for(i in 1:10) {
+
+ t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
+
+ }
+```
+
+- This is *exactly* the same as:
+
+```{r eval=FALSE}
+i <- 1
+t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
+i <- 2
+t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
+i <- 3
+###....etc....####
+```
+
+
+
+## Storing results
+
+However, this for loop is doing the calculations but not storing the results
+
+- The output of `t.test()` is an object with data placed in different slots
+ + the `names()` of the object tells us what data we can retrieve, and what variable name to use
+
+
+```{r}
+testResult <- t.test(as.numeric(normalizedValues[1,]) ~ factor(patientMetadata$er))
+names(testResult)
+testResult$statistic
+```
+
+
+- When using a loop, we often create an empty "dummy" variable
+- This is used store the results at each stage of the loop
+
+```{r}
+stats <- NULL
+for(i in 1:10) {
+ testResult <- t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
+ stats[i] <- testResult$statistic
+ }
+stats
+```
+
+## Practical application
+
+Previously we have identified probes on chromosome 8
+
+- Lets say that we want to do a t-test for each gene on chromosome 8
+```{r}
+chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",]
+head(chr8Genes)
+chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),]
+head(chr8GenesOrd)
+```
+
+- The first step is to extract the expression values for chromosome 8 genes from our expression matrix, which has expression values for all genes
+- We can use the `match` function to tell us which rows in the matrix correspond to chromosome 8 genes
+
+```{r}
+match(chr8GenesOrd$probe, rownames(normalizedValues))
+chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),]
+dim(chr8Expression)
+```
+
+We are now ready to write the for loop
+
+## Exercise:
+
+- Create a for loop to perform to test if the expression level of each gene on chromosome 8 is significantly different between ER positive and negative samples
+- Store the ***p-value*** from each individual test
+- How many genes have a p-value < 0.05?
+- N.B. Our code will be more robust if we store the number of chromosome 8 genes as a variable
+ + if the data change, the code should still run
+
+```{r}
+chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",]
+chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),]
+chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),]
+### Your Answer Here ###
+
+```
+
+
+
+
+##Conditional branching: Commands and flow control
+
+- Use an `if` statement for any kind of condition testing
+- Different outcomes can be selected based on a condition within brackets
+
+```
+if (condition) {
+ ... do this ...
+ } else {
+ ... do something else ...
+ }
+```
+
+- `condition` is any logical value, and can contain multiple conditions.
+ + e.g. `(a == 2 & b < 5)`, this is a compound conditional argument
+- The condition should return a *single* value of `TRUE` or `FALSE`
+
+
+
+## Other conditional tests
+
+- There are various tests that can check the type of data stored in a variable
+ + these tend to be called **`is...()`**.
+ + try *tab-complete* on `is.`
+
+```{r}
+is.numeric(10)
+is.numeric("TEN")
+is.character(10)
+```
+
+- `is.na()` is useful for seeing if an `NA` value is found
+ + cannot use `== NA`!
+- could be used to check if a gene symbol is found in the data before proceeding with statistical test
+
+```{r}
+match("foo", geneAnnotation$HUGO.gene.symbol)
+is.na(match("foo", geneAnnotation$HUGO.gene.symbol))
+```
+
+
+- Using the **`for`** loop we wrote before, we could add some code to plot the expression of each gene
+ + a boxplot would be ideal
+- However, we might only want plots for genes with a "significant" pvalue
+- Here's how we can use an `if` statement to test for this
+ + for each iteration of the the loop:
+ 1. test if the p-value from the test is below 0.05 or not
+ 2. if the p-value is less than 0.05 make a boxplot
+ 3. if not, do nothing
+
+```{r}
+pdf("Chromosome8Genes.pdf")
+pvals <- NULL
+for (i in 1:18) {
+ testResult <- t.test(as.numeric(chr8Expression[i,]) ~ factor(patientMetadata$er))
+ pvals[i] <- testResult$p.value
+ if(testResult$p.value < 0.05){
+ boxplot(as.numeric(chr8Expression[i,]) ~ factor(patientMetadata$er),
+ main=chr8Genes$HUGO.gene.symbol[i])
+ }
+}
+pvals
+dev.off()
+
+```
+
+
+##Code formatting avoids bugs!
+Compare:
+```{r eval=FALSE}
+f<-26
+while(f!=0){
+print(letters[f])
+f<-f-1}
+```
+to:
+```{r eval=FALSE}
+f <- 26
+while(f != 0 ){
+ print(letters[f])
+ f <- f-1
+ }
+```
+- The code between brackets `{}` *always* is *indented*, this clearly separates what is executed once, and what is run multiple times
+- Trailing bracket `}` always alone on the line at the same indentation level as the initial bracket `{`
+- Use white spaces to divide the horizontal space between units of your code, e.g. around assignments, comparisons
+
+
+# Making a heatmap
+
+- A heatmap is often used to visualise how the expression level of a set of genes vary between conditions
+- Making the plot is actually quite straightforward
+ + providing you have processed the data appropriately!
+- Let's take a list of "most-variable genes"
+ + see below for how we identified such genes
+- `heatmap` requires a matrix object rather than a data frame
+
+```{r}
+genelist <- c("CLIC6","TFF3","PDZK1","SCUBE2","CYP2B6","HOXB13","NAT1","LY6D","SLC7A2")
+probes <- geneAnnotation$probe[match(genelist, geneAnnotation$HUGO.gene.symbol)]
+probes
+exprows <- match(probes, rownames(normalizedValues))
+
+heatmap(as.matrix(normalizedValues[exprows,]))
+
+```
+
+
+## Heatmap adjustments
+
+- We can provide a colour legend for the samples
+- Adjust colour of cells
+- Label the rows according to gene name
+- Don't print the sample names as they are too cluttered
+
+```{r}
+library(RColorBrewer)
+
+sampcol <- rep("blue", ncol(normalizedValues))
+
+sampcol[patientMetadata$er == 1 ] <- "yellow"
+
+rbPal <- brewer.pal(10, "RdBu")
+
+heatmap(as.matrix(normalizedValues[exprows,]),
+ ColSideColors = sampcol,
+ col=rbPal,
+ labRow = genelist,labCol="")
+```
+
+- see also
+ + `heatmap.2` from `library(gplots)`; `example(heatmap.2)`
+ + `heatmap.plus` from `library(heatmap.plus)`; `example(heatmap.plus)`
+
+## (Supplementary) Choosing the genes for the heatmap
+
+Often when using R you will come across a convenient shortcut function that can save you many hours of coding and frustration.
+
+- the `genefilter` package in Bioconductor contains many useful methods for filtering genomic datasets
+- you can install this package with the following commands
+
+```{r eval=FALSE}
+source("http://www.bioconductor.org/biocLite.R")
+biocLite("genefilter")
+```
+
+- the `rowSds` function will calculate the standard deviation for each row in a numeric matrix
+- the output will be vector, with each element being the standard deviation for a corresponding gene
+
+```{r}
+library(genefilter)
+geneVar <- rowSds(normalizedValues)
+geneVar[1]
+sd(normalizedValues[1,])
+```
+
+- we can now `order` this matrix to get the subset with `[]` to get the indices of the most-variable genes (10 in this case).
+- the same indices can be used to subset the gene annotation data frame
+ + we can do this because the annotation data frame and expression matrix are in the same order
+
+```{r}
+topVar <- order(geneVar,decreasing = TRUE)[1:10]
+topVar
+geneAnnotation[topVar,]
+
+```
+
diff --git a/Session2.4-programming.nb.html b/Session2.4-programming.nb.html
new file mode 100644
index 0000000..b873b37
--- /dev/null
+++ b/Session2.4-programming.nb.html
@@ -0,0 +1,766 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to Solving Biological Problems Using R - Day 2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Create a for loop to perform to test if the expression level of each gene on chromosome 8 is significantly different between ER positive and negative samples
+
Store the p-value from each individual test
+
How many genes have a p-value < 0.05?
+
N.B. Our code will be more robust if we store the number of chromosome 8 genes as a variable
+
+
if the data change, the code should still run
+
+
+
+
+
+
chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",]
+chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),]
+chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),]
+### Your Answer Here ###
+
+
+
+
+
+
+
Conditional branching: Commands and flow control
+
+
Use an if statement for any kind of condition testing
+
Different outcomes can be selected based on a condition within brackets
+
+
if (condition) {
+ ... do this ...
+ } else {
+ ... do something else ...
+ }
+
+
condition is any logical value, and can contain multiple conditions.
+
+
e.g. (a == 2 & b < 5), this is a compound conditional argument
+
+
The condition should return a single value of TRUE or FALSE
+
+
+
+
Other conditional tests
+
+
There are various tests that can check the type of data stored in a variable
+
+
+
+
+
+
+
+
+
diff --git a/Session2.5-reports-and-wrap-up.Rmd b/Session2.5-reports-and-wrap-up.Rmd
new file mode 100644
index 0000000..ddf35d6
--- /dev/null
+++ b/Session2.5-reports-and-wrap-up.Rmd
@@ -0,0 +1,160 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 2"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+ Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output: html_notebook
+---
+
+# 5. Report Writing and Wrap-Up
+
+
+## Creating a markdown file from scratch
+
+So far you have been working with the R notebooks supplied with the course, but for your own analyses you'll want to start from scratch.
+
+***File → New File → R Markdown***
+
+- Choose 'Document' and the default output type (HTML)
+- A new tab is created in RStudio
+- The header allows you to specify a Page title, author and output type
+```
+---
+title: "Untitled"
+author: "Mark Dunning"
+date: "22/12/2016"
+output: html_document
+---
+```
+##Text formatting
+See ***Help*** → ***Markdown Quick Reference*** in RStudio:
+
+- Enclose text in \* to format in *italics*
+- Enclose text in \*\* to format in **bold**
+- \*\*\* for ***bold italics***
+- \` to format like `code`
+- \$ to include equations: $e =mc^2$
+- \> quoted text:
+
+>To be or not to be
+
+- See **Help -> Markdown Quick Reference** for more:
+ + Adding images
+ + Adding web links
+ + Tables
+
+
+## Not quite enough for a reproducible document
+
+- Minimally, you should record what version of R, and the packages you used.
+- Use the `sessionInfo()` function
+ + e.g. for the version of R I used to make the slides
+
+```{r}
+sessionInfo()
+```
+
+
+## Defining chunks
+
+- It is not great practice to have one long, continuous R script
+- Better to break-up into smaller pieces; '*chunks*'
+- You can document each chunk separately
+- Easier to catch errors
+- The characteristics of each chunk can be modified:
+ + You might not want to print the R code for each chunk
+ + ...or the output
+
+
+
+## Chunk options
+
+
+- It's a good idea to name each chunk
+ + Easier to track-down errors
+- We can display R code, but not run it
+ + `eval=FALSE`
+- We can run R code, but not display it
+ + `echo=FALSE`
+ + e.g. setting display options
+- Suppress warning messages
+ + `warning=FALSE`
+
+
+## Chunk options: eval
+
+- Sometimes we want to format code for display, but not execute; we want to show the code for how we read our data, but want our report to compile quickly
+
+e.g.
+
+```{r, eval=FALSE}
+data <- read.delim("path.to.my.file")
+```
+
+## Chunk options: echo
+
+- Might want to load some data, silently, from disk
+ + e.g. the R object from reading the data in the previous slide
+ + or your P.I. wants to see your results, but doesn't really want to know about the R code that you used
+
+```{r echo=FALSE}
+plot(rnorm(100))
+```
+
+
+## Chunk options: results
+
+- Some code or functions might produce lots of output to the screen that we don't need
+ + `results='hide'`
+```{r results='hide'}
+for(i in 1:100) {
+ print(i)
+ }
+```
+
+
+##Chunk options: message and warning
+
+- Loading an R package will sometimes print messages and / or warnings to the screen
+- This is not always helpful in a report:
+- Using `message=FALSE` and `warning=FALSE`
+
+
+```{r message=FALSE, warning=FALSE}
+library(DESeq)
+
+```
+- Could also need `suppressPackageStartupMessages`
+
+##Chunk options: cache
+
+- The argument `cache=TRUE` will stop certain chunks from being evaluate if their code does not change
+- It speeds-up the compilation of the document
+ + we don't want to reload our dataset if we've only made a tiny change downstream
+
+```{r echo=FALSE, cache=TRUE}
+evals <- read.delim("gene.expression.txt")
+```
+
+## Running R code from the main text
+
+- We can add R code to our main text, which gets evaluated
+ + make sure we always have the latest figures, p-values etc
+- See example `gene-expression-analysis.Rmd`
+
+# End of Course
+
+## Wrap-up
+
+- **Thanks for your attention**
+- Practice, practice, practice
+ + ... & persevere
+- Need inspiration? R code is freely-available, so read other people's code!
+ + Read [blogs](http://www.r-bloggers.com/)
+ + Follow the [forums](http://stackoverflow.com/questions/tagged/r)
+ + Download [datasets](http://vincentarelbundock.github.io/Rdatasets/datasets.html) to practice with
+ + Bookmark some [reference](https://en.wikibooks.org/wiki/R_Programming) guides
+ + on twitter @rstudio, @Rbloggers, @RLangTip
+ + Attend the [follow-on course](http://training.csx.cam.ac.uk/bioinformatics/event/1800066) on data manipulation and graphics
+- Please fill in the feedback form for us to improve the course
+
diff --git a/Session2.5-reports-and-wrap-up.nb.html b/Session2.5-reports-and-wrap-up.nb.html
new file mode 100644
index 0000000..3e225e2
--- /dev/null
+++ b/Session2.5-reports-and-wrap-up.nb.html
@@ -0,0 +1,417 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Introduction to Solving Biological Problems Using R - Day 2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Minimally, you should record what version of R, and the packages you used.
+
Use the sessionInfo() function
+
+
e.g. for the version of R I used to make the slides
+
+
+
+
+
+
sessionInfo()
+
+
+
+
+
+
Defining chunks
+
+
It is not great practice to have one long, continuous R script
+
Better to break-up into smaller pieces; ‘chunks’
+
You can document each chunk separately
+
Easier to catch errors
+
The characteristics of each chunk can be modified:
+
+
You might not want to print the R code for each chunk
+
…or the output
+
+
+
+
+
Chunk options
+
+
It’s a good idea to name each chunk
+
+
Easier to track-down errors
+
+
We can display R code, but not run it
+
+
eval=FALSE
+
+
We can run R code, but not display it
+
+
echo=FALSE
+
e.g. setting display options
+
+
Suppress warning messages
+
+
warning=FALSE
+
+
+
+
+
Chunk options: eval
+
+
Sometimes we want to format code for display, but not execute; we want to show the code for how we read our data, but want our report to compile quickly
+
+
e.g.
+
+
+
+
data <- read.delim("path.to.my.file")
+
+
+
+
+
+
Chunk options: echo
+
+
Might want to load some data, silently, from disk
+
+
e.g. the R object from reading the data in the previous slide
+
or your P.I. wants to see your results, but doesn’t really want to know about the R code that you used
+
+
+
+
+
+
+
+
+
Chunk options: results
+
+
Some code or functions might produce lots of output to the screen that we don’t need
+
+
results='hide'
+
+
+
+
+
+
+
+
+
Chunk options: message and warning
+
+
Loading an R package will sometimes print messages and / or warnings to the screen
+
This is not always helpful in a report:
+
Using message=FALSE and warning=FALSE
+
+
+
+
+
library(DESeq)
+
+
+
+
+
+
Could also need suppressPackageStartupMessages
+
+
+
+
Chunk options: cache
+
+
The argument cache=TRUE will stop certain chunks from being evaluate if their code does not change
+
It speeds-up the compilation of the document
+
+
we don’t want to reload our dataset if we’ve only made a tiny change downstream
+
+
+
+
+
+
+
+
+
Running R code from the main text
+
+
We can add R code to our main text, which gets evaluated
+
+
make sure we always have the latest figures, p-values etc
+
+
See example gene-expression-analysis.Rmd
+
+
+
+
+
End of Course
+
+
Wrap-up
+
+
Thanks for your attention
+
Practice, practice, practice
+
+
… & persevere
+
+
Need inspiration? R code is freely-available, so read other people’s code!
+
+
+
+
+
+
+
+
+
diff --git a/Slides-day1.Rmd b/Slides-day1.Rmd
deleted file mode 100644
index bb2315b..0000000
--- a/Slides-day1.Rmd
+++ /dev/null
@@ -1,1621 +0,0 @@
----
-title: "Introduction to Solving Biological Problems Using R - Day 1"
-author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
- Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
-date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
-output:
- slidy_presentation:
- footer: 'This work is licensed under the Creative Commons Attribution-ShareAlike
- 2.0. COURSE HOME: cambiotraining.github.io/r-intro/'
- ioslides_presentation: default
- beamer_presentation: default
-css: mystyle.css
-toc: yes
----
-```{r include = FALSE}
-library(knitr)
-opts_chunk$set(comment = NA) # eliminates hashtag from R outputs
-```
-
-## Course Aims
-
-- To introduce you to the basics of R
- + Reading data
- + Perform simple analyses
- + Producing graphs
- + ***How to get help!***
-- Give you all the background you need to ***practice*** by yourselves
-- Introduce tools that will help you to work in a ***reproducible*** manner
-
-## Day 1 Schedule
-
-1. Introduction to R and its environment
-2. Data Structures
-3. Data Analysis Example
-4. Plotting in R
-
-#1. Introduction to R and its environment
-
-##What's R?
-
-* A statistical programming environment
- + based on 'S'
- + suited to high-level data analysis
-* Open source and cross platform
-* Extensive graphics capabilities
-* Diverse range of add-on packages
-* Active community of developers
-* Thorough documentation
-
-
-## The R-project page
-
-http://www.r-project.org/
-
-
-
-##R in the New York Times
-http://goo.gl/pww4ZO
-
-
-
-## R - "the ultimate virus"
-
-
-
-http://www.ingenio-magazine.com/r-the-ultimate-virus/
-
-##R plotting capabilities
-http://spatial.ly/2012/02/great-maps-ggplot2/
-
-
-
-##R plotting capabilities
-https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
-
-
-##Who uses R? Not just academics!
-http://www.revolutionanalytics.com/companies-using-r
-
-- Facebook
- + http://blog.revolutionanalytics.com/2010/12/analysis-of-facebook-status-updates.html
-- Google
- + http://blog.revolutionanalytics.com/2009/05/google-using-r-to-analyze-effectiveness-of-tv-ads.html
-- Microsoft
- + http://blog.revolutionanalytics.com/2014/05/microsoft-uses-r-for-xbox-matchmaking.html
-- New York Times
- + http://blog.revolutionanalytics.com/2011/03/how-the-new-york-times-uses-r-for-data-visualization.html
-- Buzzfeed
- + http://blog.revolutionanalytics.com/2015/12/buzzfeed-uses-r-for-data-journalism.html
-- New Zealand Tourist Board
- + https://mbienz.shinyapps.io/tourism_dashboard_prod/
-
-
-## R can facilitate Reproducible Research
-
-
-
-
-
-##It is a hot topic at the moment
-
-- Statisticians at MD Anderson tried to reproduce results from a Duke paper and unintentionally unravelled a web of incompetence and skullduggery
- + as reported in the ***New York Times***
-
-
-
-##Hear the full account
-
-- Very entertaining talk from Keith Baggerly in Cambridge, December 2010
-
-
-
-## Nature editorial - May 2016
-
-
-
-
-[Reality check on reproducibility](http://www.nature.com/news/reality-check-on-reproducibility-1.19961)
-
-[1,500 scientists lift the lid on reproducibility](http://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970)
-
-
-
-##Various platforms supported
-
-- Release 3.3.0 (May 2016)
- + Base package and Contributed packages (general purpose extras)
- + `r length(XML:::readHTMLTable("http://cran.r-project.org/web/packages/available_packages_by_date.html")[[1]][[2]])` available packages as of `r date()`
-- Download from http://mirrors.ebi.ac.uk/CRAN/
-- Windows, Mac and Linux versions available
-- Executed using command line, or a graphical user interface (GUI)
-- On this course, we use the RStudio GUI (www.rstudio.com)
-- Everything you need is installed on the training machines
-- If you are using your own machine, download both R and RStudio
-
-
-
-
-##Getting started
-
-- R is a program which, once installed on your system, can be
-launched and is immediately ready to take input directly from the
-user
-- There are two ways to launch R:
- + From the command line (particularly useful if you're quite
-familiar with Linux; in the console at the prompt simply type `R`)
- + As an application called  (very good for beginners)
-
-##Launching R Using RStudio
-
-To launch RStudio, find the RStudio icon in the menu bar on the left
-of the screen and click
-
-
-
-##Basic concepts in R - command line calculation
-
-- The command line can be used as a calculator. Type:
-
-```{r basic-calc1, eval=FALSE}
-2 + 2
-20/5 - sqrt(25) + 3^2
-sin(pi/2)
-```
-
-Note: The number in the square brackets is an indicator of the
-position in the output. In this case the output is a 'vector' of length 1
-(i.e. a single number). More on vectors coming up...
-
-##Basic concepts in R - variables
-
-- A variable is a letter or word which takes (or contains) a value. We use the **assignment operator: `<-`**
-```{r variables1, eval=FALSE}
-
-x <- 10
-x
-
-myNumber <- 25
-myNumber
-```
-
-- We can perform arithmetic on variables:
-```{r variables2, eval=FALSE}
-sqrt(myNumber)
-```
-
-##Basic concepts in R - variables
-
-- We can add variables together:
-```{r variables3, eval=FALSE}
-x + myNumber
-```
-
-- We can change the value of an existing variable:
-
-```{r variables4, eval=FALSE}
-x <- 21
-x
-```
-
-##Basic concepts in R - variables
-
-- We can set one variable to equal the value of another variable:
-```{r variables5, eval=FALSE}
-x <- myNumber
-x
-```
-
-- We can modify the contents of a variable:
-
-```{r variables6, eval=FALSE}
-myNumber <- myNumber + sqrt(16)
-myNumber
-```
-
-##Basic concepts in R - functions
-
-- **Functions** in R perform operations on **arguments** (the inputs(s) to the function). We have already used:
-```{r eval=FALSE}
-sin(x)
-```
-- This returns the sine of x
- + In this case the function has one argument: **x**. Arguments are always contained in parentheses -- curved brackets, **()** -- separated by commas.
-
-- Try these:
-
-```{r functions1, eval=FALSE}
-sum(3,4,5,6)
-max(3,4,5,6)
-min(3,4,5,6)
-```
-
-##Basic concepts in R - functions
-
-- Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order)
-
-```{r functions2, eval=FALSE}
-seq(from = 2, to = 20, by = 4)
-seq(2, 20, 4)
-```
-- When testing code, it is easier and safer to name the arguments
-
-##Basic concepts in R - vectors
-
-- The basic data structure in R is a **vector** -- an ordered collection of values.
-- R treats even single values as 1-element vectors.
-- The function **`c`** *combines* its arguments into a vector:
-
-```{r vectors1, eval=FALSE}
-x <- c(3,4,5,6)
-x
-```
-- The square brackets `[]` indicate the position within the vector (the ***index***).
-- We can extract individual elements by using the `[]` notation:
-```{r vectors2, eval=FALSE}
-x[1]
-x[4]
-```
-
-##Basic concepts in R - vectors
-
-- We can even put a vector inside the square brackets (*vector indexing*):
-
-```{r vectors3, eval=FALSE}
-y <- c(2,3)
-x[y]
-```
-
-##Basic concepts in R - vectors
-
-- There are a number of shortcuts to create a vector.
-- Instead of:
-```{r vectors4}
-x <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-```
-- we can write:
-```{r vectors5, eval=FALSE}
-x <- 3:12
-x
-```
-
-##Basic concepts in R - vectors
-
-- or we can use the **`seq()`** function, which returns a vector:
-
-```{r vectors6, eval=FALSE}
-x <- seq(2, 20, 4)
-x
-```
-
-```{r vectors7, eval=FALSE}
-x <- seq(2, 20, length.out=5)
-x
-```
-
-- or we can use the **`rep()`** function:
-
-
-```{r vectors8, eval=FALSE}
-y <- rep(3, 5)
-y
-```
-
-```{r vectors9, eval=FALSE}
-y <- rep(1:3, 5)
-y
-```
-
-##Basic concepts in R - vectors
-
-- We have seen some ways of extracting elements of a vector. We can use these shortcuts to make things easier (or more complex!)
-
-```{r vectors10, eval=FALSE}
-x <- 3:12
-# Extract elements from x:
-
-x[3:7]
-x[seq(2, 6, 2)]
-x[rep(3, 2)]
-```
-
-##Basic concepts in R - vectors
-
-- We can add an element to a vector:
-```{r vectors11, eval=FALSE}
-y <- c(x, 1)
-y
-```
-- We can glue vectors together:
-```{r vectors12, eval=FALSE}
-z <- c(x, y)
-z
-```
-
-##Basic concepts in R - vectors
-
-- We can "remove" element(s) from a vector:
- + NOTE: the vector x doesn't get modified
- + we're just displaying what the vector looks like without particular elements
-
-```{r vectors13, eval=FALSE}
-x <- 3:12
-
-x[-3]
-x[-(5:7)]
-x[-seq(2, 6, 2)]
-x
-```
-
-##Basic concepts in R - vectors
-
-- Finally, we can modify the contents of a vector:
-```{r vectors14, eval=FALSE}
-x[6] <- 4
-x
-
-x[3:5] <- 1
-x
-```
-
-**Remember!**
-
- - **Square** brackets [ ] for ***indexing***
- - **Parentheses** () for function ***arguments***
-
-##Basic concepts in R - vector arithmetic
-
-- When applying all standard arithmetic operations to vectors,
-application is element-wise
-
-```{r vector-arithmetic1}
-x <- 1:10
-y <- x*2
-```
-```{r eval=FALSE}
-y
-```
-```{r vector-arithmetic1b}
-z <- x^2
-```
-```{r eval=FALSE}
-z
-```
-
-##Basic concepts in R - vector arithmetic
-
-- Adding two vectors:
-```{r vector-arithmetic2, eval=FALSE}
-y + z
-```
-- If vectors are not the same length, the shorter one will be recycled:
-```{r vector-arithmetic3, eval=FALSE}
-x + 1:2
-```
-- But be careful if the vector lengths aren't factors of each other:
-
-```{r eval=FALSE}
-x + 1:3
-```
-
-```{r vector-arithmetic4,echo=FALSE}
-options(width=50)
-x + 1:3
-```
-
-
-
-##Basic concepts in R - Character vectors and naming
-
-- All the vectors we have seen so far have contained numbers, but we can also store text (/"strings") in vectors -- this is called a **character** vector.
-
-```{r vector-naming1}
-gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
-gene.names
-```
-
-##Basic concepts in R - Character vectors and naming
-
-- We can name elements of vectors using the **`names()`** function, which can be useful to keep track of the meaning of our data:
-```{r vector-naming2}
-gene.expression <- c(0, 3.2, 1.2, -2)
-names(gene.expression) <- gene.names
-gene.expression
-
-```
-
-- We can also use the `names()` function to get a vector of the names of an object:
-```{r vector-naming4, eval=FALSE}
-
-names(gene.expression)
-```
-
-
-##Exercise: Body-Mass Index
-- Let's try some vector arithmetic. Here are the weights and heights of five individuals
-
-|Person | Weight (kg) | Height (cm)|
-|-------|------------------:|-------------------:|
-|*Jo*|65.8|192|
-|*Sam*|67.9|179|
-|*Charlie*|75.3|169|
-|*Frankie*|61.9|175|
-|*Alex*|92.4|171|
-
-
-- Create *weight* and *height* vectors to hold the data in each column using the `c` function. Create a *person* vector and use this vector to name the values in the other two vectors.
-
-##Exercise: Body-Mass Index
-
-1. The body-mass index is given by the formula:- $BMI = (Weight)/(Height^2)$; where Height is given in ***metres***
- + Create a new vector to record this, called `bmi`.
-2. Create a new vector `bmi.sorted` where the bmi values are put in increasing numeric order (HINT: look up the help on the `sort` function)
-3. The interquartile range (IQR) of a vector is defined as the 75% percentile of the data minus the 25% percentile. Calculate the IQR for our bmi values
- + check your answer using the `IQR` function
-
-##Answers to exercise
-
-```{r ex1answers1}
-weight <- c(65.8,67.9,75.3,61.9,92.4)
-height <- c(192,179,169,175,171)
-person <- c("Jo","Sam","Charlie","Frankie","Alex")
-names(weight) <- names(height) <- person
-```
-
-##Answers to exercise
-
-1. The body-mass index is given by the formula:- $BMI = (Weight)/(Height^2)$; where Height is given in ***metres***
- + Create a new vector to record this, called `bmi`.
-```{r ex1answers2}
-bmi <- (weight)/((height/100)^2)
-bmi
-```
-
-
-##Answers to exercise
-
-2. Create a new vector `bmi.sorted` where the bmi values are put in increasing numeric order
-3. Print the second, third and fourth items in the vector `bmi.sorted`
- + check your answer using the `IQR` functio
-```{r}
-bmi.sorted <- sort(bmi)
-bmi.sorted[4] - bmi.sorted[2]
-IQR(bmi)
-```
-
-## Answers to exercise
-
-Names are usually carried across to the new vector. Sometimes this is what we want (as for `bmi`) but sometimes it is not (when we are comparing values in `bmi.sorted`). We can remove names by setting them to the special NULL value:
-
-```{r}
-names(bmi.sorted) <- NULL
-bmi.sorted[4] - bmi.sorted[2]
-```
-
-
-## Documenting your analysis with RStudio
-
-Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.
-
-
-- To create an R markdown file, Click on ***File → New File → R Markdown*** in Rstudio
- + markdown is a easy-to-read, easy-to-write text format often used to write HTML, readme files, etc.
- + a simpler (but not so informative) alternative is to use a script
-
-
-- This will make our analyses ***reproducible***
-
-## Format of an R markdown file
-
-- **Lines 8 - 10**: plain text description
-- **Lines 12 - 14**: an R code 'chunk'
-- **Lines 18 to 20**: another code chunk, this time producing a plot
-
-
-
-- Pressing the ***Knit HTML*** (/***Knit PDF***) button will create a report
-- See `solution-exercise1.Rmd ` for solution to Exercise 1
-- All exercises have a markdown template that you can edit
-
-##Getting help
-
-- **This is possibly the most important slide in the whole course!?!**
-- To get help on any R function, type **`?`** followed by the function name. For example:
-```{r eval=FALSE}
-?seq
-```
-- This retrieves the syntax and arguments for the function. The help page shows the default order of arguments. It also tells you which *package* it belongs to.
-- There is typically a usage example, which you can test using the
-`example` function:
-
-```{r eval=FALSE}
-example(seq)
-```
-
-##Getting help
-
-- If you can't remember the exact name, type **`??`** followed by your guess.
-R will return a list of possibilities:
-
-```{r eval=FALSE}
-??mean
-```
-- The **Packages** tab in the lower-right panel of RStudio will help you locate the help pages for a particular package and its functions
- + Often there will be a user-guide or '*vignette*' too
-
-##Interacting with the R console
-
-- **Important** -- R console symbols:
- + **`;`** end of line
- (Enables multiple commands to be placed on one line of text)
- + **`#`** comment
- (indicates text is a comment and not executed)
- + **`+`** command line wrap
- (R is waiting for you to complete an expression)
-
-- *Ctrl-c* or *escape* to clear input line and try again
-- *Ctrl-l* to clear window
-- Use the *TAB* key for command auto completion
-- Use up and down arrows to scroll through the command history
-
-##R packages
-
-- R comes ready loaded with various libraries of functions called
-**packages**. For example: the function **`sum()`** is in the **base** package and
-**`sd()`**, which calculates the standard deviation of a vector, is in the
-**`stats`** package
-- There are 1000s of additional packages provided by third parties,
-and the packages can be found in numerous server locations on the
-web called **repositories**
-- The two repositories you will come across the most are:
- + **The Comprehensive R Archive Network (CRAN)**
- + Use metacran search to find functionality you need: http://www.r-pkg.org/
- + Or look for packages by theme: http://cran.r-project.org/web/views/
- + **Bioconductor** specialised in genomics: http://www.bioconductor.org/packages/release/bioc/
- + **https//github.com** can also host R packages, and hosts the development version of many packages
-- Bottomline: ***always*** first look if there is already an R package that does what you want before trying to implement it yourself
-
-## Installing packages
-
-- CRAN packages can be installed using **`install.packages()`**
-
- + or clicking on the *Packages* tab in RStudio
-
-```{r eval=FALSE}
-install.packages(name.of.my.package)
-```
-
-
-- Set the *Bioconductor* package download tool by typing:
-```{r eval=FALSE}
-source("http://bioconductor.org/biocLite.R")
-```
-
-- *Bioconductor* packages are then installed with the `biocLite()` function:
-```{r eval=FALSE}
-biocLite("PackageName")
-```
-
-##Example: Install packages ggplot2 and DESeq
-
-- ggplot2 is a commonly used graphics package:
- + in RStudio, go to **Tools** → **Install Packages**... and type the package name
- + or use `install.packages()` function to install it:
-
-```{r eval=FALSE}
-install.packages("ggplot2")
-```
-
-- `DESeq` is a Bioconductor package (http://www.bioconductor.org) for the analysis of RNA-seq data:
-
-```{r eval=FALSE}
-source("http://www.bioconductor.org/biocLite.R")
-biocLite("DESeq2")
-```
-
-##Example: Load packages ggplot2 and DESeq2
-
-- R needs to be told to use the new functions from the installed packages. Use **`library(...)`** function to load the newly installed features:
-
-```{r eval=FALSE}
-
-library(ggplot2) # loads ggplot functions
-library(DESeq2) # loads DESeq functions
-library() # Lists all the packages
- # you've got installed
-```
-
-# 2. Data structures
-
-##R is designed to handle experimental data
-
-- Although the basic unit of R is a vector, we usually handle data in **data frames**.
-- A data frame is a set of observations of a set of variables -- in other words, the outcome of an experiment.
-- For example, we might want to analyse information about a set of patients.
-- To start with, let's say we have ten patients and for each one we know their name, sex, age, weight and whether they give consent for their data to be made public.
-
-
-##The patients data frame
-- We are going to create a data frame called 'patients', which will have ten rows (observations) and seven columns (variables). The columns must all be equal lengths.
-- We will explore how to construct these data from scratch.
- + (in practice, we would usually import such data from a file)
-
-
-```{r echo=FALSE, tidy=T}
-age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
-weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 63.5,
- 71.5, 73.2, 64.8)
-firstName <- c("Adam", "Eve", "John", "Mary", "Peter",
- "Paul", "Joanna", "Matthew", "David", "Sally")
-secondName <- c("Jones", "Parker", "Evans", "Davis",
- "Baker","Daniels", "Edwards", "Smith", "Roberts", "Wilson")
-consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,
- FALSE, TRUE, FALSE, TRUE)
-sex <- c("Male", "Female", "Male", "Female", "Male", "Male",
- "Female", "Male", "Male", "Female")
-patients <- data.frame(firstName, secondName,
- paste(firstName, secondName),
- sex, age, weight, consent)
-names(patients) <- c("First_Name", "Second_Name",
- "Full_Name", "Sex", "Age", "Weight", "Consent")
-kable(patients)
-```
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-##Character, numeric and logical data types
-
-- Each column is a vector, like previous vectors we have seen, for
-example:
-
-```{r patients1}
-age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
-weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9,
- 63.5, 71.5, 73.2, 64.8)
-
-```
-- We can define the names using character vectors:
-```{r patients2}
-firstName <- c("Adam", "Eve", "John", "Mary",
- "Peter", "Paul", "Joanna", "Matthew",
- "David", "Sally")
-secondName <- c("Jones", "Parker", "Evans", "Davis",
- "Baker","Daniels", "Edwards", "Smith",
- "Roberts", "Wilson")
-```
-
-##Character, numeric and logical data types
-
-- We also have a new type of vector, the ***logical*** vector, which only
-contains the values `TRUE` and `FALSE`:
-```{r patients3}
-consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE,
- FALSE, FALSE, TRUE, FALSE, TRUE)
-```
-
-##Character, numeric and logical data types
-
-- Vectors can only contain one type of data; we cannot mix numbers, characters and logical values in the same vector.
- + If we try this, R will convert everything to characters:
-
-```{r}
-c(20, "a string", TRUE)
-
-```
-- We can see the type of a particular vector using the **`class()`** function:
-
-```{r, eval=FALSE}
- class(firstName)
- class(age)
- class(weight)
- class(consent)
-```
-
-##Factors
-
-- Character vectors are fine for some variables, like names. But sometimes we have categorical data and we want R to
-recognize this
-- A factor is R's data structure for categorical data:
-
-```{r patients4}
-sex <- c("Male", "Female", "Male", "Female", "Male",
- "Male", "Female", "Male", "Male", "Female")
-sex
-```
-
-##Factors
-
-```{r}
-factor(sex)
-```
-- R has converted the strings of the sex character vector into two **levels**, which are the categories in the data
-- Note the values of this factor are not character strings, but levels
-- We can use this factor to compare data for males and females
-
-##Creating a data frame (first attempt)
-
-- We can construct a data frame from other objects (N.B. The **`paste()`** function joins character vectors together)
-
-```{r patients5}
-patients <- data.frame(firstName, secondName,
- paste(firstName, secondName),
- sex, age, weight, consent)
-```
-```{r patients5b, eval=FALSE}
-patients
-```
-
-```{r echo=FALSE}
-options(width=60)
-patients2 <- data.frame(firstName, secondName,
- paste(firstName, secondName),
- "..."="", stringsAsFactors = F)
-patients2 <- rbind(patients2[c(1:6),], c("...", "", "", ""))
-patients2
-```
-
-
-##Naming data frame variables
-- We can access particular variables using the **'`$`'** *operator*:
-```{r patients6, eval=F}
-patients$age
-```
-
-- R has inferred the names of our data frame variables from the names of the vectors or the commands (e.g. the `paste()` command)
-- We can name the variables after we have created a data frame using the **`names()`** function, and we can use the same function to see the names:
-```{r patients7}
-names(patients) <- c("First_Name", "Second_Name",
- "Full_Name", "Sex", "Age",
- "Weight", "Consent")
-```
-
-```{r eval=FALSE}
-names(patients)
-```
-
-##Naming data frame variables
-
-- Or we can name the variables when we define the data frame
-
-```{r patients8}
-patients <- data.frame(First_Name = firstName,
- Second_Name = secondName,
- Full_Name = paste(firstName,
- secondName),
- Sex = sex,
- Age = age,
- Weight = weight,
- Consent = consent)
-
-```
-
-```{r eval=FALSE}
-names(patients)
-```
-
-##Factors in data frames
-
-- When creating a data frame, R assumes all character vectors should
-be categorical variables and converts them to factors. This is not
-always what we want:
- + e.g. we are unlikely to be interested in the hypothesis that people called Adam are taller, so it seems a bit silly to represent this as a factor
-```{r patients9, eval=FALSE}
-patients$First_Name
-```
-
-##Factors in data frames
-
-- We can avoid this by asking R not to treat strings as factors, and
-then explicitly stating when we want a factor by using **`factor()`**:
-
-```{r patients10}
-patients <- data.frame(First_Name = firstName,
- Second_Name = secondName,
- Full_Name = paste(firstName,
- secondName),
- Sex = factor(sex),
- Age = age,
- Weight = weight,
- Consent = consent,
- stringsAsFactors = FALSE)
-
-```
-
-```{r patients11, eval=FALSE}
-patients$Sex
-patients$First_Name
-```
-
-##Matrices
-- Data frames are R's speciality, but R also handles matrices:
- + All columns are assumed to contain the same data type, e.g. numerical
-
- + Matrices can be manipulated in the same fashion as data frame
- + We can easily convert between the two object types
-```{r matrices1}
-e <- matrix(1:10, nrow=5, ncol=2)
-e
-```
-
-- Some calculations are more efficient to do on matrices, e.g.:
-
-```{r}
-rowMeans(e)
-```
-
-
-##Indexing data frames and matrices
-
-- You can index multidimensional data structures like matrices and data
-frames using commas:
-- **`object[rows, colums]`**
-
-```{r index1, eval=FALSE}
-e[1,2]
-e[1,]
-patients[1,2]
-patients[1,]
-```
-
-- If you don't provide an index for either rows or
-columns, all of the rows or columns will be returned.
-
-##Advanced indexing
-- 'Values' in R are really vectors
-- Indices are actually vectors, and can be ***numeric*** or ***logical***:
-```{r index3}
-s <- letters[1:5]
-s
-
-# View some of the values in s:
-```
-
-```{r index3b, eval=FALSE}
-s[c(1,3)]
-s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
-```
-
-##Advanced indexing
-
-- We can do the logical test and indexing in the same line of R code
- + R will do the test first, and then use the vector of `TRUE` and `FALSE` values to subset the vector
-```{r}
-a <- 1:5
-
-# Logical tests:
-a < 3
-s[a < 3]
-```
-
-
-##Operators
-
-- Operators allow us to combine multiple logical tests
-- comparison operators
-**`<, >, <=, >=, ==, !=`**
-- logical operators
-**`!, &, |, xor`**
- + The operators for 'comparison' and 'logical' always return logical values! i.e. (`TRUE`, `FALSE`)
-
-```{r}
-s
-a
-```
-
-```{r, eval=FALSE}
-
-s[a > 1 & a <3]
-s[a == 2]
-```
-
-
-##Exercise: exercise2.Rmd
-
-- The markdown template has code to create the patients data frame from the slides
-- Make a new data frame with three extra variables: **`country`**,
-**`continent`**, and **`height`**
- + Make up the data
- + Make `country` a *character* vector but `continent` a *factor*
-- Try the **`summary()`** function on your data frame. What does it do?
-How does it treat vectors (numeric, character, logical) and factors?
-(What does it do for matrices?)
-- Use logical indexing to select the following patients from the data frame described in the slides:
- 1. Patients under 40
- 2. Patients who give consent to share their data
- 3. Men who weigh as much or more than the average European male (70.8 kg)
-
-##Logical indexing answers: solution-exercise2.pdf
-
-1. Patients under 40:
-```{r ex2ans1, eval=FALSE}
-patients[patients$Age < 40, ]
-
-```
-
-2. Patients who give consent to share their data:
-```{r ex2ans2, eval=FALSE}
-patients[patients$Consent == TRUE, ]
-
-```
-
-3. Men who weigh as much or more than the average European male (70.8 kg):
-```{r ex2ans3, eval=FALSE}
-patients[patients$Sex=="Male" & patients$Weight>=70.8, ]
-```
-
-
-# 3. R for data analysis
-
-##3 steps to Basic Data Analysis
-
-- In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline:
-
-1. Reading in data
- + `read.table()`
- + `read.csv(), read.delim()`
-2. Analysis
- + Manipulating & reshaping the data
- + Any maths you like
- + Plotting the outcome
-3. Writing out results
- + `write.table()`
- + `write.csv()`
-
-##A simple walkthrough
-
-- 50 neuroblastoma patients were tested for NMYC gene copy number by interphase nuclei FISH:
- + Amplification of NMYC correlates with worse prognosis
- + We have count data
- + Numbers of cells per patient assayed
- + For each we have NMYC copy number relative to base ploidy
-- We need to determine which patients have amplifications
- + (i.e > 33% of cells show NMYC amplification)
-
-##The Working Directory (wd)
-
-
-- Like many programs R has a concept of a working directory (**`wd`**)
-- It is the place where R will look for files to execute and where it will
-save files, by default
-- For this course we need to set the working directory to the location
-of the course scripts
-- At the command prompt in the terminal or in RStudio console type:
-
-```{r eval=FALSE}
-setwd("/home/participant/Course_Materials")
-```
-
-
-- Alternatively in RStudio use the mouse and browse to the directory
-location
-- ***Session → Set Working Directory → Choose Directory...***
-
-## 0. Locate the data
-
-Before we even start the analysis, we need to be sure of where the data are located on our hard drive
-
-- Functions that import data need a file location as a character vector
-- The default location is the ***working directory***
-```{r eval=FALSE}
-getwd()
-```
-
-- If the file you want to read is in your working directory, you can just use the file name
-```{r eval=FALSE}
-list.files()
-```
-
-- Otherwise you need the *path* to the file
- + you can get this using **`file.choose()`**
-
-##1. Read in the data
-
-- The data is a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text
-- We need to read in the results and assign it to an object (`rawdata`)
-
-```{r}
-rawData <- read.delim("countData.txt")
-```
-
-- Using `file.choose()`:
-
-```{r eval=FALSE}
-myfile <- file.choose()
-rawData <- read.delim(myfile)
-```
-
-
-- If the data is comma-separated, then use either the argument `sep=","` or the function `read.csv()`:
-```{r eval=FALSE}
-read.csv("countData.csv")
-```
-- For full list of arguments:
-```{r eval=FALSE}
-?read.table
-```
-
-##1b. Check the data
-- *Always* check the object to make sure the contents and dimensions are as you expect
-- R will sometimes create the object without error, but the contents may be un-usable for analysis
- + If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column
-
-```{r analysis1, eval=FALSE}
-# View the first 10 rows to ensure import is OK
-rawData[1:10,]
-```
-
-
-- or use the `View()` function to get a display of the data in RStudio:
-```{r eval=FALSE}
-View(rawData)
-```
-
-##1c. Understanding the object
-
-- Once we have read the data successfully, we can start to interact with it
-- The object we have created is a *data frame*:
-```{r}
-class(rawData)
-```
-
-- We can query the dimensions:
-
-```{r, eval=FALSE}
-ncol(rawData)
-nrow(rawData)
-dim(rawData)
-```
-
-- Or the structure of an object:
- + TIP: In RStudio, window *Environment*, click the blue arrow in the left of an object's name, in order to see the object structure
-```{r}
-str(rawData)
-```
-
-##1c. Understanding the object
-
-- The names of the columns are automatically assigned:
-
-```{r}
-colnames(rawData)
-```
-
-- We can use any of these names to access a particular column:
- + and create a vector
- + TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
-```{r, eval=FALSE}
-rawData$Nuclei
-```
-
-##Word of caution
-
-
-
-
-
-
-
-
-> Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)
-
-##Word of caution
-
-- You will make your life a lot easier if you keep your data ***tidy***:
- + http://vimeo.com/33727555
- + http://vita.had.co.nz/papers/tidy-data.pdf
-
-- ...and ***organised***:
- + http://kbroman.org/dataorg/
-
-##Handling missing values
-
-- The data frame contains some **`NA`** values, which means the values are missing – a common occurrence in real data collection
-- `NA` is a special value that can be present in objects of any type (logical, character, numeric etc)
-- `NA` is not the same as `NULL`:
- - `NULL` is an empty R object.
- - `NA` is one missing value within an R object (like a data frame or a vector)
-- Often R functions will handle `NA`s gracefully:
-
-```{r, eval=FALSE}
-x <- c(1, NA, 3)
-length(x)
-```
-
-##Handling missing values
-
-- However, sometimes we have to tell the functions what to do with them.
-- R has some built-in functions for dealing with `NA`s, and functions often have their own arguments (like `na.rm`) for handling them:
-
-
-```{r analysis2, eval=FALSE}
-mean(x, na.rm = TRUE)
-
-mean(na.omit(x))
-```
-
-##2. Analysis (reshaping data and maths)
-
-- Our analysis involves identifying patients with > 33% NB amplification
- + we can use the **`which()`** function to select indices from a logical vector that are `TRUE`
-
-```{r analysis3}
-# Create an index of results:
-prop <- rawData$NB_Amp / rawData$Nuclei
-
-```
-```{r analysis3b, eval=FALSE}
-prop > 0.33
-```
-```{r analysis3c}
-# Get sample names of amplified patients:
-amp <- which(prop > 0.33)
-```
-```{r analysis3d, eval=FALSE}
-amp
-```
-
-##2. Analysis (reshaping data and maths)
-
-- We can plot a simple chart of the % NB amplification
- + Note that two samples are amplified
- + Plotting will be covered in detail shortly
-
-```{r analysis4, fig.height=5,fig.width=10}
-plot(prop, ylim=c(0,1))
-# Add a horizonal line:
-abline(h=0.33)
-```
-
-##3. Outputting the results
-
-- We write out a data frame of results (patients > 33% NB amplification) as a 'comma separated values' text file (CSV):
-```{r analysis5, eval=FALSE}
-write.csv(rawData[amp,], file="selectedSamples.csv")
-```
-- The output file is directly-readable by Excel
-- It's often helpful to double check where the data has been saved. Use the *get working directory* function:
-
-```{r analysis6, eval=FALSE}
-getwd() # print working directory
-list.files() # list files in working directory
-
-```
-
-##Data analysis exercise: exercise3.Rmd
-
-
-- Patients are *near normal* if:
-`(NB_Amp / Nuclei < 0.33 & NB_Del == 0)`
-- Modify the condition in our previous code to find these patients
-- Write out a results file of the samples that match these criteria, and open it in a spreadsheet program
-
-##Solution: solution-exercise3.pdf
-
-```{r analysis7}
-norm <- which(prop < 0.33 & rawData$NB_Del == 0)
-norm
-write.csv(rawData[norm,], "My_NB_output.csv")
-```
-
-
-# 4. Plotting in R
-
-##Plot basics
-
-- As we have heard, R has extensive graphical capabilities
-- ...but we need to start simple
-- We will describe *base* graphics in R: the plots available with any standard R installation
- + other more advanced alternatives are, e.g., `lattice`, `ggplot2`
- + See our [intermediate R course](http://bioinformatics-core-shared-training.github.io/r-intermediate/) for fancy graphics
-- Plotting in R is a *vast* topic:
- + We cannot cover everything
- + You can tinker with plots to your hearts content
- + Best to learn from examples
-- ***You need to think about how best to visualise your data***
- + http://www.bioinformatics.babraham.ac.uk/training.html#figuredesign
-- R cannot prevent you from creating a plotting disaster:
- + http://www.businessinsider.com/the-27-worst-charts-of-all-time-2013-6?op=1&IR=T
-
-##Making a Scatter Plot
-
-- If given a single vector as an argument, the function **`plot()`** will make a scatter plot with the *values* of the vector on the *y* axis, and *indices* in the *x* axis
- + e.g. it puts a point at:
- + x = 1, y = 70.8
- + x = 2, y = 67.9 etc...
-```{r,fig.height=3}
-patients$Weight
-plot(patients$Weight)
-```
-
-##Making a Scatter Plot
-- R tries to guess the most appropriate way to visualise the data, according to the type and dimensions of the object(s) provided
-
-
-```{r echo=FALSE,fig.height=3}
-plot(patients$Weight)
-```
-
-- Axis limits, labels, titles are inferred from the data
- + We can modify these as we wish, by specifying ***arguments***
-
-##Making a Scatter Plot of two variables
-
-- We can give two arguments to `plot()`:
- + In order to visualise the relationship between two variables
- + It will put the values from the *first* argument in the *x* axis, and values from the *second* argument on the *y* axis
-
-```{r,fig.height=3}
-patients$Age
-plot(patients$Age, patients$Weight)
-```
-
-##Making a barplot
-
-- Other types of visualisation are available:
- + These are often just special cases of using the `plot()` function
- + One such function is `barplot()`
-
-
-```{r eval=FALSE}
-barplot(patients$Age)
-```
-
-
-```{r,fig.height=3,echo=FALSE}
-par(mar=c(1,5,1,5))
-barplot(patients$Age)
-```
-
-##Making a barplot
-
-- It is more usual to display count data in a barplot
- + e.g. the counts of a particular ***categorical*** variable
-
-```{r eval=FALSE}
-barplot(summary(patients$Sex))
-```
-
-```{r,fig.height=3,echo=FALSE}
-par(mar=c(1,5,1,5))
-barplot(summary(patients$Sex))
-```
-
-
-##Plotting a distribution: Histogram
-
-- A histogram is a popular way of visualising a distribution of ***continuous*** data:
- + You can change the width of bins
- + The y-axis can be either frequency of density
-
-```{r,fig.height=5}
-hist(patients$Weight)
-```
-
-
-
-##Plotting a distribution: Boxplot
-
-- The boxplot is commonly used in statistics to visualise a distribution:
-```{r,eval=FALSE}
-boxplot(patients$Weight, horizontal = TRUE)
-```
-```{r,fig.height=2.5,echo=FALSE}
-par(mar=c(3,5,1,5))
-boxplot(patients$Weight, horizontal=T)
-```
-
-- The black solid line is the ***median***
-- The top and bottom of the box are the 75th and 25th percentiles
- + Hence, the distance between these is a reflection of the *spread* of the data; the Inter-Quartile Range (***IQR***)
-- Whiskers are drawn at 1.5 x IQR and -1.5 x IQR
-
-##Plotting a distribution: Boxplot
-
-- Sometimes we want to compare distributions between different categories in our data
-- For this we need to use the '*formula*' syntax
- + For now, `y ~ x` means put continuous variable `y` on the *y* axis and categorical `x` on the x axis
-```{r,fig.height=3}
-boxplot(patients$Weight ~ patients$Sex, horizontal = T)
-
-```
-
-- Other alternatives to consider:
- - `example(dotchart)`
- - `example(stripchart)`
- - `example(vioplot) # From vioplot library`
- - `example(beeswarm) # From beeswarm library`
-
-## Exercise: exercise4a.Rmd
-
-- In the course folder you will find the file `ozone.csv`:
- + Data describing weather conditions in New York City in 1973, obtained from the [supplementary data](http://faculty.washington.edu/heagerty/Books/Biostatistics/index-chapter.html) to *Biostatistics: A Methodology for the Health Sciences*
- + Full description here: http://faculty.washington.edu/heagerty/Books/Biostatistics/DATA/ozonedoc.txt
-1. Import these data into R
-2. What data types are present? Try to think of ways to create the following plots from the data
- + Scatter plot two variables. e.g. Solar Radiation against Ozone
- + A histogram. e.g. Wind Speed
- + Boxplot of a continuous variable against a categorical variable. e.g. Ozone level per month
-
-
-## Suggestions: solution-exercise4a.pdf
-
-```{r include=FALSE}
-weather <- read.csv("ozone.csv")
-```
-
-```{r eval=FALSE}
-weather <- read.csv("ozone.csv")
-View(weather)
-```
-
-```{r echo=FALSE,results='asis'}
-knitr:::kable(head(weather))
-```
-
-## Suggestions
-
-```{r}
-plot(weather$Solar.R, weather$Ozone)
-```
-
-## Suggestions
-
-```{r}
-hist(weather$Wind)
-```
-
-
-## Suggestions
-
-```{r}
-boxplot(weather$Ozone)
-```
-
-## Suggestions
-
-```{r}
-boxplot(weather$Ozone ~ weather$Month)
-```
-
-## Simple customisations
-
-- `plot()` comes with a large collection of arguments that can be set when we call the function:
- + See `?plot` and `?par`
-- Recall that, unless specified, arguments have a default value
-- We can choose to draw lines on the plot rather than points
- + The rest of the plot remains the same
-
-```{r,fig.height=5}
-plot(patients$Weight, type = "l")
-```
-
-## Simple customisations
-
-- We can also have both lines and points:
-
-```{r,fig.height=5}
-plot(patients$Weight, type = "b")
-```
-
-## Simple customisations
-
-- Add an informative title to the plot using the `main` argument:
-
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight,
- main = "Relationship between Weight and Age")
-```
-
-
-## Simple customisations
-
-- Adding the x-axis label:
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, xlab = "Age")
-```
-
-## Simple customisations
-- Adding the y-axis label:
-
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, ylab = "Weight")
-```
-
-## Simple customisations
-- We can specifiy multiple arguments at once:
- + here `ylim` and `xlim` are used to specify axis limits
-```{r,fig.height=4}
-plot(patients$Age,patients$Weight,
- ylab="Weight",
- xlab="Age",
- main="Relationship between Weight and Age",
- xlim=c(10,70),
- ylim=c(60,80))
-
-```
-
-##Defining a colour
-
-- R can recognise various strings, such as `"red"`, `"orange"`,`"green"`,`"blue"`,`"yellow"`...
-- Or more exotic ones like ``r sample(colours(),8)``...
- + See `colours()`
-- See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
-- Can also use **R**ed **G**reen **Blue** and hexadecimal values:
-
- + `rgb(0.7, 0.7, 0.7)` → A light grey in RGB format`
- + `"#B3B3B3"` → The same light grey in hexadecimal
- + `"#0000FF88"`→ A semi-transparent blue, in hexadecimal
- + The hexadecimal system is the native colour system for screen visualisation (e.g. webs). It indicates the intensity of Red, Green and Blue by using two digits for each colour, in a scale from 0-9 and A-F (0 meaning no intensity and F meaning most intense)
-
-##Use of colours
-
-Changing the `col` argument to `plot()` changes the colour that the points are plotted in:
-
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, col = "red")
-```
-
-
-##Plotting characters
-
-- R can use a variety of **p**lotting **ch**aracters
-- Each of which has a numeric *code*
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, pch = 16)
-```
-
-##Plotting characters
-
-```{r echo=FALSE}
-par(mar=c(0.1,0.1,0.1,0.1))
-i <- 0:24
-
-x <- floor(i /5) + 1
-y <- i %%5
-
-plot(1:10, type="n", xlim = c(1,5), ylim=c(-1,5),axes=F,xlab="",ylab="")
-points(x,y,pch=i+1, cex=2)
-text(x,y-0.3,i+1)
-```
-
-
-##Plotting characters
-- Or you can specify a character:
-
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, pch = "X")
-```
-
-
-##Size of points
-**C**haracter **ex**pansion:
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, cex = 3)
-```
-
-##Size of points
-**C**haracter **ex**pansion:
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, cex = 0.2)
-```
-
-##Colours and characters as vectors
-
-- Previously we have used a *vector* of length 1 as our value of colour and character
-- We can use a vector of any length:
- + the values will get *recycled* (re-used) so that each point gets assigned a value
-- We can use a pre-defined ***colour palette*** (see later)
-```{r,fig.height=4}
-plot(patients$Age, patients$Weight,
- pch = 1:10, cex = 1:5,
- col = c("red", "orange", "green", "blue"))
-```
-
-##Other plots use the same arguments
-
-- Other plotting functions use the same arguments as `plot()`
- + technical explanation: the arguments are *'inherited'*
-
-```{r,fig.height=4}
-boxplot(patients$Weight~patients$Sex,
- xlab = "Sex",
- ylab = "Weight",
- main = "Relationship between Weight and Gender",
- col = c("blue","yellow"))
-```
-
-
-##Exercise: exercise4b.Rmd
-
-- Can you re-create the following plots? Hint:
- + See the `breaks` and `freq` arguments to hist (`?hist`) to create 20 bins and display density rather than frequency
- + For third plot, see the rainbow function (`?rainbow`)
- + Don't worry too much about getting the colours exactly correct
- + The `las` argument changes the label orientation. See `?par`.
-
-```{r echo=FALSE, fig.height=5}
-par(mfrow=c(2,2))
-plot(weather$Solar.R,weather$Ozone, col="orange", pch=16,
- ylab="Ozone level", xlab="Solar Radiation",
- main="Relationship between ozone level and solar radiation")
-hist(weather$Wind, col="purple", xlab="Wind Speed", main="Distribution of Wind Speed", breaks=20, freq=FALSE)
-boxplot(weather$Ozone~weather$Month,col=rainbow(5),names=c("May", "Jun", "Jul", "Aug","Sep"),las=2,lab="Ozone Level",main="Distribution of Ozone per-month")
-```
-
-## Solutions: solution-exercise4b.pdf
-
-```{r,fig.height=5}
-plot(weather$Solar.R, weather$Ozone, col="orange", pch=16,
- ylab="Ozone level", xlab="Solar Radiation",
- main="Relationship between ozone level and
- solar radiation")
-```
-
-
-## Solutions
-
-```{r,fig.height=5}
-hist(weather$Wind, col="purple", xlab="Wind Speed",
- main="Distribution of Wind Speed", breaks = 20,
- freq=FALSE)
-```
-
-## Solutions
-
-- The **`rainbow()`** function is used to create a vector of colours for the boxplot; in other words a ***palette***:
- + Red, Orange, Yellow, Green, Blue, Indigo, Violet, etc.
- + Other palette functions available: `heat.colors(), terrain.colors(), topo.colors(), cm.colors()`
- + Red, Orange, Yellow, Green, Blue, Indigo, Violet....etc
-
-
-
-```{r,fig.height=5}
-boxplot(weather$Ozone~weather$Month,col=rainbow(5),
- names=c("May", "Jun", "Jul", "Aug","Sep"),
- las=2,lab="Ozone Level",
- main="Distribution of Ozone per-month")
-```
-
-## Solutions
-
-- More aesthetically-pleasing palettes are provided by the **`RColorBrewer`** package:
- + Can also check for palettes that are accepted for those with colour-blindness
-
-```{r eval=FALSE}
-library(RColorBrewer)
-display.brewer.all()
-display.brewer.all(colorblindFriendly = TRUE)
-```
-
-```{r echo=FALSE}
-library(RColorBrewer)
-display.brewer.all(colorblindFriendly = TRUE)
-```
-
-
-## Solutions
-
-```{r fig.height=5}
-boxplot(weather$Ozone ~ weather$Month,
- col=brewer.pal(5,"Set1"))
-```
-
-
-#End of Day 1
-
-## To come tomorrow...
-- More customisation of plots
-- Statistics
-- Further manipulation of data
-- Report writing
diff --git a/Slides-day1.html b/Slides-day1.html
deleted file mode 100644
index 6cbe66e..0000000
--- a/Slides-day1.html
+++ /dev/null
@@ -1,1749 +0,0 @@
-
-
-
-
-
-
-
-
-
- Introduction to Solving Biological Problems Using R - Day 1
-
-
-
-
-
-
-
-
-
Introduction to Solving Biological Problems Using R - Day 1
-
-Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
-
-
Last modified: 02 Dec 2016
-
-
-
Course Aims
-
-
To introduce you to the basics of R
-
-
Reading data
-
Perform simple analyses
-
Producing graphs
-
How to get help!
-
-
Give you all the background you need to practice by yourselves
-
Introduce tools that will help you to work in a reproducible manner
Executed using command line, or a graphical user interface (GUI)
-
On this course, we use the RStudio GUI (www.rstudio.com)
-
Everything you need is installed on the training machines
-
If you are using your own machine, download both R and RStudio
-
-
-
Getting started
-
-
R is a program which, once installed on your system, can be launched and is immediately ready to take input directly from the user
-
There are two ways to launch R:
-
-
From the command line (particularly useful if you’re quite familiar with Linux; in the console at the prompt simply type R)
-
As an application called (very good for beginners)
-
-
-
-
Launching R Using RStudio
-
To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click
-
-
-
RStudio screenshot
-
-
-
Basic concepts in R - command line calculation
-
-
The command line can be used as a calculator. Type:
-
-
2 +2
-20/5 -sqrt(25) +3^2
-sin(pi/2)
-
Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…
-
-
Basic concepts in R - variables
-
-
A variable is a letter or word which takes (or contains) a value. We use the assignment operator: <-
-
-
x <-10
-x
-
-myNumber <-25
-myNumber
-
-
We can perform arithmetic on variables:
-
-
sqrt(myNumber)
-
-
Basic concepts in R - variables
-
-
We can add variables together:
-
-
x +myNumber
-
-
We can change the value of an existing variable:
-
-
x <-21
-x
-
-
Basic concepts in R - variables
-
-
We can set one variable to equal the value of another variable:
-
-
x <-myNumber
-x
-
-
We can modify the contents of a variable:
-
-
myNumber <-myNumber +sqrt(16)
-myNumber
-
-
Basic concepts in R - functions
-
-
Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:
-
-
sin(x)
-
-
This returns the sine of x
-
-
In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.
-
-
Try these:
-
-
sum(3,4,5,6)
-max(3,4,5,6)
-min(3,4,5,6)
-
-
Basic concepts in R - functions
-
-
Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order)
-
-
seq(from =2, to =20, by =4)
-seq(2, 20, 4)
-
-
When testing code, it is easier and safer to name the arguments
-
-
-
Basic concepts in R - vectors
-
-
The basic data structure in R is a vector – an ordered collection of values.
-
R treats even single values as 1-element vectors.
-
The function ccombines its arguments into a vector:
-
-
x <-c(3,4,5,6)
-x
-
-
The square brackets [] indicate the position within the vector (the index).
-
We can extract individual elements by using the [] notation:
-
-
x[1]
-x[4]
-
-
Basic concepts in R - vectors
-
-
We can even put a vector inside the square brackets (vector indexing):
-
-
y <-c(2,3)
-x[y]
-
-
Basic concepts in R - vectors
-
-
There are a number of shortcuts to create a vector.
-
Instead of:
-
-
x <-c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
we can write:
-
-
x <-3:12
-x
-
-
Basic concepts in R - vectors
-
-
or we can use the seq() function, which returns a vector:
-
-
x <-seq(2, 20, 4)
-x
-
x <-seq(2, 20, length.out=5)
-x
-
-
or we can use the rep() function:
-
-
y <-rep(3, 5)
-y
-
y <-rep(1:3, 5)
-y
-
-
Basic concepts in R - vectors
-
-
We have seen some ways of extracting elements of a vector. We can use these shortcuts to make things easier (or more complex!)
-
-
x <-3:12
-# Extract elements from x:
-
-x[3:7]
-x[seq(2, 6, 2)]
-x[rep(3, 2)]
-
-
Basic concepts in R - vectors
-
-
We can add an element to a vector:
-
-
y <-c(x, 1)
-y
-
-
We can glue vectors together:
-
-
z <-c(x, y)
-z
-
-
Basic concepts in R - vectors
-
-
We can “remove” element(s) from a vector:
-
-
NOTE: the vector x doesn’t get modified
-
we’re just displaying what the vector looks like without particular elements
-
-
-
x <-3:12
-
-x[-3]
-x[-(5:7)]
-x[-seq(2, 6, 2)]
-x
-
-
Basic concepts in R - vectors
-
-
Finally, we can modify the contents of a vector:
-
-
x[6] <-4
-x
-
-x[3:5] <-1
-x
-
Remember!
-
-
Square brackets [ ] for indexing
-
Parentheses () for function arguments
-
-
-
Basic concepts in R - vector arithmetic
-
-
When applying all standard arithmetic operations to vectors, application is element-wise
-
-
x <-1:10
-y <-x*2
-
y
-
z <-x^2
-
z
-
-
Basic concepts in R - vector arithmetic
-
-
Adding two vectors:
-
-
y +z
-
-
If vectors are not the same length, the shorter one will be recycled:
-
-
x +1:2
-
-
But be careful if the vector lengths aren’t factors of each other:
-
-
x +1:3
-
Warning in x + 1:3: longer object length is not a
-multiple of shorter object length
-
[1] 2 4 6 5 7 9 8 10 12 11
-
-
Basic concepts in R - Character vectors and naming
-
-
All the vectors we have seen so far have contained numbers, but we can also store text (/“strings”) in vectors – this is called a character vector.
We can also use the names() function to get a vector of the names of an object:
-
-
names(gene.expression)
-
-
Exercise: Body-Mass Index
-
-
Let’s try some vector arithmetic. Here are the weights and heights of five individuals
-
-
-
-
-
Person
-
Weight (kg)
-
Height (cm)
-
-
-
-
-
Jo
-
65.8
-
192
-
-
-
Sam
-
67.9
-
179
-
-
-
Charlie
-
75.3
-
169
-
-
-
Frankie
-
61.9
-
175
-
-
-
Alex
-
92.4
-
171
-
-
-
-
-
Create weight and height vectors to hold the data in each column using the c function. Create a person vector and use this vector to name the values in the other two vectors.
-
-
-
Exercise: Body-Mass Index
-
-
The body-mass index is given by the formula:- \(BMI = (Weight)/(Height^2)\); where Height is given in metres
-
-
Create a new vector to record this, called bmi.
-
-
Create a new vector bmi.sorted where the bmi values are put in increasing numeric order (HINT: look up the help on the sort function)
-
The interquartile range (IQR) of a vector is defined as the 75% percentile of the data minus the 25% percentile. Calculate the IQR for our bmi values
-
Names are usually carried across to the new vector. Sometimes this is what we want (as for bmi) but sometimes it is not (when we are comparing values in bmi.sorted). We can remove names by setting them to the special NULL value:
Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.
-
-
To create an R markdown file, Click on File → New File → R Markdown in Rstudio
-
-
markdown is a easy-to-read, easy-to-write text format often used to write HTML, readme files, etc.
-
a simpler (but not so informative) alternative is to use a script
-
-
This will make our analyses reproducible
-
-
-
Format of an R markdown file
-
-
Lines 8 - 10: plain text description
-
Lines 12 - 14: an R code ‘chunk’
-
Lines 18 to 20: another code chunk, this time producing a plot
-
-
-
-
md-format
-
-
-
Pressing the Knit HTML (/Knit PDF) button will create a report
-
See solution-exercise1.Rmd for solution to Exercise 1
-
All exercises have a markdown template that you can edit
-
-
-
Getting help
-
-
This is possibly the most important slide in the whole course!?!
-
To get help on any R function, type ? followed by the function name. For example:
-
-
?seq
-
-
This retrieves the syntax and arguments for the function. The help page shows the default order of arguments. It also tells you which package it belongs to.
-
There is typically a usage example, which you can test using the example function:
-
-
example(seq)
-
-
Getting help
-
-
If you can’t remember the exact name, type ?? followed by your guess. R will return a list of possibilities:
-
-
??mean
-
-
The Packages tab in the lower-right panel of RStudio will help you locate the help pages for a particular package and its functions
-
-
Often there will be a user-guide or ‘vignette’ too
-
-
-
-
Interacting with the R console
-
-
Important – R console symbols:
-
-
; end of line (Enables multiple commands to be placed on one line of text)
-
# comment (indicates text is a comment and not executed)
-
+ command line wrap (R is waiting for you to complete an expression)
-
-
Ctrl-c or escape to clear input line and try again
-
Ctrl-l to clear window
-
Use the TAB key for command auto completion
-
Use up and down arrows to scroll through the command history
-
-
-
R packages
-
-
R comes ready loaded with various libraries of functions called packages. For example: the function sum() is in the base package and sd(), which calculates the standard deviation of a vector, is in the stats package
-
There are 1000s of additional packages provided by third parties, and the packages can be found in numerous server locations on the web called repositories
-
The two repositories you will come across the most are:
-
Although the basic unit of R is a vector, we usually handle data in data frames.
-
A data frame is a set of observations of a set of variables – in other words, the outcome of an experiment.
-
For example, we might want to analyse information about a set of patients.
-
To start with, let’s say we have ten patients and for each one we know their name, sex, age, weight and whether they give consent for their data to be made public.
-
-
-
The patients data frame
-
-
We are going to create a data frame called ‘patients’, which will have ten rows (observations) and seven columns (variables). The columns must all be equal lengths.
-
We will explore how to construct these data from scratch.
-
-
(in practice, we would usually import such data from a file)
-
-
-
-
-
-
First_Name
-
Second_Name
-
Full_Name
-
Sex
-
Age
-
Weight
-
Consent
-
-
-
-
-
Adam
-
Jones
-
Adam Jones
-
Male
-
50
-
70.8
-
TRUE
-
-
-
Eve
-
Parker
-
Eve Parker
-
Female
-
21
-
67.9
-
TRUE
-
-
-
John
-
Evans
-
John Evans
-
Male
-
35
-
75.3
-
FALSE
-
-
-
Mary
-
Davis
-
Mary Davis
-
Female
-
45
-
61.9
-
TRUE
-
-
-
Peter
-
Baker
-
Peter Baker
-
Male
-
28
-
72.4
-
FALSE
-
-
-
Paul
-
Daniels
-
Paul Daniels
-
Male
-
31
-
69.9
-
FALSE
-
-
-
Joanna
-
Edwards
-
Joanna Edwards
-
Female
-
42
-
63.5
-
FALSE
-
-
-
Matthew
-
Smith
-
Matthew Smith
-
Male
-
33
-
71.5
-
TRUE
-
-
-
David
-
Roberts
-
David Roberts
-
Male
-
57
-
73.2
-
FALSE
-
-
-
Sally
-
Wilson
-
Sally Wilson
-
Female
-
62
-
64.8
-
TRUE
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Character, numeric and logical data types
-
-
Each column is a vector, like previous vectors we have seen, for example:
firstName secondName paste.firstName..secondName. ...
-1 Adam Jones Adam Jones
-2 Eve Parker Eve Parker
-3 John Evans John Evans
-4 Mary Davis Mary Davis
-5 Peter Baker Peter Baker
-6 Paul Daniels Paul Daniels
-7 ...
-
-
Naming data frame variables
-
-
We can access particular variables using the ‘$’operator:
-
-
patients$age
-
-
R has inferred the names of our data frame variables from the names of the vectors or the commands (e.g. the paste() command)
-
We can name the variables after we have created a data frame using the names() function, and we can use the same function to see the names:
When creating a data frame, R assumes all character vectors should be categorical variables and converts them to factors. This is not always what we want:
-
-
e.g. we are unlikely to be interested in the hypothesis that people called Adam are taller, so it seems a bit silly to represent this as a factor
-
-
-
patients$First_Name
-
-
Factors in data frames
-
-
We can avoid this by asking R not to treat strings as factors, and then explicitly stating when we want a factor by using factor():
Some calculations are more efficient to do on matrices, e.g.:
-
-
rowMeans(e)
-
[1] 3.5 4.5 5.5 6.5 7.5
-
-
Indexing data frames and matrices
-
-
You can index multidimensional data structures like matrices and data frames using commas:
-
object[rows, colums]
-
-
e[1,2]
-e[1,]
-patients[1,2]
-patients[1,]
-
-
If you don’t provide an index for either rows or columns, all of the rows or columns will be returned.
-
-
-
Advanced indexing
-
-
‘Values’ in R are really vectors
-
Indices are actually vectors, and can be numeric or logical:
-
-
s <-letters[1:5]
-s
-
[1] "a" "b" "c" "d" "e"
-
# View some of the values in s:
-
s[c(1,3)]
-s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
-
-
Advanced indexing
-
-
We can do the logical test and indexing in the same line of R code
-
-
R will do the test first, and then use the vector of TRUE and FALSE values to subset the vector
-
-
-
a <-1:5
-
-# Logical tests:
-a <3
-
[1] TRUE TRUE FALSE FALSE FALSE
-
s[a <3]
-
[1] "a" "b"
-
-
Operators
-
-
Operators allow us to combine multiple logical tests
-
comparison operators <, >, <=, >=, ==, !=
-
logical operators !, &, |, xor
-
-
The operators for ‘comparison’ and ‘logical’ always return logical values! i.e. (TRUE, FALSE)
-
-
-
s
-
[1] "a" "b" "c" "d" "e"
-
a
-
[1] 1 2 3 4 5
-
s[a >1 &a <3]
-s[a ==2]
-
-
Exercise: exercise2.Rmd
-
-
The markdown template has code to create the patients data frame from the slides
-
Make a new data frame with three extra variables: country, continent, and height
-
-
Make up the data
-
Make country a character vector but continent a factor
-
-
Try the summary() function on your data frame. What does it do? How does it treat vectors (numeric, character, logical) and factors? (What does it do for matrices?)
-
Use logical indexing to select the following patients from the data frame described in the slides:
-
-
Patients under 40
-
Patients who give consent to share their data
-
Men who weigh as much or more than the average European male (70.8 kg)
-
-
-
-
Logical indexing answers: solution-exercise2.pdf
-
-
Patients under 40:
-
-
patients[patients$Age <40, ]
-
-
Patients who give consent to share their data:
-
-
patients[patients$Consent ==TRUE, ]
-
-
Men who weigh as much or more than the average European male (70.8 kg):
The names of the columns are automatically assigned:
-
-
colnames(rawData)
-
[1] "Patient" "Nuclei" "NB_Amp" "NB_Nor" "NB_Del"
-
-
We can use any of these names to access a particular column:
-
-
and create a vector
-
TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
-
-
-
rawData$Nuclei
-
-
Word of caution
-
-
-
-
-
-
-
-
-
-
Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)
-
-
-
Word of caution
-
-
You will make your life a lot easier if you keep your data tidy:
-
If given a single vector as an argument, the function plot() will make a scatter plot with the values of the vector on the y axis, and indices in the x axis
-
In the course folder you will find the file ozone.csv:
-
-
Data describing weather conditions in New York City in 1973, obtained from the supplementary data to Biostatistics: A Methodology for the Health Sciences
Can also use Red Green Blue and hexadecimal values:
-
-
rgb(0.7, 0.7, 0.7) → A light grey in RGB format`
-
"#B3B3B3" → The same light grey in hexadecimal
-
"#0000FF88"→ A semi-transparent blue, in hexadecimal
-
-
The hexadecimal system is the native colour system for screen visualisation (e.g. webs). It indicates the intensity of Red, Green and Blue by using two digits for each colour, in a scale from 0-9 and A-F (0 meaning no intensity and F meaning most intense)
-
-
-
-
-
Use of colours
-
Changing the col argument to plot() changes the colour that the points are plotted in:
-
plot(patients$Age, patients$Weight, col ="red")
-
-
-
Plotting characters
-
-
R can use a variety of plotting characters
-
Each of which has a numeric code
-
-
plot(patients$Age, patients$Weight, pch =16)
-
-
-
Plotting characters
-
-
-
Plotting characters
-
-
Or you can specify a character:
-
-
plot(patients$Age, patients$Weight, pch ="X")
-
-
-
Size of points
-
Character expansion:
-
plot(patients$Age, patients$Weight, cex =3)
-
-
-
Size of points
-
Character expansion:
-
plot(patients$Age, patients$Weight, cex =0.2)
-
-
-
Colours and characters as vectors
-
-
Previously we have used a vector of length 1 as our value of colour and character
-
We can use a vector of any length:
-
-
the values will get recycled (re-used) so that each point gets assigned a value
-
-
We can use a pre-defined colour palette (see later)
-
-
-
-
-
-
diff --git a/Slides-day2.Rmd b/Slides-day2.Rmd
deleted file mode 100644
index 2a9feb5..0000000
--- a/Slides-day2.Rmd
+++ /dev/null
@@ -1,1804 +0,0 @@
----
-title: "Introduction to Solving Biological Problems Using R - Day 2"
-author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
- Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
-date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
-output:
- slidy_presentation:
- footer: 'This work is licensed under the Creative Commons Attribution-ShareAlike
- 2.0. COURSE HOME: cambiotraining.github.io/r-intro/'
- beamer_presentation: default
-css: mystyle.css
-toc: yes
----
-```{r include = FALSE}
-library(knitr)
-opts_chunk$set(comment = NA) # eliminates hashtag from R outputs
-```
-
-## Day 2 Schedule
-
-1. Further customisation of plots
-2. Statistics
-3. Data Manipulation Techniques
-4. Programming in R
-5. Further report writing
-
-#1. Further customisation of plots
-
-## Recap
-
-- We have seen how to use `plot()`, `boxplot()` , `hist()` etc to make simple plots
-- These come with arguments that can be used to change the appearance of the plot
- + `col`, `pch`
- + `main`, `xlab`, `ylab`
- + etc....
-- We will now look at ways to modify the plot appearance after it has been created
-- Also, how to export the graphs
-
-
-
-## The painter's model
-
-- R employs a painter's model to construct it's plots
-- Elements of the graph are added to the canvas one layer at a time, and the picture built up in levels.
-- Lower levels are obscured by higher levels,
- + allowing for blending, masking and overlaying of objects.
-- Caution: You can't undo the changes you make to the plot
-
-
-
-[http://www.inquisitr.com/309687/jesus-painting-restoration-goes-wrong-well-intentioned-old-lady-destroys-100-year-old-fresco/](http://www.inquisitr.com/309687/jesus-painting-restoration-goes-wrong-well-intentioned-old-lady-destroys-100-year-old-fresco/)
-
-## Example data
-
-- We will re-use the patients data from yesterday:
-
-```{r}
-age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
-weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 63.5,
- 71.5, 73.2, 64.8)
-firstName <- c("Adam", "Eve", "John", "Mary", "Peter",
- "Paul", "Joanna", "Matthew", "David", "Sally")
-secondName <- c("Jones", "Parker", "Evans", "Davis",
- "Baker","Daniels", "Edwards", "Smith",
- "Roberts", "Wilson")
-
-consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,
- FALSE, TRUE, FALSE, TRUE)
-
-sex <- c("Male", "Female", "Male", "Female", "Male",
- "Male", "Female", "Male", "Male", "Female")
-```
-
-## Example data
-
-```{r}
-patients <- data.frame(First_Name = firstName,
- Second_Name = secondName,
- Full_Name = paste(firstName, secondName),
- Sex = factor(sex),
- Age = age,
- Weight = weight,
- Consent = consent,
- stringsAsFactors = FALSE)
-```
-
-
-
-##Initial plot
-
-- Recall our patients dataset from yesterday
- + we might want to display other characteristics on the plot, e.g. gender of individual:
-
-```{r, fig.height=5}
-plot(patients$Age, patients$Weight, pch=16)
-```
-
-##The points function
-
-- `points()` can be used to set of points to an *existing* plot
-- It requires a vector of x and y coordinates
- + These do not have to be the same length as the number of points in the initial plot:
- + Hence we can use `points()` to highlight observations
- + ...or add a set of new observations
-```{r fig.height=4}
-plot(patients$Age, patients$Weight, pch=16)
-points(40,68, pch="X")
-```
-
-- Note that axis limits of the existing plot are not altered
-
-## Creating a blank plot
-
-- Often it is useful to create a blank 'canvas' with the correct labels and limits
-
-```{r, fig.height=5}
-plot(patients$Age, patients$Weight, type="n")
-```
-
-## Adding points to differentiate gender
-
-- Selecting males using the **`==`** comparison we saw yesterday
- + Gives a `TRUE` or `FALSE` value
- + Can be used to index the data frame
- + Which means we can get the relevant Age and Weight values
-```{r}
-males <- patients$Sex == "Male"
-```
-```{r, eval=FALSE}
-males
-patients[males,]
-patients$Age[males]
-patients$Weight[males]
-```
-
-## Adding points to differentiate gender
-
-- The points we add have to be within the `x` and `y` limits of the original plot axes, otherwise they won't be displayed
-
-```{r, fig.height=5}
-plot(patients$Age, patients$Weight, type="n")
-points(patients$Age[males], patients$Weight[males],
- pch=16, col="steelblue")
-
-```
-
-
-## Adding points to differentiate gender
-
-
-
-```{r}
-females <- patients$Sex == "Female"
-females
-patients[females,]
-```
-
-## Adding points to differentiate gender
-
-- Again, we have to be careful that all the points are within the `x` and `y` limits
- + this is why creating the blank plot containing the limits of the data is useful
-
-```{r, fig.height=5}
-plot(patients$Age, patients$Weight, type="n")
-points(patients$Age[males], patients$Weight[males],
- pch=16, col="steelblue")
-points(patients$Age[females], patients$Weight[females],
- pch=16, col="orangered1")
-
-```
-
-
-##Adding points
-
-- Each set of points can have a different colour and shape
-- Axis labels and title and limits are defined by the plot
-- Once you've added points to a plot, they cannot be removed
-- A call to `plot` will start a new graphics window
- - or typing `dev.off()`
-
-```{r fig.height=3,fig.width=8}
-plot(patients$Age, patients$Weight, type="n")
-points(patients$Age[males], patients$Weight[males],
- pch=16, col="steelblue")
-points(patients$Age[females], patients$Weight[females],
- pch=17, col="orangered1")
-```
-
-
-
-
-## Adding a legend
-
-- Should also add a legend to help interpret the plot
- + use the `legend` function
- + can give x and y coordinates where legend will appear
- + also recognises shortcuts such as ***topleft*** and ***bottomright***...
-
-```{r fig.height=3,fig.width=8}
-plot(patients$Age, patients$Weight, type="n")
-points(patients$Age[males], patients$Weight[males],
- pch=16, col="steelblue")
-points(patients$Age[females], patients$Weight[females],
- pch=17, col="orangered1")
-legend("topleft", legend=c("M","F"),
- col=c("steelblue","orangered1"), pch=c(16,17))
-```
-
-##Adding text
-
-- Text can also be added to a plot in a similar manner
- + The `labels` argument specifies the text we want to add
-
-```{r fig.height=4,fig.width=8}
-plot(patients$Age, patients$Weight, pch=16)
-text(patients$Age, patients$Weight, labels=patients$Full_Name)
-```
-
-##Adding text
-
-- Can alter the positions so they don't interfere with the points of the graph
-
-```{r fig.height=3,fig.width=8}
-plot(patients$Age, patients$Weight, pch=16,
- xlim=c(10,70), ylim=c(60,75))
-text(patients$Age-1, patients$Weight-0.5,
- labels=patients$Full_Name)
-```
-
-- Alternatively, you can use the argument `adj`
-
-## Adding lines
-
-- To aid our interpretation, it is often helpful to add guidelines
- + `grid()` is one easy way of doing this:
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, pch=16)
-grid(col="steelblue")
-```
-
-## Adding lines
-
-- Can also add lines that intersect the axes:
- + `v =` for vertical lines
- + `h =` for horizontal
- + can specify multiple lines in a vector
-```{r,fig.height=5}
-plot(patients$Age, patients$Weight, pch=16)
-abline(v=40, col="red")
-abline(h=c(65,70,75), col="blue")
-```
-
-
-## Plot layouts
-
-- The `par` function can be used specify the appearance of a plot
-- The settings persist until the plot is closed with **`dev.off()`**
-- `?par` and scroll to ***graphical parameters***
-- One example is `mfrow`:
- + "multiple figures per row"
- + needs to be a vector of rows and columns:
- + e.g. a plot with one row and two columns `par(mfrow=c(1,2))`
- + don't need the same kind of plot in each cell
-
-## Plot layouts
-
-```{r,fig.height=4}
-par(mfrow=c(1,2))
-plot(patients$Age, patients$Weight, pch=16,
- xlim=c(10,70), ylim=c(60,75))
-boxplot(patients$Weight ~ patients$Sex)
-
-```
-
-- See also `mar` for setting the margins:
- + `par(mar=c(...))`
-
-## Exporting graphs from RStudio
-
-- When using Rstudio interactively, the easiest option is to use the *Export* button from the *Plots* panel
-- Otherwise, in an R script you can use the `pdf()` function:
- + You will see that the plot does not appear in RStudio
-```{r eval=FALSE}
-pdf("ExampleGraph.pdf")
-plot(rnorm(1:10))
-```
-
-- You need to use the `dev.off()` to stop printing graphs to the pdf and 'close' the file
- + It allows you to create a pdf document with multiple pages
-```{r eval=FALSE}
-dev.off()
-```
-
-- pdf is a good choice for publication as they can be imported into Photoshop, Inkscape, etc.
- - Sometimes it is easier to edit in these tools than R!
- - If it is taking too long to customise a plot in R, consider if you should be using one of these tools instead
-
-## Exporting graphs from RStudio
-
-- To save any graph you have created to a pdf, repeat the code you used to create the plot with `pdf()` before and `dev.off()` afterwards
- + you can have as many lines of code in-between as you like
-
-```{r}
-pdf("mygraph.pdf")
-plot(patients$Age, patients$Weight, pch=16)
-abline(v=40, col="red")
-abline(h=c(65,70,75), col="blue")
-dev.off()
-```
-
-- If no plots are appearing in RStudio, it could be you are still writing to a pdf file
- + run `dev.off()` multiple times until you see a message `cannot shut down device (the null device)`
-
-
-## Exporting graphs from RStudio
-
-- We can specify the dimensions of the plot, and other properties of the file (`?pdf`)
-
-```{r}
-pdf("ExampleGraph.pdf", width=10, height=10)
-plot(rnorm(1:10))
-dev.off()
-```
-
-- Other formats can be created:
- + e.g. ***png***, or others `?jpeg`
- + more appropriate for email, presentations, web page
-
-```{r}
-png("ExampleGraph.png")
-plot(rnorm(1:10))
-dev.off()
-```
-
-##Exercise: exercise5.Rmd
-- Return to the weather data from yesterday:
-
-```{r}
-weather <- read.csv("ozone.csv")
-```
-
-- Using the `par` function, create a layout with three columns
-- Plot Ozone versus Solar Radiation, Wind Speed and Temperature on separate graphs
- + use different colours and plotting characters on each plot
-- Save the plot to a pdf
-- HINT: Create the graph first in RStudio. When you're happy with it, re-run the code preceeded by the `pdf` function to save to a file
- + don't forget to use `dev.off()` to close the file
-
-```{r echo=FALSE,fig.height=4,fig.width=10}
-par(mfrow=c(1,3))
-plot(weather$Solar.R,weather$Ozone,pch=16,col="lightgreen",ylab="Ozone level",xlab="Solar Radiation")
-plot(weather$Wind,weather$Ozone, pch=15,col="steelblue",ylab="Ozone level", xlab="Wind Speed")
-plot(weather$Temp,weather$Ozone,pch=17,col="orange", ylab="Ozone level",xlab="Temperature")
-
-```
-
-##Exercise: exercise5.Rmd
-
-- Temperature and Ozone level seem to be correlated
-- However, there are some observations that do not seem to fit the trend
- + those with Ozone level > 100
-- Modify the plot so that these outlier observations are in a different colour
-
-```{r fig.height=5}
-plot(weather$Temp,weather$Ozone, pch=17,
- col="orange", ylab="Ozone level",
- xlab="Temperature")
-```
-
-## Target graph
-
-HINT: You can break down the problem into the following steps
-
-- Create a blank plot
-- Identify observations with ozone > 100
- + plot the corresponding Temperature and Ozone values for these in red
-- Identify observations with ozone < 100
- + plot the corresponding Temperature and Ozone values for these in orange
-
-
-```{r echo=FALSE, fig.height=5}
-plot(weather$Temp,weather$Ozone, pch=17,
- col="orange", ylab="Ozone level",
- xlab="Temperature")
-highO <- which(weather$Ozone > 100)
-abline(h=100,lty=2)
-points(weather$Temp[highO],weather$Ozone[highO],pch=17,col="red")
-```
-
-
-## Solution
-
-```{r eval=FALSE}
-pdf("ozoneCorrelations.pdf")
-par(mfrow=c(1,3))
-plot(weather$Solar.R, weather$Ozone, pch=16,
- col="lightgreen", ylab="Ozone level",
- xlab="Solar Radiation")
-plot(weather$Wind, weather$Ozone, pch=15,
- col="steelblue", ylab="Ozone level",
- xlab="Wind Speed")
-plot(weather$Temp,weather$Ozone, pch=17,
- col="orange", ylab="Ozone level",
- xlab="Temperature")
-dev.off()
-```
-
-## Solution
-
-If the graph looks a bit stretched...
-
-```{r eval=FALSE}
-pdf("ozoneCorrelations.pdf", width=10,height = 6)
-par(mfrow=c(1,3))
-plot(weather$Solar.R, weather$Ozone, pch=16,
- col="lightgreen", ylab="Ozone level",
- xlab="Solar Radiation")
-plot(weather$Wind, weather$Ozone, pch=15,
- col="steelblue", ylab="Ozone level",
- xlab="Wind Speed")
-plot(weather$Temp,weather$Ozone, pch=17,
- col="orange", ylab="Ozone level",
- xlab="Temperature")
-dev.off()
-```
-
-
-## Solution: solution-exercise5.pdf
-
-```{r fig.height=5}
-highO <- which(weather$Ozone > 100)
-lowO <- which(weather$Ozone < 100)
-
-plot(weather$Temp,weather$Ozone, type="n",
- ylab="Ozone level",
- xlab="Temperature")
-points(weather$Temp[highO],weather$Ozone[highO],
- col="red",pch=17)
-points(weather$Temp[lowO],weather$Ozone[lowO],
- col="orange",pch=17)
-abline(h=100,lty=2)
-```
-
-
-## Alternative Solution
-
-Defining a vector of colours and plotting characters, and over-writing particular entries
-
-- `rep` is used to repeat a value a certain number of times
-
-```{r}
-mycol <-rep("orange",nrow(weather))
-highO <- which(weather$Ozone > 100)
-mycol[highO] <- "red"
-mypch <- rep(17, nrow(weather))
-mypch[highO] <- 18
-plot(weather$Temp,weather$Ozone,
- col=mycol, pch=mypch,ylab="Ozone level",
- xlab="Temperature")
-abline(h=100,lty=2)
-```
-
-## Alternative Solution
-
-`ifelse` is a handy function for creating a vector of values based on a logical test
-
-```{r}
-mycol <- ifelse(weather$Ozone > 100, "red","orange")
-mypch <- ifelse(weather$Ozone > 100, 18,17)
-
-plot(weather$Temp,weather$Ozone,
- col=mycol, pch=mypch,ylab="Ozone level",
- xlab="Temperature")
-abline(h=100,lty=2)
-```
-
-
-# 2. Statistics
-##Built-in support for statistics
-- R is a statistical programming language:
- + Classical statistical tests are built-in
- + Statistical modeling functions are built-in
- + Regression analysis is fully supported
- + Additional mathematical packages are available (`MASS`, Waves, sparse matrices, etc)
-
-##Distribution functions
-- Most commonly used distributions are built-in, functions have stereotypical names, e.g. for normal distribution:
- + **`pnorm`** - cumulative distribution for x
- + **`qnorm`** - inverse of pnorm (from probability gives x)
- + **`dnorm`** - distribution density
- + **`rnorm`** - random number from normal distribution
-
-
-
-- Available for variety of distributions: `punif` (uniform), `pbinom` (binomial), `pnbinom` (negative binomial), `ppois` (poisson), `pgeom` (geometric), `phyper` (hyper-geometric), `pt` (T distribution), pf (F distribution)
-
-##Distribution functions
-- 10 random values from the Normal distribution with mean 10 and standard deviation 5:
-```{r eval=FALSE}
-rnorm(10, mean=10, sd=5)
-```
-- The probability of drawing exactly 10 from this distribution:
-```{r}
-dnorm(10, mean=10, sd=5)
-```
-
-```{r}
-dnorm(100, mean=10, sd=5)
-```
-
-##Distribution functions (continued)
-
-- The probability of drawing a value smaller than 10:
-```{r}
-pnorm(10, mean=10, sd=5)
-```
-- The inverse of `pnorm()`:
-```{r}
-qnorm(0.5, mean=10, sd=5)
-```
-- How many standard deviations for statistical significance?
-```{r}
-qnorm(0.95, mean=0, sd=1)
-```
-
-## Example
-
-Recall our histogram of Wind Speed from yesterday:
-
-- The data look to be roughly normally-distributed
-- An assumption we rely on for various statistical tests
-
-```{r fig.height=4}
-hist(weather$Wind, col="purple", xlab="Wind Speed",
- main="Distribution of Wind Speed",
- breaks = 20, freq=FALSE)
-```
-
-## Create a normal distribution curve
-
-- If our data are normally-distributed, we can calculate the probability of drawing particular values.
- + e.g. a Wind Speed of 10
-
-```{r eval=FALSE}
-tempMean <- mean(weather$Wind)
-tempSD <- sd(weather$Wind)
-dnorm(10, mean=tempMean, sd=tempSD)
-```
-
-```{r echo=FALSE}
-tempMean <- mean(weather$Wind)
-tempSD <- sd(weather$Wind)
-```
-
-- We can overlay this on the histogram using `points` as we just saw:
-```{r fig.height=4}
-hist(weather$Wind, col="purple", xlab="Wind Speed",
- main="Distribution of Wind Speed",
- breaks = 20, freq=FALSE)
-points(10, dnorm(10, mean=tempMean, sd=tempSD),
- col="red", pch=16)
-```
-
-
-## Create a normal distribution curve
-
-- We can repeat the calculation for a vector of values
- + remember that functions in R are often ***vectorized***
- + use `lines` in this case rather than `points`
-
- ```{r eval=FALSE}
-xs <- c(0,5,10,15,20)
-ys <- dnorm(xs, mean=tempMean, sd=tempSD)
-lines(xs, ys, col="red")
-```
-
-```{r fig.height=4,echo=FALSE}
-hist(weather$Wind,col="purple",xlab="Wind Speed",
- main="Distribution of Wind Speed",breaks = 20,freq=FALSE)
-xs <- c(0,5,10,15,20)
-ys <- dnorm(xs, mean=tempMean,sd=tempSD)
-lines(xs,ys,col="red")
-```
-
-## Create a normal distribution curve
-
-- For a smoother curve, use a longer vector:
- + we can generate x values using the `seq()` function
-- Inspecting the data in this manner is usually acceptable to assess normality
- + the fit doesn't have to be exact
- + the shapiro test is also available `?shapiro.test`
-
-```{r eval=FALSE}
-xs <- seq(00,20, length.out = 10000)
-ys <- dnorm(xs, mean=tempMean, sd=tempSD)
-lines(xs, ys, col="red")
-```
-
-```{r fig.height=4,echo=FALSE}
-hist(weather$Wind,col="purple",xlab="Wind Speed",
- main="Distribution of Wind Speed",breaks = 20,freq=FALSE)
-xs <- seq(00,20, length.out = 10000)
-ys <- dnorm(xs, mean=tempMean,sd=tempSD)
-lines(xs,ys,col="red")
-```
-
-## Simple testing
-
-- If we want to compute the probability of observing a particular Wind Speed, from the same distribution, we can use the standard formula to calculate a t statistic:
-
-$$t = \frac{\bar{x} -\mu_0}{s / \sqrt(n)}$$
-
-- Say a Wind Speed of 2; which from the histogram seems to be unlikely
-
-```{r}
-t <- (tempMean - 2) / (tempSD/sqrt(length(weather$Wind)))
-t
-```
-
-## Simple testing
-
-- ...or use the **`t.test()`** function to compute the statistic and corresponding p-value
-
-```{r}
-t.test(weather$Wind, mu=2)
-```
-
-
-##Two-sample tests: Basic data analysis
-
-- Comparing 2 variances:
- + Fisher's F test
-```{r eval=FALSE}
-var.test()
-```
-- Comparing 2 sample means with normal errors:
- + Student's t test
-```{r eval=FALSE}
-t.test()
-```
-- Comparing 2 means with non-normal errors:
- + Wilcoxon's rank test
-```{r eval=FALSE}
-wilcox.test()
-```
-
-##Two-sample tests: Basic data analysis
-- Comparing 2 proportions:
- + Binomial test
- + e.g. [here](http://www.r-tutor.com/elementary-statistics/inference-about-two-populations/comparison-two-population-proportions)
-
-```{r eval=FALSE}
-prop.test()
-```
-- Correlating 2 variables:
- + Pearson's / Spearman's rank correlation
-```{r eval=FALSE}
-cor.test()
-```
-- Testing for independence of 2 variables in a contingency table:
- + Chi-squared / Fisher's exact test
-```{r eval=FALSE}
-chisq.test(); fisher.test()
-```
-
-## Statistical tests in R
-
-- Bottom-line: Pretty much any statistical test you care to name will probably be in R
- + This is not supposed to be a statistics course (sorry!)
- + None of them are particular harder than others to use
- + The difficulty is deciding which test to use:
- + whether the assumptions of the test are met, etc.
- + Consult your local statistician if not sure
- + An upcoming course that will help
- + [Introduction to Statistical Analysis](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/)
- + Some good references:
- + [Statistical Analysis Using R (Course from the Babaraham Bioinformatics Core)](http://training.csx.cam.ac.uk/bioinformatics/event/1827771)
- + [Quick-R guide to stats](http://www.statmethods.net/stats/index.html)
- + [Simple R eBook](https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf)
- + [R wiki](https://en.wikibooks.org/wiki/R_Programming/Descriptive_Statistics)
- + [R tutor](http://www.r-tutor.com/elementary-statistics)
- + [Statistical Cheatsheet](https://rawgit.com/bioinformatics-core-shared-training/intermediate-stats/master/cheatsheet.pdf)
-
-## Example analysis
-
-- We have already seen that men in our `patients` dataset tend to be heavier than women
-- We can **test this formally** in R
-
-```{r echo=FALSE,fig.height=5}
-par(mfrow=c(1,2))
-plot(patients$Age, patients$Weight,pch=16,type="n")
-points(patients$Age[males], patients$Weight[males],pch=16,col="steelblue")
-points(patients$Age[females], patients$Weight[females],pch=17,col="orangered1")
-legend("topleft", legend=c("M","F"),
- col=c("steelblue","orangered1"), pch=c(16,17))
-boxplot(patients$Weight~patients$Sex)
-```
-
-
-## Test variance assumption
-
-```{r}
-var.test(patients$Weight~patients$Sex)
-```
-
-## Perform the t-test
-
-```{r}
-t.test(patients$Weight~patients$Sex, var.equal=TRUE)
-```
-
-- This function can be tuned in various ways (`?t.test`):
- - Assumed equal variances, or not (and use Welch's correction)
- - Deal with paired samples
- - Two-sided, or one-sided p-value
-
-##Linear regression: Basic data analysis
-
-- Linear modeling is supported by the function **`lm()`**:
- + `example(lm)`
- + The output assumes you know a fair bit about the subject
-
-- `lm` is really useful for plotting lines of best fit to XY data, in order to determine intercept, gradient and Pearson's correlation coefficient
- + This is very easy in R
-
-- Three steps to plotting with a best fit line:
- 1. Plot XY scatter-plot data
- 2. Fit a linear model
- 3. Add bestfit line data to plot with `abline()` function
-
-##Typical linear regression analysis: Basic data analysis
-
-- The ~ (***tilde***) is used to define a ***formula***; i.e. "y is given by x"
-
-
-```{r fig.height=5}
-x <- c(1, 2.3, 3.1, 4.8, 5.6, 6.3)
-y <- c(2.6, 2.8, 3.1, 4.7, 5.1, 5.3)
-plot(x,y, xlim=c(0,10), ylim=c(0,10))
-```
-
-
-##Typical linear regression analysis: Basic data analysis
-
- The ~ is used to define a formula; i.e. "y is given by x"
-- Take care about the order of x and y in the plot and lm expressions
-
-```{r,fig.height=5}
-plot(x,y, xlim=c(0,10), ylim=c(0,10))
-myModel <- lm(y~x)
-abline(myModel, col="red")
-```
-
-## In-depth summary
-
-```{r}
-summary(myModel)
-```
-
-
-##Typical linear regression analysis: Basic data analysis
-- Get the coefficients of the fit from:
-```{r, eval=FALSE}
-coef(myModel) # Coefficients
-resid(myModel) # Residuals
-fitted(myModel) # Fitted values
-names(myModel) # Names of the objects within myModel
-residuals(myModel) + fitted(myModel) # what values does this give?
-```
-
-## Diagnostic plots of the fit
-
-- Get QC of fit from:
- + Some explanation is given [here](http://data.library.virginia.edu/diagnostic-plots/) and [here](http://strata.uga.edu/6370/rtips/regressionPlots.html)
-
-```{r,fig.height=5}
-par(mfrow=c(2,2))
-plot(myModel)
-```
-
-##Modelling formulae
-- R has a very powerful formula syntax for describing statistical models
-- Suppose we had two explanatory variables `x` and `z`, and one response
-variable `y`
-- We can describe a relationship between, say, `y` and `x` using a tilde `~`,
-placing the response variable on the left of the tilde and the explanatory variables on the right:
- + `y~x`
-- It is very easy to extend this syntax to do multiple regressions, ANOVAs, to include interactions, and to do many other common modelling tasks. For example
-```{r eval=FALSE}
-y~x #If x is continuous, this is linear regression
-y~x #If x is categorical, ANOVA
-y~x+z #If x and z are continuous, multiple regression
-y~x+z #If x and z are categorical, two-way ANOVA
-y~x+z+x:z # : is the symbol for the interaction term
-y~x*z # * is a shorthand for x+z+x:z
-```
-
-
-## Exercise: exercise6.Rmd
-
-- There are suggestions that Ozone level could be influenced by Temperature:
-
-```{r echo=FALSE,fig.height=4}
-plot(weather$Temp, weather$Ozone,xlab="Temperature",ylab="Ozone level",pch=16)
-```
-
-- Perform a linear regression analysis to assess this:
- + Fit the linear model and print a summary of the output
- + Plot the two variables and overlay a best-fit line
- + What is the equation of the best-fit line in the form
- + $$ y = mx + c$$
- + Calculate the correlation between the two variables using the function `cor` (`?cor`)
- + can you annotate the plot with the correlation coefficient?
-
-## Solution: solution-exercise6.pdf
-```{r fig.show=FALSE}
-mod1 <- lm(weather$Ozone~weather$Temp)
-summary(mod1)
-```
-
-## Solution
-```{r}
-plot(weather$Temp, weather$Ozone, pch=16)
-abline(mod1, col="red", lty=2)
-c = coef(mod1)
-text(95,150, paste("y = ", round(c[2],2), "x",round(c[1],2),sep=""))
-```
-
-## Solution
-```{r}
-plot(weather$Temp, weather$Ozone, pch=16)
-abline(mod1, col="red", lty=2)
-cor = cor(weather$Temp,weather$Ozone,use="c")
-cor
-text(95,150, paste("r^2 = ", round(cor^2,2)))
-```
-
-## Solution
-```{r}
-plot(weather$Temp, weather$Ozone, pch=16)
-abline(mod1, col="red", lty=2)
-cor = cor(weather$Temp,weather$Ozone,use="c")
-cor
-text(95,150, substitute(paste(r^2, "=" ,x),list(x=round(cor^2,2))))
-```
-
-
-## Word of caution
-
-***Correlation != Causation***
-
-
-
-http://tylervigen.com/spurious-correlations
-
-## Word of caution
-
-
-
-[So if I want to win a nobel prize, I should eat even more chocolate?!?!?](http://www.businessinsider.com/chocolate-consumption-vs-nobel-prizes-2014-4?IR=T)
-
-But no-one would ever take such trends seriously....would they?
-
-## Wrong!
-
-
-
-
-# 3. Data Manipulation Techniques
-
-## Motivation
-
-- So far we have been lucky that all our data have been in the same file:
- + This is not usually the case
- + Dataset may be spread over several files
- + This takes longer, and is harder, than many people realise
- + We need to combine before doing an analysis
-
-
-
-## Combining data from multiple sources: Gene Clustering Example
-
-- R has powerful functions to combine heterogeneous data sources into a single data set
-- Gene clustering example data:
- + Gene expression values in ***gene.expression.txt***
- + Gene information in ***gene.description.txt***
- + Patient information in ***cancer.patients.txt***
-- A breast cancer dataset with numerous patient characteristics:
- + We will concentrate on ***ER status*** (positive / negative)
- + What genes show a statistically-significant different change between ER groups?
-
-
-```{r echo=FALSE}
-
-if(!file.exists("gene.expression.txt")){
-
- if(!require(breastCancerNKI) | require(genefilter)) {
- source("http://www.bioconductor.org/biocLite.R")
- biocLite(c("breastCancerNKI","genefilter"))
- }
- data("nki")
- cancer.patients <- pData(nki)[,c("samplename","age","er","grade")]
- genes <- fData(nki)[,c("probe","HUGO.gene.symbol","Cytoband")]
-
- exprs(nki) <- exprs(nki)[!is.na(genes$HUGO.gene.symbol),]
-
- genes <- genes[!is.na(genes$HUGO.gene.symbol),]
-
- ##get the top50 DE genes, plus 500 random
- ps <- NULL
- for(i in 1:nrow(genes)){
- ps[i] <- t.test(exprs(nki)[i,] ~ factor(cancer.patients$er))$p.value
- }
-
- set.seed(070815)
- ind <- order(ps, decreasing = FALSE)[1:50]
- ind <- sort(c(ind, sample(setdiff(1:nrow(genes),ind),500)))
-
- evalues <- exprs(nki)[ind,]
- genes <- genes[ind,]
- library(org.Hs.eg.db)
-
- chr <- select(org.Hs.eg.db, columns=c("CHR","CHRLOC"),keys = as.character(genes$HUGO.gene.symbol),keytype = "SYMBOL")
- genes$Chromosome <- chr[match(genes$HUGO.gene.symbol, chr[,1]),2]
- genes$Chromosome <- ifelse(!is.na(genes$Chromosome),paste0("chr", genes$Chromosome),NA)
- genes$Start <- abs(chr[match(genes$HUGO.gene.symbol, chr[,1]),3])
-
- genes <- genes[,-3]
-
- final <- !is.na(genes$Chromosome)
- genes <- genes[final,]
- evalues <- evalues[final,]
-
-
-
-
-
- write.table(evalues, file="gene.expression.txt",quote=FALSE,sep="\t")
- write.table(genes, file="gene.description.txt",quote=FALSE,sep="\t")
- write.table(cancer.patients, file="cancer.patients.txt",quote=FALSE,sep="\t")
-}
-```
-
-## Analysis goals
-
-- We will show how to lookup a particular gene in the dataset
-- Also, how to look-up genes in a given genomic region
-- Assess if a given gene is differentially-expressed
-- Create a heatmap to cluster the samples and reveal any subgroups in the data
-
-## Peek at the data
-
-```{r eval=FALSE}
-evals <- read.delim("gene.expression.txt")
-evals[1:2,1:5]
-dim(evals)
-```
-
-```{r echo=FALSE}
-evals <- read.delim("gene.expression.txt")
-evals[1:2,1:5]
-dim(evals)
-```
-- `r nrow(evals)` rows and `r ncol(evals)` columns
-+ One row for each gene:
- + Rows are named according to particular technology used to make measurement
- + The names of each row can be returned by `rownames(evals)`; giving a vector
-+ One column for each patient:
- + The names of each column can be returned by `colnames(evals)`; giving a vector
-
-## Peek at the data
-
-```{r}
-genes <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
-head(genes)
-dim(genes)
-```
-
-
-- `r nrow(genes)` rows and `r ncol(genes)` columns
-- One for each gene
-- Includes mapping between manufacturer ID and Gene name
-
-## Peek at the data
-```{r}
-subjects <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
-head(subjects)
-dim(subjects)
-```
-
-- One for each patient in the study
-- Each column is a different characteristic of that patient
- + e.g. whether a patient is ER positive (value of 1) or negative (value of 0)
-
-```{r}
-table(subjects$er)
-```
-
-
-
-## Ordering and sorting
-
-To get a feel for these data, we will look at how we can subset and order
-
-- R allows us to do the kinds of filtering, sorting and ordering operations you might be familiar with in Excel
-- For example, if we want to get information about patients that are ER negative
- + these are indicated by an entry of ***0*** in the `er` column
-
-```{r eval=FALSE}
-subjects$er == 0
-```
-
-```{r echo=FALSE}
-Biobase:::selectSome(subjects$er==0)
-
-```
-
-
-## Ordering and sorting
-
-We can do the comparison within the square brackets
-
-- Remembering to include a `,` to index the columns as well
-- Best practice to create a new variable and leave the original data frame untouched
-
-```{r}
-erNegPatients <- subjects[subjects$er == 0,]
-head(erNegPatients)
-```
-
-or
-
-```{r}
-View(erNegPatients)
-```
-
-
-## Ordering and sorting
-
-Sorting is supported by the **`sort()`** function
-
-- Given a vector, it will return a sorted version of the same length
-
-```{r eval=FALSE}
-sort(erNegPatients$grade)
-```
-
-```{r echo=FALSE}
-Biobase:::selectSome(sort(erNegPatients$grade),maxToShow = 20)
-```
-
-
-- But this is not useful in all cases
- + We have lost the extra information that we have about the patients
-
-## Ordering and sorting
-
-- Instead, we can use **`order()`**
-- Given a vector, `order()` will give a set of numeric values which will give an ordered version of the vector
- + default is smallest --> largest
-
-```{r}
-myvec <- c(90,100,40,30,80,50,60,20,10,70)
-myvec
-order(myvec)
-```
-
-- i.e. number in position 9 is the smallest, number in position 8 is the second smallest:
-
-```{r}
-myvec[9]
-myvec[8]
-```
-
-N.B. `order` will also work on character vectors
-
-```{r}
-firstName <- c("Adam", "Eve", "John", "Mary", "Peter", "Paul", "Joanna", "Matthew", "David", "Sally")
-order(firstName)
-```
-
-
-## Ordering and sorting
-
-- We can use the result of `order()` to perform a subset of our original vector
-- The result is an ordered vector
-```{r}
-myvec.ord <- myvec[order(myvec)]
-myvec.ord
-```
-
-- Implication: We can use `order` on a particular column of a data frame, and use the result to sort all the rows
-
-## Ordering and sorting
-
-- We might want to select the youngest ER negative patients for a follow-up study
-- Here we order the `age` column and use the result to re-order the rows in the data frame
-
-```{r}
-erNegPatientsByAge <- erNegPatients[order(erNegPatients$age),]
-head(erNegPatientsByAge)
-```
-
-```{r}
-
-```
-
-## Ordering and sorting
-
-- can change the behaviour of `order` to be Largest --> Smallest
-```{r}
-erNegPatientsByAge <- erNegPatients[order(erNegPatients$age,decreasing = TRUE),]
-head(erNegPatientsByAge)
-```
-
-- we can write the result to a file if we wish
-
-```{r eval=FALSE}
-write.table(erNegPatientsByAge, file="erNegativeSubjectsByAge.txt", sep="\t")
-```
-
-
-## Exercise: exercise7.Rmd
-
-- Imagine we want to know information about chromosome 8 genes that have been measured.
-1. Create a new data frame containing information on genes on Chromosome 8
-2. Order the rows in this data frame according to start position, and write the results to a file
-
-## Solution: solution-exercise7.pdf
-
-```{r}
-chr8Genes <- genes[genes$Chromosome=="chr8",]
-head(chr8Genes)
-chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),]
-head(chr8GenesOrd)
-write.table(chr8GenesOrd, "chromosome8.gene.info.txt", sep="\t")
-```
-
-## Alternative:
-
-- you might find the function `subset` a bit easier to use
- + no messing around with square brackets
- + no need to remember row and column indices
- + no need for `$` operator to access columns
-
-```{r}
-chr8Genes <- subset(genes, Chromosome=="chr8")
-head(chr8Genes)
-```
-
-
-## Retrieving data for a particular gene
-
- - Gene `ESR1` is known to be hugely-different between ER positive and negative patient
- + let's check that this is evident in our dataset
- + if not, something has gone wrong!
-- First step is to locate this gene in our dataset
-- We can use `==` to do this, but there are some alternatives that are worth knowing about
-
-## Character matching in R
-
-- `match()` and `grep()` are often used to find particular matches
- + CAUTION: by default, match will only return the ***first*** match!
-
-```{r}
-match("D", LETTERS)
-grep("F", rep(LETTERS,2))
-match("F", rep(LETTERS,2))
-```
-
-## Character matching in R
-
-- `grep` can also do partial matching
- + can also do complex matching using "regular expressions"
-
-```{r}
-month.name
-grep("ary",month.name)
-grep("ber",month.name)
-```
-
-- `%in%` will return a logical if each element is contained in a shortened list
-
-```{r}
-month.name %in% c("May", "June")
-```
-
-
-## Retrieving data for a particular gene
-
-- Find the name of the ID that corresponds to gene ***ESR1*** using `match`
- + mapping between IDs and genes is in the ***genes*** data frame
- + ID in first column, gene name in the second
-- Save this ID as a variable
-
-```{r}
-ind <- match("ESR1", genes$HUGO.gene.symbol)
-genes[ind,]
-probe <- genes[ind,1]
-probe
-```
-
-
-## Retrieving data for a particular gene
-
-Now, find which row in our expression matrix is indexed by this ID
-
-- recall that the rownames of the expression matrix are the probe IDs
-- save the expression values as a variable
-
-```{r}
-match(probe, rownames(evals))
-evals[match(probe, rownames(evals)), 1:10]
-genevals <- evals[match(probe,rownames(evals)),]
-class(genevals)
-```
-
-
-
-## Relating to patient characteristics
-
-We have expression values and want to visualise them against our categorical data
-
-- use a boxplot, for example
-- however, we have to first make sure our values are treat as numeric data
-- as we created the subset of a data frame, the result was also a data frame
- + use `as.numeric`
-
-
-```{r}
-boxplot(as.numeric(genevals) ~ subjects$er)
-```
-
-## Relating to patient characteristics
-
-- In this case there is a clear difference, so we probably don't even need a p-value to convince ourselves of the difference
- + in real-life, we would probably test lots of genes and implement some kind of multiple-testing
- + e.g. `p.adjust` (`?p.adjust`)
-
-```{r}
-t.test(as.numeric(genevals) ~ subjects$er)
-
-```
-
-
-
-## Complete script
-
-`esr1Example.Rmd`
-
-```{r eval=FALSE}
-genes <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
-subjects <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
-evals <- read.delim("gene.expression.txt")
-
-ind <- match("ESR1", genes[,2])
-probe <- genes[ind,1]
-genevals <- evals[match(probe,rownames(evals)),]
-
-boxplot(as.numeric(genevals) ~ subjects$er)
-t.test(as.numeric(genevals) ~ subjects$er)
-```
-
-## Exercise: exercise8.Rmd
-
-Repeat the same steps we performed for the gene ESR1, but for GATA3:
-
-- Try and make as few changes as possible from the ESR1 script
-- Can you see why making a markdown document is useful for analysis?
-
-#4. Programming in R
-
-## Motivation
-
-From the previous exercise, you should see how we can easily adapt our markdown scripts:
-
-- e.g. ESR1 versus GATA3
-- But what if we want to analyse many genes?
-- It would be tedious to create a new markdown document for every gene
-- ...and prone to error too
-
-##Introducing loops
-
-- Many programming languages have ways of doing the same thing many times, perhaps changing some variable each time. This is called **looping**
-- Loops are not used in R so often, because we can usually achieve the same thing using vector calculations
-- For example, to add two vectors together, we do not need to add each pair of elements one by one, we can just add the vectors
-```{r eval=FALSE}
-x <- 1:10
-y <- 11:20
-x+y
-```
-- But there are some situations where R functions can not take vectors as input. For example, `t.test()` will only test one gene at a time
-- What if we wanted to test multiple genes?
-
-##Introducing loops
-- We could do this:
-
-```{r eval=FALSE}
-t.test(evals[1,] ~ factor(subjects$er))
-t.test(evals[2,] ~ factor(subjects$er))
-
-```
-
-- But this will be boring to type, difficult to change, and prone to error
-- As we are doing the same thing multiple times, but with a different index each time, we can use a **loop** instead
-
-##Loops: Commands and flow control
-- R has two basic types of loop
- + a **`for`** loop: run some code on every value in a vector
- + a **`while`** loop: run some code while some condition is true (*hardly ever used!*)
-
-`for`
-```{r loops1, eval=FALSE}
-for(i in 1:10) {
- print(i)
- }
-```
-`while`
-```{r eval=FALSE}
-i <- 1
-while(i <= 10 ) {
- print(i)
- i <- i + 1
- }
-```
-
-##Loops: Commands and flow control
-
-- Here's how we might use a `for` loop to test the first 10 genes
-
-
-```{r loops2, eval=FALSE}
-for(i in 1:10) {
- t.test(as.numeric(evals[i,]) ~ factor(subjects$er))
- }
-```
-
-- This is exactly the same as:
-
-```{r eval=FALSE}
-i <- 1
-t.test(evals[i,] ~ factor(subjects$er))
-i <- 2
-t.test(evals[i,] ~ factor(subjects$er))
-i <- 3
-...
-```
-
-
-
-## Storing results
-
-However, this for loop is doing the calculations but not storing the results
-
-- The output of `t.test()` is an object with data placed in different slots
- + the `names()` of the object tells us what data we can retrieve, and what name to use
- + N.B it is a "list" object
-
-```{r}
-t <- t.test(as.numeric(evals[1,]) ~ factor(subjects$er))
-names(t)
-t$statistic
-```
-
-## Storing results
-
-- When using a loop, we often create an empty "dummy" variable
-- This is used store the results at each stage of the loop
-
-```{r}
-stats <- NULL
-for(i in 1:10) {
- tmp <- t.test(as.numeric(evals[i,]) ~ factor(subjects$er))
- stats[i] <- tmp$statistic
- }
-stats
-```
-
-## Practical application
-
-Previously we have identified probes on chromosome 8
-
-- Lets say that we want to do a t-test for each gene on chromosome 8
-```{r}
-head(chr8GenesOrd)
-```
-
-- The first step is to extract the expression values for chromosome 8 genes from our expression matrix, which has expression values for all genes
-- We can use the `match` function to tell us which rows in the matrix correspond to chromosome 8 genes
-
-```{r}
-match(chr8GenesOrd$probe, rownames(evals))
-chr8Expression <- evals[match(chr8GenesOrd$probe, rownames(evals)),]
-dim(chr8Expression)
-```
-
-
-## Exercise: exercise9.Rmd
-
-- Create a for loop to perform to test if the expression level of each gene on chromosome 8 is significantly different between ER positive and negative samples
-- Store the ***p-value*** from each individual test
-- How many genes have a p-value < 0.05?
-
-## Solution: solution-exercise9.pdf
-
-```{r}
-pvals <- NULL
-for(i in 1:18) {
- tmp <- t.test(as.numeric(chr8Expression[i,]) ~ factor(subjects$er))
- pvals[i] <- tmp$p.value
- }
-pvals
-table(pvals < 0.05)
-sum(pvals < 0.05)
-```
-
-## Solution: solution-exercise9.pdf
-
-- Our code will be more robust if we store the number of chromosome 8 genes as a variable
- + if the data change, the code should still run
-
-```{r}
-ngenes <- nrow(chr8Expression)
-pvals <- NULL
-for(i in 1:ngenes) {
- tmp <- t.test(as.numeric(chr8Expression[i,]) ~ factor(subjects$er))
- pvals[i] <- tmp$p.value
- }
-pvals
-```
-
-
-##Conditional branching: Commands and flow control
-
-- Use an `if` statement for any kind of condition testing
-- Different outcomes can be selected based on a condition within brackets
-
-```{r if, eval=FALSE}
-if (condition) {
- ... do this ...
- } else {
- ... do something else ...
- }
-
-```
-
-- `condition` is any logical value, and can contain multiple conditions.
- + e.g. `(a == 2 & b < 5)`, this is a compound conditional argument
-- The condition should return a *single* value of `TRUE` or `FALSE`
-
-
-
-## Other conditional tests
-
-- There are various tests that can check the type of data stored in a variable
- + these tend to be called **`is...()`**.
- + try *tab-complete* on `is.`
-
-```{r}
-is.numeric(10)
-is.numeric("TEN")
-is.character(10)
-```
-
-- `is.na()` is useful for seeing if an `NA` value is found
- + cannot use `== NA`!
-
-```{r}
-match("foo", genes[,2])
-is.na(match("foo", genes[,2]))
-```
-
-
-##Conditional branching: Commands and flow control
-
-- Using the **`for`** loop we wrote before, we could add some code to plot the expression of each gene
- + a boxplot would be ideal
-- However, we might only want plots for genes with a "significant" pvalue
-- Here's how we can use an `if` statement to test for this
- + for each iteration of the the loop:
- 1. test if the p-value from the test is below 0.05 or not
- 2. if the p-value is less than 0.05 make a boxplot
- 3. if not, do nothing
-
-```{r flow-control,eval=FALSE}
-pdf("Chromosome8Genes.pdf")
-pvals <- NULL
-for (i in 1:18) {
- tmp <- t.test(as.numeric(chr8Expression[i,]) ~ factor(subjects$er))
- pvals[i] <- tmp$p.value
- if(tmp$p.value < 0.05){
- boxplot(as.numeric(chr8Expression[i,]) ~ factor(subjects$er),
- main=chr8Genes$HUGO.gene.symbol[i])
- }
- }
-pvals
-dev.off()
-
-```
-
-
-##Code formatting avoids bugs!
-Compare:
-```{r eval=FALSE}
-f<-26
-while(f!=0){
-print(letters[f])
-f<-f-1}
-```
-to:
-```{r eval=FALSE}
-f <- 26
-while(f != 0 ){
- print(letters[f])
- f <- f-1
- }
-```
-- The code between brackets `{}` *always* is *indented*, this clearly separates what is executed once, and what is run multiple times
-- Trailing bracket `}` always alone on the line at the same indentation level as the initial bracket `{`
-- Use white spaces to divide the horizontal space between units of your code, e.g. around assignments, comparisons
-
-
-
-# 5. Report Writing
-
-
-## Creating a markdown file from scratch
-
-***File → New File → R Markdown***
-
-- Choose 'Document' and the default output type (HTML)
-- A new tab is created in RStudio
-- The header allows you to specify a Page title, author and output type
-```{r eval=FALSE}
----
-title: "Untitled"
-author: "Mark Dunning"
-date: "18/08/2015"
-output: html_document
----
-```
-
-## Format of the file
-
-- **Lines 8 - 10**: Plain text description
-- **Lines 12 - 14**: An R code 'chunk'
-- **Lines 18 to 20**: Another code chunk, this time producing a plot
-
-
-
-- Pressing the ***Knit HTML*** button will create the report:
- + Note that you need to 'save' the markdown file before you will see the compiled report in your working directory
-
-##Text formatting
-See ***?*** → ***Markdown Quick Reference*** in RStudio:
-
-- Enclose text in \* to format in *italics*
-- Enclose text in \*\* to format in **bold**
-- \*\*\* for ***bold italics***
-- \` to format like `code`
-- \$ to include equations: $e =mc^2$
-- \> quoted text:
-
->To be or not to be
-
-- See **Help -> Markdown Quick Reference** for more:
- + Adding images
- + Adding web links
- + Tables
-
-
-## Not quite enough for a reproducible document
-
-- Minimally, you should record what version of R, and the packages you used.
-- Use the `sessionInfo()` function
- + e.g. for the version of R I used to make the slides
-
-```{r}
-sessionInfo()
-```
-
-
-## Defining chunks
-
-- It is not great practice to have one long, continuous R script
-- Better to break-up into smaller pieces; '*chunks*'
-- You can document each chunk separately
-- Easier to catch errors
-- The characteristics of each chunk can be modified:
- + You might not want to print the R code for each chunk
- + ...or the output
- + etc.
-
-
-## Chunk options
-
-Code chunks are encapsulated between backticks. Options for the chunk can be put inside the curly brackets`{...}`
-
-```
-'''{r}
-my code here...
-'''
-```
-
-- It's a good idea to name each chunk
- + Easier to track-down errors
-- We can display R code, but not run it
- + `eval=FALSE`
-- We can run R code, but not display it
- + `echo=FALSE`
- + e.g. setting display options
-- Suppress warning messages
- + `warning=FALSE`
-
-
-## Chunk options: eval
-
-- Sometimes we want to format code for display, but not execute; we want to show the code for how we read our data, but want our report to compile quickly
-
-```
-'''{r, eval=FALSE}
-data <- read.delim("path.to.my.file")
-'''
-```
-
-
-## Chunk options: echo
-
-- Might want to load some data from disk
- + e.g. the R object from reading the data in the previous slide
-```
-'''{r echo=FALSE}
-load("mydata.rda")
-'''
-```
-- or your P.I. wants to see your results, but doesn't really want to know about the R code that you used
-
-## Chunk options: results
-
-- Some code or functions might produce lots of output to the screen that we don't need
- + `results='hide'`
-```{r results='hide'}
-for(i in 1:100) {
- print(i)
- }
-```
-
-
-##Chunk options: message and warning
-
-- Loading an R package will sometimes print messages and / or warnings to the screen
-- This is not always helpful in a report:
-```
-'''{r}
-library(DESeq)
-'''
-```
-
-```{r echo=FALSE}
-library(DESeq)
-```
-
-##Chunk options: message and warning
-
-- Using `message=FALSE` and `warning=FALSE`
-```
-'''{r message=FALSE, warning=FALSE}
-library(DESeq)
-'''
-```
-- Could also need `suppressPackageStartupMessages`
-
-##Chunk options: cache
-
-- The argument `cache=TRUE` will stop certain chunks from being evaluate if their code does not change
-- It speeds-up the compilation of the document
- + we don't want to reload our dataset if we've only made a tiny change downstream
-```
-'''{r echo=FALSE, cache=TRUE}
-load("mydata.rda")
-'''
-```
-
-## Running R code from the main text
-
-- We can add R code to our main text, which gets evaluated
- + make sure we always have the latest figures, p-values etc
-
-```
-'''
-...the sample population consisted of 'r table(gender)[1]' females
-and 'r table(gender)[2]' males...
-'''
-```
-```{r echo=FALSE}
-gender <- c(rep("F", 47), rep("M", 50))
-```
-...the sample population consisted of `r table(gender)[1]` females and `r table(gender)[2]` males...
-
-- Alternatively:
-
-```
-'''
-...the p-value of the t-test is 'r pval', which indicates that...
-'''
-```
-```{r echo=FALSE}
-pval <- 0.05
-```
-...the p-value of the t-test is `r pval`, which indicates that...
-
-- We call this **"in-line" code**
-
-## Running R code from the main text
-
-- Like the rest of our report these R statements will get updated each time we compile the report
-
-```
-'''
-...the sample population consisted of 'r table(gender)[1]' females
-and 'r table(gender)[2]' males...
-'''
-```
-
-```{r echo=FALSE}
-gender <- c(rep("F", 41), rep("M", 54))
-```
-...the sample population consisted of `r table(gender)[1]` females and `r table(gender)[2]` males...
-
-
-
-```
-'''
-...the p-value of the t-test is 'r pval', which indicates that...
-'''
-```
-```{r echo=FALSE}
-pval <- 0.1
-```
-...the p-value of the t-test is `r pval`, which indicates that...
-
-
-## Making a heatmap
-
-- A heatmap is often used to visualise how the expression level of a set of genes vary between conditions
-- Making the plot is actually quite straightforward
- + providing you have processed the data appropriately!
- + here, we use `na.omit()` to ensure we have no `NA` values
-
-```{r}
-genelist <- c("ESR1", "NAT1", "SUSD3","SLC7A2" ,"SCUBE2")
-probes <- na.omit(genes[match(genelist, genes[,2]), 1])
-exprows <- match(probes, rownames(evals))
-
-heatmap(as.matrix(evals[exprows,]))
-
-
-```
-
-## Heatmap adjustments
-
-- We can provide a colour legend for the samples
-- Adjust colour of cells
-
-```{r}
-library(RColorBrewer)
-sampcol <- rep("blue", ncol(evals))
-sampcol[subjects$er == 1 ] <- "yellow"
-rbPal <- brewer.pal(10, "RdBu")
-heatmap(as.matrix(evals[exprows,]), ColSideColors = sampcol, col=rbPal)
-```
-
-- see also
- + `heatmap.2` from `library(gplots)`; `example(heatmap.2)`
- + `heatmap.plus` from `library(heatmap.plus)`; `example(heatmap.plus)`
-
-
-## Exercise
-
-This analysis is recorded in `exercise10.Rmd`.
-
-- Use in-line R code to report how many patients were involved in the study
-- Hide the code chunk used to produce the plot
-- Cache the code chunk used to read the raw data
-- Print the version of R, and version numbers of all packages, used to do the analysis
-
-Solution: solution-exercise10.pdf
-
-# End of Course
-
-## Wrap-up
-
-- **Thanks for your attention**
-- Practice, practice, practice
- + ... & persevere
-- Need inspiration? R code is freely-available, so read other people's code!
- + Read [blogs](http://www.r-bloggers.com/)
- + Follow the [forums](http://stackoverflow.com/questions/tagged/r)
- + Download [datasets](http://vincentarelbundock.github.io/Rdatasets/datasets.html) to practice with
- + Bookmark some [reference](https://en.wikibooks.org/wiki/R_Programming) guides
- + on twitter @rstudio, @Rbloggers, @RLangTip
- + Attend the [follow-on course](http://training.csx.cam.ac.uk/bioinformatics/event/1800066) on data manipulation and graphics
-- Please fill in the feedback form for us to improve the course
-
diff --git a/Slides-day2.html b/Slides-day2.html
deleted file mode 100644
index 804f0cc..0000000
--- a/Slides-day2.html
+++ /dev/null
@@ -1,1964 +0,0 @@
-
-
-
-
-
-
-
-
-
- Introduction to Solving Biological Problems Using R - Day 2
-
-
-
-
-
-
-
-
-
Introduction to Solving Biological Problems Using R - Day 2
-
-Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
-
-
Last modified: 06 Sep 2016
-
-
-
Day 2 Schedule
-
-
Further customisation of plots
-
Statistics
-
Data Manipulation Techniques
-
Programming in R
-
Further report writing
-
-
-
1. Further customisation of plots
-
Recap
-
-
We have seen how to use plot(), boxplot() , hist() etc to make simple plots
-
These come with arguments that can be used to change the appearance of the plot
-
-
col, pch
-
main, xlab, ylab
-
etc….
-
-
We will now look at ways to modify the plot appearance after it has been created
-
Also, how to export the graphs
-
-
-
The painter’s model
-
-
R employs a painter’s model to construct it’s plots
-
Elements of the graph are added to the canvas one layer at a time, and the picture built up in levels.
-
Lower levels are obscured by higher levels,
-
-
allowing for blending, masking and overlaying of objects.
-
-
Caution: You can’t undo the changes you make to the plot
First_Name Second_Name Full_Name Sex Age Weight Consent
-2 Eve Parker Eve Parker Female 21 67.9 TRUE
-4 Mary Davis Mary Davis Female 45 61.9 TRUE
-7 Joanna Edwards Joanna Edwards Female 42 63.5 FALSE
-10 Sally Wilson Sally Wilson Female 62 64.8 TRUE
-
-
Adding points to differentiate gender
-
-
Again, we have to be careful that all the points are within the x and y limits
-
-
this is why creating the blank plot containing the limits of the data is useful
If we want to compute the probability of observing a particular Wind Speed, from the same distribution, we can use the standard formula to calculate a t statistic:
-
-
\[t = \frac{\bar{x} -\mu_0}{s / \sqrt(n)}\]
-
-
Say a Wind Speed of 2; which from the histogram seems to be unlikely
-
-
t <-(tempMean -2) /(tempSD/sqrt(length(weather$Wind)))
-t
-
[1] 27.93897
-
-
Simple testing
-
-
…or use the t.test() function to compute the statistic and corresponding p-value
-
-
t.test(weather$Wind, mu=2)
-
- One Sample t-test
-
-data: weather$Wind
-t = 27.939, df = 152, p-value < 2.2e-16
-alternative hypothesis: true mean is not equal to 2
-95 percent confidence interval:
- 9.394804 10.520229
-sample estimates:
-mean of x
- 9.957516
We have already seen that men in our patients dataset tend to be heavier than women
-
We can test this formally in R
-
-
-
-
Test variance assumption
-
var.test(patients$Weight~patients$Sex)
-
- F test to compare two variances
-
-data: patients$Weight by patients$Sex
-F = 1.759, num df = 3, denom df = 5, p-value = 0.5417
-alternative hypothesis: true ratio of variances is not equal to 1
-95 percent confidence interval:
- 0.2265757 26.1830147
-sample estimates:
-ratio of variances
- 1.759041
- Two Sample t-test
-
-data: patients$Weight by patients$Sex
-t = -5.4584, df = 8, p-value = 0.0006027
-alternative hypothesis: true difference in means is not equal to 0
-95 percent confidence interval:
- -10.893759 -4.422908
-sample estimates:
-mean in group Female mean in group Male
- 64.52500 72.18333
-
-
This function can be tuned in various ways (?t.test):
-
-
Assumed equal variances, or not (and use Welch’s correction)
-
Deal with paired samples
-
Two-sided, or one-sided p-value
-
-
-
-
Linear regression: Basic data analysis
-
-
Linear modeling is supported by the function lm():
-
-
example(lm)
-
The output assumes you know a fair bit about the subject
-
-
lm is really useful for plotting lines of best fit to XY data, in order to determine intercept, gradient and Pearson’s correlation coefficient
-
-
This is very easy in R
-
-
Three steps to plotting with a best fit line:
-
-
Plot XY scatter-plot data
-
Fit a linear model
-
Add bestfit line data to plot with abline() function
-
-
-
-
Typical linear regression analysis: Basic data analysis
-
-
The ~ (tilde) is used to define a formula; i.e. “y is given by x”
Typical linear regression analysis: Basic data analysis
-
-
Get the coefficients of the fit from:
-
-
coef(myModel) # Coefficients
-resid(myModel) # Residuals
-fitted(myModel) # Fitted values
-names(myModel) # Names of the objects within myModel
-residuals(myModel) +fitted(myModel) # what values does this give?
R has a very powerful formula syntax for describing statistical models
-
Suppose we had two explanatory variables x and z, and one response variable y
-
We can describe a relationship between, say, y and x using a tilde ~, placing the response variable on the left of the tilde and the explanatory variables on the right:
-
-
y~x
-
-
It is very easy to extend this syntax to do multiple regressions, ANOVAs, to include interactions, and to do many other common modelling tasks. For example
-
-
y~x #If x is continuous, this is linear regression
-y~x #If x is categorical, ANOVA
-y~x+z #If x and z are continuous, multiple regression
-y~x+z #If x and z are categorical, two-way ANOVA
-y~x+z+x:z # : is the symbol for the interaction term
-y~x*z # * is a shorthand for x+z+x:z
-
-
Exercise: exercise6.Rmd
-
-
There are suggestions that Ozone level could be influenced by Temperature:
-
-
-
-
Perform a linear regression analysis to assess this:
-
-
Fit the linear model and print a summary of the output
-
Plot the two variables and overlay a best-fit line
-
What is the equation of the best-fit line in the form
-
-
\[ y = mx + c\]
-
-
Calculate the correlation between the two variables using the function cor (?cor)
-
-
can you annotate the plot with the correlation coefficient?
We have expression values and want to visualise them against our categorical data
-
-
use a boxplot, for example
-
however, we have to first make sure our values are treat as numeric data
-
as we created the subset of a data frame, the result was also a data frame
-
-
use as.numeric
-
-
-
boxplot(as.numeric(genevals) ~subjects$er)
-
-
-
Relating to patient characteristics
-
-
In this case there is a clear difference, so we probably don’t even need a p-value to convince ourselves of the difference
-
-
in real-life, we would probably test lots of genes and implement some kind of multiple-testing
-
e.g. p.adjust (?p.adjust)
-
-
-
t.test(as.numeric(genevals) ~subjects$er)
-
- Welch Two Sample t-test
-
-data: as.numeric(genevals) by subjects$er
-t = -38.746, df = 205.88, p-value < 2.2e-16
-alternative hypothesis: true difference in means is not equal to 0
-95 percent confidence interval:
- -1.246953 -1.126198
-sample estimates:
-mean in group 0 mean in group 1
- -1.17388506 0.01269076
Create a for loop to perform to test if the expression level of each gene on chromosome 8 is significantly different between ER positive and negative samples
It is not great practice to have one long, continuous R script
-
Better to break-up into smaller pieces; ‘chunks’
-
You can document each chunk separately
-
Easier to catch errors
-
The characteristics of each chunk can be modified:
-
-
You might not want to print the R code for each chunk
-
…or the output
-
etc.
-
-
-
-
Chunk options
-
Code chunks are encapsulated between backticks. Options for the chunk can be put inside the curly brackets{...}
-
'''{r}
-my code here...
-'''
-
-
It’s a good idea to name each chunk
-
-
Easier to track-down errors
-
-
We can display R code, but not run it
-
-
eval=FALSE
-
-
We can run R code, but not display it
-
-
echo=FALSE
-
e.g. setting display options
-
-
Suppress warning messages
-
-
warning=FALSE
-
-
-
-
Chunk options: eval
-
-
Sometimes we want to format code for display, but not execute; we want to show the code for how we read our data, but want our report to compile quickly
Welcome to Bioconductor
-
- Vignettes contain introductory material; view with
- 'browseVignettes()'. To cite Bioconductor, see
- 'citation("Biobase")', and for packages 'citation("pkgname")'.
-
Loading required package: locfit
-
locfit 1.5-9.1 2013-03-22
-
Loading required package: lattice
-
Welcome to 'DESeq'. For improved performance, usability and
- functionality, please consider migrating to 'DESeq2'.