bioinformatics-core-shared-training · CristianRiccio · Dec 21, 2016 · Dec 22, 2016 · Dec 22, 2016 · Dec 22, 2016
diff --git a/Basic_R_Course.zip b/Basic_R_Course.zip
diff --git a/Introduction to Solving Biological Problems Using R - Day 1.pdf b/Introduction to Solving Biological Problems Using R - Day 1.pdf
diff --git a/Introduction to Solving Biological Problems Using R - Day 2.pdf b/Introduction to Solving Biological Problems Using R - Day 2.pdf
diff --git a/Session1.1-intro.Rmd b/Session1.1-intro.Rmd
diff --git a/Session1.1-intro.nb.html b/Session1.1-intro.nb.html
diff --git a/Session1.2-data-structures.Rmd b/Session1.2-data-structures.Rmd
diff --git a/Session1.2-data-structures.nb.html b/Session1.2-data-structures.nb.html
diff --git a/Session1.3-walkthrough.Rmd b/Session1.3-walkthrough.Rmd
@@ -0,0 +1,293 @@
+---
+title: "Introduction to Solving Biological Problems Using R - Day 1"
+author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
+  Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
+date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
+output:
+  html_notebook:
+    toc: yes
+    toc_float: yes
+---
+
+# 3. R for data analysis
+
+##3 steps to Basic Data Analysis
+
+- In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline:
+
+1. Reading in data
+    + `read.table()`
+    + `read.csv(), read.delim()`
+2. Analysis
+    + Manipulating & reshaping the data
+        + perhaps dealing with "missing data"
+    + Any maths you like
+    + Diagnostic Plots
+3. Writing out results
+    + `write.table()`
+    + `write.csv()`
+
+## A simple walkthrough
+
+- We have data from 100 patients that given consent for their data to use in future studies
+- A researcher wants to undertake a study involving people that are overweight
+- We will walkthrough how to filter the data and write a new file with the candidates for the study    
+
+##The Working Directory (wd)
+
+
+- Like many programs R has a concept of a working directory 
+- It is the place where R will look for files to execute and where it will
+save files, by default
+- For this course we need to set the working directory to the location
+of the course scripts
+- In RStudio use the mouse and browse to the directory where you saved the Course Materials
+
+- ***Session → Set Working Directory → Choose Directory...***
+
+## 0. Locate the data
+
+Before we even start the analysis, we need to be sure of where the data are located on our hard drive
+
+- Functions that import data need a file location as a character vector
+- The default location is the ***working directory***
+```{r}
+getwd()
+```
+
+- If the file you want to read is in your working directory, you can just use the file name
+
+```{r eval=FALSE}
+list.files()
+```
+
+- The `file.exists` function does exactly what it says on the tin!
+    + a good sanity check for your code
+
+```{r}
+file.exists("patient-info.txt")
+```
+
+- Otherwise you need the *path* to the file
+    + you can get this using **`file.choose()`**
+
+- If you unsure about specifying a file path at the command line, this [online tutorial](http://rik.smith-unna.com/command_line_bootcamp/?id=vczhybjhtyt) will give you hands-on practice
+
+##1. Read in the data
+
+- The data are a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text
+- We need to read in the results and assign it to an object (`patients`)
+
+```{r}
+patients <- read.delim("patient-info.txt")
+
+```
+
+In the latest RStudio, there is the option to import data directly from the File menu. ***File*** -> ***Import Dataset*** -> ***From Csv***
+
+- If the data are comma-separated, then use either the argument `sep=","` or the function `read.csv()`:
+- You need to make sure you use the correct function
+    + can you explain the output of the following lines of code?
+
+```{r }
+tmp <- read.csv("patient-info.txt")
+head(tmp)
+```
+- For full list of arguments:
+```{r}
+?read.table
+```
+
+##1b. Check the data
+- *Always* check the object to make sure the contents and dimensions are as you expect
+- R will sometimes create the object without error, but the contents may be un-usable for analysis
+    + If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column
+
+```{r}
+# View the first 10 rows to ensure import is OK
+patients[1:10,]  
+```
+
+
+- or use the `View()` function to get a display of the data in RStudio:
+```{r}
+View(patients)
+```
+
+##1c. Understanding the object
+
+- Once we have read the data successfully, we can start to interact with it
+- The object we have created is a *data frame*:
+```{r}
+class(patients)
+```
+
+- We can query the dimensions:
+
+```{r}
+ncol(patients)
+nrow(patients)
+dim(patients)
+```
+
+
+- The names of the columns are automatically assigned:
+
+```{r}
+colnames(patients)
+```
+
+- We can use any of these names to access a particular column:
+    + and create a vector
+    + TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
+```{r}
+patients$ID
+
+```
+
+## Word of warning
+
+
+![](images/tolstoy.jpg)
+
+
+
+![](images/hadley.jpg)
+
+> Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)
+
+You will make your life a lot easier if you keep your data **tidy** and ***organised***. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them
+
+- http://www.datacarpentry.org/spreadsheet-ecology-lesson/
+- http://kbroman.org/dataorg/
+
+##Handling missing values
+
+- The data frame contains some **`NA`** values, which means the values are missing – a common occurrence in real data collection
+- `NA` is a special value that can be present in objects of any type (logical, character, numeric etc)
+- `NA` is not the same as `NULL`:
+    - `NULL` is an empty R object. 
+    - `NA` is one missing value within an R object (like a data frame or a vector)
+- Often R functions will handle `NA`s gracefully:
+
+```{r}
+length(patients$Height)
+mean(patients$Height)
+```
+
+- However, sometimes we have to tell the functions what to do with them. 
+- R has some built-in functions for dealing with `NA`s, and functions often have their own arguments (like `na.rm`) for handling them:
+    + annoyingly, different functions have different argument names to change their behaviour with regards to `NA` values. *Always check the documentation*
+
+```{r}
+mean(patients$Height, na.rm = TRUE)
+
+mean(na.omit(patients$Height))
+```
+
+##2. Analysis (reshaping data and maths)
+
+- Our analysis involves identifying patients with extreme BMI
+    + we will define this as being two standard deviations from the mean
+
+```{r}
+# Create an index of results:
+BMI <- (patients$Weight)/((patients$Height/100)^2)
+upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
+upper.limit
+```
+
+
+- We can plot a simple chart of the BMI values 
+    + add a vertical line to indicate the cut-off
+    + plotting will be covered in detail shortly..
+
+```{r}
+plot(BMI)
+# Add a horizonal line:
+abline(h=upper.limit) 
+```
+
+- It is also useful to save the variable we have computed as a new column in the data frame
+
+```{r}
+round(BMI,1)
+patients$BMI <- round(BMI,1)
+head(patients)
+```
+
+- To actually select the candidates we can use a logical expression to test the values of the BMI vector being greater than the upper limit
+    + if the second line looks a bit weird, remember that `<-` is doing an assignment. Thevalue we are assigning to our new variable is the logical (`TRUE` or `FALSE`) vector given by testing each item in `BMI` against the `upper.limit`
+
+```{r}
+BMI > upper.limit
+candidates <- BMI > upper.limit
+```
+
+We have seen that a logical vector can be used to subset a data frame
+
+- However, in our case the result looks a bit funny
+- Can you think why this might be?
+
+```{r}
+patients[candidates,]
+```
+
+The `which` function will take a logical vector and return the indices of the `TRUE` values
+
+- This can then be used to subset the data frame
+
+```{r}
+which(BMI > upper.limit)
+candidates <- which(BMI > upper.limit)
+```
+
+
+## 3. Outputting the results
+
+- We write out a data frame of candidates (patients with BMI more than standard deviations from the mean) as a 'comma separated values' text file (CSV):
+
+```{r}
+write.csv(patients[candidates,], file="selectedSamples.csv")
+```
+
+- The output file is directly-readable by Excel
+- It's often helpful to double check where the data has been saved. Use the *get working directory* function:
+
+```{r eval=FALSE}
+getwd()      # print working directory
+list.files() # list files in working directory
+
+```
+
+
+To recap, the set of R commands we have used is:-
+
+```{r}
+patients <- read.delim("patient-info.txt")
+BMI <- (patients$Weight)/((patients$Height/100)^2)
+upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
+plot(BMI)
+# Add a horizonal line:
+abline(h=upper.limit) 
+patients$BMI <- round(BMI,1)
+candidates <- which(BMI > upper.limit)
+write.csv(patients[candidates,], file="selectedSamples.csv")
+
+```
+
+##Exercise: Exercise 3
+
+- A separate study is looking for patients that are underweight and also smoke; 
+  + Modify the condition in our previous code to find these patients
+  + e.g. having BMI that is 2 standard deviations *less* than the mean BMI
+  + Write out a results file of the samples that match these criteria, and open it in a spreadsheet program
+
+
+```{r}
+### Your Answer Here ### 
+
+
+
+```
+
diff --git a/Session1.3-walkthrough.nb.html b/Session1.3-walkthrough.nb.html