Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a0a9a63
version of slides more-compatible with being shown from the markdown …
markdunning Dec 21, 2016
dde5048
New R notebooks to replace the Day 1 and 2 slides
markdunning Dec 22, 2016
cc5eb7b
add convenience scripts to generate data, images and zip file
markdunning Dec 22, 2016
585be6a
Update solutions for Day 1
markdunning Dec 22, 2016
adfb46e
more solution updates
markdunning Dec 22, 2016
1b45261
small update I missed to exercise 7
markdunning Dec 22, 2016
a81cc87
New solutions zip file
markdunning Dec 22, 2016
145f679
separate example of cbind and rbind into different chunks
markdunning Jan 10, 2017
71036be
Minor grammar change
markdunning Jan 10, 2017
5a2f8f1
call a script to create the patients data
markdunning Jan 10, 2017
08a5c75
Add some blank code chunks for solutions
markdunning Jan 10, 2017
35f7f27
add new script to create patients data
markdunning Jan 10, 2017
d2e3c15
Add files from Jan 2017 run of course using R notebooks
markdunning Jan 24, 2017
9d5ea04
Day 1 re-compiled to have contents in HTML notebook
markdunning Jan 27, 2017
8ce3e1c
Re-compile Day 2
markdunning Jan 27, 2017
959adb7
Revert exercise 6 to linear regression example
markdunning Jan 27, 2017
cb6c15e
re-compile exercise 6
markdunning Jan 27, 2017
886185d
Merge pull request #9 from cambiotraining/cruk
cambioinfo Feb 14, 2017
d5ef070
- add a missing sentance in the instructions for exercise 5a
markdunning Feb 23, 2017
5b267c5
Correct typos and issues from Feb 2017 course
markdunning Mar 1, 2017
9820883
correct the path to the images files
markdunning May 12, 2017
5d9b75e
Add css file to zip (required to preview the notebook)
markdunning May 12, 2017
f3ad08f
try and fix image files inside course zip
markdunning May 12, 2017
bbfa078
Update the link to the spreadsheet lesson from Data Carpentry
markdunning May 15, 2017
feca6cf
Add missing images for stats exercise
markdunning May 16, 2017
81735bf
Add Hint about using paste in the stats exercise
markdunning May 25, 2017
ffe9430
Re-compile some sections with a larger patient cohort dataset
markdunning Jun 12, 2017
d62f8e1
Re-write exercise 3 solution
markdunning Jun 12, 2017
3322c56
re-generate zip file
markdunning Jun 12, 2017
e47ded2
Hint about producing a data frame in reverse order
markdunning Jun 15, 2017
3951f05
Fix typo
markdunning Jun 15, 2017
d6fbf3e
Delete todo
markdunning Jun 15, 2017
6fc304d
add example of pos and adj
markdunning Jun 16, 2017
ddcaf83
Merge branch 'master' of https://github.com/cambiotraining/r-intro
markdunning Jun 16, 2017
893813a
correct small typo in one code chunk definition
markdunning Jun 16, 2017
72a3282
add extra hint about changing the names under each box
markdunning Jun 16, 2017
1940039
Add note about different solution to exercise 5 where a single vector…
markdunning Jun 16, 2017
5fb6c02
small change to where axis labels are printed in axis example
markdunning Jun 16, 2017
aeedc0a
Change colour for the wind speed histogram to avoid possible clash wi…
markdunning Jun 16, 2017
c27b39f
Move matrices examples to the end of section
markdunning Jun 16, 2017
051b932
Remove example of subsetting the matrix e (which no longer exists)
markdunning Jun 16, 2017
81bfdc1
correct a bad variable name
markdunning Jun 16, 2017
f5f5087
- add a couple of extra explanation sentances
markdunning Jun 16, 2017
0a99aba
clarity of use parameter in cor function
ruidlpm Oct 25, 2018
4c9293e
Merge pull request #23 from ruidlpm/patch-1
tavareshugo Nov 2, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified Basic_R_Course.zip
Binary file not shown.
Binary file not shown.
Binary file not shown.
592 changes: 592 additions & 0 deletions Session1.1-intro.Rmd

Large diffs are not rendered by default.

1,140 changes: 1,140 additions & 0 deletions Session1.1-intro.nb.html

Large diffs are not rendered by default.

463 changes: 463 additions & 0 deletions Session1.2-data-structures.Rmd

Large diffs are not rendered by default.

1,389 changes: 1,389 additions & 0 deletions Session1.2-data-structures.nb.html

Large diffs are not rendered by default.

293 changes: 293 additions & 0 deletions Session1.3-walkthrough.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
---
title: "Introduction to Solving Biological Problems Using R - Day 1"
author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
---

# 3. R for data analysis

##3 steps to Basic Data Analysis

- In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline:

1. Reading in data
+ `read.table()`
+ `read.csv(), read.delim()`
2. Analysis
+ Manipulating & reshaping the data
+ perhaps dealing with "missing data"
+ Any maths you like
+ Diagnostic Plots
3. Writing out results
+ `write.table()`
+ `write.csv()`

## A simple walkthrough

- We have data from 100 patients that given consent for their data to use in future studies
- A researcher wants to undertake a study involving people that are overweight
- We will walkthrough how to filter the data and write a new file with the candidates for the study

##The Working Directory (wd)


- Like many programs R has a concept of a working directory
- It is the place where R will look for files to execute and where it will
save files, by default
- For this course we need to set the working directory to the location
of the course scripts
- In RStudio use the mouse and browse to the directory where you saved the Course Materials

- ***Session → Set Working Directory → Choose Directory...***

## 0. Locate the data

Before we even start the analysis, we need to be sure of where the data are located on our hard drive

- Functions that import data need a file location as a character vector
- The default location is the ***working directory***
```{r}
getwd()
```

- If the file you want to read is in your working directory, you can just use the file name

```{r eval=FALSE}
list.files()
```

- The `file.exists` function does exactly what it says on the tin!
+ a good sanity check for your code

```{r}
file.exists("patient-info.txt")
```

- Otherwise you need the *path* to the file
+ you can get this using **`file.choose()`**

- If you unsure about specifying a file path at the command line, this [online tutorial](http://rik.smith-unna.com/command_line_bootcamp/?id=vczhybjhtyt) will give you hands-on practice

##1. Read in the data

- The data are a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text
- We need to read in the results and assign it to an object (`patients`)

```{r}
patients <- read.delim("patient-info.txt")

```

In the latest RStudio, there is the option to import data directly from the File menu. ***File*** -> ***Import Dataset*** -> ***From Csv***

- If the data are comma-separated, then use either the argument `sep=","` or the function `read.csv()`:
- You need to make sure you use the correct function
+ can you explain the output of the following lines of code?

```{r }
tmp <- read.csv("patient-info.txt")
head(tmp)
```
- For full list of arguments:
```{r}
?read.table
```

##1b. Check the data
- *Always* check the object to make sure the contents and dimensions are as you expect
- R will sometimes create the object without error, but the contents may be un-usable for analysis
+ If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column

```{r}
# View the first 10 rows to ensure import is OK
patients[1:10,]
```


- or use the `View()` function to get a display of the data in RStudio:
```{r}
View(patients)
```

##1c. Understanding the object

- Once we have read the data successfully, we can start to interact with it
- The object we have created is a *data frame*:
```{r}
class(patients)
```

- We can query the dimensions:

```{r}
ncol(patients)
nrow(patients)
dim(patients)
```


- The names of the columns are automatically assigned:

```{r}
colnames(patients)
```

- We can use any of these names to access a particular column:
+ and create a vector
+ TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
```{r}
patients$ID

```

## Word of warning


![](images/tolstoy.jpg)



![](images/hadley.jpg)

> Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)

You will make your life a lot easier if you keep your data **tidy** and ***organised***. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them

- http://www.datacarpentry.org/spreadsheet-ecology-lesson/
- http://kbroman.org/dataorg/

##Handling missing values

- The data frame contains some **`NA`** values, which means the values are missing – a common occurrence in real data collection
- `NA` is a special value that can be present in objects of any type (logical, character, numeric etc)
- `NA` is not the same as `NULL`:
- `NULL` is an empty R object.
- `NA` is one missing value within an R object (like a data frame or a vector)
- Often R functions will handle `NA`s gracefully:

```{r}
length(patients$Height)
mean(patients$Height)
```

- However, sometimes we have to tell the functions what to do with them.
- R has some built-in functions for dealing with `NA`s, and functions often have their own arguments (like `na.rm`) for handling them:
+ annoyingly, different functions have different argument names to change their behaviour with regards to `NA` values. *Always check the documentation*

```{r}
mean(patients$Height, na.rm = TRUE)

mean(na.omit(patients$Height))
```

##2. Analysis (reshaping data and maths)

- Our analysis involves identifying patients with extreme BMI
+ we will define this as being two standard deviations from the mean

```{r}
# Create an index of results:
BMI <- (patients$Weight)/((patients$Height/100)^2)
upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
upper.limit
```


- We can plot a simple chart of the BMI values
+ add a vertical line to indicate the cut-off
+ plotting will be covered in detail shortly..

```{r}
plot(BMI)
# Add a horizonal line:
abline(h=upper.limit)
```

- It is also useful to save the variable we have computed as a new column in the data frame

```{r}
round(BMI,1)
patients$BMI <- round(BMI,1)
head(patients)
```

- To actually select the candidates we can use a logical expression to test the values of the BMI vector being greater than the upper limit
+ if the second line looks a bit weird, remember that `<-` is doing an assignment. Thevalue we are assigning to our new variable is the logical (`TRUE` or `FALSE`) vector given by testing each item in `BMI` against the `upper.limit`

```{r}
BMI > upper.limit
candidates <- BMI > upper.limit
```

We have seen that a logical vector can be used to subset a data frame

- However, in our case the result looks a bit funny
- Can you think why this might be?

```{r}
patients[candidates,]
```

The `which` function will take a logical vector and return the indices of the `TRUE` values

- This can then be used to subset the data frame

```{r}
which(BMI > upper.limit)
candidates <- which(BMI > upper.limit)
```


## 3. Outputting the results

- We write out a data frame of candidates (patients with BMI more than standard deviations from the mean) as a 'comma separated values' text file (CSV):

```{r}
write.csv(patients[candidates,], file="selectedSamples.csv")
```

- The output file is directly-readable by Excel
- It's often helpful to double check where the data has been saved. Use the *get working directory* function:

```{r eval=FALSE}
getwd() # print working directory
list.files() # list files in working directory

```


To recap, the set of R commands we have used is:-

```{r}
patients <- read.delim("patient-info.txt")
BMI <- (patients$Weight)/((patients$Height/100)^2)
upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
plot(BMI)
# Add a horizonal line:
abline(h=upper.limit)
patients$BMI <- round(BMI,1)
candidates <- which(BMI > upper.limit)
write.csv(patients[candidates,], file="selectedSamples.csv")

```

##Exercise: Exercise 3

- A separate study is looking for patients that are underweight and also smoke;
+ Modify the condition in our previous code to find these patients
+ e.g. having BMI that is 2 standard deviations *less* than the mean BMI
+ Write out a results file of the samples that match these criteria, and open it in a spreadsheet program


```{r}
### Your Answer Here ###



```

844 changes: 844 additions & 0 deletions Session1.3-walkthrough.nb.html

Large diffs are not rendered by default.

Loading