R fridays #1: Reading in Sam's qPCR data

The Data

Sam Lisy of the Ascano lab has volunteered her qPCR data as an example for R Fridays! Here is a summary of her data: If you’d like a copy of Sam’s data to follow along with this example, just send me an email

  • 2 technical replicates of 2 biological replicates
  • Three genes tested (IFN, IFIT, Tubulin)
  • A positive control (Katie Rothamel’s old samples) and a no-template control (NTC)
  • Samples were run with the Applied Biosystems StepOnePlus real-time qPCR machine
  • The raw Ct values were output to excel using the Applied Biosystems software
  • Sam added another sheet to the excel file describing the samples and conditions tested

Step 1: Read the data into R from the excel sheet

For this example, we will use the Tidyverse collection of packages for analyzing data in R. We will also use a package called readxl that allows R to read in excel files. First, let’s install and load the tidyverse and load readxl to your R session:

The ‘#’ symbol denotes a comment in R code. A comment is a line in an Rscript that is not evaluated. You can use comments to make notes to yourself about what a line of code is doing. You can also use them to prevent a line of code from being run– a useful way to de-bug your Rscript.

#This line installs the tidyverse packages
install.packages("tidyverse")
#This line loads the packages into your current R session
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Notice that while we had to install and load the tidyverse, we only had to load readxl. The readxl package is installed as part of the tidyverse, but as it isn’t a ‘core’ package, we have to load it explicitly.

#This line loads the readxl package
library(readxl)

Once you have installed a package, you won’t need to do it again unless you update R or RStudio. Every time you start a new R session, you will need to load any packages you want to use in that session with the library(package_name) command.

To read the data into R, we will use the read_xlsx() function from the readxl package. The read_xlsx() function takes a number of different arguments to figure out how to read data given how our excel file is formatted. To see what these arguments are, view the help page of the read_xlsx() function:

?read_xlsx

In the help pane of your Rstudio window, you should see the help page for read_xlsx. From this page, we can see the arguments for the read_xlsx function:

read_xlsx(path, sheet = NULL, range = NULL, col_names = TRUE, col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = readxl_progress(), .name_repair = "unique")

Some of the arguments are given default values, like sheet = NULL. For these arguments, we can choose to provide a different value, or we can leave the default value.

  • For the path argument, we need to provide the path to the excel file. If you put the excel file in the same folder as your R Project, the path is just the file name: R-Friday_SL_qPCR_009.xlsx.
  • sheet: Sam added a sheet to describe her samples and conditions, so the sheet that contains the data is the second sheet. We will set this argument as sheet = 2.
  • range: This specifies the range of cells to read the data from. We will leave this as the default NULL for now.
  • col_names: This argument is asking if the columns in Sam’s data already have names. They do, so we don’t have to change the default value of TRUE.
  • col_types: This would let us dictate the “type” of each column (ie: numeric, character, string). We can leave the default value of NULL, so R will guess which type to use.
  • na: In qPCR data, sometimes a Ct value is “Undetermined”. In order for R to understand that the rest of the Ct values should be numeric, not strings, we can use the na argument to tell R to turn any instance of “Undetermined” into an NA value.
  • trim_ws: We don’t need to remove whitespace from our data, so we can leave this as the default value.
  • skip: This argument tells R if there are lines at the beginning of the file to skip. Sam’s file has header lines that contain info about the machine settings, but aren’t important for analyzing the data in R. There are 7 of these lines, so we can set skip = 7.
  • n_max: Sam’s data also has lines at the end of the file that we don’t need. We can set the n_max argument to tell R the maximum number of lines to read from the file, in order to ignore the extra lines at the bottom. n_max = 37

We can leave the remaining arguments with their default values. We are left with the following function call to read in Sam’s data: read_xlsx("../../data/R-Friday_SL_qPCR_009.xlsx", sheet = 2, na = "Undetermined", skip = 7, n_max = 37)

Now we can read in the data and save it as a variable called sams_data:

The ‘<-’ operator tells R to assign the data it reads from the excel file to the variable sams_data.

#Reading in Sam's qPCR data from excel

sams_data <- read_xlsx("../../data/R-Friday_SL_qPCR_009.xlsx", sheet = 2, na = "Undetermined", skip = 7, n_max = 37)

We can view the structure of the data in R using glimpse(). Wow! That is a lot of variables (columns). In the next step, we will get rid of any columns we don’t need to use for the analysis.

glimpse(sams_data)
## Observations: 36
## Variables: 32
## $ Well                     <chr> "A1", "A2", "A3", "A4", "A5", "A6", "B1", "B…
## $ `Sample Name`            <chr> "+1", "+1", "+1", "+1", "+1", "+1", "+2", "+…
## $ `Target Name`            <chr> "Tubulin", "Tubulin", "IFIT", "IFIT", "IFN",…
## $ Task                     <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "UNKNOWN", …
## $ Reporter                 <chr> "SYBR", "SYBR", "SYBR", "SYBR", "SYBR", "SYB…
## $ Quencher                 <chr> "None", "None", "None", "None", "None", "Non…
## $ RQ                       <chr> NA, NA, "12.828105926513672", "12.8281059265…
## $ `RQ Min`                 <chr> NA, NA, "9.98699951171875", "9.9869995117187…
## $ `RQ Max`                 <chr> NA, NA, "16.477451324462891", "16.4774513244…
## $ Cт                       <dbl> 30.07280, 29.90723, 29.82261, 29.85040, 35.8…
## $ `Cт Mean`                <chr> "29.990015029907227", "29.990015029907227", …
## $ `Cт SD`                  <chr> "0.11707787960767746", "0.11707787960767746"…
## $ ΔCт                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `ΔCт Mean`               <chr> NA, NA, "-0.15351009368896484", "-0.15351009…
## $ `ΔCт SE`                 <chr> NA, NA, "8.3944700658321381E-2", "8.39447006…
## $ `HK Control ΔCт Mean`    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ `HK Control ΔCт SE`      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ ΔΔCт                     <chr> NA, NA, "-3.6812362670898438", "-3.681236267…
## $ `Automatic Ct Threshold` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
## $ `Ct Threshold`           <dbl> 1.1494783, 1.1494783, 0.4703859, 0.4703859, …
## $ `Automatic Baseline`     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
## $ `Baseline Start`         <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ `Baseline End`           <dbl> 22, 22, 23, 23, 29, 39, 19, 19, 20, 20, 28, …
## $ Efficiency               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Tm1                      <dbl> 80.90846, 80.90846, 80.16264, 80.01333, 77.0…
## $ Tm2                      <chr> NA, NA, NA, NA, NA, "63.885162353515625", NA…
## $ Tm3                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Comments                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ HIGHSD                   <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N",…
## $ NOAMP                    <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N",…
## $ EXPFAIL                  <chr> "N", "N", "N", "N", "N", "Y", "N", "N", "N",…
## $ MTP                      <chr> "N", "N", "N", "N", "N", "Y", "N", "N", "N",…

Organizing the data (part 1):

First, we should select only the columns we need to use to calculate a \(\Delta\Delta\)Ct from Sam’s data. The columns we need are Sample Name, Target Name, and Cт Mean. Notice that the т from the Applied Biosystems data is not a normal T. For now you can copy the symbol from the output of glimpse, and later we can change this to a normal T. Other columns like CT and CT SD are important to really understand the data, and we will use them later.

To “Select” columns from a table, the Tidyverse has a handy function called select(). Let’s view the help page to see what arguments we need to set:

?select()

From the help page, the select function arguments are select(.data, ...).

  • .data: This argument is the name of the table we want to select columns from. In our case, that is sams_data.
  • ...: This argument is a bit more confusing. For our purposes, the ... is telling us that the remaining arguments should be the names of the variables (columns) from sams_data that we want to keep.

So our call to select will be: select(sams_data, "Sample Name", "Target Name", "Cт Mean").

  • But wait! In the details of the ... argument on the help page, it says we can use select to rename columns by setting new_name = old_name in our list of columns. We can use this to change the weird т to a normal T.

Our new function call is: select(sams_data, "Sample Name", "Target Name", "CT Mean" = "Cт Mean")

#Save the table with the selected columns as a new variable
sams_data_meanCT <- select(sams_data, "Sample Name", "Target Name", "CT Mean" = "Cт Mean")

#Glimpse the data to see the new structure
glimpse(sams_data_meanCT)
## Observations: 36
## Variables: 3
## $ `Sample Name` <chr> "+1", "+1", "+1", "+1", "+1", "+1", "+2", "+2", "+2", "…
## $ `Target Name` <chr> "Tubulin", "Tubulin", "IFIT", "IFIT", "IFN", "IFN", "Tu…
## $ `CT Mean`     <chr> "29.990015029907227", "29.990015029907227", "29.8365058…

Wrap-up

Out of time for today! In our next session, we will finish organizing (cleaning) Sam’s data. We will then calculate and plot the \(\Delta\Delta\)Ct values for her samples. If we have time, we can add color to the plot to denote which samples had a high SD between technical replicates. See you next week!

As always, if you have any questions don’t hesitate to email me.