R is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

The general consensus is that R compares well with other popular statistical packages, such as SAS, SPSS, and Stata. In a comparison of all basic features for a statistical software R is heads up with the best of statistical software.

R Studio as a cross-platform software (multi-platform, or platform independent software) is computer software that is implemented on multiple computing platforms. Cross-platform software may be divided into two types; one requires individual building or compilation for each platform that it supports, and the other one can be directly run on any platform without special preparation, e.g., software written in an interpreted language or pre-compiled portable bytecode for which the interpreters or run-time packages are common or standard components of all platforms. You can convert the R scripts to different formats such as HTML, PDF and word .

R is a free platform and it’s new packages are available as quickly as the theory is published. It is now being accepted by a wider audience as a valid alternative to the commercial software. While SPSS and SAS are likely not going away, but as budget cuts encourage the use of open source and freeware, the younger generation of scientists who learn R will encourage its use in subsequent years.

By this document, we are trying to introduce you R environment and different types of data and plots in R.

  1. Introduction

  2. Operators and assigning

  3. Setting values

  4. Special constants

  5. Vectors

  6. Factors

  7. Matrices & Arrays

  8. Lists

  9. Data Frames

  10. Flow Control, loops

  11. Functions

  12. R plots and colors

  13. Text mining and word cloud fundamentals

  14. R troubleshooting

                                1. Introduction
############################ Example of real data

#reading the data

data<-"example.txt"

data.list<- read.table(data,header=FALSE)

data.frame<-as.data.frame(data.list)

rm(data.list)

data.matrix<-data.matrix(data.frame)

rm(data.frame)

colnames<-"colnames.txt"

cols<- read.table(colnames,header=FALSE)

t.cols<-t(cols)

rm(cols)

rownames<-"rownames.txt"

rows<- read.table(rownames,header=FALSE)

t.rows<-t(rows)

rm(rows)

dimnames(data.matrix) <- list(rownames(data.matrix)<-(t.rows),
                              colnames(data.matrix)<-(t.cols))

t.data.matrix<-t(data.matrix)

Q-Q plot: The q-q plot is similar to a probability plot. For a probability plot, the quantiles for one of the data samples are replaced with the quantiles of a theoretical distribution.

par(mfrow=c(1,2))
ps = seq(0.01 , 0.99 , by = 0.01)
qs = quantile(data.matrix[,1] , ps)
normalqs = qnorm(ps , mean = mean(data.matrix[,1]) ,sd = sd(data.matrix[,1]))
plot(normalqs , qs , xlab = "QQplot" , ylab = "Data", col = "orange")
abline(0, 1, col = "blue")
normalqs = qnorm(ps , mean = mean(data.matrix[,2]) ,sd = sd(data.matrix[,2]))
plot(normalqs , qs , xlab = "QQplot" , ylab = "Data", col = "orange")
abline(0, 1, col = "blue")

Histogram plot:

par(mfrow=c(1,2))
hist(data.matrix[,1],col = "red")
hist(data.matrix[,2],col = "red")

Box Plot: A simple way of representing statistical data on a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines either side of the rectangle.

par(mfrow=c(1,2))
boxplot(data.matrix[,1])
boxplot(data.matrix[,2])

Pairwise Scatterplot: A scatter plot is a graph used to investigate the relationship between two variables in a data set. The x and y axes are used for the values of the two variables and a symbol on the graph represents the combination for each pair of values in the data set.

par(mfrow=c(1,1))
pairs(data.matrix, gap = 0, pch = ".")

M vs. A Matrix Plot: An MA plot is an application of a Bland-Altman plot for visual representation of genomic data. The plot visualises the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values. A matrix of M vs. A plots is produced. Plots are made on the upper triangle and the IQR of the Ms are displayed in the lower triangle.

Red and Green Color Image of Data Matrix:

# install.packages("WGCNA")
library("WGCNA")
## Loading required package: dynamicTreeCut
## Loading required package: fastcluster
## 
## Attaching package: 'fastcluster'
## The following object is masked from 'package:stats':
## 
##     hclust
## 
## ==========================================================================
## *
## *  Package WGCNA 1.51 loaded.
## *
## *    Important note: It appears that your system supports multi-threading,
## *    but it is not enabled within WGCNA in R. 
## *    To allow multi-threading within WGCNA with all available cores, use 
## *
## *          allowWGCNAThreads()
## *
## *    within R. Use disableWGCNAThreads() to disable threading if necessary.
## *    Alternatively, set the following environment variable on your system:
## *
## *          ALLOW_WGCNA_THREADS=<number_of_processors>
## *
## *    for example 
## *
## *          ALLOW_WGCNA_THREADS=4
## *
## *    To set the environment variable in linux bash shell, type 
## *
## *           export ALLOW_WGCNA_THREADS=4
## *
## *     before running R. Other operating systems or shells will
## *     have a similar command to achieve the same aim.
## *
## ==========================================================================
## 
## Attaching package: 'WGCNA'
## The following object is masked from 'package:stats':
## 
##     cor
par(mfrow=c(1,1))

plotMat(x=t.data.matrix, nrgcols=50, rlabels=TRUE, clabels=TRUE,
        rcols=1, ccols=1, title="Red and Green Color Image of Data Matrix")

# # Load packages
# # try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("flowCore")

# # try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("flowStats")

# # try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("flowViz")


library(flowCore)
library(flowStats)
## Loading required package: fda
## Loading required package: splines
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:flowCore':
## 
##     %&%
## 
## Attaching package: 'fda'
## The following object is masked from 'package:graphics':
## 
##     matplot
## Loading required package: cluster
## Loading required package: flowWorkspace
## Loading required package: ncdfFlow
## Loading required package: RcppArmadillo
## Loading required package: BH
library(flowViz) # for flow data visualization
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following objects are masked from 'package:ncdfFlow':
## 
##     densityplot, histogram, xyplot
## Load data
data(ITN)
ITN
## A flowSet with 15 experiments.
## 
## An object of class 'AnnotatedDataFrame'
##   rowNames: sample01 sample02 ... sample15 (15 total)
##   varLabels: GroupID SiteCode ... name (7 total)
##   varMetadata: labelDescription
## 
##   column names:
##   FSC SSC CD8 CD69 CD4 CD3 HLADr Time
## Create a workflow instance and transform data using asinh
wf <- workFlow(ITN)
## Warning: 'workFlow' is deprecated.
## Use 'flowWorkspace::GatingSet' instead.
## See help("Deprecated")
## Warning: 'workFlow' is deprecated.
## Use 'flowWorkspace::GatingSet' instead.
## See help("Deprecated")

asinh <- arcsinhTransform()
tl <- transformList(colnames(ITN)[3:7], asinh, 
                    transformationId = "asinh")
add(wf, tl)

We use the lymphGate function to find the T-cells in the CD3/SSC projection. Identify T-cells population:

lg <- lymphGate(Data(wf[["asinh"]]), channels=c("SSC", "CD3"),
                preselection="CD4", filterId="TCells", eval=FALSE,
                scale=2.5)
add(wf, lg$n2gate, parent="asinh")
print(xyplot(SSC ~ CD3| PatientID, wf[["TCells+"]],
             par.settings=list(gate=list(col="red", 
                                         fill="red", alpha=0.3))))
## Note: method with signature 'filter#missing' chosen for function 'glpolygon',
##  target signature 'logicalFilterResult#missing'.
##  "filterResult#ANY" would also be valid

RNA-seq example plots at the gene level:

# ## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("airway")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("Rsamtools")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("vsn")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("GenomicFeatures")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("GenomicAlignments")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("BiocParallel")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("DESeq2")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("genefilter")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("AnnotationDbi")

# # try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("org.Hs.eg.db")


## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("ReportingTools")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("Gviz")

## try http:// if https:// URLs are not supported
# source("http://bioconductor.org/biocLite.R")
# biocLite("DESeq2")

## try http:// if https:// URLs are not supported
# source("https://bioconductor.org/biocLite.R")
# biocLite("impute")

# install.packages("PoiClaClu")
# install.packages("RColorBrewer")
# install.packages("pheatmap")
# install.packages("hexbin")
# install.packages("impute")

We load the data package with the example data: The R function system.file can be used to find out where on your computer the files from a package have been installed. Here we ask for the full path to the extdata directory, where R packages store external data, that is part of the airway package.

library("airway")
## Loading required package: SummarizedExperiment
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:flowStats':
## 
##     normalize
## The following object is masked from 'package:Matrix':
## 
##     which
## The following objects are masked from 'package:flowCore':
## 
##     normalize, sort
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:Matrix':
## 
##     colMeans, colSums, expand, rowMeans, rowSums
## The following objects are masked from 'package:base':
## 
##     colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
dir <- system.file("extdata", package="airway", mustWork=TRUE)
# In this directory, we find the eight BAM files (and some other files):
list.files(dir)
##  [1] "GSE52778_series_matrix.txt"       
##  [2] "Homo_sapiens.GRCh37.75_subset.gtf"
##  [3] "sample_table.csv"                 
##  [4] "SraRunInfo_SRP033351.csv"         
##  [5] "SRR1039508_subset.bam"            
##  [6] "SRR1039509_subset.bam"            
##  [7] "SRR1039512_subset.bam"            
##  [8] "SRR1039513_subset.bam"            
##  [9] "SRR1039516_subset.bam"            
## [10] "SRR1039517_subset.bam"            
## [11] "SRR1039520_subset.bam"            
## [12] "SRR1039521_subset.bam"

Typically, we have a table with detailed information for each of our samples that links samples to the associated FASTQ and BAM files. For your own project, you might create such a comma-separated value (CSV) file using a text editor or spreadsheet software such as Excel.

# We load such a CSV file with read.csv:
csvfile <- file.path(dir,"sample_table.csv")
sampleTable <- read.csv(csvfile,row.names=1)
sampleTable
##            SampleName    cell   dex albut        Run avgLength Experiment
## SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345
## SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349
## SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353
## SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357
## SRR1039521 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358
##               Sample    BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677

Once the reads have been aligned, there are a number of tools that can be used to count the number of reads/fragments that can be assigned to genomic features for each sample. These often take as input SAM/BAM alignment files and a file specifying the genomic features, e.g. a GFF3 or GTF file specifying the gene models.

DESeq2 import functions: We now proceed using the summarizeOverlaps method of counting. Using the Run column in the sample table, we construct the full paths to the files we want to perform the counting operation on:

We indicate in Bioconductor that these files are BAM files using the BamFileList function from the Rsamtools package that provides an R interface to BAM files. Here we also specify details about how the BAM files should be treated, e.g., only process 2 million reads at a time. See ?BamFileList for more information.

filenames <- file.path(dir, paste0(sampleTable$Run, "_subset.bam"))
file.exists(filenames)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
library("Rsamtools")
## Loading required package: Biostrings
## Loading required package: XVector
bamfiles <- BamFileList(filenames, yieldSize=2000000)

Note: make sure that the chromosome names of the genomic features in the an