-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path05-Bioinformatics.Rmd
1185 lines (821 loc) · 35.8 KB
/
05-Bioinformatics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Bioinformatics with [Bioconductor][] {#five}
[Bioconductor]: https://bioconductor.org
```{r, echo = FALSE, eval = FALSE}
knitr::opts_chunk$set(cache = TRUE)
```
## Day 29 (Monday) Zoom check-in
### Review and trouble shoot
### [Bioconductor][], packages, and biological data
[Bioconductor][]
- Repository of more than 1900 packages for the 'analysis and comprehension of high throughput genomic data'
- Bulk and single cell RNASeq; ChIP and other gene regulation, called variants, flow cytometery, proteomics, ...
- Package discovery using [biocViews][] and [workflows][]
- Landing pages & vignettes
Packages
- One-time installation of [BiocManager][]
```{r, message = FALSE}
if (!requireNamespace("BiocManager"))
install.packages("BiocManager", repos = "https://cran.r-project.org")
```
- Install _Bioconductor_ or _CRAN_ packages
```{r, eval = FALSE}
## Biostrings and genbankr are Bioconductor packages, rentrez is from CRAN
BiocManager::install(c("Biostrings", "rentrez"))
```
- Validate installation
```{r}
BiocManager::version() # current release is 3.11; needs R 4.0
BiocManager::valid()
```
- Types of packages
- Software (e.g., [Biostrings][], [SingleCellExperiment][])
- Annotation (e.g., [org.Hs.eg.db][], [TxDb.Hsapiens.UCSC.hg38.knownGene][], [GO.db][])
- Experiment data (e.g., [airway][], [scRNAseq][])
[biocViews]: https://bioconductor.org/packages
[workflows]: https://bioconductor.org/packages/release/BiocViews.html#___Workflow
[BiocManager]: https://cran.r-project.org/package=BiocManager
[Biostrings]: https://bioconductor.org/packages/Biostrings
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[SummarizedExperiment]: https://bioconductor.org/packages/Summarizedexperiment
[SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment
[org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db
[TxDb.Hsapiens.UCSC.hg38.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg38.knownGene
[GO.db]: https://bioconductor.org/packages/GO.db
[airway]: https://bioconductor.org/packages/airway
[scRNAseq]: https://bioconductor.org/packages/scRNAseq
Key packages
- [Biostrings][] for representing sequence data
- [GenomicRanges][] for working with genomic coordinates
- [SummarizedExperiment][] and [SingleCellExperiment][] for representing experimental summaries, e.g., a gene x sample matrix of RNA-seq expression values.
'Class' and 'method'
- Classes represent data in ways that allow the specialized nature of the data to be exploited, e.g., it makes sense to calculate the `reverseComplement()` of a DNA sequence, but not of an arbitrary character vector.
```{r, message = FALSE}
library(Biostrings)
sequences <- c(
chr1 = "CTGACT",
chr2 = "CTGGGAACT"
)
dna <- DNAStringSet(sequences)
dna
## fails! `T` not in the RNA alphabet
result <- try( RNAStringSet(sequences) )
```
- Methods represent the implementation of common ('generic') operations, specialized to the specific classes that the method operates on.
```{r}
length(dna)
reverseComplement(dna)
translate(dna)
```
- E.g., one could `translate()` both a DNA sequence, and an RNA sequence, so there is both `translate,DNAStringSet-method` and `translate,RNAStringSet-method`
- Not all operations are implemented as methods on a class
- E.g., `length()` is a concept shared by collections of DNA and RNA sequences, so the method is defined on a common 'parent' class
- E.g., `dim()` of a two-dimensional object is sufficient to implement `nrow()` (as `dim()[1]`) and `ncol()` (as `dim()[2]`) so one could use a 'plain old' function `nrow <- function(x) dim(x)[1]` for `nrow()` / `ncol()`, relying on a `dim()` generic and methods to provide specialized behavior.
Getting help
- Vignettes, e.g., [DESeq2 vignette][]! `browseVignettes(package = "DESeq2")`
- Help pages for classes, e.g., `?DNAStringSet`, `?GRanges`
- Discovering methods
```{r, eval = FALSE}
methods("reverse")
methods(class = "DNAStringSet")
```
- Looking at symbols made available by packages (after the package is attached to the `search()` path): `ls("package:GenomicRanges")`
[DESeq2 vignette]:
### This week: SARS-CoV-2 sequence and human tissue-specific gene expression data
Biological background
- RNA virus ([source][img-src-05-1], license: [CC BY-SA 4.0][])
```{r echo = FALSE, out.width = "60%"}
knitr::include_graphics('images/512px-Coronavirus_virion_structure.svg.png')
```
- Key domains ([source][img-src-05-2], license: [CC BY 2][])
```{r echo = FALSE, out.width = "100%"}
knitr::include_graphics('images/SARS-CoV_genome_organization.jpg')
```
[img-src-05-1]: https://en.wikipedia.org/wiki/File:Coronavirus_virion_structure.svg
[img-src-05-2]: https://en.wikipedia.org/wiki/File:SARS-CoV_genome_organization.jpg
[CC BY-SA 4.0]: https://creativecommons.org/licenses/by-sa/4.0/deed.en
[CC BY 2]: https://creativecommons.org/licenses/by/2.0/deed.en
Setup
- Some essential _Bioconductor_ data structures
```{r, message = FALSE}
library(Biostrings)
library(GenomicRanges)
```
- Additional software
```{r, message = FALSE}
library(rentrez)
library(genbankr)
library(DECIPHER)
library(ggtree)
```
- Need to install software?
- Install [BiocManager][]
```{r}
if (!requireNamespace("BiocManager"))
install.packages("BiocManager", repos = "https://cran.r-project.org")
```
- Install other _CRAN_ or _Bioconductor_ packges
```{r, eval = FALSE}
pkgs <- c(
## packages needing installation, e.g.,
"Biostrings", "GenomicRanges", "genbankr", "rentrez", "DECIPHER",
"ggtree"
)
BiocManager::install(pkgs)
BiocManager::valid()
```
[Biostrings][] and [GenomicRanges][]
- [Biostrings][]: representing DNA sequences
```{r}
sequences <- c(
chr1 = "CTGACT",
chr2 = "CTGGGAACT"
)
dna <- DNAStringSet(sequences)
dna
```
- [GenomicRanges][]: annotations in genome space, e.g., create a variable representing 'regions of interest'
```{r}
regions_of_interest <- c(
"chr1:1-3",
"chr1:3-6",
"chr2:4-6"
)
roi <- GRanges(regions_of_interest)
roi
```
- Many fun operations, e.g., extract sequences of regions of interest
```{r}
getSeq(dna, roi)
```
Sequences and annotations
- Input FASTA files from New York State samples (it makes little biological sense to use just New York samples, but we'll do it anyway!) to `DNAStringSet`
- Work the Genbank accession of the reference sequence
- DNA sequence
- Coding sequence annotations
- Coding sequences
Alignment and visualization
- Multiple alignment using [DECIPHER][]
- Visualization using [ggtree][]
Gene expression
- Retrieving example data from ExperimentHub
- Annotating cell types
[DECIPHER]: https://bioconductor.org/packages/DECIPHER
[ggtree]: https://bioconductor.org/packages/ggtree
## Day 30 DNA sequences and annotations
### Setup
Set up a directory for working
```{r}
workdir <- "workdir"
if (!dir.exists(workdir))
dir.create(workdir)
```
Make sure the following packages are installed
```{r, message = FALSE}
library(Biostrings)
library(GenomicRanges)
library(rentrez)
library(genbankr)
```
- Install necessary packages, if needed
```{r, eval = FALSE}
if (!requireNamespace("BiocManager"))
install.packages("BiocManager", repos = "https://cran.r-project.org")
## for example...
pkgs <- c("Biostrings", "GenomicRanges", "rentrez", "genbankr")
BiocManager::install(pkgs)
```
### Representing DNA sequences
Sample information
- Visit the [NCBI SARS-CoV-2][] site for orientation. We'll retrieve and analyse some of the samples sequenced in New York state
- Read a csv file summarizing records available from the US NCBI.
```{r}
url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-genbank-accessions.csv"
csv_file <- file.path(workdir, basename(url))
if ( !file.exists(csv_file) )
download.file(url, csv_file)
records <- readr::read_csv(csv_file)
records
```
[NCBI SARS-CoV-2]: https://www.ncbi.nlm.nih.gov/projects/genome/sars-cov-2-seqs
FASTA sequence files
- Download a FASTA file containing DNA sequences of all samples sequenced in New York State (this doesn't make much sense biologically!)
```{r}
url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-SARS-CoV2-NY.fasta"
fasta_file <- file.path(workdir, basename(url))
if ( !file.exists(fasta_file) )
download.file(url, fasta_file)
```
- This is a plain text file with a simple format. Each DNA sequence is on a line starting with an identifier, `>MT...`, and then lines representing the DNA (or RNA or amino acid) sequence.
```{r}
readLines(fasta_file, 5)
```
- Load the [Biostrings][] package
```{r, message = FALSE}
library(Biostrings)
options(Biostrings.coloring = FALSE)
```
- Read the DNA sequences and explore basic properties
```{r}
dna <- readDNAStringSet(fasta_file)
dna
```
- Perform simple manipulations
```{r}
## how many records?
length(dna)
## subset
dna[1:3]
dna[c('MT434817', 'MT434815')]
## extract
dna[[1]]
## number of characters in each record
head(nchar(dna))
hist(nchar(dna)) # histogram of sequence lengths
```
### Working with `GenBankRecord` objects
Data is often represented in complicated formats that do not fit into the tidy 'table' format of a CSV file. An example is the information displayed on the [SARS-CoV-2 reference genome][] web page.
[SARS-CoV-2 reference genome]: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512
Often 'someone' has written software to manipulate this data in a more convient way.
- Attach the [rentrez][] package, which provides a way to query the NCBI's 'Entrez' collection of data bases. Use `BiocManager::install('rentrez')` if you need to install the package
```{r, message = FALSE}
library(rentrez)
## who's responsible?
maintainer("rentrez")
## how to cite?
citation("rentrez")
```
- The accession number for the SARS-CoV-2 reference sequence is `NC_045512`. Use `entrez_search()` to search the NCBI 'nuccore' database to discover information about this accession.
```{r}
accession <- "NC_045512"
nuccore_record <- entrez_search("nuccore", accession)
id <- nuccore_record$ids
id
```
- The `id` is an identifier that can be used to actually retrieve the genbank record
```{r}
gb_record <-
entrez_fetch("nuccore", id, rettype = "gbwithparts", retmode = "text")
```
- The record contains all the text in the record; if you like visualize it with
```{r, eval = FALSE}
cat(gb_record)
```
[rentrez]: https://cran.r-project.org/package=rentrez
Represent the data as `GenBankR` object
- Rather than try to 'munge' (e.g., copy and paste) relevant information from the GenBank record into R, use the [genbankr][] package. Again, install with `BiocManager::install("genbankr")` if necessary
```{r, message = FALSE}
library(genbankr)
maintainer("genbankr")
citation("genbankr")
```
- Use the function `readGenBank()` to parse the complicated text of `gb_record`
```{r, message = FALSE}
gb <- readGenBank(text = gb_record)
gb
```
- Extract the reference sequence from the object
```{r}
ref <- getSeq(gb)
ref
```
- Extract the coding sequences from the reference sequence; the `cds()` command is explained a little later.
```{r}
cds <- getSeq(ref, cds(gb))
cds
```
- There are 13 coding sequences in the genome. Translate these to their amino acid sequences
```{r}
translate(cds)
```
[genbankr]: https://bioconductor.org/packages/genbankr
### Genomic ranges and DNA sequences
Genomic ranges
- Attach the [GenomicRanges][] package. If necessary, install with `BiocManager::install("GenomicRanges")`
```{r, message = FALSE}
library(GenomicRanges)
```
- Extract the coordinates of the coding sequences contained in the GenBank accession
```{r}
cds <- cds(gb)
```
- Look at the genomic ranges defined in `cds`
```{r}
granges(cds)
```
- There are 13 coding sequences in the reference genome. The sequence is named 'Severe acute respiratory syndrome coronavirus2'. The first region starts at position `r start(cds)[1]` of the reference genome, and goes to position `r end(cds)[1]`. The coding sequence is on the '+' strand.
- The full `cds` object has additional detail about each coding sequence. Display the full object (the display 'wraps' around several lines
```{r}
cds
```
or coerce to a tibble and explore
```{r}
library(tibble)
library(dplyr)
as_tibble(cds)
```
For instance, we can find the gene associated with each coding sequence, using the following equivalent commands
```{r}
cds$gene
subset(cds, , gene)
as_tibble(cds) %>%
## info on gene and protein product
dplyr::select(gene, product)
```
Using genomic ranges to extract DNA sequences
- Recall the sequence of the entire genome
```{r}
ref <- getSeq(gb)
ref
```
- Use `getSeq()` to extract the `DNAStringSet()` corresponding to the genomic ranges specified by `cds`
```{r}
dna <- getSeq(ref, cds)
dna
```
- It might be useful to coordinate information about each coding sequence with the sequence itself
```{r}
cds$DNA <- dna
cds
```
## Day 31 Sequence alignment and visualization
### 'S' gene exploration
Read the 'S' gene sequence extracted from all genbank records into a `DNAStringSet`
```{r}
url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-SARS-CoV2-S.fasta"
fasta_file <- file.path(workdir, basename(url))
if ( !file.exists(fasta_file) )
download.file(url, fasta_file)
dna <- readDNAStringSet(fasta_file)
dna
```
Explore the data. How many sequences? How many unique sequences?
```{r}
length(dna)
dna %>% unique() %>% length()
```
Note that the DNA alphabet contains 'ambiguity' letters, corresponding to base sequences that were uncertain, e.g., 'A' or 'T'. Check out the help page for `?translate` to see options for translating ambiguity codes. How many unique protein sequences?
```{r}
IUPAC_CODE_MAP
dna %>% alphabetFrequency() %>% colSums()
aa <- dna %>% translate(if.fuzzy.codon = "solve") %>% unique()
aa
```
The `table()` function counts the number of occurrences of each element, so
```{r}
nchar(aa) %>% table() %>% as_tibble()
```
shows us that most sequences have 1274 amino acids.
The `names(aa)` are misleading -- these are the identifiers of a representative accession, but several accessions may map to the same identifier. The following code creates a table that indicates which cluster each original identifier belongs to; the names of `aa` are updated accordingly.
```{r}
all <- dna %>% translate(if.fuzzy.codon = "solve")
## match(x, y) finds the index of each element of x in y.
## match(c("apple", "pear"), c("pear", "apple", "banana")) finds
## that 'apple' in x matches the second element of y, and 'pear'
## matches the first element of y. So the result is c(2, 1).
id <- match(all, aa)
clusters <- tibble(
accession = names(all),
label = as.character(id)
)
names(aa) <- match(aa, aa)
```
With this information, it's easy to count the number of times each sequence was represented. The reference sequence belongs to cluster 1.
```{r}
clusters %>% filter(accession == "NC_045512")
clusters %>%
count(label) %>%
arrange(desc(n))
```
### Sequence alignment
Setup
- Create a working directory, if necessary. E.g.,
```{r}
workdir <- "workdir"
if (!dir.exists(workdir))
dir.create(workdir)
```
- Install (if necessary, using `BiocManager::install("DECIPHER")`) and attach the [DECIPHER][] package
```{r, message = FALSE}
library(DECIPHER)
maintainer("DECIPHER")
citation("DECIPHER")
```
- This package contains tools for multiple sequence alignment. Explore the introductory and other vignettes in this package.
```{r, eval = FALSE}
browseVignettes(package = "DECIPHER")
```
- Check out one of the help pages, e.g.,
```{r}
?AlignSeqs
```
Use `example(AlignSeqs)` to run the examples on the help page.
Perform multiple alignment
- Perform multiple alignment on the unique amino acid sequences using DECIPHER; this has minimal effect, with single insertions in the few sequences with unusual `nchar()`.
```{r, message = FALSE}
aln <- AlignSeqs(aa, verbose = FALSE)
aln
nchar(aln) %>% table() %>% as_tibble()
```
- The 'distance matrix' measures the proximity (Hamming distance) of each sequence to other sequences in sequence space. The result is a large matrix of pairwise distances
```{r}
dist <- DistanceMatrix(aln, verbose = FALSE)
class(dist)
dim(dist)
dist[1:5, 1:5]
```
- Use the distance matrix to cluster sequences using the 'Neighbor Joining' algorithm; visualize the result (see `?plot.dendrogram` for plotting options when the first argument is of class `dendrogram`)
```{r, message = FALSE}
dendrogram <- IdClusters(dist, method = "NJ", type = "dendrogram")
dendrogram
plot(dendrogram, leaflab = "none")
```
Save the tree to the working directory using the 'Newick' format
```{r}
dendrogram_file <- file.path(workdir, "SARS-CoV2-S.newick")
WriteDendrogram(dendrogram, dendrogram_file, quoteLabels = FALSE)
```
### Phylogenetic tree visualization
Setup
- Install (if necessary, using `BiocManager::install("ggtree")`) and attach the [ggtree][] package
```{r, message = FALSE}
library(ggtree)
maintainer("ggtree")
citation("ggtree")
```
- The visualizations this package provides are not well-suited to _R_'s help pages; check out the online [ggtree book][] to orient yourself.
[ggtree book]: https://yulab-smu.github.io/treedata-book
Read the data from the Newick-format file created by DECIPHER
```{r}
dendrogram_file <- file.path(workdir, "SARS-CoV2-S.newick")
tree <- treeio::read.newick(dendrogram_file)
tree
tibble::as_tibble(tree)
```
Plot the tree, adding tip labels to the longest branches
```{r}
ggtree(tree) +
geom_tiplab(
aes(
subset = branch.length > .001,
label = label
),
size = 3
) +
ggplot2::ggtitle("SARS-CoV-2 S-locus phylogeny; genbank, 10 May 2020")
```
Color sequences by region
- Find nodes representing just a single sequence -- 'singletons'
- Use GenBank country to identify which country these were sequenced from
- Color lineages according to singleton country of origin
## Day 32 Single-cell expression data
This is a very preliminary foray into COVID-19 single-cell RNA-seq gene expression data. It is derived from a preprint by [Muus et al][], including a 'Terra' (NHGRI) cloud computing [workspace][].
RNA-seq is a technology for measuring RNA expression levels as a proxy for gene expression. Single-cell RNA seq measures RNA expression levels at the single-cell resolution, sometimes in conjunction with imaging or other spatial information. Often, the overall pattern of expression of a single cell can be used to classify the cell to a particular type, e.g., epithelial.
Many interesting questions present themselves in the context of COVID-19. It would be interesting to investigate patterns of expression of human genes known to interact with the SARS-CoV-2 virus, especially in tissues likely to be involved in the introduction of the virus to the human body. One could imagine further studies identifying, e.g., genetic variants that influence susceptibility or response to infection.
Our goal is to do some preliminary visualization of scRNA seq at human genes whose product is known to interact with virus proteins. A much more comprehensive guide is available as [Orchestrating Single Cell Analysis with _Bioconductor_][osca].
[Muus et al]: https://www.biorxiv.org/content/10.1101/2020.04.19.049254v2
[workspace]: https://anvil.terra.bio/#workspaces/kco-incubator/COVID-19_cross_tissue_analysis
[osca]: https://osca.bioconductor.org/
### Setup
This module requires a recent version of _R_ and _Bioconductor_. It also requires some advanced packages that may be difficult to install. Definitely it is necessary that
```{r}
as.package_version(R.version) >= "3.6"
BiocManager::version() >= "3.10"
```
If those conditions are met, load the following libraries (use `BiocManager::install(<pkgs>)` if necessary)
```{r, message = FALSE}
library(SingleCellExperiment)
library(Matrix)
library(rhdf5)
library(dplyr)
library(hexbin)
```
Create a working directory for today.
```{r}
workdir <- "workdir"
if (!dir.exists(workdir))
dir.create(workdir)
```
A couple of scripts provide helper functions for working with the data. Download the following file to the working directory
```{r}
url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-scRNAseq-import.R"
destination <- file.path(workdir, basename(url))
if (!file.exists(destination))
download.file(url, destination)
```
Take a quick look at the downloaded file (e.g., in RStudio File -> open) and verify that it is an _R_ script. Source the script in to your _R_ session so that you have access to the functions in the file.
```{r, cache=FALSE, message = FALSE}
source(destination)
```
Download our sample data set
```{r}
url <- "https://github.com/mtmorgan/QuaRantine/raw/master/assets/05-internal_nonsmokerslung.h5ad"
destination <- file.path(workdir, basename(url))
if (!file.exists(destination))
download.file(url, destination)
```
The file is an 'hdf5' file format, used to store relatively large data sets in a binary format that allows for faster and more robust input than a CSV file. _R_ and other software packages can read hdf5 files, but Excel, etc.
### Data input and exploration
There are two main components to the data. The first describes each observation (cell).
```{r, cache = FALSE}
obs <- read_obs(destination)
obs
```
If _R_ complains that it can't find `read_obs`, then remember to `source()` the script downloaded earlier.
This experiment measured gene expression in `r nrow(obs)` cells. Use standard dplyr commands to explore information about the cells
- How many patients are represented? (we need to use `dplyr::count()` because another package attached to our search path also has a `count()` function)
```{r}
obs %>% dplyr::count(Patient_ID)
```
- What is the age and gender of each patient?
```{r}
obs %>% distinct(Patient_ID, Age, Gender)
```
- Which cell types are represented (the column contains a space in its name, which causes problems for _R_ unless the column is quoted)
```{r}
obs %>% dplyr::count(`Cell Type`) %>% arrange(desc(n))
```
- Each cell was classified to cell type based on overall gene expression. How many cells of each type were observed?
```{r}
obs %>% dplyr::count(cell_subset)
```
Single-cell RNA assays expression across all genes, but the current data set has been restricted to 22 genes relevant to virus interaction. Read in the expression levels of these 22 genes across all cells.
```{r, cache = FALSE}
x <- read_expression(destination)
dim(x)
class(x)
x[, 1:3]
```
This is a matrix of `r nrow(x)` x `r ncol(x)` values, one entry for each gene and cell. However, single-cell assays are 'sparse', with a large number of zeros. To take advantage of this sparse data, _R_ has represented the matrix as a special _sparse_ matrix. The `.` in the display above represent zeros; the object `x` only contains information about the non-zero values.
How sparse is the data, i.e., what is the fraction of zeros in the data?
- Total number of elements
```{r}
nrow(x) * ncol(x)
```
- Total number of zero elements (`x == 0` returns a logical vector of `TRUE` and `FALSE` values; `sum()` treats `TRUE` as 1, and `FALSE` as 0, so `sum(x -- 0)` is the number of zero elements)
```{r}
sum(x == 0)
```
- Fraction of zero elements
```{r}
sum(x == 0) / (nrow(x) * ncol(x))
```
- More than 80% of the values are zeros! This primarily represents limitations of the technology, where relatively few 'reads' are available from each cell, so only a fraction of the genes being expressed are detected.
- Use the function `rowMeans()` to calculate average expression of each gene; present this as a tibble
```{r}
tibble(
gene = rownames(x),
average_expression = rowMeans(x)
)
```
### Coordinating cell and expression data
Notice that we have two separate data structures -- a large matrix of expression values, and a tibble describing each cell. It seems error-prone to manage these data separately, e.g., what if we remove some rows from the cell annotations without removing the corresponding columns from the expression matrix?
A better strategy is to introduce an object that can manage both parts of the data in a coordinated way. This container is provided by the [SingleCellExperiment][] package
```{r, message = FALSE}
library(SingleCellExperiment)
```
Use the _constructor_ `SingleCellExperiment()` to create this object. The basic metaphor is that a `SingleCellExperiment` is like a matrix, but with column (and row) annotations. The column annotations are called `colData`.
```{r, out.width = "60%"}
knitr::include_graphics('images/SummarizedExperiment.png')
```
Create and view a `SingleCellExperiment`
```{r}
sce <- SingleCellExperiment(x, colData = obs)
sce
dim(sce)
```
Use `subset()` to select particular rows and columns. The subset operates in a coordinated fashion on both the cell annotations and expression data, so there is no risk of 'book-keeping' error.
```{r}
my_sce <-
sce %>%
subset(
rownames(.) %in% "ACE2",
`Cell Type` %in% c("Fibroblast", "SmoothMuscle")
)
my_sce
dim(my_sce)
```
For the exercises below, focus on the ACE2 gene. ACE2 is a gene on the outer surface of the cell membrane in lung and other tissue.
```{r}
ace2 <- subset(sce, rownames(sce) %in% "ACE2")
ace2
```
We will perform some simple visualization, as we start to explore the data. Create a tibble containing the `cell_subset` annotation and the expression values of the ACE2 gene. We make use of `$` to access columns of the cell annotation (`colData`) and `assay()` to access the expression data. `as.vector()` changes the one-dimensional array returned by `assay()` into a numeric vector.
```{r}
tbl <- tibble(
cell_subset = ace2$cell_subset,
expression = as.vector(assay(ace2))
)
tbl
```
Summarize the number of cells, non-zero expression values, and average expression of ACE2 for each cell subset.
```{r}
tbl %>%
group_by(cell_subset) %>%
summarize(
n = n(),
n_non_zero = sum(expression != 0),
mean_expression = mean(expression)
)
```
Pictures speak louder than words. Use the [ggplot2][] package
```{r, message = FALSE}
library(ggplot2)
```
to visualize as a horizontal boxplot the ACE2 expression values of each cell subtype. I had initially used `geom_boxplot()` without `coord_flip()`, but the cell subtype labels overlapped. I experimented with rotating the labels, but that just made them difficult to read. So `coord_flip()` makes the boxplot horizontal rather than vertical, the labels are easy to read, and the figure easy to interpret.
```{r}
tbl %>%
filter(expression > 0) %>%
ggplot(aes(cell_subset, expression)) +
geom_boxplot() +
coord_flip() +
ggtitle("ACE2")
```
I want to explore co-expression of two genes, ACE2 and TMPRSS2. I used `subset()` to select the relevant expression and cell annotation values, then created a tibble with gene, cell, and expression columns.
- The `rep()` function replicates each element of its first argument the number of times indicated by it's second argument; explore it's behavior with simple examples, e.g.,
```{r}
1:3
rep(1:3, 2)
rep(1:3, each = 2)
```
- When a two-dimensional array is coerced to a vector, it is 'flattened' by appending columns
```{r}
m <- matrix(1:12, nrow = 4)
m
as.vector(m)
```
- Putting this together
```{r}
two_genes <- subset(sce, rownames(sce) %in% c("ACE2", "TMPRSS2"))
tbl <- tibble(
gene = rep(rownames(two_genes), ncol(two_genes)),
cell = rep(colnames(two_genes), each = nrow(two_genes)),
expression = as.vector(assay(two_genes))
)
tbl
```
I'd like to plot the expression of one gene against the expression of the other. It is convenient to transform the tidy tibble we just created to have a separate column for each gene. Use `tidyr::pivot_wider()` for this operation
```{r}
wider_tbl <-
tbl %>%
mutate(expression = log(expression)) %>%
tidyr::pivot_wider(names_from=gene, values_from=expression)
wider_tbl
```
We're ready to plot expression values of TMPRSS2 against ACE2. If we were to use `geom_point()`, many points would over-plot and we would have a difficult time seeing overall pattern. So instead we use `geom_hex()` to divide the plot area into hexagonal regions, and to fill each region with a color that indicates the number of points in the region. `coord_fix()` ensures that the x- and y- axes have the same scale, and the plot is printed so that the 'aspect ratio' (height / width) is 1.
```{r, fix.asp=1}
wider_tbl %>%
ggplot(aes(ACE2, TMPRSS2)) +
geom_hex() +
coord_fixed()
```
The plot shows a positive relation between expression of the two genes.
A major caveat to any conclusion of co-expression from the figure is that _covariates_ may introduce spurious correlations. For instance, sequencing of one cell can be more efficient than sequencing of another, so _all_ genes in one cell might appear to have higher expression than in another cell, but for completely un-interesting reasons (i.e., a technological artifact of sequencing efficiency). We would need to do a more sophisticated analysis to reach robust conclusions.
## Day 33 (Friday) Zoom check-in
Biological background
- RNA virus ([source][img-src-05-1], license: [CC BY-SA 4.0][])
```{r echo = FALSE, out.width = "60%"}
knitr::include_graphics('images/512px-Coronavirus_virion_structure.svg.png')
```
- Key domains ([source][img-src-05-2], license: [CC BY 2][])
```{r echo = FALSE, out.width = "100%"}
knitr::include_graphics('images/SARS-CoV_genome_organization.jpg')
```
### Monday troubles (10 minutes)
Where we ran into trouble on Monday
- Load specific packages
```{r, message = FALSE}
library(Biostrings)
options(Biostrings.coloring = FALSE)
library(GenomicRanges)
```