-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathEpicapture Target regions-2_Make Feature Annotations.R
411 lines (295 loc) · 22.4 KB
/
Epicapture Target regions-2_Make Feature Annotations.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
############################################################################
# Title: Epicapture Target Regions - 2.step
# Description: Script used to make annotation files for genomic features
# Author: Miljana Tanic ([email protected])
# Created: November 2018
# Last edited: Mart 2019
#############################################################################
# !!! If packages, commands or something doesn't work - disconect from the server and reconect, it will usually "miracolously" start working!
#=========================================
# Load libraries and set working directory
#=========================================
source("https://bioconductor.org/biocLite.R")
# biocLite("rtracklayer")
library(rtracklayer) # not necessary for importing BED files, can do it manualy crating GRanges object from dataframe library(dplyr)
library(readr)
# biocLite("annotatr")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("annotatr")
library(annotatr)
library(BSgenome.Hsapiens.UCSC.hg19) # sequence of the hg19 genome
library(ggplot2)
library(AnnotationHub)
#------------------------------------------
# Set Path to files
#------------------------------------------
#setwd("~/home/[email protected]/EpiCapture/BEDfiles")
setwd("~/RDS_C2c/EpiCapture/BEDfiles/FeatureAnnotations")
list.files()
#=========================================
# Make Anotations - GRanges from anotate R
#==========================================
#-------------------------------------------------
# Install annotater for anotation of genomic regions:
# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
# BiocManager::install("annotatr")
library(annotatr)
# annotate_regions() requires a GRanges object (either the result of read_regions() or an existing object), a GRanges object of the annotations, and a logical value indicating whether to ignore.strand when calling GenomicRanges::findOverlaps().
# The positive integer minoverlap is also passed to GenomicRanges::findOverlaps() and specifies the minimum overlap required for a region to be assigned to an annotation.
# Before annotating regions, they must be built with build_annotations() which requires a character vector of desired annotation codes.
# There are three types of annotations available to annotatr:
# 1. Built-in annotations including CpG annotations, genic annotations, enhancers, GENCODE lncRNAs, and chromatin states from chromHMM.
# 2. AnnotationHub annotations include any GRanges resource within the Bioconductor AnnotationHub web resource.
# 3. Custom annotations provided by the user.
#-----------------------
# CpG Annotations
# The CpG islands are the basis for all CpG annotations, and are given by the AnnotationHub package for the given organism.
# CpG shores are defined as 2Kb upstream/downstream from the ends of the CpG islands, less the CpG islands.
# CpG shelves are defined as another 2Kb upstream/downstream of the farthest upstream/downstream limits of the CpG shores, less the CpG islands and CpG shores.
# The remaining genomic regions make up the inter-CGI annotation.
hg19_cpgs # shortcut annotates regions to CpG islands, CpG shores, CpG shelves, and inter-CGI.
#-----------------------
# Genic Annotations
# The genic annotations are determined by functions from GenomicFeatures and data from the TxDb.* and org.*.eg.db packages.
# Genic annotations include 1-5Kb upstream of the TSS, the promoter (< 1Kb upstream of the TSS), 5’UTR, first exons, exons, introns, CDS, 3’UTR, and intergenic regions (the intergenic regions exclude the previous list of annotations).
# Also included in genic annotations are intronexon and exonintron boundaries. These annotations are 200bp up/down stream of any boundary between an exon and intron. Important to note, is that the boundaries are with respect to the strand of the gene.
# Non-intergenic gene annotations include Entrez ID and gene symbol information where it exists. The org.*.eg.db packages for the appropriate organisms are used to provide gene IDs and gene symbols.
# The genic annotations have populated tx_id, gene_id, and symbol columns. Respectively they are, the knownGene transcript name, Entrez Gene ID, and gene symbol.
hg19_basicgenes # shortcut annotates regions to 1-5Kb, promoters, 5’UTRs, exons, introns, and 3’UTRs.
#------------------------------
# FANTOM5 Permissive Enhancers
# FANTOM5 permissive enhancers were determined from bi-directional CAGE transcription as in Andersson et al. (2014), and are downloaded and processed for hg19 and mm9 from the FANTOM5 resource. Using the rtracklayer::liftOver() function, enhancers from hg19 are lifted to hg38, and mm9 to mm10.
#------------------------------
# GENCODE lncRNA transcripts
# The long non-coding RNA (lncRNA) annotations are from GENCODE for hg19, hg38, and mm10. The lncRNA transcripts are used, and we eventually plan to include the lncRNA introns/exons at a later date. The lncRNA annotations have populated tx_id, gene_id, and symbol columns. Respectively they are, the Ensembl transcript name, Entrez Gene ID, and gene symbol. As per the transcript_type field in the GENCODE anntotations, the biotypes are given in the id column.
#--------------------------------
# Chromatin states from ChromHMM
# Chromatin states determined by chromHMM (Ernst and Kellis (2012)) in hg19 are available for nine cell lines (Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, and Nhlf) via the UCSC Genome Browser tracks. Annotations for all states can be built using a shortcut like hg19_Gm12878-chromatin, or specific chromatin states can be accessed via codes like hg19_chromatin_Gm12878-StrongEnhancer or hg19_chromatin_Gm12878-Repressed.
#--------------------------------
# AnnotationHub Annotations
# The AnnotationHub web resource provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard locations (e.g., UCSC, Ensembl) can be discovered. The resource includes metadata about each resource, e.g., a textual description, tags, and date of modification. The client creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.
#--------------------------------
# Custom Annotations
# Users may load their own annotations from BED files using the read_annotations() function, which uses the rtracklayer::import() function.
# The output is a GRanges with mcols() for id, tx_id, gene_id, symbol, and type.
# If a user wants to include tx_id, gene_id, and/or symbol in their custom annotations they can be included as extra columns on a BED input file.
## Custom annotation objects are given names of the form genome_custom_name
gencode_lncRNA.bed <- import("/home/[email protected]/RDS_C2c/EpiCapture/BEDfiles/FeatureAnnotations/gencode.v19.long_noncoding_RNAs.bed")
head(gencode_lncRNA.bed)
read_annotations(con=gencode_lncRNA.bed, genome = "hg19", name = "gencode_lncRNA", format="bed")
print(annotatr_cache$list_env())
#======================
# Define Annotations
#=====================
builtin_genomes()
builtin_annotations()
# Make list of built-in annotations for hg19
hg19_annotations <- grepl("hg19", builtin_annotations()) # grep logical --> logical vector
# index and subset annotations for hg19:
hg19_annotations <- builtin_annotations()[hg19_annotations]
#-----------------------------
# Build the gene annotations:
#-----------------------------
# Built in annotates regions to 1-5Kb, promoters, 5’UTRs, exons, introns, and 3’UTR:
annotations_genes.gr = build_annotations(genome = 'hg19', annotations = "hg19_basicgenes")
annotations_genes.gr
# get names of chromosomes:
seqnames(annotations_genes.gr) # has all chromosomes as in hg19 including chrUn
# calculate reduced region size:
sum(as.numeric(width(reduce(annotations_genes.gr)))) # size of all genic reagions in bp
# show metadata columns:
elementMetadata(annotations_genes.gr)
# subset GRanges according to defined features (CGI, genic, regulatory, etc...)
# mcols(annotations_genes.gr)$type=="hg19_genes_cds" # 2 subset - logical
annotations_genes.gr[mcols(annotations_genes.gr)$type=="hg19_genes_cds" , ]
# Make GRangesList object by splitting by type of annotation
annotations_genes.grl <- split(annotations_genes.gr, mcols(annotations_genes.gr)$type)
names(annotations_genes.grl) # get names of the list
# You can always unlist back to GRanges object:
unlist(annotations_genes.grl)
# calculate size in bp of each feature - use reduce not to count overlaping regions twice
#sum(as.numeric((width(reduce(annotations_genic.gr[mcols(annotations_genic.gr)$type=="hg19_genes_5UTRs" , ])))))
# better way:
sum(width(reduce(annotations_genes.grl)))
# It is counting both strands! (double the size of the genome? it would make sense since genes are direrently defined on different strands - the iformation load is actualy for double stranded DNA - countng 6 Gb of information) - solved by using reduce not to count overlaping regions
# Note inclusion of custom annotation, and use of shortcuts
annot_genes= c("hg19_genes_1to5kb", "hg19_genes_promoters", "hg19_genes_cds" , "hg19_genes_5UTRs", "hg19_genes_exons", "hg19_genes_firstexons" , "hg19_genes_introns" , "hg19_genes_intronexonboundaries", "hg19_genes_exonintronboundaries" , "hg19_genes_3UTRs", "hg19_genes_intergenic" )
annotations_genic.gr = build_annotations(genome = 'hg19', annotations = annot_genes)
# Make GRangesList object by splitting by type of annotation
annotations_genic.grl <- split(annotations_genic.gr, mcols(annotations_genic.gr)$type)
names(annotations_genic.grl)
# getSeq() doesnt't work if there is any undefined strand "*" - need to remove intergenic info
#intergenic.gr <- annotations_genic.grl[8]
#------------------------------------
# Build the CpG islands annotations:
#------------------------------------
# annotates regions to CpG islands, CpG shores, CpG shelves, and inter-CGI:
annotations_CpGs.gr = build_annotations(genome = 'hg19', annotations = "hg19_cpgs")
# Make GRangesList object by splitting by type of annotation
annotations_CpGs.grl <- split(annotations_CpGs.gr, mcols(annotations_CpGs.gr)$type)
names(annotations_CpGs.grl)
# Custom:
annot_CpG = c("hg19_cpg_islands", "hg19_cpg_shores", "hg19_cpg_shelves", "hg19_cpg_inter")
annotations_CGI.gr = build_annotations(genome = 'hg19', annotations = "hg19_cpg_islands")
annotations_shores.gr = build_annotations(genome = 'hg19', annotations = "hg19_cpg_shores")
annotations_shelves.gr = build_annotations(genome = 'hg19', annotations = "hg19_cpg_shelves")
annotations_interCGI.gr = build_annotations(genome = 'hg19', annotations = "hg19_cpg_inter")
sum(width(reduce(annotations_interCGI.gr)))
annotations_CpGs.grl <- GRangesList("hg19_cpg_islands"=annotations_CGI.gr,"hg19_cpg_shores"=annotations_shores.gr,"hg19_cpg_shelves"=annotations_shelves.gr, "hg19_cpg_inter"=annotations_interCGI.gr )
# Size of each CGI feature
# sum(width(reduce(annotations_CpGs.grl))) # sum of widths of each grl element BUT gives wrong data for inter-CGI!
# this is the correct way of doing this and gives good numbers!
lapply(annotations_CpGs.grl, function(x) sum(width(reduce(x))))
#------------------------------------
# Build the Regulatory annotations:
#------------------------------------
# Enhancers:
annot_enhancers_fantom <- c("hg19_enhancers_fantom")
annotations_enhancers_fantom.gr = build_annotations(genome = 'hg19', annotations = annot_enhancers_fantom)
sum(as.numeric(width(reduce(annotations_enhancers_fantom.gr))))
# Long non coding RNA:
path2lncRNA <- "/home/[email protected]/RDS_C2c/EpiCapture/BEDfiles/FeatureAnnotations"
file = "gencode.v19.long_noncoding_RNAs.bed"
# load their own annotations from BED files using the read_annotations() function
annotations_lncRNA.gr <- read_annotations(con = paste0(path2lncRNA,"/", file), genome="hg19", name="lncRNA", format = "bed")
sum(as.numeric(width(reduce(annotations_lncRNA.gr))))
#Using the build_ah_annots() function, users can turn any resource of class GRanges into an annotation for use in annotatr eg. from AnnotationHub
#------------------------------------
# Chromatine annotations:
#------------------------------------
annot_chromHMM <- c("hg19_Gm12878-chromatin",
"hg19_H1hesc-chromatin", "hg19_Hepg2-chromatin" , "hg19_Hmec-chromatin","hg19_Hsmm-chromatin" , "hg19_Huvec-chromatin" , "hg19_K562-chromatin" , "hg19_Nhek-chromatin", "hg19_Nhlf-chromatin")
#list of all annotations for chromatin - this will list also stratifed by type of chromatin mark
# hg19_annotations_chromHMM <- grepl("chromatin", hg19_annotations) # grep logical --> logical vector
annotations_chromHMM.gr = build_annotations(genome = 'hg19', annotations = annot_chromHMM)
sum(as.numeric(width(reduce(annotations_chromHMM.gr))))
#-----------------------------------------------------
# subset GRanges according to type of chromatine mark
#-----------------------------------------------------
grepl("Insulator", mcols(annotations_chromHMM.gr)$type)
annotations_chromHMM.gr[ grepl("Insulator", mcols(annotations_chromHMM.gr)$type)==TRUE,]
# Make a GRangesList object from regulatory annotations - split for each cell line a chromatine mark:
annotations_chromHMM.grl = split(annotations_chromHMM.gr,mcols(annotations_chromHMM.gr)$type)
names(annotations_chromHMM.grl) # 107 elements
# grep logical --> logical vector
print(grepl("Insulator", hg19_annotations))
# subset annotations that contain same type of chromatine mark for all analyzed cell lines:
annotations_ActivePromoter.gr = annotations_chromHMM.gr[ grepl("ActivePromoter", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_WeakPromoter.gr = annotations_chromHMM.gr[ grepl("WeakPromoter", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_PoisedPromoter.gr =annotations_chromHMM.gr[ grepl("PoisedPromoter", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_StrongEnhancer.gr = annotations_chromHMM.gr[ grepl("StrongEnhancer", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_WeakEnhancer.gr = annotations_chromHMM.gr[ grepl("WeakEnhancer", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_Insulator.gr = annotations_chromHMM.gr[ grepl("Insulator", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_TxnElongation.gr =annotations_chromHMM.gr[ grepl("TxnElongation", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_TxnTransition.gr = annotations_chromHMM.gr[ grepl("TxnTransition", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_WeakTxn.gr = annotations_chromHMM.gr[ grepl("WeakTxn", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_Repressed.gr = annotations_chromHMM.gr[ grepl("Repressed", mcols(annotations_chromHMM.gr)$type)==TRUE,]
annotations_Heterochrom.gr = annotations_chromHMM.gr[ grepl("Heterochrom/lo", mcols(annotations_chromHMM.gr)$type)==TRUE,]
#annotations_Repetitive-CNV.gr = annotations_chromHMM.gr[ grepl("Repetitive", mcols(annotations_chromHMM.gr)$type)==TRUE,] # object 'annotations_Repetitive' not found
# Calculate the size of each region:
sum(width(reduce(annotations_chromHMM.gr[grepl("Repetitive/CNV", mcols(annotations_chromHMM.gr)$type)==TRUE,])))
annotations_regul.grl <- GRangesList(annotations_ActivePromoter.gr, annotations_WeakPromoter.gr, annotations_PoisedPromoter.gr, annotations_StrongEnhancer.gr, annotations_WeakEnhancer.gr, annotations_Insulator.gr, annotations_TxnElongation.gr, annotations_TxnTransition.gr, annotations_WeakTxn.gr, annotations_WeakTxn.gr, annotations_Repressed.gr, annotations_Heterochrom.gr)
# Get sizes of each chromHHM feature in each cell line:
file <- sum(width(reduce(annotations_regul.grl)))
write.table(x=file, file="ChromHHM_size.txt", sep="\t")
#-----------------------------------------------
# Save Annotation GRanges objects to load later:
#-----------------------------------------------
save(annotations_genes.grl, file="Genes_annotations_GRangesList.RData")
save(annotations_genes.gr, file="Genes_annotations_GRanges.RData")
save(annotations_genic.grl, file="Genic-extended_annotations_GRangesList.RData")
save(annotations_CpGs.grl, file="CpG_annottations_GRangesList.RData")
save(annotations_CpGs.gr, file="CpG_annottations_GRanges.RData")
save(annotations_enhancers_fantom.gr, file="Ehnahcers_annotations_GRanges.RData")
save(annotations_lncRNA.gr, file="lncRNA_annotations_GRanges.RData")
save(annotations_regul.grl, file="ChromHMM-byState_GRangesList.RData")
save(annotations_chromHMM.grl, file="ChromHMM-byCellLine-byState_GRangesList.RData")
save(annotations_ActivePromoter.gr, annotations_WeakPromoter.gr, annotations_PoisedPromoter.gr, annotations_StrongEnhancer.gr, annotations_WeakEnhancer.gr, annotations_Insulator.gr, annotations_TxnElongation.gr, annotations_TxnTransition.gr, annotations_WeakTxn.gr, annotations_WeakTxn.gr, annotations_Repressed.gr, annotations_Heterochrom.gr, file="ChromHMM-byState_GRanges.RData")
# these can be reloaded into another R session using the load() function:
load(file="Genes_annotations_GRangesList.RData")
load(file="extended_annotations_GRangesList.RData")
load(file="CpG_annottations_GRangesList.RData")
load(file="Enhancers_annotations_GRanges.RData")
load(file="lncRNA_annotations_GRanges.RData")
load(file="ChromHMM-byState_GRangesList.RData")
load(file="ChromHMM-byCellLine-byState_GRangesList.RData")
load(file="ChromHMM-byState_GRanges.RData")
#=====================
# AnnotationHub
#=====================
library(AnnotationHub)
ah <- AnnotationHub()
# data provaiders
unique(ah$dataprovider)
# species available
head(unique(ah$species))
# type of data availeble
head(unique(ah$rdataclass))
# make queries for spcific data type
dm <- query(ah, c("GRanges", "UCSC", "Homo sapiens"))
dm["AH5012"]
#-----------------------------------------------
# get the CpG islands positions from UCSC, most easily from AnnotationHub
#-----------------------------------------------
query(ah, c("cpg","hg19"))
cpgs_ucsc.gr <- ah[["AH5086"]]
# save CGI annotations
save(cpgs_ucsc.gr, file="UCSC_CGI_GRanges.RData")
#-----------------------------------------------
# get the Roadmap data from AnnotatioHub
#-----------------------------------------------
epiFiles <- query(ah, "EpigenomeRoadMap")
# get a more descriptive idea of these different files
sort(table(epiFiles$description), decreasing=TRUE)
#=========================================================
# get positions of all CpG sites genome postion for hg19
#=========================================================
library(BSgenome.Hsapiens.UCSC.hg19)
chrs <- names(Hsapiens)[1:24]
cgs <- lapply(chrs, function(x) start(matchPattern("CG", Hsapiens[[x]])))
cpgr <- do.call(c, lapply(1:24, function(x) GRanges(names(Hsapiens)[x], IRanges(cgs[[x]], width = 2))))
cpgr
# save CpG sites annotations
save(cpgr, file="hg19_CpGsites_GRanges.RData")
#=========================================================
# Get anotations from Enseble Regulatory build
#=========================================================
# source("http://bioconductor.org/biocLite.R")
# biocLite("biomaRt")
# library(biomaRt)
# listMarts()
# listEnsembl()
# listEnsembl("GRCh37")
# listEnsemblArchives(name="GRCh37")
# # GRC38 Ensembl Regulatory Build:
# #ensembl <- wget(http://www.ensembl.org/biomart/martview/29726bf6c14acd6b0f252bd47f0064c5?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_regulatory_feature.default.regulatory_feature.chromosome_name|hsapiens_regulatory_feature.default.regulatory_feature.chromosome_start|hsapiens_regulatory_feature.default.regulatory_feature.chromosome_end|hsapiens_regulatory_feature.default.regulatory_feature.feature_type_name&FILTERS=&VISIBLEPANEL=resultspanel)
# ensembl <- useMart("ensembl")
# ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
# The Ensembl Regulatory Build (Zerbino et al. 2015) contains a genome-wide set of regions that are likely to be involved in gene regulation. The classification of these regions are:
# Promoters
# Promoter flanking regions
# Enhancers
# CTCF binding sites
# Transcription factor binding sites
# Open chromatin regions
# Cell type specific activity levels
# For each cell type the regulatory features are assigned labels to describe their activity levels. For each cell type and each regulatory feature defined in the Regulatory Build, we check whether there is a segmentation state or experimental evidence (TF binding from ChIP-seq data or open chromatin from DNase-seq data) that could be used to test the activity of this annotation. Based on this then the activity levels can be one of the following:
# ACTIVE, when the feature displays an active epigenetic signature.
# POISED, when the feature displays a epigenetic signature with the potential to be activated.
# REPRESSED, when the feature is epigenetically repressed.
# INACTIVE, when the region bears no epigenetic modifications from the ones included in the Regulatory Build.
# NA, when there is no available data in the cell type for this feature.
#=========================================================
# Repetitive elements
#=========================================================
# download the RepeatMasker results performed on hg19 from the UCSC Table Browser tool.
# I’ve downloaded the results as "all fields from selected table" file and named it hg19_rmsk.gz
$ gunzip hg19_rmsk.gz
# use shell bedops to convert to BED file and keep lass annotation
# convert it to sorted BED data
$ rmsk2bed < hg19_rmsk
# Use bedops –merge to merge elements, e.g.: rmsk2bed < foo.out | bedops --merge - > merged_repeatmasker_elements.bed
# we strip the header element, but adding the --keep-header option will preserve this header as a BED element that uses _header as a chromosome name:
# Note the conversion from 1- to 0-based coordinate indexing, in the transition from RepeatMasker annotation output to BED. BEDOPS supports operations on input with any coordinate indexing, but the coordinate change made here is believed to be convenient for most end users.