From 334110149cd9cdb73d769a8452f1ad1c61c041d4 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Tue, 15 Jul 2014 15:46:46 +0100 Subject: [PATCH 01/12] Copied example verbatim from Aaron Quinlan Original URL: https://raw.githubusercontent.com/arq5x/tutorials/master/bedtools.md --- novice/extras/bedtools-capstone.md | 548 +++++++++++++++++++++++++++++ 1 file changed, 548 insertions(+) create mode 100644 novice/extras/bedtools-capstone.md diff --git a/novice/extras/bedtools-capstone.md b/novice/extras/bedtools-capstone.md new file mode 100644 index 000000000..1952e39fe --- /dev/null +++ b/novice/extras/bedtools-capstone.md @@ -0,0 +1,548 @@ +% bedtools Tutorial +% Aaron Quinlan +% November 22, 2013 + + +Synopsis +======== + +Our goal is to work through examples that demonstrate how to +explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) with the `bedtools` software package. + +Some of our analysis will be based upon the Maurano et al exploration of DnaseI hypersensitivity sites in hundreds of primary tissue types. + + Maurano et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012. Vol. 337 no. 6099 pp. 1190-1195. + + www.sciencemag.org/content/337/6099/1190.short + + +This tutorial is merely meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/). + + + + +\ + + +Setup +===== +From the Terminal, create a new directory on your Desktop called "bedtools-demo". + + cd ~/Desktop + mkdir bedtools-demo + +Navigate into that directory. + + cd bedtools-demo + +Download the sample BED files I have provided. + + curl -O http://quinlanlab.cs.virginia.edu/cshl2013/maurano.dnaseI.tgz + curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed + curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed + curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed + curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt + +Now, we need to extract all of the 20 Dnase I hypersensitivity BED files from the "tarball" named +`maurano.dnaseI.tgz`. + + tar -zxvf maurano.dnaseI.tgz + rm maurano.dnaseI.tgz + +Let's take a look at what files we now have. + + ls -1 + + +\ + + +What are these files? +========================= +Your directory should now contain 23 BED files and 1 genome file. Twenty of these files (those starting with "f" for "fetal tissue") reflect Dnase I hypersensitivity sites measured in twenty different fetal tissue samples from the brain, heart, intestine, kidney, lung, muscle, skin, and stomach. + +In addition: `cpg.bed` represents CpG islands in the human genome; `exons.bed` represents RefSeq exons from human genes; and `gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS). + +The latter 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). + + +\ + + +The bedtools help +================== +To bring up the help, just type + + bedtools + +As you can see, there are multiple "subcommands" and for bedtools to +work you must tell it which subcommand you want to use. Examples: + + bedtools intersect + bedtools merge + bedtools subtract + + +bedtools "intersect" +==================== + +The `intersect` command is the workhorse of the `bedtools` suite. It compares two BED/VCF/GFF files (or a BAM file and one of the aforementioned files) and identifies all the regions in the gemome where the features in the two files overlap (that is, share at least one base pair in common). + +![](http://bedtools.readthedocs.org/en/latest/_images/intersect-glyph.png) + +Default behavior +---------------- +By default, `intersect` reports the intervals that represent overlaps between your two files. To demonstrate, let's identify all of the CpG islands that overlap exons. + + bedtools intersect -a cpg.bed -b exons.bed | head -5 + chr1 29320 29370 CpG:_116 + chr1 135124 135563 CpG:_30 + chr1 327790 328229 CpG:_29 + chr1 327790 328229 CpG:_29 + chr1 327790 328229 CpG:_29 + +Reporting the original feature in each file. +-------------------------------------------- +The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of not only showing you *where* the intersections occurred, it shows you *what* intersected. + + bedtools intersect -a cpg.bed -b exons.bed -wa -wb \ + | head -5 + chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - + chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - + chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + + chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + + chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + + +How many base pairs of overlap were there? +------------------------------------------ +The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files. + + bedtools intersect -a cpg.bed -b exons.bed -wo \ + | head -10 + chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - 50 + chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 + chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 + chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 + chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 + chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_ 033908_exon_6_0_chr1_713664_r 0 - 84 + chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _015368_exon_0_0_chr1_762971_f 0 + 185 + chr1 762416 763445 CpG:_115 chr1 763177 763229 NR _047525_exon_0_0_chr1_763178_f 0 + 52 + chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _047524_exon_0_0_chr1_762971_f 0 + 185 + chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 + +Counting the number of overlapping features. +-------------------------------------------- +We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option. + + bedtools intersect -a cpg.bed -b exons.bed -c \ + | head + chr1 28735 29810 CpG:_116 1 + chr1 135124 135563 CpG:_30 1 + chr1 327790 328229 CpG:_29 3 + chr1 437151 438164 CpG:_84 0 + chr1 449273 450544 CpG:_99 0 + chr1 533219 534114 CpG:_94 0 + chr1 544738 546649 CpG:_171 0 + chr1 713984 714547 CpG:_60 1 + chr1 762416 763445 CpG:_115 10 + chr1 788863 789211 CpG:_28 9 + +\ + + +Find features that DO NOT overlap +-------------------------------------------- +Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case. + + bedtools intersect -a cpg.bed -b exons.bed -v \ + | head + chr1 437151 438164 CpG:_84 + chr1 449273 450544 CpG:_99 + chr1 533219 534114 CpG:_94 + chr1 544738 546649 CpG:_171 + chr1 801975 802338 CpG:_24 + chr1 805198 805628 CpG:_50 + chr1 839694 840619 CpG:_83 + chr1 844299 845883 CpG:_153 + chr1 912869 913153 CpG:_28 + chr1 919726 919927 CpG:_15 + + +Require a minimal fraction of overlap. +-------------------------------------------- +Recall that the default is to report overlaps between features in A and B so long as *at least one basepair* of overlap exists. However, the `-f` option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported. + +Let's be more strict and require 50% of overlap. + + bedtools intersect -a cpg.bed -b exons.bed \ + -wo -f 0.50 \ + | head + chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 + chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 + chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 + chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 + chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047525_exon_4_0_chr1_788771_f 0 + 348 + chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047524_exon_3_0_chr1_788771_f 0 + 348 + chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047523_exon_3_0_chr1_788771_f 0 + 348 + chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_ 047522_exon_5_0_chr1_788859_f 0 + 348 + chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047521_exon_4_0_chr1_788771_f 0 + 348 + chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_ 047520_exon_6_0_chr1_788859_f 0 + 348 + + +\ + + +bedtools "merge" +==================== +Many datasets of genomic features have many individual features that overlap one another (e.g. aligments from a ChiP seq experiment). It is often useful to just cobine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you. + +![](http://bedtools.readthedocs.org/en/latest/_images/merge-glyph.png) + + +Input must be sorted +-------------------- +The merge tool requires that the input file is sorted by chromosome, then by start position. This allows the merging algorithm to work very quickly without requiring any RAM. + +If you run the merge command on a file that is not sorted, you will get an error. For example: + + bedtools merge -i exons.bed + chr1 66999824 67000051 + chr1 67091529 67091593 + chr1 67098752 67098777 + chr1 67101626 67101698 + chr1 67105459 67105516 + chr1 67108492 67108547 + chr1 67109226 67109402 + chr1 67126195 67126207 + chr1 67133212 67133224 + chr1 67136677 67136702 + chr1 67137626 67137678 + chr1 67138963 67139049 + chr1 67142686 67142779 + chr1 67145360 67145435 + chr1 67147551 67148052 + chr1 67154830 67154958 + chr1 67155872 67155999 + chr1 67161116 67161176 + chr1 67184976 67185088 + chr1 67194946 67195102 + chr1 67199430 67199563 + chr1 67205017 67205220 + chr1 67206340 67206405 + chr1 67206954 67207119 + ERROR: input file: (exons.bed) is not sorted by chrom then start. + The start coordinate at line 26 is less than the start at line 25 + +To correct this, you need to sort your BED using the UNIX `sort` utility. + + sort -k1,1 -k2,2n exons.bed > exons.sort.bed + +Let's try again. + + bedtools merge -i exons.sort.bed | head -10 + +The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. + + +Count the number of overlapping intervals. +------------------------------------------ +A more sophisticated approach would be to not only merge overlapping intervals, but also report the *number* of intervals that were integrated into the new, merged interval. One does this with the `-n` option. + + bedtools merge -i exons.sort.bed -n | head -20 + chr1 11873 12227 1 + chr1 12612 12721 1 + chr1 13220 14829 2 + chr1 14969 15038 1 + chr1 15795 15947 1 + chr1 16606 16765 1 + chr1 16857 17055 1 + chr1 17232 17368 1 + chr1 17605 17742 1 + chr1 17914 18061 1 + chr1 18267 18366 1 + chr1 24737 24891 1 + chr1 29320 29370 1 + chr1 34610 35174 2 + chr1 35276 35481 2 + chr1 35720 36081 2 + chr1 69090 70008 1 + chr1 134772 139696 1 + chr1 139789 139847 1 + chr1 140074 140566 1 + +Merging features that are close to one another. +----------------------------------------------- +With the `-d` (distance) option, one can also merge intervals that do not overlap, yet are close to one another. For example, to merge features that are no more than 1000bp apart, one would run: + + bedtools merge -i exons.sort.bed -d 1000 -n | head -20 + chr1 11873 18366 12 + chr1 24737 24891 1 + chr1 29320 29370 1 + chr1 34610 36081 6 + chr1 69090 70008 1 + chr1 134772 140566 3 + chr1 323891 328581 10 + chr1 367658 368597 3 + chr1 621095 622034 3 + chr1 661138 665731 3 + chr1 700244 700627 1 + chr1 701708 701767 1 + chr1 703927 705092 2 + chr1 708355 708487 1 + chr1 709550 709660 1 + chr1 713663 714068 1 + chr1 752750 755214 2 + chr1 761585 763229 10 + chr1 764382 764484 9 + chr1 776579 778984 1 + + +\ + + +bedtools "complement" +===================== +We often want to know which intervals of the genome are **NOT** "covered" by intervals in a given feature file. For example, if you have a set of ChIP-seq peaks, you may also want to know which regions of the genome are not bound by the factor you assayed. The `complement` addresses this task. + +![](http://bedtools.readthedocs.org/en/latest/_images/complement-glyph.png) + +As an example, let's find all of the non-exonic (i.e., intronic or intergenic) regions of the genome. Note, to do this you need a ["genome"](http://bedtools.readthedocs.org/en/latest/content/general-usage.html#genome-file-format) file, which tells `bedtools` the length of each chromosome in your file. *Consider why the tool would need this information...* + + bedtools complement -i exons.bed -g genome.txt \ + > non-exonic.bed + head non-exonic.bed + chr1 0 11873 + chr1 12227 12612 + chr1 12721 13220 + chr1 14829 14969 + chr1 15038 15795 + chr1 15947 16606 + chr1 16765 16857 + chr1 17055 17232 + chr1 17368 17605 + chr1 17742 17914 + + +\ + + +bedtools "genomecov" +==================== +For many analyses, one wants to measure the genome wide coverage of a feature file. For example, we often want to know what fraction of the genome is covered by 1 feature, 2 features, 3 features, etc. This is frequently crucial when assessing the "uniformity" of coverage from whole-genome sequencing. This is done with the versatile `genomecov` tool. + +![](http://bedtools.readthedocs.org/en/latest/_images/genomecov-glyph.png) + +As an example, let's produce a histogram of coverage of the exons throughout the genome. Like the `merge` tool, `genomecov` requires pre-sorted data. It also needs a genome file as above. + + bedtools genomecov -i exons.sort.bed -g genome.txt + +This should run for 3 minutes or so. At the end of your output, you should see something like: + + genome 0 3062406951 3137161264 0.976171 + genome 1 44120515 3137161264 0.0140638 + genome 2 15076446 3137161264 0.00480576 + genome 3 7294047 3137161264 0.00232505 + genome 4 3650324 3137161264 0.00116358 + genome 5 1926397 3137161264 0.000614057 + genome 6 1182623 3137161264 0.000376972 + genome 7 574102 3137161264 0.000183 + genome 8 353352 3137161264 0.000112634 + genome 9 152653 3137161264 4.86596e-05 + genome 10 113362 3137161264 3.61352e-05 + genome 11 57361 3137161264 1.82844e-05 + genome 12 52000 3137161264 1.65755e-05 + genome 13 55368 3137161264 1.76491e-05 + genome 14 19218 3137161264 6.12592e-06 + genome 15 19369 3137161264 6.17405e-06 + genome 16 26651 3137161264 8.49526e-06 + genome 17 9942 3137161264 3.16911e-06 + genome 18 13442 3137161264 4.28477e-06 + genome 19 1030 3137161264 3.28322e-07 + genome 20 6329 3137161264 2.01743e-06 + ... + + +\ + + +Producing BEDGRAPH output +-------------------------- +Using the `-bg` option, one can also produce BEDGRAPH output which represents the "depth" fo feature coverage for each base pair in the genome: + + bedtools genomecov -i exons.sort.bed -g genome.txt -bg | head -20 + chr1 11873 12227 1 + chr1 12612 12721 1 + chr1 13220 14361 1 + chr1 14361 14409 2 + chr1 14409 14829 1 + chr1 14969 15038 1 + chr1 15795 15947 1 + chr1 16606 16765 1 + chr1 16857 17055 1 + chr1 17232 17368 1 + chr1 17605 17742 1 + chr1 17914 18061 1 + chr1 18267 18366 1 + chr1 24737 24891 1 + chr1 29320 29370 1 + chr1 34610 35174 2 + chr1 35276 35481 2 + chr1 35720 36081 2 + chr1 69090 70008 1 + chr1 134772 139696 1 + + +\ + + +Sophistication through chaining multiple bedtools +================================================= +Analytical power in `bedtools` comes from the ability to "chain" together multiple tools in order to construct rather sophisicated analyses with very little programming - you just need **genome arithmetic**! Have a look at the examples [here](http://bedtools.readthedocs.org/en/latest/content/advanced-usage.html). + + +\ + + +Principal component analysis +============================= + +We will use the bedtools implementation of a Jaccard statistic to meaure the similarity of two +datasets. Briefly, the Jaccard statistic measures the ratio of the number of *intersecting* base +pairs to the *total* number of base pairs in the two sets. As such, the score ranges from 0.0 to 1. +0; lower values reflect lower similarity, whereas higher values reflect higher similarity. + +Let's walk through an example: we would expect the Dnase hypersensivity sites to be rather similar +between two samples of the **same** fetal tissue type. Let's test: + + bedtools jaccard \ + -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \ + -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed + intersection union jaccard + 81269248 160493950 0.50637 + +But what about the similarity of two **different** tissue types? + + bedtools jaccard \ + -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \ + -b fSkin_fibro_bicep_R-DS19745.hg19.hotspot.twopass.fdr0.05.merge.bed + intersection union jaccard + 28076951 164197278 0.170995 + +Hopefully this demonstrates how the Jaccard statistic can be used as a simple statistic to reduce +the dimensionality of the comparison between two large (e.g., often containing thousands or +millions of intervals) feature sets. + + +\ + + +A Jaccard statistic for all 400 pairwise comparisons. +------------------------------------------------------ + + +We are going to take this a bit further and use the Jaccard statistic to measure the similarity of +all 20 tissue samples against all other 20 samples. Once we have a 20x20 matrix of similarities, +we can use dimensionality reduction techniques such as hierarchical clustering or principal +component analysis to detect higher order similarities among **all** of the datasets. + + +We will use GNU parallel to compute a Jaccard statistic for the 400 (20*20) pairwise comparisons +among the fetal tissue samples. + +But first, we need to install [GNU parallel](http://www.gnu.org/software/parallel/). + + brew install parallel + +Next, we need to install a tiny script I wrote for this analysis. + + curl -O http://quinlanlab.cs.virginia.edu/cshl2013/make-matrix.py + + +Now, we can use `parallel` to, you guessed it, compute the 400 pairwise Jaccard statistics in parallel using as many processors as you have available. + + parallel "bedtools jaccard -a {1} -b {2} \ + | awk 'NR>1' \ + | cut -f 3 \ + > {1}.{2}.jaccard" \ + ::: `ls *.merge.bed` ::: `ls *.merge.bed` + +This command will create a single file containing the pairwise Jaccard measurements from all 400 tests. + + find . \ + | grep jaccard \ + | xargs grep "" \ + | sed -e s"/\.\///" \ + | perl -pi -e "s/.bed./.bed\t/" \ + | perl -pi -e "s/.jaccard:/\t/" \ + > pairwise.dnase.txt + +A bit of cleanup to use more intelligible names for each of the samples. + + cat pairwise.dnase.txt \ + | sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \ + | sed -e 's/.hg19//g' \ + > pairwise.dnase.shortnames.txt + +Now let's make a 20x20 matrix of the Jaccard statistic. This will allow the data to play nicely with R. + + awk 'NF==3' pairwise.dnase.shortnames.txt \ + | awk '$1 ~ /^f/ && $2 ~ /^f/' \ + | python make-matrix.py \ + > dnase.shortnames.distance.matrix + +Let's also make a file of labels for each dataset so that we can label each dataset in our R plot. + + cut -f 1 dnase.shortnames.distance.matrix | cut -f 1 -d "-" | cut -f 1 -d "_" > labels.txt + +Now start up R. (This assumes you have installed the `ggplot2` package). + + R + +You should see something very similar to this: + + + R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows" + Copyright (C) 2012 The R Foundation for Statistical Computing + ISBN 3-900051-07-0 + Platform: x86_64-apple-darwin12.0.0 (64-bit) + + R is free software and comes with ABSOLUTELY NO WARRANTY. + You are welcome to redistribute it under certain conditions. + Type 'license()' or 'licence()' for distribution details. + + Natural language support but running in an English locale + + R is a collaborative project with many contributors. + Type 'contributors()' for more information and + 'citation()' on how to cite R or R packages in publications. + + Type 'demo()' for some demos, 'help()' for on-line help, or + 'help.start()' for an HTML browser interface to help. + Type 'q()' to quit R. + + > + +No paste these commands into the R console: + + library(ggplot2) + library(RColorBrewer) + blues <- colorRampPalette(c('dark blue', 'light blue')) + greens <- colorRampPalette(c('dark green', 'light green')) + reds <- colorRampPalette(c('pink', 'dark red')) + + setwd("~/Desktop/bedtools-demo") + x <- read.table('dnase.shortnames.distance.matrix') + labels <- read.table('labels.txt') + ngroups <- length(unique(labels)) + pca <- princomp(x) + qplot(pca$scores[,1], pca$scores[,2], color=labels[,1], geom="point", size=1) + + scale_color_manual(values = c(blues(4), greens(5), reds(5))) + +You should see this: + +![](http://quinlanlab.cs.virginia.edu/cshl2013/pca.png) + +Et voila. Note that PCA was used in this case as an example of what PCA does for the CSHL Adv. Seq. course. Heatmaps are a more informative visualization in this case. + + + From 165b0029449ab642c7b45074b0f89f406d2530c4 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Tue, 15 Jul 2014 15:48:56 +0100 Subject: [PATCH 02/12] Removed the last part of Aaron's tutorial To be replaced with permutation --- novice/extras/bedtools-capstone.md | 301 ----------------------------- 1 file changed, 301 deletions(-) diff --git a/novice/extras/bedtools-capstone.md b/novice/extras/bedtools-capstone.md index 1952e39fe..9a640e130 100644 --- a/novice/extras/bedtools-capstone.md +++ b/novice/extras/bedtools-capstone.md @@ -244,305 +244,4 @@ Let's try again. The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. -Count the number of overlapping intervals. ------------------------------------------- -A more sophisticated approach would be to not only merge overlapping intervals, but also report the *number* of intervals that were integrated into the new, merged interval. One does this with the `-n` option. - - bedtools merge -i exons.sort.bed -n | head -20 - chr1 11873 12227 1 - chr1 12612 12721 1 - chr1 13220 14829 2 - chr1 14969 15038 1 - chr1 15795 15947 1 - chr1 16606 16765 1 - chr1 16857 17055 1 - chr1 17232 17368 1 - chr1 17605 17742 1 - chr1 17914 18061 1 - chr1 18267 18366 1 - chr1 24737 24891 1 - chr1 29320 29370 1 - chr1 34610 35174 2 - chr1 35276 35481 2 - chr1 35720 36081 2 - chr1 69090 70008 1 - chr1 134772 139696 1 - chr1 139789 139847 1 - chr1 140074 140566 1 - -Merging features that are close to one another. ------------------------------------------------ -With the `-d` (distance) option, one can also merge intervals that do not overlap, yet are close to one another. For example, to merge features that are no more than 1000bp apart, one would run: - - bedtools merge -i exons.sort.bed -d 1000 -n | head -20 - chr1 11873 18366 12 - chr1 24737 24891 1 - chr1 29320 29370 1 - chr1 34610 36081 6 - chr1 69090 70008 1 - chr1 134772 140566 3 - chr1 323891 328581 10 - chr1 367658 368597 3 - chr1 621095 622034 3 - chr1 661138 665731 3 - chr1 700244 700627 1 - chr1 701708 701767 1 - chr1 703927 705092 2 - chr1 708355 708487 1 - chr1 709550 709660 1 - chr1 713663 714068 1 - chr1 752750 755214 2 - chr1 761585 763229 10 - chr1 764382 764484 9 - chr1 776579 778984 1 - - -\ - - -bedtools "complement" -===================== -We often want to know which intervals of the genome are **NOT** "covered" by intervals in a given feature file. For example, if you have a set of ChIP-seq peaks, you may also want to know which regions of the genome are not bound by the factor you assayed. The `complement` addresses this task. - -![](http://bedtools.readthedocs.org/en/latest/_images/complement-glyph.png) - -As an example, let's find all of the non-exonic (i.e., intronic or intergenic) regions of the genome. Note, to do this you need a ["genome"](http://bedtools.readthedocs.org/en/latest/content/general-usage.html#genome-file-format) file, which tells `bedtools` the length of each chromosome in your file. *Consider why the tool would need this information...* - - bedtools complement -i exons.bed -g genome.txt \ - > non-exonic.bed - head non-exonic.bed - chr1 0 11873 - chr1 12227 12612 - chr1 12721 13220 - chr1 14829 14969 - chr1 15038 15795 - chr1 15947 16606 - chr1 16765 16857 - chr1 17055 17232 - chr1 17368 17605 - chr1 17742 17914 - - -\ - - -bedtools "genomecov" -==================== -For many analyses, one wants to measure the genome wide coverage of a feature file. For example, we often want to know what fraction of the genome is covered by 1 feature, 2 features, 3 features, etc. This is frequently crucial when assessing the "uniformity" of coverage from whole-genome sequencing. This is done with the versatile `genomecov` tool. - -![](http://bedtools.readthedocs.org/en/latest/_images/genomecov-glyph.png) - -As an example, let's produce a histogram of coverage of the exons throughout the genome. Like the `merge` tool, `genomecov` requires pre-sorted data. It also needs a genome file as above. - - bedtools genomecov -i exons.sort.bed -g genome.txt - -This should run for 3 minutes or so. At the end of your output, you should see something like: - - genome 0 3062406951 3137161264 0.976171 - genome 1 44120515 3137161264 0.0140638 - genome 2 15076446 3137161264 0.00480576 - genome 3 7294047 3137161264 0.00232505 - genome 4 3650324 3137161264 0.00116358 - genome 5 1926397 3137161264 0.000614057 - genome 6 1182623 3137161264 0.000376972 - genome 7 574102 3137161264 0.000183 - genome 8 353352 3137161264 0.000112634 - genome 9 152653 3137161264 4.86596e-05 - genome 10 113362 3137161264 3.61352e-05 - genome 11 57361 3137161264 1.82844e-05 - genome 12 52000 3137161264 1.65755e-05 - genome 13 55368 3137161264 1.76491e-05 - genome 14 19218 3137161264 6.12592e-06 - genome 15 19369 3137161264 6.17405e-06 - genome 16 26651 3137161264 8.49526e-06 - genome 17 9942 3137161264 3.16911e-06 - genome 18 13442 3137161264 4.28477e-06 - genome 19 1030 3137161264 3.28322e-07 - genome 20 6329 3137161264 2.01743e-06 - ... - - -\ - - -Producing BEDGRAPH output --------------------------- -Using the `-bg` option, one can also produce BEDGRAPH output which represents the "depth" fo feature coverage for each base pair in the genome: - - bedtools genomecov -i exons.sort.bed -g genome.txt -bg | head -20 - chr1 11873 12227 1 - chr1 12612 12721 1 - chr1 13220 14361 1 - chr1 14361 14409 2 - chr1 14409 14829 1 - chr1 14969 15038 1 - chr1 15795 15947 1 - chr1 16606 16765 1 - chr1 16857 17055 1 - chr1 17232 17368 1 - chr1 17605 17742 1 - chr1 17914 18061 1 - chr1 18267 18366 1 - chr1 24737 24891 1 - chr1 29320 29370 1 - chr1 34610 35174 2 - chr1 35276 35481 2 - chr1 35720 36081 2 - chr1 69090 70008 1 - chr1 134772 139696 1 - - -\ - - -Sophistication through chaining multiple bedtools -================================================= -Analytical power in `bedtools` comes from the ability to "chain" together multiple tools in order to construct rather sophisicated analyses with very little programming - you just need **genome arithmetic**! Have a look at the examples [here](http://bedtools.readthedocs.org/en/latest/content/advanced-usage.html). - - -\ - - -Principal component analysis -============================= - -We will use the bedtools implementation of a Jaccard statistic to meaure the similarity of two -datasets. Briefly, the Jaccard statistic measures the ratio of the number of *intersecting* base -pairs to the *total* number of base pairs in the two sets. As such, the score ranges from 0.0 to 1. -0; lower values reflect lower similarity, whereas higher values reflect higher similarity. - -Let's walk through an example: we would expect the Dnase hypersensivity sites to be rather similar -between two samples of the **same** fetal tissue type. Let's test: - - bedtools jaccard \ - -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \ - -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed - intersection union jaccard - 81269248 160493950 0.50637 - -But what about the similarity of two **different** tissue types? - - bedtools jaccard \ - -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \ - -b fSkin_fibro_bicep_R-DS19745.hg19.hotspot.twopass.fdr0.05.merge.bed - intersection union jaccard - 28076951 164197278 0.170995 - -Hopefully this demonstrates how the Jaccard statistic can be used as a simple statistic to reduce -the dimensionality of the comparison between two large (e.g., often containing thousands or -millions of intervals) feature sets. - - -\ - - -A Jaccard statistic for all 400 pairwise comparisons. ------------------------------------------------------- - - -We are going to take this a bit further and use the Jaccard statistic to measure the similarity of -all 20 tissue samples against all other 20 samples. Once we have a 20x20 matrix of similarities, -we can use dimensionality reduction techniques such as hierarchical clustering or principal -component analysis to detect higher order similarities among **all** of the datasets. - - -We will use GNU parallel to compute a Jaccard statistic for the 400 (20*20) pairwise comparisons -among the fetal tissue samples. - -But first, we need to install [GNU parallel](http://www.gnu.org/software/parallel/). - - brew install parallel - -Next, we need to install a tiny script I wrote for this analysis. - - curl -O http://quinlanlab.cs.virginia.edu/cshl2013/make-matrix.py - - -Now, we can use `parallel` to, you guessed it, compute the 400 pairwise Jaccard statistics in parallel using as many processors as you have available. - - parallel "bedtools jaccard -a {1} -b {2} \ - | awk 'NR>1' \ - | cut -f 3 \ - > {1}.{2}.jaccard" \ - ::: `ls *.merge.bed` ::: `ls *.merge.bed` - -This command will create a single file containing the pairwise Jaccard measurements from all 400 tests. - - find . \ - | grep jaccard \ - | xargs grep "" \ - | sed -e s"/\.\///" \ - | perl -pi -e "s/.bed./.bed\t/" \ - | perl -pi -e "s/.jaccard:/\t/" \ - > pairwise.dnase.txt - -A bit of cleanup to use more intelligible names for each of the samples. - - cat pairwise.dnase.txt \ - | sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \ - | sed -e 's/.hg19//g' \ - > pairwise.dnase.shortnames.txt - -Now let's make a 20x20 matrix of the Jaccard statistic. This will allow the data to play nicely with R. - - awk 'NF==3' pairwise.dnase.shortnames.txt \ - | awk '$1 ~ /^f/ && $2 ~ /^f/' \ - | python make-matrix.py \ - > dnase.shortnames.distance.matrix - -Let's also make a file of labels for each dataset so that we can label each dataset in our R plot. - - cut -f 1 dnase.shortnames.distance.matrix | cut -f 1 -d "-" | cut -f 1 -d "_" > labels.txt - -Now start up R. (This assumes you have installed the `ggplot2` package). - - R - -You should see something very similar to this: - - - R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows" - Copyright (C) 2012 The R Foundation for Statistical Computing - ISBN 3-900051-07-0 - Platform: x86_64-apple-darwin12.0.0 (64-bit) - - R is free software and comes with ABSOLUTELY NO WARRANTY. - You are welcome to redistribute it under certain conditions. - Type 'license()' or 'licence()' for distribution details. - - Natural language support but running in an English locale - - R is a collaborative project with many contributors. - Type 'contributors()' for more information and - 'citation()' on how to cite R or R packages in publications. - - Type 'demo()' for some demos, 'help()' for on-line help, or - 'help.start()' for an HTML browser interface to help. - Type 'q()' to quit R. - - > - -No paste these commands into the R console: - - library(ggplot2) - library(RColorBrewer) - blues <- colorRampPalette(c('dark blue', 'light blue')) - greens <- colorRampPalette(c('dark green', 'light green')) - reds <- colorRampPalette(c('pink', 'dark red')) - - setwd("~/Desktop/bedtools-demo") - x <- read.table('dnase.shortnames.distance.matrix') - labels <- read.table('labels.txt') - ngroups <- length(unique(labels)) - pca <- princomp(x) - qplot(pca$scores[,1], pca$scores[,2], color=labels[,1], geom="point", size=1) + - scale_color_manual(values = c(blues(4), greens(5), reds(5))) - -You should see this: - -![](http://quinlanlab.cs.virginia.edu/cshl2013/pca.png) - -Et voila. Note that PCA was used in this case as an example of what PCA does for the CSHL Adv. Seq. course. Heatmaps are a more informative visualization in this case. - - From 33bb59d722aa651cd601c0cae3fc4b645ad72860 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Mon, 21 Jul 2014 15:14:23 +0100 Subject: [PATCH 03/12] Fleshed out the section on shuffling beds --- .../bedtools-python}/bedtools-capstone.md | 139 +++++++++++++++++- .../capstones/bedtools-python/do_shuffle.sh | 8 + novice/capstones/bedtools-python/readings.py | 37 +++++ 3 files changed, 182 insertions(+), 2 deletions(-) rename novice/{extras => capstones/bedtools-python}/bedtools-capstone.md (69%) create mode 100755 novice/capstones/bedtools-python/do_shuffle.sh create mode 100644 novice/capstones/bedtools-python/readings.py diff --git a/novice/extras/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md similarity index 69% rename from novice/extras/bedtools-capstone.md rename to novice/capstones/bedtools-python/bedtools-capstone.md index 9a640e130..f586947a2 100644 --- a/novice/extras/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -191,6 +191,27 @@ Let's be more strict and require 50% of overlap. \ +Do exons and CPG islands overlap significantly? +=============================================== + +Going back to base pairs overlap +-------------------------------- + + bedtools intersect -a cpg.bed -b exons.bed -wo \ + | head -10 + chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - 50 + chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 + chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 + chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 + chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 + chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_ 033908_exon_6_0_chr1_713664_r 0 - 84 + chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _015368_exon_0_0_chr1_762971_f 0 + 185 + chr1 762416 763445 CpG:_115 chr1 763177 763229 NR _047525_exon_0_0_chr1_763178_f 0 + 52 + chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _047524_exon_0_0_chr1_762971_f 0 + 185 + chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 + + + bedtools "merge" ==================== @@ -243,5 +264,119 @@ Let's try again. The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. - - +Or alternatively: + + bedtools sort -i exons.bed | bedtools merge > exons.merged.bed + +Now the overlap with merged exons: + + + bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ + | head -n 10 + chr1 28735 29810 CpG:_116 chr1 29320 29370 50 + chr1 135124 135563 CpG:_30 chr1 134772 139696 439 + chr1 327790 328229 CpG:_29 chr1 324438 328581 439 + chr1 713984 714547 CpG:_60 chr1 713663 714068 84 + chr1 762416 763445 CpG:_115 chr1 761585 762902 486 + chr1 762416 763445 CpG:_115 chr1 762970 763155 185 + chr1 762416 763445 CpG:_115 chr1 763177 763229 52 + chr1 788863 789211 CpG:_28 chr1 788770 794826 348 + chr1 854765 854973 CpG:_16 chr1 854714 854817 52 + chr1 858970 861632 CpG:_257 chr1 861120 861180 60 + +All we want is the very last column: + + bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ + | cut -f 8,8 | head -n 10 + 50 + 439 + 439 + 84 + 486 + 185 + 52 + 348 + 52 + 60 + +Then we want the total bp overlap, using a slight alteration of the readings.py from the novice python lessons: + + bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ + | cut -f 8,8 | python readings.py --sum + 7235722.0 + +How do we know if this overlap is significant? Shuffle one of the files. + + bedtools shuffle -i cpg.bed -g genome.txt \ + | head -n 10 + chr7 154365006 154366081 CpG:_116 + chr2 104893603 104894042 CpG:_30 + chr11 11786369 11786808 CpG:_29 + chr1 116263073 116264086 CpG:_84 + chr10 10674197 10675468 CpG:_99 + chrX 7680868 7681763 CpG:_94 + chr6 34093178 34095089 CpG:_171 + chr1 88263374 88263937 CpG:_60 + chr1 150416635 150417664 CpG:_115 + chr6 39650957 39651305 CpG:_28 + +But keep the intervals on the same chromosome... + + bedtools shuffle -chrom -i cpg.bed -g genome.txt \ + | head -n 10 + chr1 2367858 2368933 CpG:_116 + chr1 225323760 225324199 CpG:_30 + chr1 37368198 37368637 CpG:_29 + chr1 13872987 13874000 CpG:_84 + chr1 91280571 91281842 CpG:_99 + chr1 74656135 74657030 CpG:_94 + chr1 64079046 64080957 CpG:_171 + chr1 54034535 54035098 CpG:_60 + chr1 237633659 237634688 CpG:_115 + chr1 51954093 51954441 CpG:_28 + +Save to a file, and count the overlap for the shuffled features + + bedtools shuffle -chrom -i cpg.bed -g genome.txt \ + > cpg.shuffled.bed + bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \ + | cut -f 8,8 | python readings.py --sum + 681980.0 + +Do this lots of times to get a p-value + + mkdir shuffled_cpg + + + #! /bin/bash + + + for i in {1..100} + do + bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed + done + +And then: + + for f in shuffled_cpg/*; + do bedtools intersect -a $f -b exons.merged.bed -wo \ + | cut -f 8,8 | python readings.py --sum; done | head -n 10 + 651838.0 + 655178.0 + 633355.0 + 636853.0 + 668339.0 + 651838.0 + 655178.0 + 633355.0 + 636853.0 + 668339.0 + +Save that to a file: + + for f in shuffled_cpg/*; + do bedtools intersect -a $f -b exons.merged.bed -wo \ + | cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt + + +TODO: Write a python program to plot the distribution of shuffled results, and the real result diff --git a/novice/capstones/bedtools-python/do_shuffle.sh b/novice/capstones/bedtools-python/do_shuffle.sh new file mode 100755 index 000000000..2c7e2edc1 --- /dev/null +++ b/novice/capstones/bedtools-python/do_shuffle.sh @@ -0,0 +1,8 @@ +#! /bin/bash + + +for i in {1..5} +do + bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed +done + diff --git a/novice/capstones/bedtools-python/readings.py b/novice/capstones/bedtools-python/readings.py new file mode 100644 index 000000000..28c61922e --- /dev/null +++ b/novice/capstones/bedtools-python/readings.py @@ -0,0 +1,37 @@ +import sys +import numpy as np + + +def main(): + script = sys.argv[0] + action = sys.argv[1] + filenames = sys.argv[2:] + assert action in ['--min', '--mean', '--max', '--sum'], \ + 'Action is not one of --min, --mean, or --max: ' + action + if len(filenames) == 0: + process(sys.stdin, action) + else: + for f in filenames: + process(f, action) + + +def process(filename, action): + data = np.loadtxt(filename, delimiter=',') + + data_shape = data.shape + if len(data_shape) == 1: + data = data.reshape((1, len(data))) + + if action == '--min': + values = data.min(axis=1) + elif action == '--mean': + values = data.mean(axis=1) + elif action == '--max': + values = data.max(axis=1) + elif action == '--sum': + values = data.sum(axis=1) + + for m in values: + print m + +main() From a56b93fb20630de6c613eb8322403d92f4b3f951 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Tue, 22 Jul 2014 11:31:32 +0100 Subject: [PATCH 04/12] Added python script for plotting the shuffled results. --- .../bedtools-python/bedtools-capstone.md | 172 +++++++++++++++++- novice/capstones/bedtools-python/results.py | 42 +++++ 2 files changed, 213 insertions(+), 1 deletion(-) create mode 100644 novice/capstones/bedtools-python/results.py diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md index f586947a2..1df8735cb 100644 --- a/novice/capstones/bedtools-python/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -379,4 +379,174 @@ Save that to a file: | cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt -TODO: Write a python program to plot the distribution of shuffled results, and the real result +Plotting our results with python +-------------------------------- + +Let's copy the readings.py script and alter it to read in our results files: + + import sys + import numpy as np + + + def main(): + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] + + process(shuffled_results_file, real_results_file) + + + def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) + + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data + + main() + +Which prints: + + Real data: + 7235722.0 + + Shuffled data: + [ 640473. 638611. 644321. 642422. 634641. 660745. 645816. 635427. + 595615. 644916. 662412. 645489. 666141. 674597. 637099. 621515. + 647847. 653944. 641051. 685423. 658610. 686554. 618233. 655932. + 670053. 641412. 617865. 651189. 639806. 658123. 639381. 644652. + 667240. 689363. 627791. 625137. 635577. 643151. 616453. 633041. + 629223. 645209. 629201. 639179. 649602. 638849. 667827. 637550. + 652560. 647235. 710669. 626332. 689819. 646094. 631575. 633863. + 657661. 642538. 648691. 660292. 649780. 643179. 615128. 653863. + 610901. 613489. 624788. 680903. 617416. 654761. 663484. 672619. + 615038. 630960. 622431. 634951. 659257. 649931. 633901. 612363. + 639325. 683887. 656753. 690935. 661310. 692022. 633441. 644006. + 652187. 643428. 643711. 682621. 607918. 674996. 674909. 625800. + 642090. 662203. 667386. 660448.] + +What happens if we call the script with no arguments? + + python results.py + + Traceback (most recent call last): + File "results.py", line 23, in + main() + File "results.py", line 7, in main + shuffled_results_file = sys.argv[1] + IndexError: list index out of range + +Not very helpful. Let's add a usage description: + + import sys + import numpy as np + + usage_string = """ + Results.py plots the real overlap between two bed files + versus the overlap when one of the results files is randomly + shuffled. + + Usage: python random.py shuffled_results.txt real_results.txt + """ + + def main(): + + if not len(sys.argv) == 3: + sys.exit(usage_string) + +Now we can plot a histogram of the shuffled data: + + plt.hist(shuffled_data, color='black') + plt.show() + +And we can plot the real result as a vertical red line: + + import sys + import numpy as np + from matplotlib import pyplot as plt + + usage_string = """ + Results.py plots the real overlap between two bed files + versus the overlap when one of the results files is randomly + shuffled. + + Usage: python random.py shuffled_results.txt real_results.txt + """ + + def main(): + + if not len(sys.argv) == 3: + sys.exit(usage_string) + + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] + + process(shuffled_results_file, real_results_file) + + + def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) + + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data + + plt.hist(shuffled_data, color='black') + plt.axvline(real_data, color='red', ls='--', lw=2) + plt.show() + + main() + +Finally, let's add some axis labels and a legend: + + import sys + import numpy as np + from matplotlib import pyplot as plt + + usage_string = """ + Results.py plots the real overlap between two bed files + versus the overlap when one of the results files is randomly + shuffled. + + Usage: python random.py shuffled_results.txt real_results.txt + """ + + def main(): + + if not len(sys.argv) == 3: + sys.exit(usage_string) + + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] + + process(shuffled_results_file, real_results_file) + + + def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) + + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data + + plt.hist(shuffled_data, color='black', label='Shuffled Data') + plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data') + plt.xlabel('Nucleotides overlap') + plt.ylabel('Number of shuffled results') + plt.legend() + plt.show() + + main() + + diff --git a/novice/capstones/bedtools-python/results.py b/novice/capstones/bedtools-python/results.py new file mode 100644 index 000000000..5b71b4f56 --- /dev/null +++ b/novice/capstones/bedtools-python/results.py @@ -0,0 +1,42 @@ +import sys +import numpy as np +from matplotlib import pyplot as plt + +usage_string = """ +Results.py plots the real overlap between two bed files +versus the overlap when one of the results files is randomly +shuffled. + +Usage: python random.py shuffled_results.txt real_results.txt +""" + +def main(): + + if not len(sys.argv) == 3: + sys.exit(usage_string) + + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] + + process(shuffled_results_file, real_results_file) + + +def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) + + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data + + plt.hist(shuffled_data, color='black', label='Shuffled Data') + plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data') + plt.xlabel('Nucleotides overlap') + plt.ylabel('Number of shuffled results') + plt.legend() + plt.show() + +main() From 7916e338f6fe4e2cfe7490c6afd319d73f33d087 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Tue, 22 Jul 2014 14:19:46 +0100 Subject: [PATCH 05/12] Get jekyll to build our new example --- Makefile | 1 + novice/capstones/bedtools-python/bedtools-capstone.md | 9 +++++---- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/Makefile b/Makefile index cb6805a12..57460b891 100644 --- a/Makefile +++ b/Makefile @@ -87,6 +87,7 @@ BOOK_SRC = \ # All source pages (including things not in the book). PAGES_SRC = \ contents.md \ + $(wildcard novice/capstones/*/*.md) \ $(wildcard intermediate/python/*.md) \ $(BOOK_SRC) diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md index 1df8735cb..5a4bd2842 100644 --- a/novice/capstones/bedtools-python/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -1,7 +1,8 @@ -% bedtools Tutorial -% Aaron Quinlan -% November 22, 2013 - +--- +layout: lesson +root: ../../.. +title: Capstone example Python and Bedtools +--- Synopsis ======== From 3226f2fe8ed42f67e0ff453e98b3d9f675b31319 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Tue, 22 Jul 2014 15:49:00 +0100 Subject: [PATCH 06/12] Sorted formatting out --- .../bedtools-python/bedtools-capstone.md | 819 ++++++++++-------- 1 file changed, 475 insertions(+), 344 deletions(-) diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md index 5a4bd2842..9dbcf36dd 100644 --- a/novice/capstones/bedtools-python/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -10,53 +10,42 @@ Synopsis Our goal is to work through examples that demonstrate how to explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) with the `bedtools` software package. -Some of our analysis will be based upon the Maurano et al exploration of DnaseI hypersensitivity sites in hundreds of primary tissue types. - - Maurano et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012. Vol. 337 no. 6099 pp. 1190-1195. - - www.sciencemag.org/content/337/6099/1190.short - - This tutorial is merely meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/). - - - -\ - - Setup ===== From the Terminal, create a new directory on your Desktop called "bedtools-demo". - cd ~/Desktop - mkdir bedtools-demo +~~~ +cd ~/Desktop +mkdir bedtools-demo +~~~ +{:class="in"} + Navigate into that directory. - cd bedtools-demo +~~~ +cd bedtools-demo +~~~ +{:class="in"} Download the sample BED files I have provided. - curl -O http://quinlanlab.cs.virginia.edu/cshl2013/maurano.dnaseI.tgz - curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed - curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed - curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed - curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt - -Now, we need to extract all of the 20 Dnase I hypersensitivity BED files from the "tarball" named -`maurano.dnaseI.tgz`. - - tar -zxvf maurano.dnaseI.tgz - rm maurano.dnaseI.tgz +~~~ +curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed +curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed +curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed +curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt +~~~ +{:class="in"} Let's take a look at what files we now have. - ls -1 - - -\ - +~~~ +ls -1 +~~~ +{:class="in"} What are these files? ========================= @@ -66,22 +55,24 @@ In addition: `cpg.bed` represents CpG islands in the human genome; `exons.bed` r The latter 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). - -\ - - The bedtools help ================== To bring up the help, just type - bedtools +~~~ +bedtools +~~~ +{:class="in"} As you can see, there are multiple "subcommands" and for bedtools to work you must tell it which subcommand you want to use. Examples: - bedtools intersect - bedtools merge - bedtools subtract +~~~ +bedtools intersect +bedtools merge +bedtools subtract +~~~ +{:class="in"} bedtools "intersect" @@ -95,78 +86,105 @@ Default behavior ---------------- By default, `intersect` reports the intervals that represent overlaps between your two files. To demonstrate, let's identify all of the CpG islands that overlap exons. - bedtools intersect -a cpg.bed -b exons.bed | head -5 - chr1 29320 29370 CpG:_116 - chr1 135124 135563 CpG:_30 - chr1 327790 328229 CpG:_29 - chr1 327790 328229 CpG:_29 - chr1 327790 328229 CpG:_29 +~~~ +bedtools intersect -a cpg.bed -b exons.bed | head -5 +~~~ +{:class="in"} +~~~ +chr1 29320 29370 CpG:_116 +chr1 135124 135563 CpG:_30 +chr1 327790 328229 CpG:_29 +chr1 327790 328229 CpG:_29 +chr1 327790 328229 CpG:_29 +~~~ +{:class="out"} Reporting the original feature in each file. -------------------------------------------- The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of not only showing you *where* the intersections occurred, it shows you *what* intersected. - bedtools intersect -a cpg.bed -b exons.bed -wa -wb \ - | head -5 - chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - - chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - - chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + - chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + - chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + +~~~ +bedtools intersect -a cpg.bed -b exons.bed -wa -wb \ +| head -5 +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + +~~~ +{:class="out"} How many base pairs of overlap were there? ------------------------------------------ The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files. - bedtools intersect -a cpg.bed -b exons.bed -wo \ - | head -10 - chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - 50 - chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 - chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 - chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 - chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 - chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_ 033908_exon_6_0_chr1_713664_r 0 - 84 - chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _015368_exon_0_0_chr1_762971_f 0 + 185 - chr1 762416 763445 CpG:_115 chr1 763177 763229 NR _047525_exon_0_0_chr1_763178_f 0 + 52 - chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _047524_exon_0_0_chr1_762971_f 0 + 185 - chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 +~~~ +bedtools intersect -a cpg.bed -b exons.bed -wo \ +| head -10 +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - 50 +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 +chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_ 033908_exon_6_0_chr1_713664_r 0 - 84 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _015368_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 NR _047525_exon_0_0_chr1_763178_f 0 + 52 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _047524_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 +~~~ +{:class="out"} Counting the number of overlapping features. -------------------------------------------- We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option. - bedtools intersect -a cpg.bed -b exons.bed -c \ - | head - chr1 28735 29810 CpG:_116 1 - chr1 135124 135563 CpG:_30 1 - chr1 327790 328229 CpG:_29 3 - chr1 437151 438164 CpG:_84 0 - chr1 449273 450544 CpG:_99 0 - chr1 533219 534114 CpG:_94 0 - chr1 544738 546649 CpG:_171 0 - chr1 713984 714547 CpG:_60 1 - chr1 762416 763445 CpG:_115 10 - chr1 788863 789211 CpG:_28 9 - -\ - +~~~ +bedtools intersect -a cpg.bed -b exons.bed -c \ +| head +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 1 +chr1 135124 135563 CpG:_30 1 +chr1 327790 328229 CpG:_29 3 +chr1 437151 438164 CpG:_84 0 +chr1 449273 450544 CpG:_99 0 +chr1 533219 534114 CpG:_94 0 +chr1 544738 546649 CpG:_171 0 +chr1 713984 714547 CpG:_60 1 +chr1 762416 763445 CpG:_115 10 +chr1 788863 789211 CpG:_28 9 +~~~ +{:class="out"} Find features that DO NOT overlap -------------------------------------------- Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case. - bedtools intersect -a cpg.bed -b exons.bed -v \ - | head - chr1 437151 438164 CpG:_84 - chr1 449273 450544 CpG:_99 - chr1 533219 534114 CpG:_94 - chr1 544738 546649 CpG:_171 - chr1 801975 802338 CpG:_24 - chr1 805198 805628 CpG:_50 - chr1 839694 840619 CpG:_83 - chr1 844299 845883 CpG:_153 - chr1 912869 913153 CpG:_28 - chr1 919726 919927 CpG:_15 +~~~ +bedtools intersect -a cpg.bed -b exons.bed -v \ +| head +~~~ +{:class="in"} +~~~ +chr1 437151 438164 CpG:_84 +chr1 449273 450544 CpG:_99 +chr1 533219 534114 CpG:_94 +chr1 544738 546649 CpG:_171 +chr1 801975 802338 CpG:_24 +chr1 805198 805628 CpG:_50 +chr1 839694 840619 CpG:_83 +chr1 844299 845883 CpG:_153 +chr1 912869 913153 CpG:_28 +chr1 919726 919927 CpG:_15 +~~~ +{:class="out"} Require a minimal fraction of overlap. @@ -175,22 +193,25 @@ Recall that the default is to report overlaps between features in A and B so lon Let's be more strict and require 50% of overlap. - bedtools intersect -a cpg.bed -b exons.bed \ - -wo -f 0.50 \ - | head - chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 - chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 - chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 - chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 - chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047525_exon_4_0_chr1_788771_f 0 + 348 - chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047524_exon_3_0_chr1_788771_f 0 + 348 - chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047523_exon_3_0_chr1_788771_f 0 + 348 - chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_ 047522_exon_5_0_chr1_788859_f 0 + 348 - chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047521_exon_4_0_chr1_788771_f 0 + 348 - chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_ 047520_exon_6_0_chr1_788859_f 0 + 348 - - -\ +~~~ +bedtools intersect -a cpg.bed -b exons.bed \ +-wo -f 0.50 \ +| head +~~~ +{:class="in"} +~~~ +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 +chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047525_exon_4_0_chr1_788771_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047524_exon_3_0_chr1_788771_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047523_exon_3_0_chr1_788771_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_ 047522_exon_5_0_chr1_788859_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047521_exon_4_0_chr1_788771_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_ 047520_exon_6_0_chr1_788859_f 0 + 348 +~~~ +{:class="out"} Do exons and CPG islands overlap significantly? =============================================== @@ -198,21 +219,24 @@ Do exons and CPG islands overlap significantly? Going back to base pairs overlap -------------------------------- - bedtools intersect -a cpg.bed -b exons.bed -wo \ - | head -10 - chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - 50 - chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 - chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 - chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 - chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 - chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_ 033908_exon_6_0_chr1_713664_r 0 - 84 - chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _015368_exon_0_0_chr1_762971_f 0 + 185 - chr1 762416 763445 CpG:_115 chr1 763177 763229 NR _047525_exon_0_0_chr1_763178_f 0 + 52 - chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _047524_exon_0_0_chr1_762971_f 0 + 185 - chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 - - - +~~~ +bedtools intersect -a cpg.bed -b exons.bed -wo \ +| head -10 +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - 50 +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 +chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_ 033908_exon_6_0_chr1_713664_r 0 - 84 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _015368_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 NR _047525_exon_0_0_chr1_763178_f 0 + 52 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _047524_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 +~~~ +{:class="out"} bedtools "merge" ==================== @@ -227,157 +251,237 @@ The merge tool requires that the input file is sorted by chromosome, then by sta If you run the merge command on a file that is not sorted, you will get an error. For example: - bedtools merge -i exons.bed - chr1 66999824 67000051 - chr1 67091529 67091593 - chr1 67098752 67098777 - chr1 67101626 67101698 - chr1 67105459 67105516 - chr1 67108492 67108547 - chr1 67109226 67109402 - chr1 67126195 67126207 - chr1 67133212 67133224 - chr1 67136677 67136702 - chr1 67137626 67137678 - chr1 67138963 67139049 - chr1 67142686 67142779 - chr1 67145360 67145435 - chr1 67147551 67148052 - chr1 67154830 67154958 - chr1 67155872 67155999 - chr1 67161116 67161176 - chr1 67184976 67185088 - chr1 67194946 67195102 - chr1 67199430 67199563 - chr1 67205017 67205220 - chr1 67206340 67206405 - chr1 67206954 67207119 - ERROR: input file: (exons.bed) is not sorted by chrom then start. - The start coordinate at line 26 is less than the start at line 25 +~~~ +bedtools merge -i exons.bed +~~~ +{:class="in"} +~~~ +chr1 66999824 67000051 +chr1 67091529 67091593 +chr1 67098752 67098777 +chr1 67101626 67101698 +chr1 67105459 67105516 +chr1 67108492 67108547 +chr1 67109226 67109402 +chr1 67126195 67126207 +chr1 67133212 67133224 +chr1 67136677 67136702 +chr1 67137626 67137678 +chr1 67138963 67139049 +chr1 67142686 67142779 +chr1 67145360 67145435 +chr1 67147551 67148052 +chr1 67154830 67154958 +chr1 67155872 67155999 +chr1 67161116 67161176 +chr1 67184976 67185088 +chr1 67194946 67195102 +chr1 67199430 67199563 +chr1 67205017 67205220 +chr1 67206340 67206405 +chr1 67206954 67207119 +ERROR: input file: (exons.bed) is not sorted by chrom then start. + The start coordinate at line 26 is less than the start at line 25 +~~~ +{:class="err"} To correct this, you need to sort your BED using the UNIX `sort` utility. - sort -k1,1 -k2,2n exons.bed > exons.sort.bed +~~~ +sort -k1,1 -k2,2n exons.bed > exons.sort.bed +~~~ +{:class="in"} Let's try again. - bedtools merge -i exons.sort.bed | head -10 +~~~ +bedtools merge -i exons.sort.bed | head -10 +~~~ +{:class="in"} The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. Or alternatively: - bedtools sort -i exons.bed | bedtools merge > exons.merged.bed +~~~ +bedtools sort -i exons.bed | bedtools merge > exons.merged.bed +~~~ +{:class="in"} Now the overlap with merged exons: - bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ - | head -n 10 - chr1 28735 29810 CpG:_116 chr1 29320 29370 50 - chr1 135124 135563 CpG:_30 chr1 134772 139696 439 - chr1 327790 328229 CpG:_29 chr1 324438 328581 439 - chr1 713984 714547 CpG:_60 chr1 713663 714068 84 - chr1 762416 763445 CpG:_115 chr1 761585 762902 486 - chr1 762416 763445 CpG:_115 chr1 762970 763155 185 - chr1 762416 763445 CpG:_115 chr1 763177 763229 52 - chr1 788863 789211 CpG:_28 chr1 788770 794826 348 - chr1 854765 854973 CpG:_16 chr1 854714 854817 52 - chr1 858970 861632 CpG:_257 chr1 861120 861180 60 +~~~ +bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ +| head -n 10 +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 50 +chr1 135124 135563 CpG:_30 chr1 134772 139696 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 439 +chr1 713984 714547 CpG:_60 chr1 713663 714068 84 +chr1 762416 763445 CpG:_115 chr1 761585 762902 486 +chr1 762416 763445 CpG:_115 chr1 762970 763155 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 52 +chr1 788863 789211 CpG:_28 chr1 788770 794826 348 +chr1 854765 854973 CpG:_16 chr1 854714 854817 52 +chr1 858970 861632 CpG:_257 chr1 861120 861180 60 +~~~ +{:class="out"} All we want is the very last column: - bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ - | cut -f 8,8 | head -n 10 - 50 - 439 - 439 - 84 - 486 - 185 - 52 - 348 - 52 - 60 +~~~ +bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ +| cut -f 8,8 | head -n 10 +~~~ +{:class="in"} +~~~ +50 +439 +439 +84 +486 +185 +52 +348 +52 +60 +~~~ +{:class="out"} Then we want the total bp overlap, using a slight alteration of the readings.py from the novice python lessons: - bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ - | cut -f 8,8 | python readings.py --sum - 7235722.0 +~~~ +bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ +| cut -f 8,8 | python readings.py --sum +~~~ +{:class="in"} +~~~ +7235722.0 +~~~ +{:class="out"} How do we know if this overlap is significant? Shuffle one of the files. - bedtools shuffle -i cpg.bed -g genome.txt \ - | head -n 10 - chr7 154365006 154366081 CpG:_116 - chr2 104893603 104894042 CpG:_30 - chr11 11786369 11786808 CpG:_29 - chr1 116263073 116264086 CpG:_84 - chr10 10674197 10675468 CpG:_99 - chrX 7680868 7681763 CpG:_94 - chr6 34093178 34095089 CpG:_171 - chr1 88263374 88263937 CpG:_60 - chr1 150416635 150417664 CpG:_115 - chr6 39650957 39651305 CpG:_28 +~~~ +bedtools shuffle -i cpg.bed -g genome.txt \ +| head -n 10 +~~~ +{:class="in"} +~~~ +chr7 154365006 154366081 CpG:_116 +chr2 104893603 104894042 CpG:_30 +chr11 11786369 11786808 CpG:_29 +chr1 116263073 116264086 CpG:_84 +chr10 10674197 10675468 CpG:_99 +chrX 7680868 7681763 CpG:_94 +chr6 34093178 34095089 CpG:_171 +chr1 88263374 88263937 CpG:_60 +chr1 150416635 150417664 CpG:_115 +chr6 39650957 39651305 CpG:_28 +~~~ +{:class="out"} But keep the intervals on the same chromosome... - bedtools shuffle -chrom -i cpg.bed -g genome.txt \ - | head -n 10 - chr1 2367858 2368933 CpG:_116 - chr1 225323760 225324199 CpG:_30 - chr1 37368198 37368637 CpG:_29 - chr1 13872987 13874000 CpG:_84 - chr1 91280571 91281842 CpG:_99 - chr1 74656135 74657030 CpG:_94 - chr1 64079046 64080957 CpG:_171 - chr1 54034535 54035098 CpG:_60 - chr1 237633659 237634688 CpG:_115 - chr1 51954093 51954441 CpG:_28 +~~~ +bedtools shuffle -chrom -i cpg.bed -g genome.txt \ +| head -n 10 +~~~ +{:class="in"} +~~~ +chr1 2367858 2368933 CpG:_116 +chr1 225323760 225324199 CpG:_30 +chr1 37368198 37368637 CpG:_29 +chr1 13872987 13874000 CpG:_84 +chr1 91280571 91281842 CpG:_99 +chr1 74656135 74657030 CpG:_94 +chr1 64079046 64080957 CpG:_171 +chr1 54034535 54035098 CpG:_60 +chr1 237633659 237634688 CpG:_115 +chr1 51954093 51954441 CpG:_28 +~~~ +{:class="out"} Save to a file, and count the overlap for the shuffled features - bedtools shuffle -chrom -i cpg.bed -g genome.txt \ - > cpg.shuffled.bed - bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \ - | cut -f 8,8 | python readings.py --sum - 681980.0 +~~~ +bedtools shuffle -chrom -i cpg.bed -g genome.txt \ +> cpg.shuffled.bed +bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \ +| cut -f 8,8 | python readings.py --sum +~~~ +{:class="in"} +~~~ +681980.0 +~~~ +{:class="out"} Do this lots of times to get a p-value - mkdir shuffled_cpg +~~~ +mkdir shuffled_cpg +~~~ +{:class="in"} + +We have a script for that: + +~~~ +cat do_shuffle.sh +~~~ +{:class="in"} +~~~ +#! /bin/bash - #! /bin/bash +for i in {1..100} +do + bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed +done +~~~ +{:class="out"} +Run the script: + +~~~ +./do_suffle.sh +ls -l shuffled_cpg | head -n 5 +~~~ +{:class="in"} - for i in {1..100} - do - bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed - done And then: - for f in shuffled_cpg/*; - do bedtools intersect -a $f -b exons.merged.bed -wo \ - | cut -f 8,8 | python readings.py --sum; done | head -n 10 - 651838.0 - 655178.0 - 633355.0 - 636853.0 - 668339.0 - 651838.0 - 655178.0 - 633355.0 - 636853.0 - 668339.0 +~~~ +for f in shuffled_cpg/*; +do bedtools intersect -a $f -b exons.merged.bed -wo \ +| cut -f 8,8 | python readings.py --sum; done | head -n 10 +~~~ +{:class="in"} +~~~ +651838.0 +655178.0 +633355.0 +636853.0 +668339.0 +651838.0 +655178.0 +633355.0 +636853.0 +668339.0 +~~~ +{:class="out"} Save that to a file: - for f in shuffled_cpg/*; - do bedtools intersect -a $f -b exons.merged.bed -wo \ - | cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt +~~~ +for f in shuffled_cpg/*; +do bedtools intersect -a $f -b exons.merged.bed -wo \ +| cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt +~~~ +{:class="in"} Plotting our results with python @@ -385,169 +489,196 @@ Plotting our results with python Let's copy the readings.py script and alter it to read in our results files: - import sys - import numpy as np +~~~ +cat results.py +~~~ +{:class="in"} +~~~ +import sys +import numpy as np - def main(): - script = sys.argv[0] - shuffled_results_file = sys.argv[1] - real_results_file = sys.argv[2] +def main(): + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] - process(shuffled_results_file, real_results_file) + process(shuffled_results_file, real_results_file) - def process(shuffled_results_file, real_results_file): - shuffled_data = np.loadtxt(shuffled_results_file) - real_data = np.loadtxt(real_results_file) +def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) - print 'Real data:' - print real_data - print - print 'Shuffled data:' - print shuffled_data + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data - main() +main() +~~~ +{:class="out"} Which prints: - Real data: - 7235722.0 - - Shuffled data: - [ 640473. 638611. 644321. 642422. 634641. 660745. 645816. 635427. - 595615. 644916. 662412. 645489. 666141. 674597. 637099. 621515. - 647847. 653944. 641051. 685423. 658610. 686554. 618233. 655932. - 670053. 641412. 617865. 651189. 639806. 658123. 639381. 644652. - 667240. 689363. 627791. 625137. 635577. 643151. 616453. 633041. - 629223. 645209. 629201. 639179. 649602. 638849. 667827. 637550. - 652560. 647235. 710669. 626332. 689819. 646094. 631575. 633863. - 657661. 642538. 648691. 660292. 649780. 643179. 615128. 653863. - 610901. 613489. 624788. 680903. 617416. 654761. 663484. 672619. - 615038. 630960. 622431. 634951. 659257. 649931. 633901. 612363. - 639325. 683887. 656753. 690935. 661310. 692022. 633441. 644006. - 652187. 643428. 643711. 682621. 607918. 674996. 674909. 625800. - 642090. 662203. 667386. 660448.] +~~~ +python results.py shuffled_results.txt cpg_result.txt +~~~ +{:class="in"} +~~~ +Real data: +7235722.0 + +Shuffled data: +[ 640473. 638611. 644321. 642422. 634641. 660745. 645816. 635427. + 595615. 644916. 662412. 645489. 666141. 674597. 637099. 621515. + 647847. 653944. 641051. 685423. 658610. 686554. 618233. 655932. + 670053. 641412. 617865. 651189. 639806. 658123. 639381. 644652. + 667240. 689363. 627791. 625137. 635577. 643151. 616453. 633041. + 629223. 645209. 629201. 639179. 649602. 638849. 667827. 637550. + 652560. 647235. 710669. 626332. 689819. 646094. 631575. 633863. + 657661. 642538. 648691. 660292. 649780. 643179. 615128. 653863. + 610901. 613489. 624788. 680903. 617416. 654761. 663484. 672619. + 615038. 630960. 622431. 634951. 659257. 649931. 633901. 612363. + 639325. 683887. 656753. 690935. 661310. 692022. 633441. 644006. + 652187. 643428. 643711. 682621. 607918. 674996. 674909. 625800. + 642090. 662203. 667386. 660448.] +~~~ +{:class="out"} What happens if we call the script with no arguments? - python results.py - - Traceback (most recent call last): - File "results.py", line 23, in - main() - File "results.py", line 7, in main - shuffled_results_file = sys.argv[1] - IndexError: list index out of range +~~~ +python results.py +~~~ +{:class="in"} +~~~ +Traceback (most recent call last): + File "results.py", line 23, in + main() + File "results.py", line 7, in main + shuffled_results_file = sys.argv[1] +IndexError: list index out of range +~~~ +{:class="err"} Not very helpful. Let's add a usage description: - import sys - import numpy as np +~~~ +import sys +import numpy as np - usage_string = """ - Results.py plots the real overlap between two bed files - versus the overlap when one of the results files is randomly - shuffled. +usage_string = """ +Results.py plots the real overlap between two bed files +versus the overlap when one of the results files is randomly +shuffled. - Usage: python random.py shuffled_results.txt real_results.txt - """ +Usage: python random.py shuffled_results.txt real_results.txt +""" - def main(): +def main(): - if not len(sys.argv) == 3: - sys.exit(usage_string) + if not len(sys.argv) == 3: + sys.exit(usage_string) +~~~ Now we can plot a histogram of the shuffled data: - plt.hist(shuffled_data, color='black') - plt.show() +~~~ +plt.hist(shuffled_data, color='black') +plt.show() +~~~ And we can plot the real result as a vertical red line: - import sys - import numpy as np - from matplotlib import pyplot as plt +~~~ +import sys +import numpy as np +from matplotlib import pyplot as plt - usage_string = """ - Results.py plots the real overlap between two bed files - versus the overlap when one of the results files is randomly - shuffled. +usage_string = """ +Results.py plots the real overlap between two bed files +versus the overlap when one of the results files is randomly +shuffled. - Usage: python random.py shuffled_results.txt real_results.txt - """ +Usage: python random.py shuffled_results.txt real_results.txt +""" - def main(): +def main(): - if not len(sys.argv) == 3: - sys.exit(usage_string) + if not len(sys.argv) == 3: + sys.exit(usage_string) - script = sys.argv[0] - shuffled_results_file = sys.argv[1] - real_results_file = sys.argv[2] + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] - process(shuffled_results_file, real_results_file) + process(shuffled_results_file, real_results_file) - def process(shuffled_results_file, real_results_file): - shuffled_data = np.loadtxt(shuffled_results_file) - real_data = np.loadtxt(real_results_file) +def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) - print 'Real data:' - print real_data - print - print 'Shuffled data:' - print shuffled_data + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data - plt.hist(shuffled_data, color='black') - plt.axvline(real_data, color='red', ls='--', lw=2) - plt.show() + plt.hist(shuffled_data, color='black') + plt.axvline(real_data, color='red', ls='--', lw=2) + plt.show() - main() +main() +~~~ Finally, let's add some axis labels and a legend: - import sys - import numpy as np - from matplotlib import pyplot as plt +~~~ +import sys +import numpy as np +from matplotlib import pyplot as plt - usage_string = """ - Results.py plots the real overlap between two bed files - versus the overlap when one of the results files is randomly - shuffled. +usage_string = """ +Results.py plots the real overlap between two bed files +versus the overlap when one of the results files is randomly +shuffled. - Usage: python random.py shuffled_results.txt real_results.txt - """ +Usage: python random.py shuffled_results.txt real_results.txt +""" - def main(): +def main(): - if not len(sys.argv) == 3: - sys.exit(usage_string) + if not len(sys.argv) == 3: + sys.exit(usage_string) - script = sys.argv[0] - shuffled_results_file = sys.argv[1] - real_results_file = sys.argv[2] + script = sys.argv[0] + shuffled_results_file = sys.argv[1] + real_results_file = sys.argv[2] - process(shuffled_results_file, real_results_file) + process(shuffled_results_file, real_results_file) - def process(shuffled_results_file, real_results_file): - shuffled_data = np.loadtxt(shuffled_results_file) - real_data = np.loadtxt(real_results_file) +def process(shuffled_results_file, real_results_file): + shuffled_data = np.loadtxt(shuffled_results_file) + real_data = np.loadtxt(real_results_file) - print 'Real data:' - print real_data - print - print 'Shuffled data:' - print shuffled_data + print 'Real data:' + print real_data + print + print 'Shuffled data:' + print shuffled_data - plt.hist(shuffled_data, color='black', label='Shuffled Data') - plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data') - plt.xlabel('Nucleotides overlap') - plt.ylabel('Number of shuffled results') - plt.legend() - plt.show() + plt.hist(shuffled_data, color='black', label='Shuffled Data') + plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data') + plt.xlabel('Nucleotides overlap') + plt.ylabel('Number of shuffled results') + plt.legend() + plt.show() - main() +main() +~~~ From 6dc35eeb5b83617bfa20fdc1c9b951a35980930c Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Tue, 22 Jul 2014 17:29:07 +0100 Subject: [PATCH 07/12] Expanding explanatory text and fixing some formatting issues --- .../bedtools-python/bedtools-capstone.md | 215 ++++++++++-------- 1 file changed, 126 insertions(+), 89 deletions(-) diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md index 9dbcf36dd..6536a1d1a 100644 --- a/novice/capstones/bedtools-python/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -10,15 +10,16 @@ Synopsis Our goal is to work through examples that demonstrate how to explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) with the `bedtools` software package. -This tutorial is merely meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/). +This tutorial is meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/). Setup ===== + From the Terminal, create a new directory on your Desktop called "bedtools-demo". ~~~ -cd ~/Desktop -mkdir bedtools-demo +$ cd ~/Desktop +$ mkdir bedtools-demo ~~~ {:class="in"} @@ -26,41 +27,50 @@ mkdir bedtools-demo Navigate into that directory. ~~~ -cd bedtools-demo +$ cd bedtools-demo ~~~ {:class="in"} -Download the sample BED files I have provided. +Download the sample BED files. ~~~ -curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed -curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed -curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed -curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt +$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed +$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed +$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed +$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt ~~~ {:class="in"} Let's take a look at what files we now have. ~~~ -ls -1 +$ ls -1 ~~~ {:class="in"} +~~~ +cpg.bed +exons.bed +genome.txt +gwas.bed +~~~ +{:class="out"} + + What are these files? ========================= -Your directory should now contain 23 BED files and 1 genome file. Twenty of these files (those starting with "f" for "fetal tissue") reflect Dnase I hypersensitivity sites measured in twenty different fetal tissue samples from the brain, heart, intestine, kidney, lung, muscle, skin, and stomach. +Your directory should now contain 3 BED files and 1 genome file. -In addition: `cpg.bed` represents CpG islands in the human genome; `exons.bed` represents RefSeq exons from human genes; and `gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS). +`cpg.bed` represents CpG islands in the human genome; `exons.bed` represents RefSeq exons from human genes; and `gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS). -The latter 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). +These 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). The bedtools help ================== To bring up the help, just type ~~~ -bedtools +$ bedtools ~~~ {:class="in"} @@ -68,26 +78,27 @@ As you can see, there are multiple "subcommands" and for bedtools to work you must tell it which subcommand you want to use. Examples: ~~~ -bedtools intersect -bedtools merge -bedtools subtract +$ bedtools intersect +$ bedtools merge +$ bedtools subtract ~~~ {:class="in"} +This should bring up the specific help text associated with each tool. bedtools "intersect" ==================== -The `intersect` command is the workhorse of the `bedtools` suite. It compares two BED/VCF/GFF files (or a BAM file and one of the aforementioned files) and identifies all the regions in the gemome where the features in the two files overlap (that is, share at least one base pair in common). +The `intersect` command is the workhorse of the `bedtools` suite. It compares two BED/VCF/GFF files (or a BAM file and one of the aforementioned files) and identifies all the regions in the genome where the features in the two files overlap (that is, share at least one base pair in common). ![](http://bedtools.readthedocs.org/en/latest/_images/intersect-glyph.png) Default behavior ---------------- -By default, `intersect` reports the intervals that represent overlaps between your two files. To demonstrate, let's identify all of the CpG islands that overlap exons. +By default, `intersect` reports the intervals that represent overlaps between your two files. To demonstrate, let's identify all of the CpG islands that overlap exons. For example, let's list the first five overlaps between CpG islands and exons. ~~~ -bedtools intersect -a cpg.bed -b exons.bed | head -5 +$ bedtools intersect -a cpg.bed -b exons.bed | head -5 ~~~ {:class="in"} ~~~ @@ -104,16 +115,15 @@ Reporting the original feature in each file. The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of not only showing you *where* the intersections occurred, it shows you *what* intersected. ~~~ -bedtools intersect -a cpg.bed -b exons.bed -wa -wb \ -| head -5 +$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -5 ~~~ {:class="in"} ~~~ -chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - -chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + -chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR_024540_exon_10_0_chr1_29321_r 0 - +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + ~~~ {:class="out"} @@ -122,20 +132,19 @@ How many base pairs of overlap were there? The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files. ~~~ -bedtools intersect -a cpg.bed -b exons.bed -wo \ -| head -10 +$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10 ~~~ {:class="in"} ~~~ -chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - 50 -chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 -chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 -chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_ 033908_exon_6_0_chr1_713664_r 0 - 84 -chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _015368_exon_0_0_chr1_762971_f 0 + 185 -chr1 762416 763445 CpG:_115 chr1 763177 763229 NR _047525_exon_0_0_chr1_763178_f 0 + 52 -chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _047524_exon_0_0_chr1_762971_f 0 + 185 +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR_024540_exon_10_0_chr1_29321_r 0 - 50 +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 439 +chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_033908_exon_6_0_chr1_713664_r 0 - 84 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_015368_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 NR_047525_exon_0_0_chr1_763178_f 0 + 52 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047524_exon_0_0_chr1_762971_f 0 + 185 chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 ~~~ {:class="out"} @@ -145,8 +154,7 @@ Counting the number of overlapping features. We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option. ~~~ -bedtools intersect -a cpg.bed -b exons.bed -c \ -| head +$ bedtools intersect -a cpg.bed -b exons.bed -c | head ~~~ {:class="in"} ~~~ @@ -168,8 +176,7 @@ Find features that DO NOT overlap Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case. ~~~ -bedtools intersect -a cpg.bed -b exons.bed -v \ -| head +$ bedtools intersect -a cpg.bed -b exons.bed -v | head ~~~ {:class="in"} ~~~ @@ -194,52 +201,56 @@ Recall that the default is to report overlaps between features in A and B so lon Let's be more strict and require 50% of overlap. ~~~ -bedtools intersect -a cpg.bed -b exons.bed \ --wo -f 0.50 \ -| head +$ bedtools intersect -a cpg.bed -b exons.bed -wo -f 0.50 | head ~~~ {:class="in"} ~~~ -chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 -chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 -chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047525_exon_4_0_chr1_788771_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047524_exon_3_0_chr1_788771_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047523_exon_3_0_chr1_788771_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_ 047522_exon_5_0_chr1_788859_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_ 047521_exon_4_0_chr1_788771_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_ 047520_exon_6_0_chr1_788859_f 0 + 348 +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 439 +chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_047525_exon_4_0_chr1_788771_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_047524_exon_3_0_chr1_788771_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_047523_exon_3_0_chr1_788771_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_047522_exon_5_0_chr1_788859_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_047521_exon_4_0_chr1_788771_f 0 + 348 +chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_047520_exon_6_0_chr1_788859_f 0 + 348 ~~~ {:class="out"} -Do exons and CPG islands overlap significantly? +Do exons and CpG islands overlap significantly? =============================================== +OK, now we've explored the capabilities of BedTools a little, let's use it to answer a biological question. Let's try to answer if CpG islands and exons overlap significantly - that is, do the overlap more than we would expect by random chance? + Going back to base pairs overlap -------------------------------- +In order to start answering this question, we need to get the base pairs of overlap between all of the elements. From the section above, we know we can get this using the -wo flag: + ~~~ -bedtools intersect -a cpg.bed -b exons.bed -wo \ -| head -10 +$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10 ~~~ {:class="in"} ~~~ -chr1 28735 29810 CpG:_116 chr1 29320 29370 NR _024540_exon_10_0_chr1_29321_r 0 - 50 -chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_ 039983_exon_0_0_chr1_134773_r 0 - 439 -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028322_exon_2_0_chr1_324439_f 0 + 439 -chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_ 028327_exon_3_0_chr1_327036_f 0 + 439 -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_ 028325_exon_2_0_chr1_324439_f 0 + 439 -chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_ 033908_exon_6_0_chr1_713664_r 0 - 84 -chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _015368_exon_0_0_chr1_762971_f 0 + 185 -chr1 762416 763445 CpG:_115 chr1 763177 763229 NR _047525_exon_0_0_chr1_763178_f 0 + 52 -chr1 762416 763445 CpG:_115 chr1 762970 763155 NR _047524_exon_0_0_chr1_762971_f 0 + 185 +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR_024540_exon_10_0_chr1_29321_r 0 - 50 +chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + 439 +chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 439 +chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_033908_exon_6_0_chr1_713664_r 0 - 84 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_015368_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 NR_047525_exon_0_0_chr1_763178_f 0 + 52 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047524_exon_0_0_chr1_762971_f 0 + 185 chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 ~~~ {:class="out"} +Can you spot anything odd about this output? Lines 3-5 indicate that one of our CpG islands is overlapping with three alternative versions of the same exon. For the purposes of our analysis though, this should really count as a single overlap. What we need to do is merge overlapping exons in exons.bed together. + bedtools "merge" ==================== + Many datasets of genomic features have many individual features that overlap one another (e.g. aligments from a ChiP seq experiment). It is often useful to just cobine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you. ![](http://bedtools.readthedocs.org/en/latest/_images/merge-glyph.png) @@ -247,6 +258,7 @@ Many datasets of genomic features have many individual features that overlap one Input must be sorted -------------------- + The merge tool requires that the input file is sorted by chromosome, then by start position. This allows the merging algorithm to work very quickly without requiring any RAM. If you run the merge command on a file that is not sorted, you will get an error. For example: @@ -280,40 +292,66 @@ chr1 67199430 67199563 chr1 67205017 67205220 chr1 67206340 67206405 chr1 67206954 67207119 +~~~ +{:class="in"} +~~~ ERROR: input file: (exons.bed) is not sorted by chrom then start. The start coordinate at line 26 is less than the start at line 25 ~~~ {:class="err"} -To correct this, you need to sort your BED using the UNIX `sort` utility. +To correct this, you need to sort your BED. We could use the UNIX `sort` utility that we already covered, which would look like this: ~~~ -sort -k1,1 -k2,2n exons.bed > exons.sort.bed +$ sort -k1,1 -k2,2n exons.bed | head -n 5 ~~~ {:class="in"} - -Let's try again. - ~~~ -bedtools merge -i exons.sort.bed | head -10 +chr1 11873 12227 NR_046018_exon_0_0_chr1_11874_f 0 + +chr1 12612 12721 NR_046018_exon_1_0_chr1_12613_f 0 + +chr1 13220 14409 NR_046018_exon_2_0_chr1_13221_f 0 + +chr1 14361 14829 NR_024540_exon_0_0_chr1_14362_r 0 - +chr1 14969 15038 NR_024540_exon_1_0_chr1_14970_r 0 - ~~~ -{:class="in"} +{:class="out"} -The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. +But if you can't remember the right arguments to use for sort, the command `bedtools sort` will also do what we want: -Or alternatively: +~~~ +$ bedtools sort -i exons.bed | head -n 5 +~~~ +{:class="in"} +~~~ +chr1 11873 12227 NR_046018_exon_0_0_chr1_11874_f 0 + +chr1 12612 12721 NR_046018_exon_1_0_chr1_12613_f 0 + +chr1 13220 14409 NR_046018_exon_2_0_chr1_13221_f 0 + +chr1 14361 14829 NR_024540_exon_0_0_chr1_14362_r 0 - +chr1 14969 15038 NR_024540_exon_1_0_chr1_14970_r 0 - +~~~ +{:class="out"} +Now instead of saving the sorted results to a separate file, we can actually pipe the result into bedtools merge: + ~~~ -bedtools sort -i exons.bed | bedtools merge > exons.merged.bed +$ bedtools sort -i exons.bed | bedtools merge | head -n 5 ~~~ {:class="in"} +~~~ +chr1 11873 12227 +chr1 12612 12721 +chr1 13220 14829 +chr1 14969 15038 +chr1 15795 15947 +~~~ +{:class="out"} -Now the overlap with merged exons: +The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. +Let's save the output of bedtools merge to a file and recalculate the overlap: ~~~ -bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ -| head -n 10 +$ bedtools sort -i exons.bed | bedtools merge > exons.merged.bed +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | head -n 10 ~~~ {:class="in"} ~~~ @@ -330,11 +368,10 @@ chr1 858970 861632 CpG:_257 chr1 861120 861180 60 ~~~ {:class="out"} -All we want is the very last column: +This gives us the coordinates of the overlap, the name of the CpG island, the coordinates of the CpG island and the overlap in base pairs. But all we actually need is the overlap column. We can use the unix `cut` command to extract just the 8th column. ~~~ -bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ -| cut -f 8,8 | head -n 10 +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | head -n 10 ~~~ {:class="in"} ~~~ @@ -351,11 +388,10 @@ bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ ~~~ {:class="out"} -Then we want the total bp overlap, using a slight alteration of the readings.py from the novice python lessons: +Now we need a program to read in this column of numbers and give us the sum. The `readings.py` program we wrote earlier will do this with a few small changes. ~~~ -bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ -| cut -f 8,8 | python readings.py --sum +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | python readings.py --sum ~~~ {:class="in"} ~~~ @@ -363,11 +399,12 @@ bedtools intersect -a cpg.bed -b exons.merged.bed -wo \ ~~~ {:class="out"} -How do we know if this overlap is significant? Shuffle one of the files. +So we know what the overlap is between CpG islands and exons. How do we know whether this is significant? How do we know what we would expect by random chance? + +BedTools includes a command called shuffle which will randomize one of our input files. Let's try randomizing the CpG islands: ~~~ -bedtools shuffle -i cpg.bed -g genome.txt \ -| head -n 10 +bedtools shuffle -i cpg.bed -g genome.txt | head -n 10 ~~~ {:class="in"} ~~~ From 40ac97e61138a2231b71d7c2741289c7d9d38b81 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Wed, 23 Jul 2014 09:36:48 +0100 Subject: [PATCH 08/12] Moved the scripts to their own repository --- .../capstones/bedtools-python/do_shuffle.sh | 8 ---- novice/capstones/bedtools-python/readings.py | 37 ---------------- novice/capstones/bedtools-python/results.py | 42 ------------------- 3 files changed, 87 deletions(-) delete mode 100755 novice/capstones/bedtools-python/do_shuffle.sh delete mode 100644 novice/capstones/bedtools-python/readings.py delete mode 100644 novice/capstones/bedtools-python/results.py diff --git a/novice/capstones/bedtools-python/do_shuffle.sh b/novice/capstones/bedtools-python/do_shuffle.sh deleted file mode 100755 index 2c7e2edc1..000000000 --- a/novice/capstones/bedtools-python/do_shuffle.sh +++ /dev/null @@ -1,8 +0,0 @@ -#! /bin/bash - - -for i in {1..5} -do - bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed -done - diff --git a/novice/capstones/bedtools-python/readings.py b/novice/capstones/bedtools-python/readings.py deleted file mode 100644 index 28c61922e..000000000 --- a/novice/capstones/bedtools-python/readings.py +++ /dev/null @@ -1,37 +0,0 @@ -import sys -import numpy as np - - -def main(): - script = sys.argv[0] - action = sys.argv[1] - filenames = sys.argv[2:] - assert action in ['--min', '--mean', '--max', '--sum'], \ - 'Action is not one of --min, --mean, or --max: ' + action - if len(filenames) == 0: - process(sys.stdin, action) - else: - for f in filenames: - process(f, action) - - -def process(filename, action): - data = np.loadtxt(filename, delimiter=',') - - data_shape = data.shape - if len(data_shape) == 1: - data = data.reshape((1, len(data))) - - if action == '--min': - values = data.min(axis=1) - elif action == '--mean': - values = data.mean(axis=1) - elif action == '--max': - values = data.max(axis=1) - elif action == '--sum': - values = data.sum(axis=1) - - for m in values: - print m - -main() diff --git a/novice/capstones/bedtools-python/results.py b/novice/capstones/bedtools-python/results.py deleted file mode 100644 index 5b71b4f56..000000000 --- a/novice/capstones/bedtools-python/results.py +++ /dev/null @@ -1,42 +0,0 @@ -import sys -import numpy as np -from matplotlib import pyplot as plt - -usage_string = """ -Results.py plots the real overlap between two bed files -versus the overlap when one of the results files is randomly -shuffled. - -Usage: python random.py shuffled_results.txt real_results.txt -""" - -def main(): - - if not len(sys.argv) == 3: - sys.exit(usage_string) - - script = sys.argv[0] - shuffled_results_file = sys.argv[1] - real_results_file = sys.argv[2] - - process(shuffled_results_file, real_results_file) - - -def process(shuffled_results_file, real_results_file): - shuffled_data = np.loadtxt(shuffled_results_file) - real_data = np.loadtxt(real_results_file) - - print 'Real data:' - print real_data - print - print 'Shuffled data:' - print shuffled_data - - plt.hist(shuffled_data, color='black', label='Shuffled Data') - plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data') - plt.xlabel('Nucleotides overlap') - plt.ylabel('Number of shuffled results') - plt.legend() - plt.show() - -main() From 0d9e0ca5bb615cec8ccc80a024e4a98d0e7e623c Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Wed, 23 Jul 2014 10:18:35 +0100 Subject: [PATCH 09/12] Added instructions for forking repository --- .../bedtools-python/bedtools-capstone.md | 49 +++++++++++++------ 1 file changed, 35 insertions(+), 14 deletions(-) diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md index 6536a1d1a..b07c0e831 100644 --- a/novice/capstones/bedtools-python/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -12,14 +12,25 @@ explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) wit This tutorial is meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/). +During this lesson, you'll be working on some scripts and data which are in their own github repository. In our git lesson, we covered the situation where you have been given write access to a repository. Now we're going to cover what to do if you don't have write access to a repository - this is called forking. + +Forking the github repository +============================= + +The data and code that you need to work on are located at [](git@github.com:rbeagrie/bedtools-example.git). You are going to want to make changes to the code, but this repository is owned by someone else. Instead of creating a new project, you want to [fork](../../gloss.html#fork) it; i.e., clone it on GitHub. You can do this using the GitHub web interface: + +The Fork Button + +If you visit the [bedtools-example repository](git@github.com:rbeagrie/bedtools-example.git) and click the "fork" button, you will have your own copy of the repository under your own github account - so there should now be a page that looks like https://github.com/YOUR_USERNAME/bedtools-example. + Setup ===== -From the Terminal, create a new directory on your Desktop called "bedtools-demo". +From the Terminal, clone your fork of the repository into a new folder on your Desktop called "bedtools-example". ~~~ $ cd ~/Desktop -$ mkdir bedtools-demo +$ git clone git@github.com:YOUR_USERNAME_GOES_HERE/bedtools-example.git ~~~ {:class="in"} @@ -27,24 +38,27 @@ $ mkdir bedtools-demo Navigate into that directory. ~~~ -$ cd bedtools-demo +$ cd bedtools-example ~~~ {:class="in"} -Download the sample BED files. +Let's take a look at what files we now have. ~~~ -$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed -$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed -$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed -$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt +$ ls -1 ~~~ {:class="in"} -Let's take a look at what files we now have. +~~~ +data +scripts +~~~ +{:class="out"} + +So we have two directories, one containing data and the other containing some scripts. We're going to start with the data folder, so lets look at what's in there. ~~~ -$ ls -1 +$ ls -1 data ~~~ {:class="in"} @@ -58,15 +72,22 @@ gwas.bed What are these files? -========================= -Your directory should now contain 3 BED files and 1 genome file. +--------------------- + +The data directory should contain 3 BED files and 1 genome file. -`cpg.bed` represents CpG islands in the human genome; `exons.bed` represents RefSeq exons from human genes; and `gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS). +`data/cpg.bed` represents CpG islands in the human genome; `data/exons.bed` represents RefSeq exons from human genes; and `data/gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS). `data/genome.txt` file contains the lengths of all the human chromosomes. -These 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). +These 4 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). + +Introducing BedTools +==================== + +We're going to be exploring these files using a suite of tools called BedTools. As the name suggests, these tools all work on a particular type of file called a bed file. Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate The bedtools help ================== + To bring up the help, just type ~~~ From 069945c25f8ae920b7294cdf78fa2fc3c31bf900 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Wed, 23 Jul 2014 14:38:03 +0100 Subject: [PATCH 10/12] Adding some more explanation about all of the commands --- .../bedtools-python/bedtools-capstone.md | 213 ++++++++++++------ 1 file changed, 142 insertions(+), 71 deletions(-) diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md index b07c0e831..09d3bdf38 100644 --- a/novice/capstones/bedtools-python/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -83,10 +83,28 @@ These 4 files were extracted from the UCSC Genome Browser's [Table Browser](http Introducing BedTools ==================== -We're going to be exploring these files using a suite of tools called BedTools. As the name suggests, these tools all work on a particular type of file called a bed file. Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate +We're going to be exploring these files using a suite of tools called BedTools. As the name suggests, these tools all work on a particular type of file called a bed file. Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`: + +~~~ +head data/cpg.bed +~~~ +{:class="in"} +~~~ +chr1 28735 29810 CpG:_116 +chr1 135124 135563 CpG:_30 +chr1 327790 328229 CpG:_29 +chr1 437151 438164 CpG:_84 +chr1 449273 450544 CpG:_99 +chr1 533219 534114 CpG:_94 +chr1 544738 546649 CpG:_171 +chr1 713984 714547 CpG:_60 +chr1 762416 763445 CpG:_115 +chr1 788863 789211 CpG:_28 +~~~ +{:class="out"} The bedtools help -================== +----------------- To bring up the help, just type @@ -110,7 +128,7 @@ This should bring up the specific help text associated with each tool. bedtools "intersect" ==================== -The `intersect` command is the workhorse of the `bedtools` suite. It compares two BED/VCF/GFF files (or a BAM file and one of the aforementioned files) and identifies all the regions in the genome where the features in the two files overlap (that is, share at least one base pair in common). +The `intersect` command is the workhorse of the `bedtools` suite. It compares two bed files and identifies all the regions in the genome where the features in the two files overlap (that is, share at least one base pair in common). ![](http://bedtools.readthedocs.org/en/latest/_images/intersect-glyph.png) @@ -131,9 +149,25 @@ chr1 327790 328229 CpG:_29 ~~~ {:class="out"} +Notice how the new file we create lists the name of the CpG island which is overlapping, but not the name of the exon. If we switch the order of files, we get the exon name instead: + +~~~ +$ bedtools intersect -a data/exons.bed -b data/cpg.bed | head -5 +~~~ +{:class="in"} +~~~ +chr1 50489434 50489626 NM_032785_exon_13_0_chr1_50489435_r 0 - +chr1 16767166 16767348 NM_018090_exon_0_0_chr1_16767167_f 0 + +chr1 33546713 33546895 NM_052998_exon_0_0_chr1_33546714_f 0 + +chr1 33546988 33547109 NM_052998_exon_1_0_chr1_33546989_f 0 + +chr1 33547201 33547243 NM_052998_exon_2_0_chr1_33547202_f 0 + +~~~ +{:class="out"} + + Reporting the original feature in each file. -------------------------------------------- -The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of not only showing you *where* the intersections occurred, it shows you *what* intersected. +The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of only showing you *where* the intersections occurred, it shows you *what* intersected. ~~~ $ bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -5 @@ -148,47 +182,51 @@ chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_ ~~~ {:class="out"} -How many base pairs of overlap were there? ------------------------------------------- -The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files. +Counting the number of overlapping features. +-------------------------------------------- +We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option. ~~~ -$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10 +$ bedtools intersect -a cpg.bed -b exons.bed -c | head ~~~ {:class="in"} ~~~ -chr1 28735 29810 CpG:_116 chr1 29320 29370 NR_024540_exon_10_0_chr1_29321_r 0 - 50 -chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 439 -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 439 -chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + 439 -chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 439 -chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_033908_exon_6_0_chr1_713664_r 0 - 84 -chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_015368_exon_0_0_chr1_762971_f 0 + 185 -chr1 762416 763445 CpG:_115 chr1 763177 763229 NR_047525_exon_0_0_chr1_763178_f 0 + 52 -chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047524_exon_0_0_chr1_762971_f 0 + 185 -chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 +chr1 28735 29810 CpG:_116 1 +chr1 135124 135563 CpG:_30 1 +chr1 327790 328229 CpG:_29 3 +chr1 437151 438164 CpG:_84 0 +chr1 449273 450544 CpG:_99 0 +chr1 533219 534114 CpG:_94 0 +chr1 544738 546649 CpG:_171 0 +chr1 713984 714547 CpG:_60 1 +chr1 762416 763445 CpG:_115 10 +chr1 788863 789211 CpG:_28 9 ~~~ {:class="out"} -Counting the number of overlapping features. +Again, we see that the order of the bed files is important. We get an output line for every feature in `data/cpg.bed`, even if there is no overlap with `data/exons.bed`. + +Require a minimal fraction of overlap. -------------------------------------------- -We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option. +Recall that the default is to report overlaps between features in A and B so long as *at least one basepair* of overlap exists. However, the `-f` option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported. + +Let's be more strict and require 50% of overlap. ~~~ -$ bedtools intersect -a cpg.bed -b exons.bed -c | head +$ bedtools intersect -a data/cpg.bed -b data/exons.bed -c -f 0.50 | head ~~~ {:class="in"} ~~~ -chr1 28735 29810 CpG:_116 1 +chr1 28735 29810 CpG:_116 0 chr1 135124 135563 CpG:_30 1 chr1 327790 328229 CpG:_29 3 chr1 437151 438164 CpG:_84 0 chr1 449273 450544 CpG:_99 0 chr1 533219 534114 CpG:_94 0 chr1 544738 546649 CpG:_171 0 -chr1 713984 714547 CpG:_60 1 -chr1 762416 763445 CpG:_115 10 -chr1 788863 789211 CpG:_28 9 +chr1 713984 714547 CpG:_60 0 +chr1 762416 763445 CpG:_115 0 +chr1 788863 789211 CpG:_28 7 ~~~ {:class="out"} @@ -214,28 +252,25 @@ chr1 919726 919927 CpG:_15 ~~~ {:class="out"} - -Require a minimal fraction of overlap. --------------------------------------------- -Recall that the default is to report overlaps between features in A and B so long as *at least one basepair* of overlap exists. However, the `-f` option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported. - -Let's be more strict and require 50% of overlap. +How many base pairs of overlap were there? +------------------------------------------ +The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files. ~~~ -$ bedtools intersect -a cpg.bed -b exons.bed -wo -f 0.50 | head +$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10 ~~~ {:class="in"} ~~~ +chr1 28735 29810 CpG:_116 chr1 29320 29370 NR_024540_exon_10_0_chr1_29321_r 0 - 50 chr1 135124 135563 CpG:_30 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 439 chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 439 chr1 327790 328229 CpG:_29 chr1 327035 328581 NR_028327_exon_3_0_chr1_327036_f 0 + 439 chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 439 -chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_047525_exon_4_0_chr1_788771_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_047524_exon_3_0_chr1_788771_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_047523_exon_3_0_chr1_788771_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_047522_exon_5_0_chr1_788859_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788770 794826 NR_047521_exon_4_0_chr1_788771_f 0 + 348 -chr1 788863 789211 CpG:_28 chr1 788858 794826 NR_047520_exon_6_0_chr1_788859_f 0 + 348 +chr1 713984 714547 CpG:_60 chr1 713663 714068 NR_033908_exon_6_0_chr1_713664_r 0 - 84 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_015368_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 763177 763229 NR_047525_exon_0_0_chr1_763178_f 0 + 52 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047524_exon_0_0_chr1_762971_f 0 + 185 +chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_chr1_762971_f 0 + 185 ~~~ {:class="out"} @@ -324,7 +359,7 @@ ERROR: input file: (exons.bed) is not sorted by chrom then start. To correct this, you need to sort your BED. We could use the UNIX `sort` utility that we already covered, which would look like this: ~~~ -$ sort -k1,1 -k2,2n exons.bed | head -n 5 +$ sort -k1,1 -k2,2n exons.bed | head ~~~ {:class="in"} ~~~ @@ -339,7 +374,7 @@ chr1 14969 15038 NR_024540_exon_1_0_chr1_14970_r 0 - But if you can't remember the right arguments to use for sort, the command `bedtools sort` will also do what we want: ~~~ -$ bedtools sort -i exons.bed | head -n 5 +$ bedtools sort -i exons.bed | head ~~~ {:class="in"} ~~~ @@ -354,7 +389,7 @@ chr1 14969 15038 NR_024540_exon_1_0_chr1_14970_r 0 - Now instead of saving the sorted results to a separate file, we can actually pipe the result into bedtools merge: ~~~ -$ bedtools sort -i exons.bed | bedtools merge | head -n 5 +$ bedtools sort -i exons.bed | bedtools merge | head ~~~ {:class="in"} ~~~ @@ -366,13 +401,13 @@ chr1 15795 15947 ~~~ {:class="out"} -The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. +The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. Notice how the merged file has less columns than the original file - we've lost the name of the exon and the strand. Let's save the output of bedtools merge to a file and recalculate the overlap: ~~~ $ bedtools sort -i exons.bed | bedtools merge > exons.merged.bed -$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | head -n 10 +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | head ~~~ {:class="in"} ~~~ @@ -389,10 +424,14 @@ chr1 858970 861632 CpG:_257 chr1 861120 861180 60 ~~~ {:class="out"} -This gives us the coordinates of the overlap, the name of the CpG island, the coordinates of the CpG island and the overlap in base pairs. But all we actually need is the overlap column. We can use the unix `cut` command to extract just the 8th column. +This gives us the coordinates of the overlap, the name of the CpG island, the coordinates of the CpG island and the overlap in base pairs. Notice how the repetition in line 3-5 is now gone. + +#Challenge: stranded merge + +All we actually need from the output of `bedtools intersect` is the overlap column. We can use the unix `cut` command to extract just the 8th field (or column) using the -f flag. ~~~ -$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | head -n 10 +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | head ~~~ {:class="in"} ~~~ @@ -409,10 +448,10 @@ $ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | head -n 1 ~~~ {:class="out"} -Now we need a program to read in this column of numbers and give us the sum. The `readings.py` program we wrote earlier will do this with a few small changes. +Now we need a program to read in this column of numbers and give us the sum. Hopefully this should ring some bells - The `readings.py` program we wrote earlier will do this with a few small changes. ~~~ -$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | python readings.py --sum +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum ~~~ {:class="in"} ~~~ @@ -420,12 +459,20 @@ $ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | python re ~~~ {:class="out"} +Let's save this result to a file so that we remember it later: + +~~~ +$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum > cpg_result.txt +~~~ +{:class="in"} + + So we know what the overlap is between CpG islands and exons. How do we know whether this is significant? How do we know what we would expect by random chance? BedTools includes a command called shuffle which will randomize one of our input files. Let's try randomizing the CpG islands: ~~~ -bedtools shuffle -i cpg.bed -g genome.txt | head -n 10 +bedtools shuffle -i cpg.bed -g genome.txt | head ~~~ {:class="in"} ~~~ @@ -442,11 +489,10 @@ chr6 39650957 39651305 CpG:_28 ~~~ {:class="out"} -But keep the intervals on the same chromosome... +Notice how the CpG islands have been shuffled to random chromosomes. This means that each chromosome will have roughly the same number of CpG islands, regardless of length. A better random model would be to keep the distribution of CpG islands over chromosomes and just shuffle the positions of each CpG on each chromosome. We can do this using the -chrom flag. ~~~ -bedtools shuffle -chrom -i cpg.bed -g genome.txt \ -| head -n 10 +$ bedtools shuffle -chrom -i cpg.bed -g genome.txt | head ~~~ {:class="in"} ~~~ @@ -463,13 +509,11 @@ chr1 51954093 51954441 CpG:_28 ~~~ {:class="out"} -Save to a file, and count the overlap for the shuffled features +Now let's save that result to a file, and count the overlap for the shuffled features. ~~~ -bedtools shuffle -chrom -i cpg.bed -g genome.txt \ -> cpg.shuffled.bed -bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \ -| cut -f 8,8 | python readings.py --sum +$ bedtools shuffle -chrom -i cpg.bed -g genome.txt > cpg.shuffled.bed +$ bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum ~~~ {:class="in"} ~~~ @@ -477,17 +521,19 @@ bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \ ~~~ {:class="out"} -Do this lots of times to get a p-value +Great, so our shuffled file overlaps much less with exons than real CpG islands do. How do we know if the difference is significant? One approach is to make lots of different randomly shuffled files, and count the overlap for each one. If we do this 100 times, and we never see an overlap that is as big as the one we see for the original `data/cpg.bed` file, then we can say that the overlap between exons and CpG islands is significant with a P-value (roughly) < 0.01. + +First, prepare a directory to hold our shuffled results: ~~~ -mkdir shuffled_cpg +$ mkdir shuffled_cpg ~~~ {:class="in"} -We have a script for that: +In the `scripts` folder, you should see a file called `scripts/do_shuffle.sh`. This is a bash script that contains a loop from 1 to 100. It runs bedtools shuffle 100 times, and saves the results to a folder. Have a look at the contents of the file: ~~~ -cat do_shuffle.sh +$ cat scripts/do_shuffle.sh ~~~ {:class="in"} ~~~ @@ -496,26 +542,50 @@ cat do_shuffle.sh for i in {1..100} do - bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed + bedtools shuffle -chrom -i data/cpg.bed -g data/genome.txt > shuffled_cpg/cpg.shuffled${i}.bed done ~~~ {:class="out"} -Run the script: +Run the script, and then check to make sure you have some shiny new files in the `shuffled_cpg` folder: ~~~ -./do_suffle.sh -ls -l shuffled_cpg | head -n 5 +$ scripts/do_suffle.sh +$ ls -1 shuffled_cpg | head ~~~ {:class="in"} +~~~ +cpg.shuffled100.bed +cpg.shuffled10.bed +cpg.shuffled11.bed +cpg.shuffled12.bed +cpg.shuffled13.bed +cpg.shuffled14.bed +cpg.shuffled15.bed +cpg.shuffled16.bed +cpg.shuffled17.bed +cpg.shuffled18.bed +~~~ +{:class="out"} + +Now we need to calculate the overlap between `data/exons.bed` and each of our shuffled bed files. The command for the first file is: +~~~ +$ bedtools intersect -a shuffled_cpg/cpg.shuffled1.bed -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum +~~~ +{:class="in"} +~~~ +630845.0 +~~~ +{:class="out"} -And then: +See if you can adapt this command and turn it into a for loop which runs over all of our shuffled results. ~~~ -for f in shuffled_cpg/*; -do bedtools intersect -a $f -b exons.merged.bed -wo \ -| cut -f 8,8 | python readings.py --sum; done | head -n 10 +$ for filename in shuffled_cpg/* +> do +> bedtools intersect -a $filename -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum +> done | head ~~~ {:class="in"} ~~~ @@ -535,9 +605,10 @@ do bedtools intersect -a $f -b exons.merged.bed -wo \ Save that to a file: ~~~ -for f in shuffled_cpg/*; -do bedtools intersect -a $f -b exons.merged.bed -wo \ -| cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt +$ for filename in shuffled_cpg/*; +do + bedtools intersect -a $filename -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum +done > shuffled_results.txt ~~~ {:class="in"} From c1cf14192100b2d7d19541a4c4667e8c92ddcd89 Mon Sep 17 00:00:00 2001 From: Rob Beagrie Date: Wed, 23 Jul 2014 16:11:16 +0100 Subject: [PATCH 11/12] Changes from PR review comments --- .../bedtools-python/bedtools-capstone.md | 88 +++++++++--------- .../bedtools-python/img/git-fork-ui.png | Bin 0 -> 9847 bytes 2 files changed, 45 insertions(+), 43 deletions(-) create mode 100644 novice/capstones/bedtools-python/img/git-fork-ui.png diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md index 09d3bdf38..665462981 100644 --- a/novice/capstones/bedtools-python/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -7,8 +7,9 @@ title: Capstone example Python and Bedtools Synopsis ======== -Our goal is to work through examples that demonstrate how to -explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) with the `bedtools` software package. +Our goal is to work through examples that demonstrate how to explore, process and manipulate bed files with the `bedtools` software package. + +Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. If you are working with published data from other groups (for example, transcription factor binding sites identified by ChIP-seq) you will likely find that the data is provided in bed format. One of the most commonly asked questions of bioinformaticians is whether two different sets of features overlap with each other, so today we'll be showing you how to answer that question. This tutorial is meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/). @@ -17,11 +18,11 @@ During this lesson, you'll be working on some scripts and data which are in thei Forking the github repository ============================= -The data and code that you need to work on are located at [](git@github.com:rbeagrie/bedtools-example.git). You are going to want to make changes to the code, but this repository is owned by someone else. Instead of creating a new project, you want to [fork](../../gloss.html#fork) it; i.e., clone it on GitHub. You can do this using the GitHub web interface: +The data and code that you need to work on are located at . You are going to want to make changes to the code, but this repository is owned by someone else. Instead of creating a new project, you want to [fork](../../gloss.html#fork) it; i.e., clone it on GitHub. You can do this using the GitHub web interface: The Fork Button -If you visit the [bedtools-example repository](git@github.com:rbeagrie/bedtools-example.git) and click the "fork" button, you will have your own copy of the repository under your own github account - so there should now be a page that looks like https://github.com/YOUR_USERNAME/bedtools-example. +If you visit the [bedtools-example repository](http://github.com/rbeagrie/bedtools-example) and click the "fork" button, you will have your own copy of the repository under your own github account - so there should now be a page that looks like https://github.com/YOUR_USERNAME/bedtools-example. Setup ===== @@ -45,28 +46,24 @@ $ cd bedtools-example Let's take a look at what files we now have. ~~~ -$ ls -1 +$ ls ~~~ {:class="in"} ~~~ -data -scripts +data scripts ~~~ {:class="out"} So we have two directories, one containing data and the other containing some scripts. We're going to start with the data folder, so lets look at what's in there. ~~~ -$ ls -1 data +$ ls data ~~~ {:class="in"} ~~~ -cpg.bed -exons.bed -genome.txt -gwas.bed +cpg.bed exons.bed genome.txt gwas.bed ~~~ {:class="out"} @@ -83,8 +80,7 @@ These 4 files were extracted from the UCSC Genome Browser's [Table Browser](http Introducing BedTools ==================== -We're going to be exploring these files using a suite of tools called BedTools. As the name suggests, these tools all work on a particular type of file called a bed file. Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`: - +All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`: ~~~ head data/cpg.bed ~~~ @@ -277,7 +273,7 @@ chr1 762416 763445 CpG:_115 chr1 762970 763155 NR_047523_exon_0_0_c Do exons and CpG islands overlap significantly? =============================================== -OK, now we've explored the capabilities of BedTools a little, let's use it to answer a biological question. Let's try to answer if CpG islands and exons overlap significantly - that is, do the overlap more than we would expect by random chance? +OK, now we've explored the capabilities of BedTools a little, let's use it to answer a biological question. Let's try to answer if CpG islands and exons overlap significantly - that is, do they overlap more than we would expect by random chance? Going back to base pairs overlap -------------------------------- @@ -307,7 +303,7 @@ Can you spot anything odd about this output? Lines 3-5 indicate that one of our bedtools "merge" ==================== -Many datasets of genomic features have many individual features that overlap one another (e.g. aligments from a ChiP seq experiment). It is often useful to just cobine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you. +Many datasets of genomic features have many individual features that overlap one another (e.g. alignments from a ChiP seq experiment). It is often useful to just combine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you. ![](http://bedtools.readthedocs.org/en/latest/_images/merge-glyph.png) @@ -551,20 +547,23 @@ Run the script, and then check to make sure you have some shiny new files in the ~~~ $ scripts/do_suffle.sh -$ ls -1 shuffled_cpg | head +$ ls shuffled_cpg ~~~ {:class="in"} ~~~ -cpg.shuffled100.bed -cpg.shuffled10.bed -cpg.shuffled11.bed -cpg.shuffled12.bed -cpg.shuffled13.bed -cpg.shuffled14.bed -cpg.shuffled15.bed -cpg.shuffled16.bed -cpg.shuffled17.bed -cpg.shuffled18.bed +cpg.shuffled100.bed cpg.shuffled21.bed cpg.shuffled33.bed cpg.shuffled45.bed cpg.shuffled57.bed cpg.shuffled69.bed cpg.shuffled80.bed cpg.shuffled92.bed +cpg.shuffled10.bed cpg.shuffled22.bed cpg.shuffled34.bed cpg.shuffled46.bed cpg.shuffled58.bed cpg.shuffled6.bed cpg.shuffled81.bed cpg.shuffled93.bed +cpg.shuffled11.bed cpg.shuffled23.bed cpg.shuffled35.bed cpg.shuffled47.bed cpg.shuffled59.bed cpg.shuffled70.bed cpg.shuffled82.bed cpg.shuffled94.bed +cpg.shuffled12.bed cpg.shuffled24.bed cpg.shuffled36.bed cpg.shuffled48.bed cpg.shuffled5.bed cpg.shuffled71.bed cpg.shuffled83.bed cpg.shuffled95.bed +cpg.shuffled13.bed cpg.shuffled25.bed cpg.shuffled37.bed cpg.shuffled49.bed cpg.shuffled60.bed cpg.shuffled72.bed cpg.shuffled84.bed cpg.shuffled96.bed +cpg.shuffled14.bed cpg.shuffled26.bed cpg.shuffled38.bed cpg.shuffled4.bed cpg.shuffled61.bed cpg.shuffled73.bed cpg.shuffled85.bed cpg.shuffled97.bed +cpg.shuffled15.bed cpg.shuffled27.bed cpg.shuffled39.bed cpg.shuffled50.bed cpg.shuffled62.bed cpg.shuffled74.bed cpg.shuffled86.bed cpg.shuffled98.bed +cpg.shuffled16.bed cpg.shuffled28.bed cpg.shuffled3.bed cpg.shuffled51.bed cpg.shuffled63.bed cpg.shuffled75.bed cpg.shuffled87.bed cpg.shuffled99.bed +cpg.shuffled17.bed cpg.shuffled29.bed cpg.shuffled40.bed cpg.shuffled52.bed cpg.shuffled64.bed cpg.shuffled76.bed cpg.shuffled88.bed cpg.shuffled9.bed +cpg.shuffled18.bed cpg.shuffled2.bed cpg.shuffled41.bed cpg.shuffled53.bed cpg.shuffled65.bed cpg.shuffled77.bed cpg.shuffled89.bed +cpg.shuffled19.bed cpg.shuffled30.bed cpg.shuffled42.bed cpg.shuffled54.bed cpg.shuffled66.bed cpg.shuffled78.bed cpg.shuffled8.bed +cpg.shuffled1.bed cpg.shuffled31.bed cpg.shuffled43.bed cpg.shuffled55.bed cpg.shuffled67.bed cpg.shuffled79.bed cpg.shuffled90.bed +cpg.shuffled20.bed cpg.shuffled32.bed cpg.shuffled44.bed cpg.shuffled56.bed cpg.shuffled68.bed cpg.shuffled7.bed cpg.shuffled91.bed ~~~ {:class="out"} @@ -606,9 +605,9 @@ Save that to a file: ~~~ $ for filename in shuffled_cpg/*; -do - bedtools intersect -a $filename -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum -done > shuffled_results.txt +> do +> bedtools intersect -a $filename -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum +> done > shuffled_results.txt ~~~ {:class="in"} @@ -616,10 +615,10 @@ done > shuffled_results.txt Plotting our results with python -------------------------------- -Let's copy the readings.py script and alter it to read in our results files: +Also in your `scripts` directory, there should be a file called results.py. Open it and have a look. ~~~ -cat results.py +$ cat scripts/results.py ~~~ {:class="in"} ~~~ @@ -679,7 +678,7 @@ Shuffled data: What happens if we call the script with no arguments? ~~~ -python results.py +$ python scripts/results.py ~~~ {:class="in"} ~~~ @@ -692,7 +691,7 @@ IndexError: list index out of range ~~~ {:class="err"} -Not very helpful. Let's add a usage description: +Not very helpful. Let's add a usage description. Modify the python file so that it looks like this: ~~~ import sys @@ -703,7 +702,7 @@ Results.py plots the real overlap between two bed files versus the overlap when one of the results files is randomly shuffled. -Usage: python random.py shuffled_results.txt real_results.txt +Usage: python results.py shuffled_results.txt real_results.txt """ def main(): @@ -712,14 +711,17 @@ def main(): sys.exit(usage_string) ~~~ -Now we can plot a histogram of the shuffled data: +Let's try plotting a histogram of the shuffled data, first you'll need to add `from matplotlib import pyplot as plt` at the top of the file, then add this at the bottom of the process function. ~~~ -plt.hist(shuffled_data, color='black') -plt.show() + print 'Shuffled data:' + print shuffled_data + + plt.hist(shuffled_data, color='black') + plt.savefig('cpg_overlap_exons.png') ~~~ -And we can plot the real result as a vertical red line: +Have a look at the image you just made. At the moment we're only plotting the shuffled results, so let's add the real overlap as a vertical red line. Update your results.py file so that it looks like this: ~~~ import sys @@ -731,7 +733,7 @@ Results.py plots the real overlap between two bed files versus the overlap when one of the results files is randomly shuffled. -Usage: python random.py shuffled_results.txt real_results.txt +Usage: python results.py shuffled_results.txt real_results.txt """ def main(): @@ -758,7 +760,7 @@ def process(shuffled_results_file, real_results_file): plt.hist(shuffled_data, color='black') plt.axvline(real_data, color='red', ls='--', lw=2) - plt.show() + plt.savefig('cpg_overlap_exons.png') main() ~~~ @@ -775,7 +777,7 @@ Results.py plots the real overlap between two bed files versus the overlap when one of the results files is randomly shuffled. -Usage: python random.py shuffled_results.txt real_results.txt +Usage: python results.py shuffled_results.txt real_results.txt """ def main(): @@ -805,7 +807,7 @@ def process(shuffled_results_file, real_results_file): plt.xlabel('Nucleotides overlap') plt.ylabel('Number of shuffled results') plt.legend() - plt.show() + plt.savefig('cpg_overlap_exons.png') main() ~~~ diff --git a/novice/capstones/bedtools-python/img/git-fork-ui.png b/novice/capstones/bedtools-python/img/git-fork-ui.png new file mode 100644 index 0000000000000000000000000000000000000000..fe90450479b2f3f8fbc08d9db63ecfe80d651d0c GIT binary patch literal 9847 zcmd6NRa9NwvgXE};O_1a+&1ok;LgT1I2(6&CrEH7xVyW1u;A_zAh4_Oo*^5cDLEHVWEAl6xliz~^B zi<2oi*_&J0m;nGXD3vL0TB@T2Lx-8SQb@t5dG{NTGSzwTbE0Zu81T%YB4{;~QH(7( z5)vYySxNdpYls2ZaAPBMn`K;6bfv|?K$K)J$kdj5A-DJTBnxxj#{=Nz57+EN7=OJ< z4EQ-Cd;t5yN*|lBqNltK5%E2AJOr!~fS9g!SsOBLW`^u4ZvPbsh=8-thycG9ocVnk zaca@F0pu!~@+Kq=PKi8VsUR&w8gT>YU@?|J$;@bA!MkR?q-N^V;B1K>Y7`USL>O^d9-T^m5{~>^79h5arp}K`k%*Go?*>LaoX~I4T~^ z8?)l|X4;_K_dV0vL~6lqoITmS!3}!tKxd7OwVZd0qy!eRJL@V-GG;zTW`65 zI&GnCq={YjK?`nNTvM-5W(PTL4&wk1iuVy1oFYk2CWqf+A`#g+Exe}WQ&ZPscfLXv zLaY>?0epJ)IWRa?$Vo@_PJPsWyt9{fU=$b_MZ7n_U=sxNCm!(pE7c{oX_(x>sGcO{6qjSKX}QzAp^>U02WpCl6{$v-jB!vi*7V5G;9J@bS1*CP4loGUD4{8meS|HRG8ZTmO0A?fd`Xh*(ZMW#hm~~i3$s)D1=@7vy%Q#G&Ldy1Zs&&aqNOI zWv-tzY64|&E`(gc=@LF-CbMds#I~WXQqKjHvo_8&p2VJvZ5TdS>wyd=oZJqL1L92P>&;MFMW0nosDU(Alsj;y;v2iFEJY1rW|($)Th=L^gW z^9!DDD4F;SB`>-dJdqfQk-#=vUY0`=n_L?C7zRo>%K*GFgCk{a;Ic%Aytlmf7m9=i z+1NtrLaaGh6_{h*V+O0V+XyCUZW^zI-BF5b?rU@>xJQUb<`A*jk`~n+x|Q#?-}%R= z$5h9p>C3*iVyQ+_HtLtjl}VnY{9?2UY{+g1yLCGDK6W}zn&T;=HG^k=V%KM9`r*Q^ z&Te4}YgTMlTBlJ*UWZ?&Yj$c@JsLZ{y?49sy~pbm;uN;WG8&_#tVUdho2;-d0hI5Q zCr%z7BOKw_SKa@z7dLu8sz^UVzsw*(zsI0PpQ(wLpJ(~RNyD7gl;s-e8h59y44M^B zC~uLhmHm^l14pe+?M7XtQmaxw2P&E#b2cO0=iJX4s~l5Qn_(DY_@bs!dZyW0!CuNx zeNr)|!K?wyds0Ht;Lt!))va`usnoR90g10Gf^xewUlX4SVHNs8l37tt#&~NcP1v`PBKu^+Jl-q*8Jg3bkafEdEY$(?#1jrS`~tjV9rxheMb>zaK^M!s@DIb^wOInUzIqRlD8sQ@kZTLqierf1vBRps-*bM94_6`w}88r zV-Jah`BntX1hO2S##bizCR;Z=yc2pUF7o@Mr1Tm|G zIO;FX4Q_L&KbrYSRXlATY~GJNi&VqZV3tyyYrzQy8wD*zUM_D*cb|u6`04O-^f(F> zO34I}@|^M?Wfx`A{LX?e1!2Za#*0I1;p!0z5w=nNgOY;_Lkg^awcEUoEjvKo$}gQ} zu;WgtAl<4uL7&G*>^Vj)#;-Jn82Okdm|ZnuT3i*!66VG(J6NkxonfzuyHoVL@(!Su zqP1G8)WW&KKc=aBpixXg9(_K0_CdxX_I;N!zmtGppRCBOr02}%3+6L8uJ94?Egqa+ zIDPNsVv<`K;h<>(DSDNPRSzyE=#6q~xRJ(mQWjE}^~p8*AaB4ceYu_ET+d;hc0p1%dx0~k0^HO3fqADpmNulKSt9njx zHj7<=&b3aljz^`qZLKYo1>}@-qd;?xkhSmK2wx>w{SbS~zFVY2$=&GA_-*hQRdm?Y z!s$?EDY8mB)tF`dSbuKajqmDufw4=I<*)m#UdM^k-!-lhYl|w5s^*QF?%6r(C9fx* z_RX*NEBB3S&Mhm8m&=<%HI`d_%+0!4&7JlJt^FO~-HV-esC$abb<${3qPFQ);Wx6^ z-qSF?PZlJ$?$I84FA@i`kE)F&T-}7P8-k|7&c4fDE%(|F_0p7Q38{XjuYb%MuK4yn z9@7rA)?5BQv?6Rqyy1Q^i0;Db$o#?dqzPE^HOc%*OB%1=?ahTCPO#y zdg>0e%B!Cc3!Q80|8uwfN-gAB5;5O znd*=K1;Ag)4M4F5VPisBDx;=XmuQ2g8H^SFD^e@zdS#_KoM!+Z;>yLr)wS5g_0F1k z$AtuP=8x}&LDV2Y_)?rFfUOM$C}1`a@J{y*$xZf$BPxM7;A1(PX{Dy+tfe5&Z(?uD zVq|J>Y{uel>+rF<1poxy`9DNkGiM_*cUv1fFuyyH;$H~<5BVQ8D+SrVAkNl63M~aC zGI4t+Gcqog&n)Z|LWpE!WP(nn=KLxWQvXf=_yST`Iy*b?v$DFmxv{u$u-H3Uuzu#_ z<6~uGXJuz+{y;E;J?xx~+?nmbl>g=AfBTUz1DiNmIXGL{+mZd_*T~r3#TiIJ@lT+C zKmV1dnY-10BH4leyR44_S^u%Her91~{kQKAs^C9bekCh+GaD@lD_b)=@J9?G4o)t? zf8qZx%YP#N52p5inB4yd@qZZpn<&WoPyYX-p8smrzqB9S5<(PY{debu5KB&BkpTew zWmySPHFt<}5Q4Y7=4y_>GZiL+_8#>>cI2E&V=`T$3KnaTeVMwB_3X}|X3>%6m-C{# z#NP{ZzZn+EI2V;TDQYDo;1kIb<;`0;?!7KDe8^eyz^^_T&1BdIcMBg6559!?-uT1EMc8q1vmV4gEZjYa zB%jK>5c(2+3qd;o80t#K88F0k3b5tP}{d8?B z?Jg^h4Cmx((=B)(2Pr9KHdLZrG9FiCq9~G(=AsY!II1uKfP;zFB3t&ID<;cWY=RP_ z#~2clN7f~Ef*msFRu_kp1c_##!{1-%aELU5$JCdLQYU!y>|Ge3;#_)>_-Gt}5!F=j zSjAlwu9!*}KP`(Uq7VQ|@sV{ym_9Qps%v>t~g zpNB|F9+KX?lJ5w3Q?Zy0*Nx^F97wA?6z+SNlGeo*!)z9K7Ib~=bX`KE5Fq-nsllA z$uN3vs_v2oEZ>;slku|G)DFvixU4073(C#IQ=W$~u`!p6)K@$;^+`eb zSWv@38pG{*ZNGqZymJ15mXCgbe%_*I`QT8#aKksFwXDbyf;=5Lj9q{DHKroI>0fe3 zlJBcKcteG&uBJza=jDkcHN;_CtsVWRDXgzn(@@!`g!n+ihn{=J@<$?&($xTy^7FSZ zH-q1ApPOgh{ijG&6^Hg2DNHcUsaQ|=nBU~x_B`T{GdnPLk7Nkwn*EZYkyPxfGiVjFcwfeUoGSK?OsclW z(ohizcQD@6qNmdjH;Hn@RQ@FQUJ=`FSu30bwK?NAKMBepMgqC-e%A&K@5b&CGRq(h zDB;p%J{Hoq6YC?E8$y1T^? zSG$FN$@B9ullW2kosY1?*w}hJyhupG-k|MU-hFq9a6sx-exYwxTi5DwQu(8x^9UDv zi1k3Qj*{LOlBvVZXx9Av@NHal^qKJ4^8lwW%J;SwWMmjjG?lE+#2a<4=Wm#wekFT1 z2%YDR!#EwITXR0fk9AC!#59xYgTD#)HeFk=8LP}4y5$f?CH!+f7nibgYp9tKbcY&nRaf@Fg|)S;2BXV zR;^Fjv00b2@XcuovBKC9wXC@qz6zXcdWxdISdTvu87{y#hHFCkEJeu5fld z@0j$n)A(AP50m+#%^c|6)_6kFFjl{%+=5aHWtKxDp4j2Cl z#L3oM)82zPR+@(xxx}p(;1Q5*W{+QK79Iv;`a*vj*V(pn+lSP`f?wT2QR4*KAKa_@ zoGweOgsF5VqWAjtA{a~j1X{%IRy;PRcy0RMgYv|MeYnzA|Mo{ebj-M-OxEMtJQ*KX z3B|l5xM8tfQ@}DMHXPa09uSn@rZqaNgi(T@VpmG#E?eS7PWa+&l|2$xcb zx-T&|7$y|Ina7xGP^eycRB33jl`j#VS*UO2j#NUcVb~VNF6RY-}=ADV;{?x9peR;!mC1{Z60-nw* z?&`2onESm&2*S~l7vRW z??sr6C;4<~3DnuiuN7;FYkto8Zr-+3TUYsS8h6yCd;RM{>Fen;z-9 z8&SY}in7&4`_k=Ho9CV0^&fDUhiGHq^yLT>lXU}OKpgK+DcDg0rO4Rp>?ZKRa$o1+ zry|%cAoaxXpP0n&cm6}MY(|cre zBKDP>J4)60!OTw}BqRjd|4aaCkg-1Naje5etO@N&7`5c7>`S8YAa_cz;c`x33PN5m?| zRCO{NdydZKKn0`a{cL86&*9`LZg$N#WU4>{Vp^M*i=3y^+8jkK)6r?>I@Z_OFZa?s zmCk3w`AKI#xe!-H`U5< z)l~awZ5kTi-+B*W>Gsr6tpEtJ%eEY0%nCE5-+Y_ewV4kytca*yoF>++XqA&xUF{u@ z3f9MWa`_xCVtdim6|@mLJ)=kE9`>uxTp>lKbXe=mkuQ+?P6&WJRfMDyJy=%|hP!DZ zBHP7Z#H@rfTF!h(%n_UdKIY?SQ`IsA-^xS^;IPB~we5I03RU#jxYOh6pO;k#mLsb4 z1gE6K5o@?_Hq$8Xo`4{u7o9G`E|-;%ka|_Vh)qVL#>tqIQ_ag4-w5b*TmAB}&yPc4 zOZWD1n@tOM(fby^jK}7&1Nh5X8RiI+3EtKxepZ>zo~m>4S8Z;;XJvabDd`&m?1KPKtQMvctn zEJ){mi|u_rwN z@d?jmh<4OMI|iUt9#ntH!H4d})&PDQdJeAsre)9WO?mpyKiJkZhKt2wjs8|3qcNZ; zX%L!eC2z6O_+zJsn^Xo#F=YdcNI*h1(?x<(&vMO+sIuI-lftLpesug=H()tPI^9by zfZ9gsH6nE~Qds@NL>F01j;+4Xe^QVlSmUEr!^vYc9eDnyIDJ<4Qpz}z}k5wewIJsCZUO^5IvvQWQ1U|Ll;bABU82Iw< zr-;xsRq7qcjA!uZbY$#+sFi_lgm0JkP+39#)B?Vy(z588QBg>X3l@W&>rD#di{kUz zn>9XI1fK5t6;>v4d9u36j9kZ-R`-R_Im(veE z7D-dYf{v&ve(Pn}S%rdT{^C@T@K^{AY^0XpyI9uHXgrRKQy1xndG3-kz7}gb1k|~{ zHnLsSZpZcb+;{>mu{^cE>JP!3cbaSmNVs-_u%DaA_ATomD5^Cl?1b8qa#0xpHR&pCZqc(dLf*C4 zrfRt*p?Z*_avd#2rb|Ml9mHr5MI~*npqmnwI&}f+^5wo#m&ACeF)?~9U+2g;tghxP z1J`pq0RJp4I?hv*?&rMhtI_E4)nv?NU!2dZ`qxTNiN{%{QQbB`Pt}?X&tyWx&`uX? z_mMhD!4P~a$`AHL-JugR54b*!YNyN`;Za7|bp-#Z8NKD3RAN~jh&N{3wgAJ6$u z&4SvvxP&{C9KOGo-Fs@q58nQi>FbR&Xmp<*KCcLK4SsJkez9#LgP$Vjh9kYJfsy~E z2Hp5@aD%#=YWLLfPP27O#8K^Nh(MwYnE{VtNFk;KCh|k<8JwQ3c3kzPl$LzfMvfm6 z!8;(hPYf2f@wHT{pZ5{_@Iw@ zypk&J2Y-uXxe?Dk3Tzw!m+!BhwmQaF-G(>kok(x(ByDTd=>1WmeAz_ERwXFol$u=DB6!~D92AU=&~`E z<`JhxIR;^EPn{^T?X{%-D@jz;di~`HX$h_Ga^K(@?}%_gZ~MoYEPc&NiFs8+XY*5Y zR=mT-TAp}VR}LC6-M(V9)?gV(FAoM&trDXRR>HVOC&Q=dp?sqn!m_){z)xtt!gl87 z5Icw~lPh@{qV{P&Zw5G-5bR30RLe4hbYLGMz@+DI(yT47g$+`VzK=H>p3c$6FJ@ku zaxwcdvJ1|OKP=12%+%}jbom9vaCdfeH38ivU!_(mFIy6Ow%&fZ)eAlXBYAIFlb~ZS zRDy?_Q1)PfZ-J*a(&t6VU#M~1pLu#vQh$x>QkqfG1jT9^nG(bZvhu7_+?T5LexV2< ziKzWKGcelOVG8>&#F&(ZQlWQFJu5sld3c%srmU?C4QGN-0E;k`zl7a8Dp&RE=f~Xu z>zU(6D453yw^->jdh(v_sVsq_=VzFauR1z9l{$@R=XG~z-Uu07f7cavv@yMLU(9dA zT%b6u4)5eMIA5!!4``qcrrrzbtul!!)QXi#zlWB@W%r4WQ~g0fR&r#6Vd|Gj{uZ+W zoq9>s{~587(erO4+4fG0;Wp?=J6q`OCUE=XyUInT_2cWL>DSJ)p z?X5Fg?jO-G01HR-BQe!#@fE1iG?x$wOf)7_b^R_jq!9k=LJI~af5koI_ddW1#6>=S zOf{K1kgX?2^5@*o`Lb7K-1|y>ueVx=MQU_FDH6DVky9=?9Z>_};lkQyt>38`FA{M! z5ym6w%L{^yY({W-ce2Fi7HEbR4LlQ_dcNYR&fQN9T#uSGijbkU69hJK@>K`Xdxq0yufS~6k8to*~8&i_JC(jFXPU$*xeqTTrBtcW^I8$49sUz zd|FVC4!9-3)%kG~`MXJ>cR6%xkM;@*+?7+S#J~*tZW4w>E^(F-<=nJH%H+R42#cDi6-rc*CP+_N9M$=* ziB8=T<18h?X5(bof&^bHMj?43#vMZQOGyNGK_vdVB?4g%GT#xGX6;56Nec$P7V>&V z@>?=-3^6%m2o(;1K8PyI`ZGv`u+9Bw9QLIy2wq#tWVk>aYFHt_jp_^;brgLb#yfXB zQuRlLis(;2@%ec$gsG^z=J!#X951wRc%&n-miQkSZ8gXg&ZOFpdk~@ktr;V{`1(7!^%s)(n?E)M^gboI;?5<`I%WKn%J} zHlv4L`33UXF8# Date: Wed, 23 Jul 2014 17:15:59 +0100 Subject: [PATCH 12/12] Last work for me from the #mozsprint - more fixes from PR comments --- .../bedtools-python/bedtools-capstone.md | 68 +++++++++++++++---- 1 file changed, 53 insertions(+), 15 deletions(-) diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md index 665462981..3d53a645a 100644 --- a/novice/capstones/bedtools-python/bedtools-capstone.md +++ b/novice/capstones/bedtools-python/bedtools-capstone.md @@ -81,6 +81,7 @@ Introducing BedTools ==================== All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`: + ~~~ head data/cpg.bed ~~~ @@ -99,18 +100,21 @@ chr1 788863 789211 CpG:_28 ~~~ {:class="out"} +The fourth column here is the name of each CpG island - which in this case is just a numerical ID assigned by the folks at UCSC. + The bedtools help ----------------- -To bring up the help, just type +You should have installed bedtools as part of the setup for this bootcamp. In which case, you can bring up the help by typing: ~~~ $ bedtools ~~~ {:class="in"} -As you can see, there are multiple "subcommands" and for bedtools to -work you must tell it which subcommand you want to use. Examples: +If that didn't work, you probably don't have bedtools installed properly. Have a look at for some help. + +If it did work, you should see that there are multiple "subcommands". For bedtools to work you must tell it which subcommand you want to use. Examples: ~~~ $ bedtools intersect @@ -130,10 +134,10 @@ The `intersect` command is the workhorse of the `bedtools` suite. It compares tw Default behavior ---------------- -By default, `intersect` reports the intervals that represent overlaps between your two files. To demonstrate, let's identify all of the CpG islands that overlap exons. For example, let's list the first five overlaps between CpG islands and exons. +By default, `intersect` reports the intervals that represent overlaps between your two files. This is represented by the top line of the image above (the one which reads "A intersect B"). As an example, you may know that CpG islands often mark transcription start sites. Imagine that we want to find all the exons in the genome which might contain a promotor. One way of doing this would be to identify all of the CpG islands that overlap exons. For example, let's list the first five overlaps between CpG islands and exons. ~~~ -$ bedtools intersect -a cpg.bed -b exons.bed | head -5 +$ bedtools intersect -a data/cpg.bed -b data/exons.bed | head ~~~ {:class="in"} ~~~ @@ -142,28 +146,38 @@ chr1 135124 135563 CpG:_30 chr1 327790 328229 CpG:_29 chr1 327790 328229 CpG:_29 chr1 327790 328229 CpG:_29 +chr1 713984 714068 CpG:_60 +chr1 762970 763155 CpG:_115 +chr1 763177 763229 CpG:_115 +chr1 762970 763155 CpG:_115 +chr1 762970 763155 CpG:_115 ~~~ {:class="out"} -Notice how the new file we create lists the name of the CpG island which is overlapping, but not the name of the exon. If we switch the order of files, we get the exon name instead: +Notice how the output lists the name of the CpG island which is overlapping, but not the name of the exon. If we switch the order of files, we get all the extra columns from the exons.bed file, including the name (fourth column). The fifth column is a score (not used in this context) and the sixth is the strand. ~~~ $ bedtools intersect -a data/exons.bed -b data/cpg.bed | head -5 ~~~ {:class="in"} ~~~ -chr1 50489434 50489626 NM_032785_exon_13_0_chr1_50489435_r 0 - -chr1 16767166 16767348 NM_018090_exon_0_0_chr1_16767167_f 0 + -chr1 33546713 33546895 NM_052998_exon_0_0_chr1_33546714_f 0 + -chr1 33546988 33547109 NM_052998_exon_1_0_chr1_33546989_f 0 + -chr1 33547201 33547243 NM_052998_exon_2_0_chr1_33547202_f 0 + +chr1 50489434 50489626 NM_032785_exon_13_0_chr1_50489435_r 0 - +chr1 16767166 16767348 NM_018090_exon_0_0_chr1_16767167_f 0 + +chr1 33546713 33546895 NM_052998_exon_0_0_chr1_33546714_f 0 + +chr1 33546988 33547109 NM_052998_exon_1_0_chr1_33546989_f 0 + +chr1 33547201 33547243 NM_052998_exon_2_0_chr1_33547202_f 0 + +chr1 16767166 16767270 NM_001145278_exon_0_0_chr1_16767167_f 0 + +chr1 16767166 16767348 NM_001145277_exon_0_0_chr1_16767167_f 0 + +chr1 8384389 8384719 NM_001080397_exon_0_0_chr1_8384390_f 0 + +chr1 8385895 8386102 NM_001080397_exon_2_0_chr1_8385878_f 0 + +chr1 8390268 8390800 NM_001080397_exon_3_0_chr1_8390269_f 0 + ~~~ {:class="out"} Reporting the original feature in each file. -------------------------------------------- -The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of only showing you *where* the intersections occurred, it shows you *what* intersected. +The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of only showing you *where* the intersections occurred, it shows you *what* intersected. This is represented by the second line of the image above (the one which reads "A intersect B (-wa)"). A ~~~ $ bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -5 @@ -178,12 +192,14 @@ chr1 327790 328229 CpG:_29 chr1 324438 328581 NR_028325_exon_2_0_chr1_ ~~~ {:class="out"} +The first three columns of the output show us where there are overlaps between exons and cpg islands. The fourth column contains the name of the CpG island which is overlapping. The rest of the output is the full line from the exons.bed file. + Counting the number of overlapping features. -------------------------------------------- We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option. ~~~ -$ bedtools intersect -a cpg.bed -b exons.bed -c | head +$ bedtools intersect -a data/cpg.bed -b data/exons.bed -c | head ~~~ {:class="in"} ~~~ @@ -200,7 +216,27 @@ chr1 788863 789211 CpG:_28 9 ~~~ {:class="out"} -Again, we see that the order of the bed files is important. We get an output line for every feature in `data/cpg.bed`, even if there is no overlap with `data/exons.bed`. +Again, we see that the order of the bed files is important. We get an output line for every feature in `data/cpg.bed`, even if there is no overlap with `data/exons.bed`. If we reverse the order of the two files again, we get a different output: + +~~~ +$ bedtools intersect -a data/exons.bed -b data/cpg.bed -c | head +~~~ +{:class="in"} +~~~ +chr1 66999824 67000051 NM_032291_exon_0_0_chr1_66999825_f 0 + 0 +chr1 67091529 67091593 NM_032291_exon_1_0_chr1_67091530_f 0 + 0 +chr1 67098752 67098777 NM_032291_exon_2_0_chr1_67098753_f 0 + 0 +chr1 67101626 67101698 NM_032291_exon_3_0_chr1_67101627_f 0 + 0 +chr1 67105459 67105516 NM_032291_exon_4_0_chr1_67105460_f 0 + 0 +chr1 67108492 67108547 NM_032291_exon_5_0_chr1_67108493_f 0 + 0 +chr1 67109226 67109402 NM_032291_exon_6_0_chr1_67109227_f 0 + 0 +chr1 67126195 67126207 NM_032291_exon_7_0_chr1_67126196_f 0 + 0 +chr1 67133212 67133224 NM_032291_exon_8_0_chr1_67133213_f 0 + 0 +chr1 67136677 67136702 NM_032291_exon_9_0_chr1_67136678_f 0 + 0 +~~~ +{:class="out"} + +Now we get one line for every exon. In this case, the first ten exons actually don't overlap with any CpG islands. Require a minimal fraction of overlap. -------------------------------------------- @@ -226,9 +262,11 @@ chr1 788863 789211 CpG:_28 7 ~~~ {:class="out"} +If you compare this to the result you see without the -f flag you will see that the very first CpG island is now reported as having 0 overlaps. This means it must have overlapped an exon by less than 50%. + Find features that DO NOT overlap -------------------------------------------- -Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case. +Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case. This is represented in the last line of the image above. ~~~ $ bedtools intersect -a cpg.bed -b exons.bed -v | head