From 334110149cd9cdb73d769a8452f1ad1c61c041d4 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Tue, 15 Jul 2014 15:46:46 +0100
Subject: [PATCH 01/12] Copied example verbatim from Aaron Quinlan

Original URL: https://raw.githubusercontent.com/arq5x/tutorials/master/bedtools.md
---
 novice/extras/bedtools-capstone.md | 548 +++++++++++++++++++++++++++++
 1 file changed, 548 insertions(+)
 create mode 100644 novice/extras/bedtools-capstone.md

diff --git a/novice/extras/bedtools-capstone.md b/novice/extras/bedtools-capstone.md
new file mode 100644
index 000000000..1952e39fe
--- /dev/null
+++ b/novice/extras/bedtools-capstone.md
@@ -0,0 +1,548 @@
+% bedtools Tutorial
+% Aaron Quinlan
+% November 22, 2013
+
+
+Synopsis
+========
+
+Our goal is to work through examples that demonstrate how to 
+explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) with the `bedtools` software package.
+
+Some of our analysis will be based upon the Maurano et al exploration of DnaseI hypersensitivity sites in hundreds of primary tissue types.
+
+    Maurano et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012. Vol. 337 no. 6099 pp. 1190-1195.
+
+    www.sciencemag.org/content/337/6099/1190.short
+
+
+This tutorial is merely meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/).
+
+
+
+
+\
+
+
+Setup
+=====
+From the Terminal, create a new directory on your Desktop called "bedtools-demo".
+
+    cd ~/Desktop
+    mkdir bedtools-demo
+
+Navigate into that directory.
+
+    cd bedtools-demo
+
+Download the sample BED files I have provided.
+
+    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/maurano.dnaseI.tgz
+    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed
+    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed
+    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed
+    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt
+
+Now, we need to extract all of the 20 Dnase I hypersensitivity BED files from the "tarball" named
+`maurano.dnaseI.tgz`.
+
+    tar -zxvf maurano.dnaseI.tgz
+    rm maurano.dnaseI.tgz
+
+Let's take a look at what files we now have.
+
+    ls -1
+
+
+\
+
+
+What are these files?
+=========================
+Your directory should now contain 23 BED files and 1 genome file. Twenty of these files (those starting with "f" for "fetal tissue") reflect Dnase I hypersensitivity sites measured in twenty different fetal tissue samples from the brain, heart, intestine, kidney, lung, muscle, skin, and stomach.
+
+In addition: `cpg.bed` represents CpG islands in the human genome; `exons.bed` represents RefSeq exons from human genes; and `gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS).
+
+The latter 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start).
+
+
+\
+
+
+The bedtools help
+==================
+To bring up the help, just type
+
+    bedtools
+
+As you can see, there are multiple "subcommands" and for bedtools to
+work you must tell it which subcommand you want to use. Examples:
+
+    bedtools intersect
+    bedtools merge
+    bedtools subtract
+
+
+bedtools "intersect"
+====================
+
+The `intersect` command is the workhorse of the `bedtools` suite. It compares two BED/VCF/GFF files (or a BAM file and one of the aforementioned files) and identifies all the regions in the gemome where the features in the two files overlap (that is, share at least one base pair in common).
+
+![](http://bedtools.readthedocs.org/en/latest/_images/intersect-glyph.png)
+
+Default behavior
+----------------
+By default, `intersect` reports the intervals that represent overlaps between your two files.  To demonstrate, let's identify all of the CpG islands that overlap exons.
+
+    bedtools intersect -a cpg.bed -b exons.bed | head -5
+    chr1    29320   29370   CpG:_116
+    chr1    135124  135563  CpG:_30
+    chr1    327790  328229  CpG:_29
+    chr1    327790  328229  CpG:_29
+    chr1    327790  328229  CpG:_29
+
+Reporting the original feature in each file.
+--------------------------------------------
+The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped.  As such, instead of not only showing you *where* the intersections occurred, it shows you *what* intersected.
+
+    bedtools intersect -a cpg.bed -b exons.bed -wa -wb \
+    | head -5
+    chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -
+    chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +
+    chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +
+
+How many base pairs of overlap were there?
+------------------------------------------
+The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files.
+
+    bedtools intersect -a cpg.bed -b exons.bed -wo \
+    | head -10
+    chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -   50
+    chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
+    chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
+    chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_ 033908_exon_6_0_chr1_713664_r    0   -   84
+    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _015368_exon_0_0_chr1_762971_f    0   +   185
+    chr1    762416  763445  CpG:_115    chr1    763177  763229  NR  _047525_exon_0_0_chr1_763178_f    0   +   52
+    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _047524_exon_0_0_chr1_762971_f    0   +   185
+    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
+
+Counting the number of overlapping features.
+--------------------------------------------
+We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option.
+
+    bedtools intersect -a cpg.bed -b exons.bed -c \
+    | head
+    chr1    28735   29810   CpG:_116    1
+    chr1    135124  135563  CpG:_30 1
+    chr1    327790  328229  CpG:_29 3
+    chr1    437151  438164  CpG:_84 0
+    chr1    449273  450544  CpG:_99 0
+    chr1    533219  534114  CpG:_94 0
+    chr1    544738  546649  CpG:_171    0
+    chr1    713984  714547  CpG:_60 1
+    chr1    762416  763445  CpG:_115    10
+    chr1    788863  789211  CpG:_28 9
+
+\
+
+
+Find features that DO NOT overlap
+--------------------------------------------
+Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case.
+
+    bedtools intersect -a cpg.bed -b exons.bed -v \
+    | head
+    chr1    437151  438164  CpG:_84
+    chr1    449273  450544  CpG:_99
+    chr1    533219  534114  CpG:_94
+    chr1    544738  546649  CpG:_171
+    chr1    801975  802338  CpG:_24
+    chr1    805198  805628  CpG:_50
+    chr1    839694  840619  CpG:_83
+    chr1    844299  845883  CpG:_153
+    chr1    912869  913153  CpG:_28
+    chr1    919726  919927  CpG:_15
+
+
+Require a minimal fraction of overlap.
+--------------------------------------------
+Recall that the default is to report overlaps between features in A and B so long as *at least one basepair* of overlap exists. However, the `-f` option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported.
+
+Let's be more strict and require 50% of overlap.
+
+    bedtools intersect -a cpg.bed -b exons.bed \
+    -wo -f 0.50 \
+    | head
+    chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
+    chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
+    chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047525_exon_4_0_chr1_788771_f    0   +   348
+    chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047524_exon_3_0_chr1_788771_f    0   +   348
+    chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047523_exon_3_0_chr1_788771_f    0   +   348
+    chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_ 047522_exon_5_0_chr1_788859_f    0   +   348
+    chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047521_exon_4_0_chr1_788771_f    0   +   348
+    chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_ 047520_exon_6_0_chr1_788859_f    0   +   348
+
+
+\
+
+
+bedtools "merge"
+====================
+Many datasets of genomic features have many individual features that overlap one another (e.g. aligments from a ChiP seq experiment). It is often useful to just cobine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you.
+
+![](http://bedtools.readthedocs.org/en/latest/_images/merge-glyph.png)
+
+
+Input must be sorted
+--------------------
+The merge tool requires that the input file is sorted by chromosome, then by start position. This allows the merging algorithm to work very quickly without requiring any RAM.
+
+If you run the merge command on a file that is not sorted, you will get an error. For example:
+
+    bedtools merge -i exons.bed
+    chr1    66999824    67000051
+    chr1    67091529    67091593
+    chr1    67098752    67098777
+    chr1    67101626    67101698
+    chr1    67105459    67105516
+    chr1    67108492    67108547
+    chr1    67109226    67109402
+    chr1    67126195    67126207
+    chr1    67133212    67133224
+    chr1    67136677    67136702
+    chr1    67137626    67137678
+    chr1    67138963    67139049
+    chr1    67142686    67142779
+    chr1    67145360    67145435
+    chr1    67147551    67148052
+    chr1    67154830    67154958
+    chr1    67155872    67155999
+    chr1    67161116    67161176
+    chr1    67184976    67185088
+    chr1    67194946    67195102
+    chr1    67199430    67199563
+    chr1    67205017    67205220
+    chr1    67206340    67206405
+    chr1    67206954    67207119
+    ERROR: input file: (exons.bed) is not sorted by chrom then start.
+       The start coordinate at line 26 is less than the start at line 25
+
+To correct this, you need to sort your BED using the UNIX `sort` utility.
+
+    sort -k1,1 -k2,2n exons.bed > exons.sort.bed
+
+Let's try again.
+    
+    bedtools merge -i exons.sort.bed | head -10
+
+The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file.
+
+
+Count the number of overlapping intervals.
+------------------------------------------
+A more sophisticated approach would be to not only merge overlapping intervals, but also report the *number* of intervals that were integrated into the new, merged interval. One does this with the `-n` option.
+
+    bedtools merge -i exons.sort.bed -n | head -20
+    chr1    11873   12227   1
+    chr1    12612   12721   1
+    chr1    13220   14829   2
+    chr1    14969   15038   1
+    chr1    15795   15947   1
+    chr1    16606   16765   1
+    chr1    16857   17055   1
+    chr1    17232   17368   1
+    chr1    17605   17742   1
+    chr1    17914   18061   1
+    chr1    18267   18366   1
+    chr1    24737   24891   1
+    chr1    29320   29370   1
+    chr1    34610   35174   2
+    chr1    35276   35481   2
+    chr1    35720   36081   2
+    chr1    69090   70008   1
+    chr1    134772  139696  1
+    chr1    139789  139847  1
+    chr1    140074  140566  1
+
+Merging features that are close to one another.
+-----------------------------------------------
+With the `-d` (distance) option, one can also merge intervals that do not overlap, yet are close to one another. For example, to merge features that are no more than 1000bp apart, one would run:
+
+    bedtools merge -i exons.sort.bed -d 1000 -n | head -20
+    chr1    11873   18366   12
+    chr1    24737   24891   1
+    chr1    29320   29370   1
+    chr1    34610   36081   6
+    chr1    69090   70008   1
+    chr1    134772  140566  3
+    chr1    323891  328581  10
+    chr1    367658  368597  3
+    chr1    621095  622034  3
+    chr1    661138  665731  3
+    chr1    700244  700627  1
+    chr1    701708  701767  1
+    chr1    703927  705092  2
+    chr1    708355  708487  1
+    chr1    709550  709660  1
+    chr1    713663  714068  1
+    chr1    752750  755214  2
+    chr1    761585  763229  10
+    chr1    764382  764484  9
+    chr1    776579  778984  1
+
+
+\
+
+
+bedtools "complement"
+=====================
+We often want to know which intervals of the genome are **NOT** "covered" by intervals in a given feature file. For example, if you have a set of ChIP-seq peaks, you may also want to know which regions of the genome are not bound by the factor you assayed. The `complement` addresses this task.
+
+![](http://bedtools.readthedocs.org/en/latest/_images/complement-glyph.png)
+
+As an example, let's find all of the non-exonic (i.e., intronic or intergenic) regions of the genome.  Note, to do this you need a ["genome"](http://bedtools.readthedocs.org/en/latest/content/general-usage.html#genome-file-format) file, which tells `bedtools` the length of each chromosome in your file.  *Consider why the tool would need this information...*
+
+    bedtools complement -i exons.bed -g genome.txt \
+    > non-exonic.bed
+    head non-exonic.bed
+    chr1    0   11873
+    chr1    12227   12612
+    chr1    12721   13220
+    chr1    14829   14969
+    chr1    15038   15795
+    chr1    15947   16606
+    chr1    16765   16857
+    chr1    17055   17232
+    chr1    17368   17605
+    chr1    17742   17914
+
+
+\
+
+
+bedtools "genomecov"
+====================
+For many analyses, one wants to measure the genome wide coverage of a feature file. For example, we often want to know what fraction of the genome is covered by 1 feature, 2 features, 3 features, etc. This is frequently crucial when assessing the "uniformity" of coverage from whole-genome sequencing. This is done with the versatile `genomecov` tool.
+
+![](http://bedtools.readthedocs.org/en/latest/_images/genomecov-glyph.png)
+
+As an example, let's produce a histogram of coverage of the exons throughout the genome. Like the `merge` tool, `genomecov` requires pre-sorted data. It also needs a genome file as above.
+
+    bedtools genomecov -i exons.sort.bed -g genome.txt
+
+This should run for 3 minutes or so. At the end of your output, you should see something like:
+
+    genome  0   3062406951  3137161264  0.976171
+    genome  1   44120515    3137161264  0.0140638
+    genome  2   15076446    3137161264  0.00480576
+    genome  3   7294047 3137161264  0.00232505
+    genome  4   3650324 3137161264  0.00116358
+    genome  5   1926397 3137161264  0.000614057
+    genome  6   1182623 3137161264  0.000376972
+    genome  7   574102  3137161264  0.000183
+    genome  8   353352  3137161264  0.000112634
+    genome  9   152653  3137161264  4.86596e-05
+    genome  10  113362  3137161264  3.61352e-05
+    genome  11  57361   3137161264  1.82844e-05
+    genome  12  52000   3137161264  1.65755e-05
+    genome  13  55368   3137161264  1.76491e-05
+    genome  14  19218   3137161264  6.12592e-06
+    genome  15  19369   3137161264  6.17405e-06
+    genome  16  26651   3137161264  8.49526e-06
+    genome  17  9942    3137161264  3.16911e-06
+    genome  18  13442   3137161264  4.28477e-06
+    genome  19  1030    3137161264  3.28322e-07
+    genome  20  6329    3137161264  2.01743e-06
+    ...
+
+
+\
+
+
+Producing BEDGRAPH output
+--------------------------
+Using the `-bg` option, one can also produce BEDGRAPH output which represents the "depth" fo feature coverage for each base pair in the genome:
+
+    bedtools genomecov -i exons.sort.bed -g genome.txt -bg | head -20
+    chr1    11873   12227   1
+    chr1    12612   12721   1
+    chr1    13220   14361   1
+    chr1    14361   14409   2
+    chr1    14409   14829   1
+    chr1    14969   15038   1
+    chr1    15795   15947   1
+    chr1    16606   16765   1
+    chr1    16857   17055   1
+    chr1    17232   17368   1
+    chr1    17605   17742   1
+    chr1    17914   18061   1
+    chr1    18267   18366   1
+    chr1    24737   24891   1
+    chr1    29320   29370   1
+    chr1    34610   35174   2
+    chr1    35276   35481   2
+    chr1    35720   36081   2
+    chr1    69090   70008   1
+    chr1    134772  139696  1
+
+    
+\
+
+
+Sophistication through chaining multiple bedtools
+=================================================
+Analytical power in `bedtools` comes from the ability to "chain" together multiple tools in order to construct rather sophisicated analyses with very little programming - you just need **genome arithmetic**!  Have a look at the examples [here](http://bedtools.readthedocs.org/en/latest/content/advanced-usage.html).
+
+
+\
+
+
+Principal component analysis
+=============================
+
+We will use the bedtools implementation of a Jaccard statistic to meaure the similarity of two 
+datasets. Briefly, the Jaccard statistic measures the ratio of the number of *intersecting* base 
+pairs to the *total* number of base pairs in the two sets.  As such, the score ranges from 0.0 to 1.
+0; lower values reflect lower similarity, whereas higher values reflect higher similarity.
+
+Let's walk through an example: we would expect the Dnase hypersensivity sites to be rather similar 
+between two samples of the **same** fetal tissue type.  Let's test:
+
+    bedtools jaccard \
+        -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
+        -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed
+    intersection    union   jaccard
+    81269248    160493950   0.50637
+
+But what about the similarity of two **different** tissue types?
+
+    bedtools jaccard \
+        -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
+        -b fSkin_fibro_bicep_R-DS19745.hg19.hotspot.twopass.fdr0.05.merge.bed
+    intersection    union   jaccard
+    28076951    164197278   0.170995
+
+Hopefully this demonstrates how the Jaccard statistic can be used as a simple statistic to reduce 
+the dimensionality of the comparison between two large (e.g., often containing thousands or 
+millions of intervals) feature sets.
+
+
+\
+
+
+A Jaccard statistic for all 400 pairwise comparisons.
+------------------------------------------------------
+
+
+We are going to take this a bit further and use the Jaccard statistic to measure the similarity of 
+all 20 tissue samples against all other 20 samples.  Once we have a 20x20 matrix of similarities, 
+we can use dimensionality reduction techniques such as hierarchical clustering or principal 
+component analysis to detect higher order similarities among **all** of the datasets.
+
+
+We will use GNU parallel to compute a Jaccard statistic for the 400 (20*20) pairwise comparisons 
+among the fetal tissue samples.
+
+But first, we need to install [GNU parallel](http://www.gnu.org/software/parallel/).
+
+    brew install parallel
+
+Next, we need to install a tiny script I wrote for this analysis.
+
+    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/make-matrix.py
+
+
+Now, we can use `parallel` to, you guessed it, compute the 400 pairwise Jaccard statistics in parallel using as many processors as you have available.
+
+    parallel "bedtools jaccard -a {1} -b {2} \
+             | awk 'NR>1' \
+             | cut -f 3 \
+             > {1}.{2}.jaccard" \
+             ::: `ls *.merge.bed` ::: `ls *.merge.bed`
+
+This command will create a single file containing the pairwise Jaccard measurements from all 400 tests.
+
+    find . \
+        | grep jaccard \
+        | xargs grep "" \
+        | sed -e s"/\.\///" \
+        | perl -pi -e "s/.bed./.bed\t/" \
+        | perl -pi -e "s/.jaccard:/\t/" \
+        > pairwise.dnase.txt
+
+A bit of cleanup to use more intelligible names for each of the samples.
+
+    cat pairwise.dnase.txt \
+    | sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \
+    | sed -e 's/.hg19//g' \
+    > pairwise.dnase.shortnames.txt
+ 
+Now let's make a 20x20 matrix of the Jaccard statistic. This will allow the data to play nicely with R.
+
+    awk 'NF==3' pairwise.dnase.shortnames.txt \
+    | awk '$1 ~ /^f/ && $2 ~ /^f/' \
+    | python make-matrix.py \
+    > dnase.shortnames.distance.matrix
+ 
+Let's also make a file of labels for each dataset so that we can label each dataset in our R plot.
+
+    cut -f 1 dnase.shortnames.distance.matrix | cut -f 1 -d "-" | cut -f 1 -d "_" > labels.txt
+ 
+Now start up R. (This assumes you have installed the `ggplot2` package).
+
+    R
+
+You should see something very similar to this:
+
+
+    R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
+    Copyright (C) 2012 The R Foundation for Statistical Computing
+    ISBN 3-900051-07-0
+    Platform: x86_64-apple-darwin12.0.0 (64-bit)
+    
+    R is free software and comes with ABSOLUTELY NO WARRANTY.
+    You are welcome to redistribute it under certain conditions.
+    Type 'license()' or 'licence()' for distribution details.
+    
+      Natural language support but running in an English locale
+    
+    R is a collaborative project with many contributors.
+    Type 'contributors()' for more information and
+    'citation()' on how to cite R or R packages in publications.
+    
+    Type 'demo()' for some demos, 'help()' for on-line help, or
+    'help.start()' for an HTML browser interface to help.
+    Type 'q()' to quit R.
+    
+    >
+
+No paste these commands into the R console:
+
+    library(ggplot2)
+    library(RColorBrewer)
+    blues <- colorRampPalette(c('dark blue', 'light blue'))
+    greens <- colorRampPalette(c('dark green', 'light green'))
+    reds <- colorRampPalette(c('pink', 'dark red'))
+     
+    setwd("~/Desktop/bedtools-demo")
+    x <- read.table('dnase.shortnames.distance.matrix')
+    labels <- read.table('labels.txt')
+    ngroups <- length(unique(labels))
+    pca <- princomp(x)
+    qplot(pca$scores[,1], pca$scores[,2], color=labels[,1],     geom="point", size=1) +
+      scale_color_manual(values = c(blues(4), greens(5), reds(5))) 
+
+You should see this:
+
+![](http://quinlanlab.cs.virginia.edu/cshl2013/pca.png)
+
+Et voila. Note that PCA was used in this case as an example of what PCA does for the CSHL Adv. Seq. course. Heatmaps are a more informative visualization in this case.
+
+
+

From 165b0029449ab642c7b45074b0f89f406d2530c4 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Tue, 15 Jul 2014 15:48:56 +0100
Subject: [PATCH 02/12] Removed the last part of Aaron's tutorial

To be replaced with permutation
---
 novice/extras/bedtools-capstone.md | 301 -----------------------------
 1 file changed, 301 deletions(-)

diff --git a/novice/extras/bedtools-capstone.md b/novice/extras/bedtools-capstone.md
index 1952e39fe..9a640e130 100644
--- a/novice/extras/bedtools-capstone.md
+++ b/novice/extras/bedtools-capstone.md
@@ -244,305 +244,4 @@ Let's try again.
 The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file.
 
 
-Count the number of overlapping intervals.
-------------------------------------------
-A more sophisticated approach would be to not only merge overlapping intervals, but also report the *number* of intervals that were integrated into the new, merged interval. One does this with the `-n` option.
-
-    bedtools merge -i exons.sort.bed -n | head -20
-    chr1    11873   12227   1
-    chr1    12612   12721   1
-    chr1    13220   14829   2
-    chr1    14969   15038   1
-    chr1    15795   15947   1
-    chr1    16606   16765   1
-    chr1    16857   17055   1
-    chr1    17232   17368   1
-    chr1    17605   17742   1
-    chr1    17914   18061   1
-    chr1    18267   18366   1
-    chr1    24737   24891   1
-    chr1    29320   29370   1
-    chr1    34610   35174   2
-    chr1    35276   35481   2
-    chr1    35720   36081   2
-    chr1    69090   70008   1
-    chr1    134772  139696  1
-    chr1    139789  139847  1
-    chr1    140074  140566  1
-
-Merging features that are close to one another.
------------------------------------------------
-With the `-d` (distance) option, one can also merge intervals that do not overlap, yet are close to one another. For example, to merge features that are no more than 1000bp apart, one would run:
-
-    bedtools merge -i exons.sort.bed -d 1000 -n | head -20
-    chr1    11873   18366   12
-    chr1    24737   24891   1
-    chr1    29320   29370   1
-    chr1    34610   36081   6
-    chr1    69090   70008   1
-    chr1    134772  140566  3
-    chr1    323891  328581  10
-    chr1    367658  368597  3
-    chr1    621095  622034  3
-    chr1    661138  665731  3
-    chr1    700244  700627  1
-    chr1    701708  701767  1
-    chr1    703927  705092  2
-    chr1    708355  708487  1
-    chr1    709550  709660  1
-    chr1    713663  714068  1
-    chr1    752750  755214  2
-    chr1    761585  763229  10
-    chr1    764382  764484  9
-    chr1    776579  778984  1
-
-
-\
-
-
-bedtools "complement"
-=====================
-We often want to know which intervals of the genome are **NOT** "covered" by intervals in a given feature file. For example, if you have a set of ChIP-seq peaks, you may also want to know which regions of the genome are not bound by the factor you assayed. The `complement` addresses this task.
-
-![](http://bedtools.readthedocs.org/en/latest/_images/complement-glyph.png)
-
-As an example, let's find all of the non-exonic (i.e., intronic or intergenic) regions of the genome.  Note, to do this you need a ["genome"](http://bedtools.readthedocs.org/en/latest/content/general-usage.html#genome-file-format) file, which tells `bedtools` the length of each chromosome in your file.  *Consider why the tool would need this information...*
-
-    bedtools complement -i exons.bed -g genome.txt \
-    > non-exonic.bed
-    head non-exonic.bed
-    chr1    0   11873
-    chr1    12227   12612
-    chr1    12721   13220
-    chr1    14829   14969
-    chr1    15038   15795
-    chr1    15947   16606
-    chr1    16765   16857
-    chr1    17055   17232
-    chr1    17368   17605
-    chr1    17742   17914
-
-
-\
-
-
-bedtools "genomecov"
-====================
-For many analyses, one wants to measure the genome wide coverage of a feature file. For example, we often want to know what fraction of the genome is covered by 1 feature, 2 features, 3 features, etc. This is frequently crucial when assessing the "uniformity" of coverage from whole-genome sequencing. This is done with the versatile `genomecov` tool.
-
-![](http://bedtools.readthedocs.org/en/latest/_images/genomecov-glyph.png)
-
-As an example, let's produce a histogram of coverage of the exons throughout the genome. Like the `merge` tool, `genomecov` requires pre-sorted data. It also needs a genome file as above.
-
-    bedtools genomecov -i exons.sort.bed -g genome.txt
-
-This should run for 3 minutes or so. At the end of your output, you should see something like:
-
-    genome  0   3062406951  3137161264  0.976171
-    genome  1   44120515    3137161264  0.0140638
-    genome  2   15076446    3137161264  0.00480576
-    genome  3   7294047 3137161264  0.00232505
-    genome  4   3650324 3137161264  0.00116358
-    genome  5   1926397 3137161264  0.000614057
-    genome  6   1182623 3137161264  0.000376972
-    genome  7   574102  3137161264  0.000183
-    genome  8   353352  3137161264  0.000112634
-    genome  9   152653  3137161264  4.86596e-05
-    genome  10  113362  3137161264  3.61352e-05
-    genome  11  57361   3137161264  1.82844e-05
-    genome  12  52000   3137161264  1.65755e-05
-    genome  13  55368   3137161264  1.76491e-05
-    genome  14  19218   3137161264  6.12592e-06
-    genome  15  19369   3137161264  6.17405e-06
-    genome  16  26651   3137161264  8.49526e-06
-    genome  17  9942    3137161264  3.16911e-06
-    genome  18  13442   3137161264  4.28477e-06
-    genome  19  1030    3137161264  3.28322e-07
-    genome  20  6329    3137161264  2.01743e-06
-    ...
-
-
-\
-
-
-Producing BEDGRAPH output
---------------------------
-Using the `-bg` option, one can also produce BEDGRAPH output which represents the "depth" fo feature coverage for each base pair in the genome:
-
-    bedtools genomecov -i exons.sort.bed -g genome.txt -bg | head -20
-    chr1    11873   12227   1
-    chr1    12612   12721   1
-    chr1    13220   14361   1
-    chr1    14361   14409   2
-    chr1    14409   14829   1
-    chr1    14969   15038   1
-    chr1    15795   15947   1
-    chr1    16606   16765   1
-    chr1    16857   17055   1
-    chr1    17232   17368   1
-    chr1    17605   17742   1
-    chr1    17914   18061   1
-    chr1    18267   18366   1
-    chr1    24737   24891   1
-    chr1    29320   29370   1
-    chr1    34610   35174   2
-    chr1    35276   35481   2
-    chr1    35720   36081   2
-    chr1    69090   70008   1
-    chr1    134772  139696  1
-
-    
-\
-
-
-Sophistication through chaining multiple bedtools
-=================================================
-Analytical power in `bedtools` comes from the ability to "chain" together multiple tools in order to construct rather sophisicated analyses with very little programming - you just need **genome arithmetic**!  Have a look at the examples [here](http://bedtools.readthedocs.org/en/latest/content/advanced-usage.html).
-
-
-\
-
-
-Principal component analysis
-=============================
-
-We will use the bedtools implementation of a Jaccard statistic to meaure the similarity of two 
-datasets. Briefly, the Jaccard statistic measures the ratio of the number of *intersecting* base 
-pairs to the *total* number of base pairs in the two sets.  As such, the score ranges from 0.0 to 1.
-0; lower values reflect lower similarity, whereas higher values reflect higher similarity.
-
-Let's walk through an example: we would expect the Dnase hypersensivity sites to be rather similar 
-between two samples of the **same** fetal tissue type.  Let's test:
-
-    bedtools jaccard \
-        -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
-        -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed
-    intersection    union   jaccard
-    81269248    160493950   0.50637
-
-But what about the similarity of two **different** tissue types?
-
-    bedtools jaccard \
-        -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
-        -b fSkin_fibro_bicep_R-DS19745.hg19.hotspot.twopass.fdr0.05.merge.bed
-    intersection    union   jaccard
-    28076951    164197278   0.170995
-
-Hopefully this demonstrates how the Jaccard statistic can be used as a simple statistic to reduce 
-the dimensionality of the comparison between two large (e.g., often containing thousands or 
-millions of intervals) feature sets.
-
-
-\
-
-
-A Jaccard statistic for all 400 pairwise comparisons.
-------------------------------------------------------
-
-
-We are going to take this a bit further and use the Jaccard statistic to measure the similarity of 
-all 20 tissue samples against all other 20 samples.  Once we have a 20x20 matrix of similarities, 
-we can use dimensionality reduction techniques such as hierarchical clustering or principal 
-component analysis to detect higher order similarities among **all** of the datasets.
-
-
-We will use GNU parallel to compute a Jaccard statistic for the 400 (20*20) pairwise comparisons 
-among the fetal tissue samples.
-
-But first, we need to install [GNU parallel](http://www.gnu.org/software/parallel/).
-
-    brew install parallel
-
-Next, we need to install a tiny script I wrote for this analysis.
-
-    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/make-matrix.py
-
-
-Now, we can use `parallel` to, you guessed it, compute the 400 pairwise Jaccard statistics in parallel using as many processors as you have available.
-
-    parallel "bedtools jaccard -a {1} -b {2} \
-             | awk 'NR>1' \
-             | cut -f 3 \
-             > {1}.{2}.jaccard" \
-             ::: `ls *.merge.bed` ::: `ls *.merge.bed`
-
-This command will create a single file containing the pairwise Jaccard measurements from all 400 tests.
-
-    find . \
-        | grep jaccard \
-        | xargs grep "" \
-        | sed -e s"/\.\///" \
-        | perl -pi -e "s/.bed./.bed\t/" \
-        | perl -pi -e "s/.jaccard:/\t/" \
-        > pairwise.dnase.txt
-
-A bit of cleanup to use more intelligible names for each of the samples.
-
-    cat pairwise.dnase.txt \
-    | sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \
-    | sed -e 's/.hg19//g' \
-    > pairwise.dnase.shortnames.txt
- 
-Now let's make a 20x20 matrix of the Jaccard statistic. This will allow the data to play nicely with R.
-
-    awk 'NF==3' pairwise.dnase.shortnames.txt \
-    | awk '$1 ~ /^f/ && $2 ~ /^f/' \
-    | python make-matrix.py \
-    > dnase.shortnames.distance.matrix
- 
-Let's also make a file of labels for each dataset so that we can label each dataset in our R plot.
-
-    cut -f 1 dnase.shortnames.distance.matrix | cut -f 1 -d "-" | cut -f 1 -d "_" > labels.txt
- 
-Now start up R. (This assumes you have installed the `ggplot2` package).
-
-    R
-
-You should see something very similar to this:
-
-
-    R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
-    Copyright (C) 2012 The R Foundation for Statistical Computing
-    ISBN 3-900051-07-0
-    Platform: x86_64-apple-darwin12.0.0 (64-bit)
-    
-    R is free software and comes with ABSOLUTELY NO WARRANTY.
-    You are welcome to redistribute it under certain conditions.
-    Type 'license()' or 'licence()' for distribution details.
-    
-      Natural language support but running in an English locale
-    
-    R is a collaborative project with many contributors.
-    Type 'contributors()' for more information and
-    'citation()' on how to cite R or R packages in publications.
-    
-    Type 'demo()' for some demos, 'help()' for on-line help, or
-    'help.start()' for an HTML browser interface to help.
-    Type 'q()' to quit R.
-    
-    >
-
-No paste these commands into the R console:
-
-    library(ggplot2)
-    library(RColorBrewer)
-    blues <- colorRampPalette(c('dark blue', 'light blue'))
-    greens <- colorRampPalette(c('dark green', 'light green'))
-    reds <- colorRampPalette(c('pink', 'dark red'))
-     
-    setwd("~/Desktop/bedtools-demo")
-    x <- read.table('dnase.shortnames.distance.matrix')
-    labels <- read.table('labels.txt')
-    ngroups <- length(unique(labels))
-    pca <- princomp(x)
-    qplot(pca$scores[,1], pca$scores[,2], color=labels[,1],     geom="point", size=1) +
-      scale_color_manual(values = c(blues(4), greens(5), reds(5))) 
-
-You should see this:
-
-![](http://quinlanlab.cs.virginia.edu/cshl2013/pca.png)
-
-Et voila. Note that PCA was used in this case as an example of what PCA does for the CSHL Adv. Seq. course. Heatmaps are a more informative visualization in this case.
-
-
 

From 33bb59d722aa651cd601c0cae3fc4b645ad72860 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Mon, 21 Jul 2014 15:14:23 +0100
Subject: [PATCH 03/12] Fleshed out the section on shuffling beds

---
 .../bedtools-python}/bedtools-capstone.md     | 139 +++++++++++++++++-
 .../capstones/bedtools-python/do_shuffle.sh   |   8 +
 novice/capstones/bedtools-python/readings.py  |  37 +++++
 3 files changed, 182 insertions(+), 2 deletions(-)
 rename novice/{extras => capstones/bedtools-python}/bedtools-capstone.md (69%)
 create mode 100755 novice/capstones/bedtools-python/do_shuffle.sh
 create mode 100644 novice/capstones/bedtools-python/readings.py

diff --git a/novice/extras/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
similarity index 69%
rename from novice/extras/bedtools-capstone.md
rename to novice/capstones/bedtools-python/bedtools-capstone.md
index 9a640e130..f586947a2 100644
--- a/novice/extras/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -191,6 +191,27 @@ Let's be more strict and require 50% of overlap.
 
 \
 
+Do exons and CPG islands overlap significantly?
+===============================================
+
+Going back to base pairs overlap
+--------------------------------
+
+    bedtools intersect -a cpg.bed -b exons.bed -wo \
+    | head -10
+    chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -   50
+    chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
+    chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
+    chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_ 033908_exon_6_0_chr1_713664_r    0   -   84
+    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _015368_exon_0_0_chr1_762971_f    0   +   185
+    chr1    762416  763445  CpG:_115    chr1    763177  763229  NR  _047525_exon_0_0_chr1_763178_f    0   +   52
+    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _047524_exon_0_0_chr1_762971_f    0   +   185
+    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
+
+
+
 
 bedtools "merge"
 ====================
@@ -243,5 +264,119 @@ Let's try again.
 
 The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file.
 
-
-
+Or alternatively:
+
+    bedtools sort -i exons.bed | bedtools merge > exons.merged.bed
+
+Now the overlap with merged exons:
+
+
+    bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
+    | head -n 10
+    chr1    28735   29810   CpG:_116    chr1    29320   29370   50
+    chr1    135124  135563  CpG:_30 chr1    134772  139696  439
+    chr1    327790  328229  CpG:_29 chr1    324438  328581  439
+    chr1    713984  714547  CpG:_60 chr1    713663  714068  84
+    chr1    762416  763445  CpG:_115    chr1    761585  762902  486
+    chr1    762416  763445  CpG:_115    chr1    762970  763155  185
+    chr1    762416  763445  CpG:_115    chr1    763177  763229  52
+    chr1    788863  789211  CpG:_28 chr1    788770  794826  348
+    chr1    854765  854973  CpG:_16 chr1    854714  854817  52
+    chr1    858970  861632  CpG:_257    chr1    861120  861180  60
+
+All we want is the very last column:
+
+    bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
+    | cut -f 8,8 | head -n 10
+    50
+    439
+    439
+    84
+    486
+    185
+    52
+    348
+    52
+    60
+
+Then we want the total bp overlap, using a slight alteration of the readings.py from the novice python lessons:
+
+    bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
+    | cut -f 8,8 | python readings.py --sum
+    7235722.0
+
+How do we know if this overlap is significant? Shuffle one of the files.
+
+    bedtools shuffle -i cpg.bed -g genome.txt \
+    | head -n 10 
+    chr7    154365006   154366081   CpG:_116
+    chr2    104893603   104894042   CpG:_30
+    chr11   11786369    11786808    CpG:_29
+    chr1    116263073   116264086   CpG:_84
+    chr10   10674197    10675468    CpG:_99
+    chrX    7680868 7681763 CpG:_94
+    chr6    34093178    34095089    CpG:_171
+    chr1    88263374    88263937    CpG:_60
+    chr1    150416635   150417664   CpG:_115
+    chr6    39650957    39651305    CpG:_28
+
+But keep the intervals on the same chromosome...
+
+    bedtools shuffle -chrom -i cpg.bed -g genome.txt \
+    | head -n 10 
+    chr1    2367858 2368933 CpG:_116
+    chr1    225323760   225324199   CpG:_30
+    chr1    37368198    37368637    CpG:_29
+    chr1    13872987    13874000    CpG:_84
+    chr1    91280571    91281842    CpG:_99
+    chr1    74656135    74657030    CpG:_94
+    chr1    64079046    64080957    CpG:_171
+    chr1    54034535    54035098    CpG:_60
+    chr1    237633659   237634688   CpG:_115
+    chr1    51954093    51954441    CpG:_28
+
+Save to a file, and count the overlap for the shuffled features
+
+    bedtools shuffle -chrom -i cpg.bed -g genome.txt \
+    > cpg.shuffled.bed
+    bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \
+    | cut -f 8,8 | python readings.py --sum
+    681980.0
+
+Do this lots of times to get a p-value
+
+    mkdir shuffled_cpg
+
+
+    #! /bin/bash
+
+
+    for i in {1..100}
+    do
+        bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed
+    done
+
+And then:
+
+    for f in shuffled_cpg/*; 
+    do bedtools intersect -a $f -b exons.merged.bed -wo \
+    | cut -f 8,8 | python readings.py --sum; done | head -n 10
+    651838.0
+    655178.0
+    633355.0
+    636853.0
+    668339.0
+    651838.0
+    655178.0
+    633355.0
+    636853.0
+    668339.0
+
+Save that to a file:
+
+    for f in shuffled_cpg/*; 
+    do bedtools intersect -a $f -b exons.merged.bed -wo \
+    | cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt
+
+
+TODO: Write a python program to plot the distribution of shuffled results, and the real result
diff --git a/novice/capstones/bedtools-python/do_shuffle.sh b/novice/capstones/bedtools-python/do_shuffle.sh
new file mode 100755
index 000000000..2c7e2edc1
--- /dev/null
+++ b/novice/capstones/bedtools-python/do_shuffle.sh
@@ -0,0 +1,8 @@
+#! /bin/bash
+
+
+for i in {1..5}
+do
+    bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed
+done
+
diff --git a/novice/capstones/bedtools-python/readings.py b/novice/capstones/bedtools-python/readings.py
new file mode 100644
index 000000000..28c61922e
--- /dev/null
+++ b/novice/capstones/bedtools-python/readings.py
@@ -0,0 +1,37 @@
+import sys
+import numpy as np
+
+
+def main():
+    script = sys.argv[0]
+    action = sys.argv[1]
+    filenames = sys.argv[2:]
+    assert action in ['--min', '--mean', '--max', '--sum'], \
+           'Action is not one of --min, --mean, or --max: ' + action
+    if len(filenames) == 0:
+        process(sys.stdin, action)
+    else:
+        for f in filenames:
+            process(f, action)
+
+
+def process(filename, action):
+    data = np.loadtxt(filename, delimiter=',')
+
+    data_shape = data.shape
+    if len(data_shape) == 1:
+        data = data.reshape((1, len(data)))
+
+    if action == '--min':
+        values = data.min(axis=1)
+    elif action == '--mean':
+        values = data.mean(axis=1)
+    elif action == '--max':
+        values = data.max(axis=1)
+    elif action == '--sum':
+        values = data.sum(axis=1)
+
+    for m in values:
+        print m
+
+main()

From a56b93fb20630de6c613eb8322403d92f4b3f951 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Tue, 22 Jul 2014 11:31:32 +0100
Subject: [PATCH 04/12] Added python script for plotting the shuffled results.

---
 .../bedtools-python/bedtools-capstone.md      | 172 +++++++++++++++++-
 novice/capstones/bedtools-python/results.py   |  42 +++++
 2 files changed, 213 insertions(+), 1 deletion(-)
 create mode 100644 novice/capstones/bedtools-python/results.py

diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
index f586947a2..1df8735cb 100644
--- a/novice/capstones/bedtools-python/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -379,4 +379,174 @@ Save that to a file:
     | cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt
 
 
-TODO: Write a python program to plot the distribution of shuffled results, and the real result
+Plotting our results with python
+--------------------------------
+
+Let's copy the readings.py script and alter it to read in our results files:
+
+    import sys
+    import numpy as np
+
+
+    def main():
+        script = sys.argv[0]
+        shuffled_results_file = sys.argv[1]
+        real_results_file = sys.argv[2]
+
+        process(shuffled_results_file, real_results_file)
+
+
+    def process(shuffled_results_file, real_results_file):
+        shuffled_data = np.loadtxt(shuffled_results_file)
+        real_data = np.loadtxt(real_results_file)
+
+        print 'Real data:'
+        print real_data
+        print
+        print 'Shuffled data:'
+        print shuffled_data
+
+    main()
+
+Which prints:
+
+    Real data:
+    7235722.0
+
+    Shuffled data:
+    [ 640473.  638611.  644321.  642422.  634641.  660745.  645816.  635427.
+      595615.  644916.  662412.  645489.  666141.  674597.  637099.  621515.
+      647847.  653944.  641051.  685423.  658610.  686554.  618233.  655932.
+      670053.  641412.  617865.  651189.  639806.  658123.  639381.  644652.
+      667240.  689363.  627791.  625137.  635577.  643151.  616453.  633041.
+      629223.  645209.  629201.  639179.  649602.  638849.  667827.  637550.
+      652560.  647235.  710669.  626332.  689819.  646094.  631575.  633863.
+      657661.  642538.  648691.  660292.  649780.  643179.  615128.  653863.
+      610901.  613489.  624788.  680903.  617416.  654761.  663484.  672619.
+      615038.  630960.  622431.  634951.  659257.  649931.  633901.  612363.
+      639325.  683887.  656753.  690935.  661310.  692022.  633441.  644006.
+      652187.  643428.  643711.  682621.  607918.  674996.  674909.  625800.
+      642090.  662203.  667386.  660448.]
+
+What happens if we call the script with no arguments?
+
+    python results.py
+
+    Traceback (most recent call last):
+      File "results.py", line 23, in <module>
+        main()
+      File "results.py", line 7, in main
+        shuffled_results_file = sys.argv[1]
+    IndexError: list index out of range
+
+Not very helpful. Let's add a usage description:
+
+    import sys
+    import numpy as np
+
+    usage_string = """
+    Results.py plots the real overlap between two bed files
+    versus the overlap when one of the results files is randomly
+    shuffled.
+
+    Usage: python random.py shuffled_results.txt real_results.txt
+    """
+
+    def main():
+
+        if not len(sys.argv) == 3:
+            sys.exit(usage_string)
+
+Now we can plot a histogram of the shuffled data:
+
+    plt.hist(shuffled_data, color='black')
+    plt.show()
+
+And we can plot the real result as a vertical red line:
+
+    import sys 
+    import numpy as np
+    from matplotlib import pyplot as plt 
+
+    usage_string = """ 
+    Results.py plots the real overlap between two bed files
+    versus the overlap when one of the results files is randomly
+    shuffled.
+
+    Usage: python random.py shuffled_results.txt real_results.txt
+    """
+
+    def main():
+
+        if not len(sys.argv) == 3:
+            sys.exit(usage_string)
+
+        script = sys.argv[0]
+        shuffled_results_file = sys.argv[1]
+        real_results_file = sys.argv[2]
+
+        process(shuffled_results_file, real_results_file)
+
+
+    def process(shuffled_results_file, real_results_file):
+        shuffled_data = np.loadtxt(shuffled_results_file)
+        real_data = np.loadtxt(real_results_file)
+
+        print 'Real data:'
+        print real_data
+        print
+        print 'Shuffled data:'
+        print shuffled_data
+
+        plt.hist(shuffled_data, color='black')
+        plt.axvline(real_data, color='red', ls='--', lw=2)
+        plt.show()
+
+    main()
+
+Finally, let's add some axis labels and a legend:
+
+    import sys
+    import numpy as np
+    from matplotlib import pyplot as plt
+
+    usage_string = """
+    Results.py plots the real overlap between two bed files
+    versus the overlap when one of the results files is randomly
+    shuffled.
+
+    Usage: python random.py shuffled_results.txt real_results.txt
+    """
+
+    def main():
+
+        if not len(sys.argv) == 3:
+            sys.exit(usage_string)
+
+        script = sys.argv[0]
+        shuffled_results_file = sys.argv[1]
+        real_results_file = sys.argv[2]
+
+        process(shuffled_results_file, real_results_file)
+
+
+    def process(shuffled_results_file, real_results_file):
+        shuffled_data = np.loadtxt(shuffled_results_file)
+        real_data = np.loadtxt(real_results_file)
+
+        print 'Real data:'
+        print real_data
+        print
+        print 'Shuffled data:'
+        print shuffled_data
+
+        plt.hist(shuffled_data, color='black', label='Shuffled Data')
+        plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data')
+        plt.xlabel('Nucleotides overlap')
+        plt.ylabel('Number of shuffled results')
+        plt.legend()
+        plt.show()
+
+    main()
+
+
diff --git a/novice/capstones/bedtools-python/results.py b/novice/capstones/bedtools-python/results.py
new file mode 100644
index 000000000..5b71b4f56
--- /dev/null
+++ b/novice/capstones/bedtools-python/results.py
@@ -0,0 +1,42 @@
+import sys
+import numpy as np
+from matplotlib import pyplot as plt
+
+usage_string = """
+Results.py plots the real overlap between two bed files
+versus the overlap when one of the results files is randomly
+shuffled.
+
+Usage: python random.py shuffled_results.txt real_results.txt
+"""
+
+def main():
+
+    if not len(sys.argv) == 3:
+        sys.exit(usage_string)
+
+    script = sys.argv[0]
+    shuffled_results_file = sys.argv[1]
+    real_results_file = sys.argv[2]
+
+    process(shuffled_results_file, real_results_file)
+
+
+def process(shuffled_results_file, real_results_file):
+    shuffled_data = np.loadtxt(shuffled_results_file)
+    real_data = np.loadtxt(real_results_file)
+
+    print 'Real data:'
+    print real_data
+    print
+    print 'Shuffled data:'
+    print shuffled_data
+
+    plt.hist(shuffled_data, color='black', label='Shuffled Data')
+    plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data')
+    plt.xlabel('Nucleotides overlap')
+    plt.ylabel('Number of shuffled results')
+    plt.legend()
+    plt.show()
+
+main()

From 7916e338f6fe4e2cfe7490c6afd319d73f33d087 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Tue, 22 Jul 2014 14:19:46 +0100
Subject: [PATCH 05/12] Get jekyll to build our new example

---
 Makefile                                              | 1 +
 novice/capstones/bedtools-python/bedtools-capstone.md | 9 +++++----
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/Makefile b/Makefile
index cb6805a12..57460b891 100644
--- a/Makefile
+++ b/Makefile
@@ -87,6 +87,7 @@ BOOK_SRC = \
 # All source pages (including things not in the book).
 PAGES_SRC = \
 	contents.md \
+	$(wildcard novice/capstones/*/*.md) \
 	$(wildcard intermediate/python/*.md) \
 	$(BOOK_SRC)
 
diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
index 1df8735cb..5a4bd2842 100644
--- a/novice/capstones/bedtools-python/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -1,7 +1,8 @@
-% bedtools Tutorial
-% Aaron Quinlan
-% November 22, 2013
-
+---
+layout: lesson
+root: ../../..
+title: Capstone example Python and Bedtools
+---
 
 Synopsis
 ========

From 3226f2fe8ed42f67e0ff453e98b3d9f675b31319 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Tue, 22 Jul 2014 15:49:00 +0100
Subject: [PATCH 06/12] Sorted formatting out

---
 .../bedtools-python/bedtools-capstone.md      | 819 ++++++++++--------
 1 file changed, 475 insertions(+), 344 deletions(-)

diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
index 5a4bd2842..9dbcf36dd 100644
--- a/novice/capstones/bedtools-python/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -10,53 +10,42 @@ Synopsis
 Our goal is to work through examples that demonstrate how to 
 explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) with the `bedtools` software package.
 
-Some of our analysis will be based upon the Maurano et al exploration of DnaseI hypersensitivity sites in hundreds of primary tissue types.
-
-    Maurano et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012. Vol. 337 no. 6099 pp. 1190-1195.
-
-    www.sciencemag.org/content/337/6099/1190.short
-
-
 This tutorial is merely meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/).
 
-
-
-
-\
-
-
 Setup
 =====
 From the Terminal, create a new directory on your Desktop called "bedtools-demo".
 
-    cd ~/Desktop
-    mkdir bedtools-demo
+~~~
+cd ~/Desktop
+mkdir bedtools-demo
+~~~
+{:class="in"}
+
 
 Navigate into that directory.
 
-    cd bedtools-demo
+~~~
+cd bedtools-demo
+~~~
+{:class="in"}
 
 Download the sample BED files I have provided.
 
-    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/maurano.dnaseI.tgz
-    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed
-    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed
-    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed
-    curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt
-
-Now, we need to extract all of the 20 Dnase I hypersensitivity BED files from the "tarball" named
-`maurano.dnaseI.tgz`.
-
-    tar -zxvf maurano.dnaseI.tgz
-    rm maurano.dnaseI.tgz
+~~~
+curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed
+curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed
+curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed
+curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt
+~~~
+{:class="in"}
 
 Let's take a look at what files we now have.
 
-    ls -1
-
-
-\
-
+~~~
+ls -1
+~~~
+{:class="in"}
 
 What are these files?
 =========================
@@ -66,22 +55,24 @@ In addition: `cpg.bed` represents CpG islands in the human genome; `exons.bed` r
 
 The latter 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start).
 
-
-\
-
-
 The bedtools help
 ==================
 To bring up the help, just type
 
-    bedtools
+~~~
+bedtools
+~~~
+{:class="in"}
 
 As you can see, there are multiple "subcommands" and for bedtools to
 work you must tell it which subcommand you want to use. Examples:
 
-    bedtools intersect
-    bedtools merge
-    bedtools subtract
+~~~
+bedtools intersect
+bedtools merge
+bedtools subtract
+~~~
+{:class="in"}
 
 
 bedtools "intersect"
@@ -95,78 +86,105 @@ Default behavior
 ----------------
 By default, `intersect` reports the intervals that represent overlaps between your two files.  To demonstrate, let's identify all of the CpG islands that overlap exons.
 
-    bedtools intersect -a cpg.bed -b exons.bed | head -5
-    chr1    29320   29370   CpG:_116
-    chr1    135124  135563  CpG:_30
-    chr1    327790  328229  CpG:_29
-    chr1    327790  328229  CpG:_29
-    chr1    327790  328229  CpG:_29
+~~~
+bedtools intersect -a cpg.bed -b exons.bed | head -5
+~~~
+{:class="in"}
+~~~
+chr1    29320   29370   CpG:_116
+chr1    135124  135563  CpG:_30
+chr1    327790  328229  CpG:_29
+chr1    327790  328229  CpG:_29
+chr1    327790  328229  CpG:_29
+~~~
+{:class="out"}
 
 Reporting the original feature in each file.
 --------------------------------------------
 The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped.  As such, instead of not only showing you *where* the intersections occurred, it shows you *what* intersected.
 
-    bedtools intersect -a cpg.bed -b exons.bed -wa -wb \
-    | head -5
-    chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -
-    chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +
-    chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +
+~~~
+bedtools intersect -a cpg.bed -b exons.bed -wa -wb \
+| head -5
+~~~
+{:class="in"}
+~~~
+chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -
+chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +
+chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +
+~~~
+{:class="out"}
 
 How many base pairs of overlap were there?
 ------------------------------------------
 The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files.
 
-    bedtools intersect -a cpg.bed -b exons.bed -wo \
-    | head -10
-    chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -   50
-    chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
-    chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
-    chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_ 033908_exon_6_0_chr1_713664_r    0   -   84
-    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _015368_exon_0_0_chr1_762971_f    0   +   185
-    chr1    762416  763445  CpG:_115    chr1    763177  763229  NR  _047525_exon_0_0_chr1_763178_f    0   +   52
-    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _047524_exon_0_0_chr1_762971_f    0   +   185
-    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
+~~~
+bedtools intersect -a cpg.bed -b exons.bed -wo \
+| head -10
+~~~
+{:class="in"}
+~~~
+chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -   50
+chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
+chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_ 033908_exon_6_0_chr1_713664_r    0   -   84
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _015368_exon_0_0_chr1_762971_f    0   +   185
+chr1    762416  763445  CpG:_115    chr1    763177  763229  NR  _047525_exon_0_0_chr1_763178_f    0   +   52
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _047524_exon_0_0_chr1_762971_f    0   +   185
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
+~~~
+{:class="out"}
 
 Counting the number of overlapping features.
 --------------------------------------------
 We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option.
 
-    bedtools intersect -a cpg.bed -b exons.bed -c \
-    | head
-    chr1    28735   29810   CpG:_116    1
-    chr1    135124  135563  CpG:_30 1
-    chr1    327790  328229  CpG:_29 3
-    chr1    437151  438164  CpG:_84 0
-    chr1    449273  450544  CpG:_99 0
-    chr1    533219  534114  CpG:_94 0
-    chr1    544738  546649  CpG:_171    0
-    chr1    713984  714547  CpG:_60 1
-    chr1    762416  763445  CpG:_115    10
-    chr1    788863  789211  CpG:_28 9
-
-\
-
+~~~
+bedtools intersect -a cpg.bed -b exons.bed -c \
+| head
+~~~
+{:class="in"}
+~~~
+chr1    28735   29810   CpG:_116    1
+chr1    135124  135563  CpG:_30 1
+chr1    327790  328229  CpG:_29 3
+chr1    437151  438164  CpG:_84 0
+chr1    449273  450544  CpG:_99 0
+chr1    533219  534114  CpG:_94 0
+chr1    544738  546649  CpG:_171    0
+chr1    713984  714547  CpG:_60 1
+chr1    762416  763445  CpG:_115    10
+chr1    788863  789211  CpG:_28 9
+~~~
+{:class="out"}
 
 Find features that DO NOT overlap
 --------------------------------------------
 Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case.
 
-    bedtools intersect -a cpg.bed -b exons.bed -v \
-    | head
-    chr1    437151  438164  CpG:_84
-    chr1    449273  450544  CpG:_99
-    chr1    533219  534114  CpG:_94
-    chr1    544738  546649  CpG:_171
-    chr1    801975  802338  CpG:_24
-    chr1    805198  805628  CpG:_50
-    chr1    839694  840619  CpG:_83
-    chr1    844299  845883  CpG:_153
-    chr1    912869  913153  CpG:_28
-    chr1    919726  919927  CpG:_15
+~~~
+bedtools intersect -a cpg.bed -b exons.bed -v \
+| head
+~~~
+{:class="in"}
+~~~
+chr1    437151  438164  CpG:_84
+chr1    449273  450544  CpG:_99
+chr1    533219  534114  CpG:_94
+chr1    544738  546649  CpG:_171
+chr1    801975  802338  CpG:_24
+chr1    805198  805628  CpG:_50
+chr1    839694  840619  CpG:_83
+chr1    844299  845883  CpG:_153
+chr1    912869  913153  CpG:_28
+chr1    919726  919927  CpG:_15
+~~~
+{:class="out"}
 
 
 Require a minimal fraction of overlap.
@@ -175,22 +193,25 @@ Recall that the default is to report overlaps between features in A and B so lon
 
 Let's be more strict and require 50% of overlap.
 
-    bedtools intersect -a cpg.bed -b exons.bed \
-    -wo -f 0.50 \
-    | head
-    chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
-    chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
-    chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047525_exon_4_0_chr1_788771_f    0   +   348
-    chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047524_exon_3_0_chr1_788771_f    0   +   348
-    chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047523_exon_3_0_chr1_788771_f    0   +   348
-    chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_ 047522_exon_5_0_chr1_788859_f    0   +   348
-    chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047521_exon_4_0_chr1_788771_f    0   +   348
-    chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_ 047520_exon_6_0_chr1_788859_f    0   +   348
-
-
-\
+~~~
+bedtools intersect -a cpg.bed -b exons.bed \
+-wo -f 0.50 \
+| head
+~~~
+{:class="in"}
+~~~
+chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
+chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047525_exon_4_0_chr1_788771_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047524_exon_3_0_chr1_788771_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047523_exon_3_0_chr1_788771_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_ 047522_exon_5_0_chr1_788859_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047521_exon_4_0_chr1_788771_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_ 047520_exon_6_0_chr1_788859_f    0   +   348
+~~~
+{:class="out"}
 
 Do exons and CPG islands overlap significantly?
 ===============================================
@@ -198,21 +219,24 @@ Do exons and CPG islands overlap significantly?
 Going back to base pairs overlap
 --------------------------------
 
-    bedtools intersect -a cpg.bed -b exons.bed -wo \
-    | head -10
-    chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -   50
-    chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
-    chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
-    chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_ 033908_exon_6_0_chr1_713664_r    0   -   84
-    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _015368_exon_0_0_chr1_762971_f    0   +   185
-    chr1    762416  763445  CpG:_115    chr1    763177  763229  NR  _047525_exon_0_0_chr1_763178_f    0   +   52
-    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _047524_exon_0_0_chr1_762971_f    0   +   185
-    chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
-
-
-
+~~~
+bedtools intersect -a cpg.bed -b exons.bed -wo \
+| head -10
+~~~
+{:class="in"}
+~~~
+chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -   50
+chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
+chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_ 033908_exon_6_0_chr1_713664_r    0   -   84
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _015368_exon_0_0_chr1_762971_f    0   +   185
+chr1    762416  763445  CpG:_115    chr1    763177  763229  NR  _047525_exon_0_0_chr1_763178_f    0   +   52
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _047524_exon_0_0_chr1_762971_f    0   +   185
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
+~~~
+{:class="out"}
 
 bedtools "merge"
 ====================
@@ -227,157 +251,237 @@ The merge tool requires that the input file is sorted by chromosome, then by sta
 
 If you run the merge command on a file that is not sorted, you will get an error. For example:
 
-    bedtools merge -i exons.bed
-    chr1    66999824    67000051
-    chr1    67091529    67091593
-    chr1    67098752    67098777
-    chr1    67101626    67101698
-    chr1    67105459    67105516
-    chr1    67108492    67108547
-    chr1    67109226    67109402
-    chr1    67126195    67126207
-    chr1    67133212    67133224
-    chr1    67136677    67136702
-    chr1    67137626    67137678
-    chr1    67138963    67139049
-    chr1    67142686    67142779
-    chr1    67145360    67145435
-    chr1    67147551    67148052
-    chr1    67154830    67154958
-    chr1    67155872    67155999
-    chr1    67161116    67161176
-    chr1    67184976    67185088
-    chr1    67194946    67195102
-    chr1    67199430    67199563
-    chr1    67205017    67205220
-    chr1    67206340    67206405
-    chr1    67206954    67207119
-    ERROR: input file: (exons.bed) is not sorted by chrom then start.
-       The start coordinate at line 26 is less than the start at line 25
+~~~
+bedtools merge -i exons.bed
+~~~
+{:class="in"}
+~~~
+chr1    66999824    67000051
+chr1    67091529    67091593
+chr1    67098752    67098777
+chr1    67101626    67101698
+chr1    67105459    67105516
+chr1    67108492    67108547
+chr1    67109226    67109402
+chr1    67126195    67126207
+chr1    67133212    67133224
+chr1    67136677    67136702
+chr1    67137626    67137678
+chr1    67138963    67139049
+chr1    67142686    67142779
+chr1    67145360    67145435
+chr1    67147551    67148052
+chr1    67154830    67154958
+chr1    67155872    67155999
+chr1    67161116    67161176
+chr1    67184976    67185088
+chr1    67194946    67195102
+chr1    67199430    67199563
+chr1    67205017    67205220
+chr1    67206340    67206405
+chr1    67206954    67207119
+ERROR: input file: (exons.bed) is not sorted by chrom then start.
+   The start coordinate at line 26 is less than the start at line 25
+~~~
+{:class="err"}
 
 To correct this, you need to sort your BED using the UNIX `sort` utility.
 
-    sort -k1,1 -k2,2n exons.bed > exons.sort.bed
+~~~
+sort -k1,1 -k2,2n exons.bed > exons.sort.bed
+~~~
+{:class="in"}
 
 Let's try again.
     
-    bedtools merge -i exons.sort.bed | head -10
+~~~
+bedtools merge -i exons.sort.bed | head -10
+~~~
+{:class="in"}
 
 The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file.
 
 Or alternatively:
 
-    bedtools sort -i exons.bed | bedtools merge > exons.merged.bed
+~~~
+bedtools sort -i exons.bed | bedtools merge > exons.merged.bed
+~~~
+{:class="in"}
 
 Now the overlap with merged exons:
 
 
-    bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
-    | head -n 10
-    chr1    28735   29810   CpG:_116    chr1    29320   29370   50
-    chr1    135124  135563  CpG:_30 chr1    134772  139696  439
-    chr1    327790  328229  CpG:_29 chr1    324438  328581  439
-    chr1    713984  714547  CpG:_60 chr1    713663  714068  84
-    chr1    762416  763445  CpG:_115    chr1    761585  762902  486
-    chr1    762416  763445  CpG:_115    chr1    762970  763155  185
-    chr1    762416  763445  CpG:_115    chr1    763177  763229  52
-    chr1    788863  789211  CpG:_28 chr1    788770  794826  348
-    chr1    854765  854973  CpG:_16 chr1    854714  854817  52
-    chr1    858970  861632  CpG:_257    chr1    861120  861180  60
+~~~
+bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
+| head -n 10
+~~~
+{:class="in"}
+~~~
+chr1    28735   29810   CpG:_116    chr1    29320   29370   50
+chr1    135124  135563  CpG:_30 chr1    134772  139696  439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  439
+chr1    713984  714547  CpG:_60 chr1    713663  714068  84
+chr1    762416  763445  CpG:_115    chr1    761585  762902  486
+chr1    762416  763445  CpG:_115    chr1    762970  763155  185
+chr1    762416  763445  CpG:_115    chr1    763177  763229  52
+chr1    788863  789211  CpG:_28 chr1    788770  794826  348
+chr1    854765  854973  CpG:_16 chr1    854714  854817  52
+chr1    858970  861632  CpG:_257    chr1    861120  861180  60
+~~~
+{:class="out"}
 
 All we want is the very last column:
 
-    bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
-    | cut -f 8,8 | head -n 10
-    50
-    439
-    439
-    84
-    486
-    185
-    52
-    348
-    52
-    60
+~~~
+bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
+| cut -f 8,8 | head -n 10
+~~~
+{:class="in"}
+~~~
+50
+439
+439
+84
+486
+185
+52
+348
+52
+60
+~~~
+{:class="out"}
 
 Then we want the total bp overlap, using a slight alteration of the readings.py from the novice python lessons:
 
-    bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
-    | cut -f 8,8 | python readings.py --sum
-    7235722.0
+~~~
+bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
+| cut -f 8,8 | python readings.py --sum
+~~~
+{:class="in"}
+~~~
+7235722.0
+~~~
+{:class="out"}
 
 How do we know if this overlap is significant? Shuffle one of the files.
 
-    bedtools shuffle -i cpg.bed -g genome.txt \
-    | head -n 10 
-    chr7    154365006   154366081   CpG:_116
-    chr2    104893603   104894042   CpG:_30
-    chr11   11786369    11786808    CpG:_29
-    chr1    116263073   116264086   CpG:_84
-    chr10   10674197    10675468    CpG:_99
-    chrX    7680868 7681763 CpG:_94
-    chr6    34093178    34095089    CpG:_171
-    chr1    88263374    88263937    CpG:_60
-    chr1    150416635   150417664   CpG:_115
-    chr6    39650957    39651305    CpG:_28
+~~~
+bedtools shuffle -i cpg.bed -g genome.txt \
+| head -n 10 
+~~~
+{:class="in"}
+~~~
+chr7    154365006   154366081   CpG:_116
+chr2    104893603   104894042   CpG:_30
+chr11   11786369    11786808    CpG:_29
+chr1    116263073   116264086   CpG:_84
+chr10   10674197    10675468    CpG:_99
+chrX    7680868 7681763 CpG:_94
+chr6    34093178    34095089    CpG:_171
+chr1    88263374    88263937    CpG:_60
+chr1    150416635   150417664   CpG:_115
+chr6    39650957    39651305    CpG:_28
+~~~
+{:class="out"}
 
 But keep the intervals on the same chromosome...
 
-    bedtools shuffle -chrom -i cpg.bed -g genome.txt \
-    | head -n 10 
-    chr1    2367858 2368933 CpG:_116
-    chr1    225323760   225324199   CpG:_30
-    chr1    37368198    37368637    CpG:_29
-    chr1    13872987    13874000    CpG:_84
-    chr1    91280571    91281842    CpG:_99
-    chr1    74656135    74657030    CpG:_94
-    chr1    64079046    64080957    CpG:_171
-    chr1    54034535    54035098    CpG:_60
-    chr1    237633659   237634688   CpG:_115
-    chr1    51954093    51954441    CpG:_28
+~~~
+bedtools shuffle -chrom -i cpg.bed -g genome.txt \
+| head -n 10 
+~~~
+{:class="in"}
+~~~
+chr1    2367858 2368933 CpG:_116
+chr1    225323760   225324199   CpG:_30
+chr1    37368198    37368637    CpG:_29
+chr1    13872987    13874000    CpG:_84
+chr1    91280571    91281842    CpG:_99
+chr1    74656135    74657030    CpG:_94
+chr1    64079046    64080957    CpG:_171
+chr1    54034535    54035098    CpG:_60
+chr1    237633659   237634688   CpG:_115
+chr1    51954093    51954441    CpG:_28
+~~~
+{:class="out"}
 
 Save to a file, and count the overlap for the shuffled features
 
-    bedtools shuffle -chrom -i cpg.bed -g genome.txt \
-    > cpg.shuffled.bed
-    bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \
-    | cut -f 8,8 | python readings.py --sum
-    681980.0
+~~~
+bedtools shuffle -chrom -i cpg.bed -g genome.txt \
+> cpg.shuffled.bed
+bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \
+| cut -f 8,8 | python readings.py --sum
+~~~
+{:class="in"}
+~~~
+681980.0
+~~~
+{:class="out"}
 
 Do this lots of times to get a p-value
 
-    mkdir shuffled_cpg
+~~~
+mkdir shuffled_cpg
+~~~
+{:class="in"}
+
+We have a script for that:
+
+~~~
+cat do_shuffle.sh
+~~~
+{:class="in"}
+~~~
+#! /bin/bash
 
 
-    #! /bin/bash
+for i in {1..100}
+do
+    bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed
+done
+~~~
+{:class="out"}
 
+Run the script:
+
+~~~
+./do_suffle.sh
+ls -l shuffled_cpg | head -n 5
+~~~
+{:class="in"}
 
-    for i in {1..100}
-    do
-        bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed
-    done
 
 And then:
 
-    for f in shuffled_cpg/*; 
-    do bedtools intersect -a $f -b exons.merged.bed -wo \
-    | cut -f 8,8 | python readings.py --sum; done | head -n 10
-    651838.0
-    655178.0
-    633355.0
-    636853.0
-    668339.0
-    651838.0
-    655178.0
-    633355.0
-    636853.0
-    668339.0
+~~~
+for f in shuffled_cpg/*; 
+do bedtools intersect -a $f -b exons.merged.bed -wo \
+| cut -f 8,8 | python readings.py --sum; done | head -n 10
+~~~
+{:class="in"}
+~~~
+651838.0
+655178.0
+633355.0
+636853.0
+668339.0
+651838.0
+655178.0
+633355.0
+636853.0
+668339.0
+~~~
+{:class="out"}
 
 Save that to a file:
 
-    for f in shuffled_cpg/*; 
-    do bedtools intersect -a $f -b exons.merged.bed -wo \
-    | cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt
+~~~
+for f in shuffled_cpg/*; 
+do bedtools intersect -a $f -b exons.merged.bed -wo \
+| cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt
+~~~
+{:class="in"}
 
 
 Plotting our results with python
@@ -385,169 +489,196 @@ Plotting our results with python
 
 Let's copy the readings.py script and alter it to read in our results files:
 
-    import sys
-    import numpy as np
+~~~
+cat results.py
+~~~
+{:class="in"}
+~~~
+import sys
+import numpy as np
 
 
-    def main():
-        script = sys.argv[0]
-        shuffled_results_file = sys.argv[1]
-        real_results_file = sys.argv[2]
+def main():
+    script = sys.argv[0]
+    shuffled_results_file = sys.argv[1]
+    real_results_file = sys.argv[2]
 
-        process(shuffled_results_file, real_results_file)
+    process(shuffled_results_file, real_results_file)
 
 
-    def process(shuffled_results_file, real_results_file):
-        shuffled_data = np.loadtxt(shuffled_results_file)
-        real_data = np.loadtxt(real_results_file)
+def process(shuffled_results_file, real_results_file):
+    shuffled_data = np.loadtxt(shuffled_results_file)
+    real_data = np.loadtxt(real_results_file)
 
-        print 'Real data:'
-        print real_data
-        print
-        print 'Shuffled data:'
-        print shuffled_data
+    print 'Real data:'
+    print real_data
+    print
+    print 'Shuffled data:'
+    print shuffled_data
 
-    main()
+main()
+~~~
+{:class="out"}
 
 Which prints:
 
-    Real data:
-    7235722.0
-
-    Shuffled data:
-    [ 640473.  638611.  644321.  642422.  634641.  660745.  645816.  635427.
-      595615.  644916.  662412.  645489.  666141.  674597.  637099.  621515.
-      647847.  653944.  641051.  685423.  658610.  686554.  618233.  655932.
-      670053.  641412.  617865.  651189.  639806.  658123.  639381.  644652.
-      667240.  689363.  627791.  625137.  635577.  643151.  616453.  633041.
-      629223.  645209.  629201.  639179.  649602.  638849.  667827.  637550.
-      652560.  647235.  710669.  626332.  689819.  646094.  631575.  633863.
-      657661.  642538.  648691.  660292.  649780.  643179.  615128.  653863.
-      610901.  613489.  624788.  680903.  617416.  654761.  663484.  672619.
-      615038.  630960.  622431.  634951.  659257.  649931.  633901.  612363.
-      639325.  683887.  656753.  690935.  661310.  692022.  633441.  644006.
-      652187.  643428.  643711.  682621.  607918.  674996.  674909.  625800.
-      642090.  662203.  667386.  660448.]
+~~~
+python results.py shuffled_results.txt cpg_result.txt
+~~~
+{:class="in"}
+~~~
+Real data:
+7235722.0
+
+Shuffled data:
+[ 640473.  638611.  644321.  642422.  634641.  660745.  645816.  635427.
+  595615.  644916.  662412.  645489.  666141.  674597.  637099.  621515.
+  647847.  653944.  641051.  685423.  658610.  686554.  618233.  655932.
+  670053.  641412.  617865.  651189.  639806.  658123.  639381.  644652.
+  667240.  689363.  627791.  625137.  635577.  643151.  616453.  633041.
+  629223.  645209.  629201.  639179.  649602.  638849.  667827.  637550.
+  652560.  647235.  710669.  626332.  689819.  646094.  631575.  633863.
+  657661.  642538.  648691.  660292.  649780.  643179.  615128.  653863.
+  610901.  613489.  624788.  680903.  617416.  654761.  663484.  672619.
+  615038.  630960.  622431.  634951.  659257.  649931.  633901.  612363.
+  639325.  683887.  656753.  690935.  661310.  692022.  633441.  644006.
+  652187.  643428.  643711.  682621.  607918.  674996.  674909.  625800.
+  642090.  662203.  667386.  660448.]
+~~~
+{:class="out"}
 
 What happens if we call the script with no arguments?
 
-    python results.py
-
-    Traceback (most recent call last):
-      File "results.py", line 23, in <module>
-        main()
-      File "results.py", line 7, in main
-        shuffled_results_file = sys.argv[1]
-    IndexError: list index out of range
+~~~
+python results.py
+~~~
+{:class="in"}
+~~~
+Traceback (most recent call last):
+  File "results.py", line 23, in <module>
+    main()
+  File "results.py", line 7, in main
+    shuffled_results_file = sys.argv[1]
+IndexError: list index out of range
+~~~
+{:class="err"}
 
 Not very helpful. Let's add a usage description:
 
-    import sys
-    import numpy as np
+~~~
+import sys
+import numpy as np
 
-    usage_string = """
-    Results.py plots the real overlap between two bed files
-    versus the overlap when one of the results files is randomly
-    shuffled.
+usage_string = """
+Results.py plots the real overlap between two bed files
+versus the overlap when one of the results files is randomly
+shuffled.
 
-    Usage: python random.py shuffled_results.txt real_results.txt
-    """
+Usage: python random.py shuffled_results.txt real_results.txt
+"""
 
-    def main():
+def main():
 
-        if not len(sys.argv) == 3:
-            sys.exit(usage_string)
+    if not len(sys.argv) == 3:
+        sys.exit(usage_string)
+~~~
 
 Now we can plot a histogram of the shuffled data:
 
-    plt.hist(shuffled_data, color='black')
-    plt.show()
+~~~
+plt.hist(shuffled_data, color='black')
+plt.show()
+~~~
 
 And we can plot the real result as a vertical red line:
 
-    import sys 
-    import numpy as np
-    from matplotlib import pyplot as plt 
+~~~
+import sys 
+import numpy as np
+from matplotlib import pyplot as plt 
 
-    usage_string = """ 
-    Results.py plots the real overlap between two bed files
-    versus the overlap when one of the results files is randomly
-    shuffled.
+usage_string = """ 
+Results.py plots the real overlap between two bed files
+versus the overlap when one of the results files is randomly
+shuffled.
 
-    Usage: python random.py shuffled_results.txt real_results.txt
-    """
+Usage: python random.py shuffled_results.txt real_results.txt
+"""
 
-    def main():
+def main():
 
-        if not len(sys.argv) == 3:
-            sys.exit(usage_string)
+    if not len(sys.argv) == 3:
+        sys.exit(usage_string)
 
-        script = sys.argv[0]
-        shuffled_results_file = sys.argv[1]
-        real_results_file = sys.argv[2]
+    script = sys.argv[0]
+    shuffled_results_file = sys.argv[1]
+    real_results_file = sys.argv[2]
 
-        process(shuffled_results_file, real_results_file)
+    process(shuffled_results_file, real_results_file)
 
 
-    def process(shuffled_results_file, real_results_file):
-        shuffled_data = np.loadtxt(shuffled_results_file)
-        real_data = np.loadtxt(real_results_file)
+def process(shuffled_results_file, real_results_file):
+    shuffled_data = np.loadtxt(shuffled_results_file)
+    real_data = np.loadtxt(real_results_file)
 
-        print 'Real data:'
-        print real_data
-        print
-        print 'Shuffled data:'
-        print shuffled_data
+    print 'Real data:'
+    print real_data
+    print
+    print 'Shuffled data:'
+    print shuffled_data
 
-        plt.hist(shuffled_data, color='black')
-        plt.axvline(real_data, color='red', ls='--', lw=2)
-        plt.show()
+    plt.hist(shuffled_data, color='black')
+    plt.axvline(real_data, color='red', ls='--', lw=2)
+    plt.show()
 
-    main()
+main()
+~~~
 
 Finally, let's add some axis labels and a legend:
 
-    import sys
-    import numpy as np
-    from matplotlib import pyplot as plt
+~~~
+import sys
+import numpy as np
+from matplotlib import pyplot as plt
 
-    usage_string = """
-    Results.py plots the real overlap between two bed files
-    versus the overlap when one of the results files is randomly
-    shuffled.
+usage_string = """
+Results.py plots the real overlap between two bed files
+versus the overlap when one of the results files is randomly
+shuffled.
 
-    Usage: python random.py shuffled_results.txt real_results.txt
-    """
+Usage: python random.py shuffled_results.txt real_results.txt
+"""
 
-    def main():
+def main():
 
-        if not len(sys.argv) == 3:
-            sys.exit(usage_string)
+    if not len(sys.argv) == 3:
+        sys.exit(usage_string)
 
-        script = sys.argv[0]
-        shuffled_results_file = sys.argv[1]
-        real_results_file = sys.argv[2]
+    script = sys.argv[0]
+    shuffled_results_file = sys.argv[1]
+    real_results_file = sys.argv[2]
 
-        process(shuffled_results_file, real_results_file)
+    process(shuffled_results_file, real_results_file)
 
 
-    def process(shuffled_results_file, real_results_file):
-        shuffled_data = np.loadtxt(shuffled_results_file)
-        real_data = np.loadtxt(real_results_file)
+def process(shuffled_results_file, real_results_file):
+    shuffled_data = np.loadtxt(shuffled_results_file)
+    real_data = np.loadtxt(real_results_file)
 
-        print 'Real data:'
-        print real_data
-        print
-        print 'Shuffled data:'
-        print shuffled_data
+    print 'Real data:'
+    print real_data
+    print
+    print 'Shuffled data:'
+    print shuffled_data
 
-        plt.hist(shuffled_data, color='black', label='Shuffled Data')
-        plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data')
-        plt.xlabel('Nucleotides overlap')
-        plt.ylabel('Number of shuffled results')
-        plt.legend()
-        plt.show()
+    plt.hist(shuffled_data, color='black', label='Shuffled Data')
+    plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data')
+    plt.xlabel('Nucleotides overlap')
+    plt.ylabel('Number of shuffled results')
+    plt.legend()
+    plt.show()
 
-    main()
+main()
+~~~
 
 
From 6dc35eeb5b83617bfa20fdc1c9b951a35980930c Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Tue, 22 Jul 2014 17:29:07 +0100
Subject: [PATCH 07/12] Expanding explanatory text and fixing some formatting
 issues

---
 .../bedtools-python/bedtools-capstone.md      | 215 ++++++++++--------
 1 file changed, 126 insertions(+), 89 deletions(-)

diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
index 9dbcf36dd..6536a1d1a 100644
--- a/novice/capstones/bedtools-python/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -10,15 +10,16 @@ Synopsis
 Our goal is to work through examples that demonstrate how to 
 explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) with the `bedtools` software package.
 
-This tutorial is merely meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/).
+This tutorial is meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/).
 
 Setup
 =====
+
 From the Terminal, create a new directory on your Desktop called "bedtools-demo".
 
 ~~~
-cd ~/Desktop
-mkdir bedtools-demo
+$ cd ~/Desktop
+$ mkdir bedtools-demo
 ~~~
 {:class="in"}
 
@@ -26,41 +27,50 @@ mkdir bedtools-demo
 Navigate into that directory.
 
 ~~~
-cd bedtools-demo
+$ cd bedtools-demo
 ~~~
 {:class="in"}
 
-Download the sample BED files I have provided.
+Download the sample BED files.
 
 ~~~
-curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed
-curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed
-curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed
-curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt
+$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed
+$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed
+$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed
+$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt
 ~~~
 {:class="in"}
 
 Let's take a look at what files we now have.
 
 ~~~
-ls -1
+$ ls -1
 ~~~
 {:class="in"}
 
+~~~
+cpg.bed
+exons.bed
+genome.txt
+gwas.bed
+~~~
+{:class="out"}
+
+
 What are these files?
 =========================
-Your directory should now contain 23 BED files and 1 genome file. Twenty of these files (those starting with "f" for "fetal tissue") reflect Dnase I hypersensitivity sites measured in twenty different fetal tissue samples from the brain, heart, intestine, kidney, lung, muscle, skin, and stomach.
+Your directory should now contain 3 BED files and 1 genome file. 
 
-In addition: `cpg.bed` represents CpG islands in the human genome; `exons.bed` represents RefSeq exons from human genes; and `gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS).
+`cpg.bed` represents CpG islands in the human genome; `exons.bed` represents RefSeq exons from human genes; and `gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS).
 
-The latter 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start).
+These 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start).
 
 The bedtools help
 ==================
 To bring up the help, just type
 
 ~~~
-bedtools
+$ bedtools
 ~~~
 {:class="in"}
 
@@ -68,26 +78,27 @@ As you can see, there are multiple "subcommands" and for bedtools to
 work you must tell it which subcommand you want to use. Examples:
 
 ~~~
-bedtools intersect
-bedtools merge
-bedtools subtract
+$ bedtools intersect
+$ bedtools merge
+$ bedtools subtract
 ~~~
 {:class="in"}
 
+This should bring up the specific help text associated with each tool.
 
 bedtools "intersect"
 ====================
 
-The `intersect` command is the workhorse of the `bedtools` suite. It compares two BED/VCF/GFF files (or a BAM file and one of the aforementioned files) and identifies all the regions in the gemome where the features in the two files overlap (that is, share at least one base pair in common).
+The `intersect` command is the workhorse of the `bedtools` suite. It compares two BED/VCF/GFF files (or a BAM file and one of the aforementioned files) and identifies all the regions in the genome where the features in the two files overlap (that is, share at least one base pair in common).
 
 ![](http://bedtools.readthedocs.org/en/latest/_images/intersect-glyph.png)
 
 Default behavior
 ----------------
-By default, `intersect` reports the intervals that represent overlaps between your two files.  To demonstrate, let's identify all of the CpG islands that overlap exons.
+By default, `intersect` reports the intervals that represent overlaps between your two files.  To demonstrate, let's identify all of the CpG islands that overlap exons. For example, let's list the first five overlaps between CpG islands and exons.
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.bed | head -5
+$ bedtools intersect -a cpg.bed -b exons.bed | head -5
 ~~~
 {:class="in"}
 ~~~
@@ -104,16 +115,15 @@ Reporting the original feature in each file.
 The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped.  As such, instead of not only showing you *where* the intersections occurred, it shows you *what* intersected.
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.bed -wa -wb \
-| head -5
+$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -5
 ~~~
 {:class="in"}
 ~~~
-chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -
-chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +
-chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +
+chr1    28735   29810   CpG:_116    chr1    29320   29370   NR_024540_exon_10_0_chr1_29321_r    0   -
+chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r    0   -
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f    0   +
+chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_028327_exon_3_0_chr1_327036_f    0   +
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f    0   +
 ~~~
 {:class="out"}
 
@@ -122,20 +132,19 @@ How many base pairs of overlap were there?
 The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files.
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.bed -wo \
-| head -10
+$ bedtools intersect -a cpg.bed -b exons.bed -wo  | head -10
 ~~~
 {:class="in"}
 ~~~
-chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -   50
-chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
-chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
-chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_ 033908_exon_6_0_chr1_713664_r    0   -   84
-chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _015368_exon_0_0_chr1_762971_f    0   +   185
-chr1    762416  763445  CpG:_115    chr1    763177  763229  NR  _047525_exon_0_0_chr1_763178_f    0   +   52
-chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _047524_exon_0_0_chr1_762971_f    0   +   185
+chr1    28735   29810   CpG:_116    chr1    29320   29370   NR_024540_exon_10_0_chr1_29321_r    0   -   50
+chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r    0   -   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_028327_exon_3_0_chr1_327036_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f    0   +   439
+chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_033908_exon_6_0_chr1_713664_r    0   -   84
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_015368_exon_0_0_chr1_762971_f    0   +   185
+chr1    762416  763445  CpG:_115    chr1    763177  763229  NR_047525_exon_0_0_chr1_763178_f    0   +   52
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047524_exon_0_0_chr1_762971_f    0   +   185
 chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
 ~~~
 {:class="out"}
@@ -145,8 +154,7 @@ Counting the number of overlapping features.
 We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option.
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.bed -c \
-| head
+$ bedtools intersect -a cpg.bed -b exons.bed -c  | head
 ~~~
 {:class="in"}
 ~~~
@@ -168,8 +176,7 @@ Find features that DO NOT overlap
 Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case.
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.bed -v \
-| head
+$ bedtools intersect -a cpg.bed -b exons.bed -v  | head
 ~~~
 {:class="in"}
 ~~~
@@ -194,52 +201,56 @@ Recall that the default is to report overlaps between features in A and B so lon
 Let's be more strict and require 50% of overlap.
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.bed \
--wo -f 0.50 \
-| head
+$ bedtools intersect -a cpg.bed -b exons.bed -wo -f 0.50 | head
 ~~~
 {:class="in"}
 ~~~
-chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
-chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
-chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047525_exon_4_0_chr1_788771_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047524_exon_3_0_chr1_788771_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047523_exon_3_0_chr1_788771_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_ 047522_exon_5_0_chr1_788859_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_ 047521_exon_4_0_chr1_788771_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_ 047520_exon_6_0_chr1_788859_f    0   +   348
+chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r    0   -   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_028327_exon_3_0_chr1_327036_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f    0   +   439
+chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_047525_exon_4_0_chr1_788771_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_047524_exon_3_0_chr1_788771_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_047523_exon_3_0_chr1_788771_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_047522_exon_5_0_chr1_788859_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_047521_exon_4_0_chr1_788771_f    0   +   348
+chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_047520_exon_6_0_chr1_788859_f    0   +   348
 ~~~
 {:class="out"}
 
-Do exons and CPG islands overlap significantly?
+Do exons and CpG islands overlap significantly?
 ===============================================
 
+OK, now we've explored the capabilities of BedTools a little, let's use it to answer a biological question. Let's try to answer if CpG islands and exons overlap significantly - that is, do the overlap more than we would expect by random chance?
+
 Going back to base pairs overlap
 --------------------------------
 
+In order to start answering this question, we need to get the base pairs of overlap between all of the elements. From the section above, we know we can get this using the -wo flag:
+
 ~~~
-bedtools intersect -a cpg.bed -b exons.bed -wo \
-| head -10
+$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10
 ~~~
 {:class="in"}
 ~~~
-chr1    28735   29810   CpG:_116    chr1    29320   29370   NR  _024540_exon_10_0_chr1_29321_r    0   -   50
-chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_ 039983_exon_0_0_chr1_134773_r    0   -   439
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028322_exon_2_0_chr1_324439_f    0   +   439
-chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_ 028327_exon_3_0_chr1_327036_f    0   +   439
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_ 028325_exon_2_0_chr1_324439_f    0   +   439
-chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_ 033908_exon_6_0_chr1_713664_r    0   -   84
-chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _015368_exon_0_0_chr1_762971_f    0   +   185
-chr1    762416  763445  CpG:_115    chr1    763177  763229  NR  _047525_exon_0_0_chr1_763178_f    0   +   52
-chr1    762416  763445  CpG:_115    chr1    762970  763155  NR  _047524_exon_0_0_chr1_762971_f    0   +   185
+chr1  28735   29810   CpG:_116    chr1    29320   29370   NR_024540_exon_10_0_chr1_29321_r    0   -   50
+chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r    0   -   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_028327_exon_3_0_chr1_327036_f    0   +   439
+chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f    0   +   439
+chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_033908_exon_6_0_chr1_713664_r    0   -   84
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_015368_exon_0_0_chr1_762971_f    0   +   185
+chr1    762416  763445  CpG:_115    chr1    763177  763229  NR_047525_exon_0_0_chr1_763178_f    0   +   52
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047524_exon_0_0_chr1_762971_f    0   +   185
 chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
 ~~~
 {:class="out"}
 
+Can you spot anything odd about this output? Lines 3-5 indicate that one of our CpG islands is overlapping with three alternative versions of the same exon. For the purposes of our analysis though, this should really count as a single overlap. What we need to do is merge overlapping exons in exons.bed together.
+
 bedtools "merge"
 ====================
+
 Many datasets of genomic features have many individual features that overlap one another (e.g. aligments from a ChiP seq experiment). It is often useful to just cobine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you.
 
 ![](http://bedtools.readthedocs.org/en/latest/_images/merge-glyph.png)
@@ -247,6 +258,7 @@ Many datasets of genomic features have many individual features that overlap one
 
 Input must be sorted
 --------------------
+
 The merge tool requires that the input file is sorted by chromosome, then by start position. This allows the merging algorithm to work very quickly without requiring any RAM.
 
 If you run the merge command on a file that is not sorted, you will get an error. For example:
@@ -280,40 +292,66 @@ chr1    67199430    67199563
 chr1    67205017    67205220
 chr1    67206340    67206405
 chr1    67206954    67207119
+~~~
+{:class="in"}
+~~~
 ERROR: input file: (exons.bed) is not sorted by chrom then start.
    The start coordinate at line 26 is less than the start at line 25
 ~~~
 {:class="err"}
 
-To correct this, you need to sort your BED using the UNIX `sort` utility.
+To correct this, you need to sort your BED. We could use the UNIX `sort` utility that we already covered, which would look like this:
 
 ~~~
-sort -k1,1 -k2,2n exons.bed > exons.sort.bed
+$ sort -k1,1 -k2,2n exons.bed | head -n 5
 ~~~
 {:class="in"}
-
-Let's try again.
-    
 ~~~
-bedtools merge -i exons.sort.bed | head -10
+chr1    11873   12227   NR_046018_exon_0_0_chr1_11874_f 0   +
+chr1    12612   12721   NR_046018_exon_1_0_chr1_12613_f 0   +
+chr1    13220   14409   NR_046018_exon_2_0_chr1_13221_f 0   +
+chr1    14361   14829   NR_024540_exon_0_0_chr1_14362_r 0   -
+chr1    14969   15038   NR_024540_exon_1_0_chr1_14970_r 0   -
 ~~~
-{:class="in"}
+{:class="out"}
 
-The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file.
+But if you can't remember the right arguments to use for sort, the command `bedtools sort` will also do what we want:
 
-Or alternatively:
+~~~
+$ bedtools sort -i exons.bed | head -n 5
+~~~
+{:class="in"}
+~~~
+chr1    11873   12227   NR_046018_exon_0_0_chr1_11874_f 0   +
+chr1    12612   12721   NR_046018_exon_1_0_chr1_12613_f 0   +
+chr1    13220   14409   NR_046018_exon_2_0_chr1_13221_f 0   +
+chr1    14361   14829   NR_024540_exon_0_0_chr1_14362_r 0   -
+chr1    14969   15038   NR_024540_exon_1_0_chr1_14970_r 0   -
+~~~
+{:class="out"}
 
+Now instead of saving the sorted results to a separate file, we can actually pipe the result into bedtools merge:
+    
 ~~~
-bedtools sort -i exons.bed | bedtools merge > exons.merged.bed
+$ bedtools sort -i exons.bed | bedtools merge | head -n 5
 ~~~
 {:class="in"}
+~~~
+chr1    11873   12227
+chr1    12612   12721
+chr1    13220   14829
+chr1    14969   15038
+chr1    15795   15947
+~~~
+{:class="out"}
 
-Now the overlap with merged exons:
+The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file.
 
+Let's save the output of bedtools merge to a file and recalculate the overlap:
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
-| head -n 10
+$ bedtools sort -i exons.bed | bedtools merge > exons.merged.bed
+$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | head -n 10
 ~~~
 {:class="in"}
 ~~~
@@ -330,11 +368,10 @@ chr1    858970  861632  CpG:_257    chr1    861120  861180  60
 ~~~
 {:class="out"}
 
-All we want is the very last column:
+This gives us the coordinates of the overlap, the name of the CpG island, the coordinates of the CpG island and the overlap in base pairs. But all we actually need is the overlap column. We can use the unix `cut` command to extract just the 8th column.
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
-| cut -f 8,8 | head -n 10
+$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | head -n 10
 ~~~
 {:class="in"}
 ~~~
@@ -351,11 +388,10 @@ bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
 ~~~
 {:class="out"}
 
-Then we want the total bp overlap, using a slight alteration of the readings.py from the novice python lessons:
+Now we need a program to read in this column of numbers and give us the sum. The `readings.py` program we wrote earlier will do this with a few small changes.
 
 ~~~
-bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
-| cut -f 8,8 | python readings.py --sum
+$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | python readings.py --sum
 ~~~
 {:class="in"}
 ~~~
@@ -363,11 +399,12 @@ bedtools intersect -a cpg.bed -b exons.merged.bed -wo \
 ~~~
 {:class="out"}
 
-How do we know if this overlap is significant? Shuffle one of the files.
+So we know what the overlap is between CpG islands and exons. How do we know whether this is significant? How do we know what we would expect by random chance?
+
+BedTools includes a command called shuffle which will randomize one of our input files. Let's try randomizing the CpG islands:
 
 ~~~
-bedtools shuffle -i cpg.bed -g genome.txt \
-| head -n 10 
+bedtools shuffle -i cpg.bed -g genome.txt  | head -n 10 
 ~~~
 {:class="in"}
 ~~~

From 40ac97e61138a2231b71d7c2741289c7d9d38b81 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Wed, 23 Jul 2014 09:36:48 +0100
Subject: [PATCH 08/12] Moved the scripts to their own repository

---
 .../capstones/bedtools-python/do_shuffle.sh   |  8 ----
 novice/capstones/bedtools-python/readings.py  | 37 ----------------
 novice/capstones/bedtools-python/results.py   | 42 -------------------
 3 files changed, 87 deletions(-)
 delete mode 100755 novice/capstones/bedtools-python/do_shuffle.sh
 delete mode 100644 novice/capstones/bedtools-python/readings.py
 delete mode 100644 novice/capstones/bedtools-python/results.py

diff --git a/novice/capstones/bedtools-python/do_shuffle.sh b/novice/capstones/bedtools-python/do_shuffle.sh
deleted file mode 100755
index 2c7e2edc1..000000000
--- a/novice/capstones/bedtools-python/do_shuffle.sh
+++ /dev/null
@@ -1,8 +0,0 @@
-#! /bin/bash
-
-
-for i in {1..5}
-do
-    bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed
-done
-
diff --git a/novice/capstones/bedtools-python/readings.py b/novice/capstones/bedtools-python/readings.py
deleted file mode 100644
index 28c61922e..000000000
--- a/novice/capstones/bedtools-python/readings.py
+++ /dev/null
@@ -1,37 +0,0 @@
-import sys
-import numpy as np
-
-
-def main():
-    script = sys.argv[0]
-    action = sys.argv[1]
-    filenames = sys.argv[2:]
-    assert action in ['--min', '--mean', '--max', '--sum'], \
-           'Action is not one of --min, --mean, or --max: ' + action
-    if len(filenames) == 0:
-        process(sys.stdin, action)
-    else:
-        for f in filenames:
-            process(f, action)
-
-
-def process(filename, action):
-    data = np.loadtxt(filename, delimiter=',')
-
-    data_shape = data.shape
-    if len(data_shape) == 1:
-        data = data.reshape((1, len(data)))
-
-    if action == '--min':
-        values = data.min(axis=1)
-    elif action == '--mean':
-        values = data.mean(axis=1)
-    elif action == '--max':
-        values = data.max(axis=1)
-    elif action == '--sum':
-        values = data.sum(axis=1)
-
-    for m in values:
-        print m
-
-main()
diff --git a/novice/capstones/bedtools-python/results.py b/novice/capstones/bedtools-python/results.py
deleted file mode 100644
index 5b71b4f56..000000000
--- a/novice/capstones/bedtools-python/results.py
+++ /dev/null
@@ -1,42 +0,0 @@
-import sys
-import numpy as np
-from matplotlib import pyplot as plt
-
-usage_string = """
-Results.py plots the real overlap between two bed files
-versus the overlap when one of the results files is randomly
-shuffled.
-
-Usage: python random.py shuffled_results.txt real_results.txt
-"""
-
-def main():
-
-    if not len(sys.argv) == 3:
-        sys.exit(usage_string)
-
-    script = sys.argv[0]
-    shuffled_results_file = sys.argv[1]
-    real_results_file = sys.argv[2]
-
-    process(shuffled_results_file, real_results_file)
-
-
-def process(shuffled_results_file, real_results_file):
-    shuffled_data = np.loadtxt(shuffled_results_file)
-    real_data = np.loadtxt(real_results_file)
-
-    print 'Real data:'
-    print real_data
-    print
-    print 'Shuffled data:'
-    print shuffled_data
-
-    plt.hist(shuffled_data, color='black', label='Shuffled Data')
-    plt.axvline(real_data, color='red', ls='--', lw=2, label='Real Data')
-    plt.xlabel('Nucleotides overlap')
-    plt.ylabel('Number of shuffled results')
-    plt.legend()
-    plt.show()
-
-main()

From 0d9e0ca5bb615cec8ccc80a024e4a98d0e7e623c Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Wed, 23 Jul 2014 10:18:35 +0100
Subject: [PATCH 09/12] Added instructions for forking repository

---
 .../bedtools-python/bedtools-capstone.md      | 49 +++++++++++++------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
index 6536a1d1a..b07c0e831 100644
--- a/novice/capstones/bedtools-python/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -12,14 +12,25 @@ explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) wit
 
 This tutorial is meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/).
 
+During this lesson, you'll be working on some scripts and data which are in their own github repository. In our git lesson, we covered the situation where you have been given write access to a repository. Now we're going to cover what to do if you don't have write access to a repository - this is called forking.
+
+Forking the github repository
+=============================
+
+The data and code that you need to work on are located at [](git@github.com:rbeagrie/bedtools-example.git). You are going to want to make changes to the code, but this repository is owned by someone else. Instead of creating a new project, you want to [fork](../../gloss.html#fork) it; i.e., clone it on GitHub. You can do this using the GitHub web interface:
+
+<img src="img/git-fork-ui.png" alt="The Fork Button" />
+
+If you visit the [bedtools-example repository](git@github.com:rbeagrie/bedtools-example.git) and click the "fork" button, you will have your own copy of the repository under your own github account - so there should now be a page that looks like https://github.com/YOUR_USERNAME/bedtools-example.
+
 Setup
 =====
 
-From the Terminal, create a new directory on your Desktop called "bedtools-demo".
+From the Terminal, clone your fork of the repository into a new folder on your Desktop called "bedtools-example".
 
 ~~~
 $ cd ~/Desktop
-$ mkdir bedtools-demo
+$ git clone git@github.com:YOUR_USERNAME_GOES_HERE/bedtools-example.git
 ~~~
 {:class="in"}
 
@@ -27,24 +38,27 @@ $ mkdir bedtools-demo
 Navigate into that directory.
 
 ~~~
-$ cd bedtools-demo
+$ cd bedtools-example
 ~~~
 {:class="in"}
 
-Download the sample BED files.
+Let's take a look at what files we now have.
 
 ~~~
-$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/cpg.bed
-$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/exons.bed
-$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/gwas.bed
-$ curl -O http://quinlanlab.cs.virginia.edu/cshl2013/genome.txt
+$ ls -1
 ~~~
 {:class="in"}
 
-Let's take a look at what files we now have.
+~~~
+data
+scripts
+~~~
+{:class="out"}
+
+So we have two directories, one containing data and the other containing some scripts. We're going to start with the data folder, so lets look at what's in there.
 
 ~~~
-$ ls -1
+$ ls -1 data
 ~~~
 {:class="in"}
 
@@ -58,15 +72,22 @@ gwas.bed
 
 
 What are these files?
-=========================
-Your directory should now contain 3 BED files and 1 genome file. 
+---------------------
+
+The data directory should contain 3 BED files and 1 genome file. 
 
-`cpg.bed` represents CpG islands in the human genome; `exons.bed` represents RefSeq exons from human genes; and `gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS).
+`data/cpg.bed` represents CpG islands in the human genome; `data/exons.bed` represents RefSeq exons from human genes; and `data/gwas.bed` represents human disease-associated SNPs that were identified in genome-wide association studies (GWAS). `data/genome.txt` file contains the lengths of all the human chromosomes.
 
-These 3 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start).
+These 4 files were extracted from the UCSC Genome Browser's [Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?command=start).
+
+Introducing BedTools
+====================
+
+We're going to be exploring these files using a suite of tools called BedTools. As the name suggests, these tools all work on a particular type of file called a bed file. Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate
 
 The bedtools help
 ==================
+
 To bring up the help, just type
 
 ~~~

From 069945c25f8ae920b7294cdf78fa2fc3c31bf900 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Wed, 23 Jul 2014 14:38:03 +0100
Subject: [PATCH 10/12] Adding some more explanation about all of the commands

---
 .../bedtools-python/bedtools-capstone.md      | 213 ++++++++++++------
 1 file changed, 142 insertions(+), 71 deletions(-)

diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
index b07c0e831..09d3bdf38 100644
--- a/novice/capstones/bedtools-python/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -83,10 +83,28 @@ These 4 files were extracted from the UCSC Genome Browser's [Table Browser](http
 Introducing BedTools
 ====================
 
-We're going to be exploring these files using a suite of tools called BedTools. As the name suggests, these tools all work on a particular type of file called a bed file. Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate
+We're going to be exploring these files using a suite of tools called BedTools. As the name suggests, these tools all work on a particular type of file called a bed file. Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`:
+
+~~~
+head data/cpg.bed 
+~~~
+{:class="in"}
+~~~
+chr1    28735   29810   CpG:_116
+chr1    135124  135563  CpG:_30
+chr1    327790  328229  CpG:_29
+chr1    437151  438164  CpG:_84
+chr1    449273  450544  CpG:_99
+chr1    533219  534114  CpG:_94
+chr1    544738  546649  CpG:_171
+chr1    713984  714547  CpG:_60
+chr1    762416  763445  CpG:_115
+chr1    788863  789211  CpG:_28
+~~~
+{:class="out"}
 
 The bedtools help
-==================
+-----------------
 
 To bring up the help, just type
 
@@ -110,7 +128,7 @@ This should bring up the specific help text associated with each tool.
 bedtools "intersect"
 ====================
 
-The `intersect` command is the workhorse of the `bedtools` suite. It compares two BED/VCF/GFF files (or a BAM file and one of the aforementioned files) and identifies all the regions in the genome where the features in the two files overlap (that is, share at least one base pair in common).
+The `intersect` command is the workhorse of the `bedtools` suite. It compares two bed files and identifies all the regions in the genome where the features in the two files overlap (that is, share at least one base pair in common).
 
 ![](http://bedtools.readthedocs.org/en/latest/_images/intersect-glyph.png)
 
@@ -131,9 +149,25 @@ chr1    327790  328229  CpG:_29
 ~~~
 {:class="out"}
 
+Notice how the new file we create lists the name of the CpG island which is overlapping, but not the name of the exon. If we switch the order of files, we get the exon name instead:
+
+~~~
+$ bedtools intersect -a data/exons.bed -b data/cpg.bed | head -5 
+~~~
+{:class="in"}
+~~~
+chr1    50489434    50489626    NM_032785_exon_13_0_chr1_50489435_r 0   -
+chr1    16767166    16767348    NM_018090_exon_0_0_chr1_16767167_f  0   +
+chr1    33546713    33546895    NM_052998_exon_0_0_chr1_33546714_f  0   +
+chr1    33546988    33547109    NM_052998_exon_1_0_chr1_33546989_f  0   +
+chr1    33547201    33547243    NM_052998_exon_2_0_chr1_33547202_f  0   +
+~~~
+{:class="out"}
+
+
 Reporting the original feature in each file.
 --------------------------------------------
-The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped.  As such, instead of not only showing you *where* the intersections occurred, it shows you *what* intersected.
+The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped.  As such, instead of only showing you *where* the intersections occurred, it shows you *what* intersected.
 
 ~~~
 $ bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -5
@@ -148,47 +182,51 @@ chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028325_exon_2_0_chr1_
 ~~~
 {:class="out"}
 
-How many base pairs of overlap were there?
-------------------------------------------
-The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files.
+Counting the number of overlapping features.
+--------------------------------------------
+We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option.
 
 ~~~
-$ bedtools intersect -a cpg.bed -b exons.bed -wo  | head -10
+$ bedtools intersect -a cpg.bed -b exons.bed -c  | head
 ~~~
 {:class="in"}
 ~~~
-chr1    28735   29810   CpG:_116    chr1    29320   29370   NR_024540_exon_10_0_chr1_29321_r    0   -   50
-chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r    0   -   439
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f    0   +   439
-chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_028327_exon_3_0_chr1_327036_f    0   +   439
-chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f    0   +   439
-chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_033908_exon_6_0_chr1_713664_r    0   -   84
-chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_015368_exon_0_0_chr1_762971_f    0   +   185
-chr1    762416  763445  CpG:_115    chr1    763177  763229  NR_047525_exon_0_0_chr1_763178_f    0   +   52
-chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047524_exon_0_0_chr1_762971_f    0   +   185
-chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
+chr1    28735   29810   CpG:_116    1
+chr1    135124  135563  CpG:_30 1
+chr1    327790  328229  CpG:_29 3
+chr1    437151  438164  CpG:_84 0
+chr1    449273  450544  CpG:_99 0
+chr1    533219  534114  CpG:_94 0
+chr1    544738  546649  CpG:_171    0
+chr1    713984  714547  CpG:_60 1
+chr1    762416  763445  CpG:_115    10
+chr1    788863  789211  CpG:_28 9
 ~~~
 {:class="out"}
 
-Counting the number of overlapping features.
+Again, we see that the order of the bed files is important. We get an output line for every feature in `data/cpg.bed`, even if there is no overlap with `data/exons.bed`.
+
+Require a minimal fraction of overlap.
 --------------------------------------------
-We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option.
+Recall that the default is to report overlaps between features in A and B so long as *at least one basepair* of overlap exists. However, the `-f` option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported.
+
+Let's be more strict and require 50% of overlap.
 
 ~~~
-$ bedtools intersect -a cpg.bed -b exons.bed -c  | head
+$ bedtools intersect -a data/cpg.bed -b data/exons.bed -c -f 0.50 | head
 ~~~
 {:class="in"}
 ~~~
-chr1    28735   29810   CpG:_116    1
+chr1    28735   29810   CpG:_116    0
 chr1    135124  135563  CpG:_30 1
 chr1    327790  328229  CpG:_29 3
 chr1    437151  438164  CpG:_84 0
 chr1    449273  450544  CpG:_99 0
 chr1    533219  534114  CpG:_94 0
 chr1    544738  546649  CpG:_171    0
-chr1    713984  714547  CpG:_60 1
-chr1    762416  763445  CpG:_115    10
-chr1    788863  789211  CpG:_28 9
+chr1    713984  714547  CpG:_60 0
+chr1    762416  763445  CpG:_115    0
+chr1    788863  789211  CpG:_28 7
 ~~~
 {:class="out"}
 
@@ -214,28 +252,25 @@ chr1    919726  919927  CpG:_15
 ~~~
 {:class="out"}
 
-
-Require a minimal fraction of overlap.
---------------------------------------------
-Recall that the default is to report overlaps between features in A and B so long as *at least one basepair* of overlap exists. However, the `-f` option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported.
-
-Let's be more strict and require 50% of overlap.
+How many base pairs of overlap were there?
+------------------------------------------
+The `-wo` (write overlap) option allows one to also report the *number* of base pairs of overlap between the features that overlap between each of the files.
 
 ~~~
-$ bedtools intersect -a cpg.bed -b exons.bed -wo -f 0.50 | head
+$ bedtools intersect -a cpg.bed -b exons.bed -wo  | head -10
 ~~~
 {:class="in"}
 ~~~
+chr1    28735   29810   CpG:_116    chr1    29320   29370   NR_024540_exon_10_0_chr1_29321_r    0   -   50
 chr1    135124  135563  CpG:_30 chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r    0   -   439
 chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f    0   +   439
 chr1    327790  328229  CpG:_29 chr1    327035  328581  NR_028327_exon_3_0_chr1_327036_f    0   +   439
 chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f    0   +   439
-chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_047525_exon_4_0_chr1_788771_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_047524_exon_3_0_chr1_788771_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_047523_exon_3_0_chr1_788771_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_047522_exon_5_0_chr1_788859_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788770  794826  NR_047521_exon_4_0_chr1_788771_f    0   +   348
-chr1    788863  789211  CpG:_28 chr1    788858  794826  NR_047520_exon_6_0_chr1_788859_f    0   +   348
+chr1    713984  714547  CpG:_60 chr1    713663  714068  NR_033908_exon_6_0_chr1_713664_r    0   -   84
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_015368_exon_0_0_chr1_762971_f    0   +   185
+chr1    762416  763445  CpG:_115    chr1    763177  763229  NR_047525_exon_0_0_chr1_763178_f    0   +   52
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047524_exon_0_0_chr1_762971_f    0   +   185
+chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_chr1_762971_f    0   +   185
 ~~~
 {:class="out"}
 
@@ -324,7 +359,7 @@ ERROR: input file: (exons.bed) is not sorted by chrom then start.
 To correct this, you need to sort your BED. We could use the UNIX `sort` utility that we already covered, which would look like this:
 
 ~~~
-$ sort -k1,1 -k2,2n exons.bed | head -n 5
+$ sort -k1,1 -k2,2n exons.bed | head
 ~~~
 {:class="in"}
 ~~~
@@ -339,7 +374,7 @@ chr1    14969   15038   NR_024540_exon_1_0_chr1_14970_r 0   -
 But if you can't remember the right arguments to use for sort, the command `bedtools sort` will also do what we want:
 
 ~~~
-$ bedtools sort -i exons.bed | head -n 5
+$ bedtools sort -i exons.bed | head
 ~~~
 {:class="in"}
 ~~~
@@ -354,7 +389,7 @@ chr1    14969   15038   NR_024540_exon_1_0_chr1_14970_r 0   -
 Now instead of saving the sorted results to a separate file, we can actually pipe the result into bedtools merge:
     
 ~~~
-$ bedtools sort -i exons.bed | bedtools merge | head -n 5
+$ bedtools sort -i exons.bed | bedtools merge | head
 ~~~
 {:class="in"}
 ~~~
@@ -366,13 +401,13 @@ chr1    15795   15947
 ~~~
 {:class="out"}
 
-The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file.
+The result is a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file. Notice how the merged file has less columns than the original file - we've lost the name of the exon and the strand.
 
 Let's save the output of bedtools merge to a file and recalculate the overlap:
 
 ~~~
 $ bedtools sort -i exons.bed | bedtools merge > exons.merged.bed
-$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | head -n 10
+$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | head
 ~~~
 {:class="in"}
 ~~~
@@ -389,10 +424,14 @@ chr1    858970  861632  CpG:_257    chr1    861120  861180  60
 ~~~
 {:class="out"}
 
-This gives us the coordinates of the overlap, the name of the CpG island, the coordinates of the CpG island and the overlap in base pairs. But all we actually need is the overlap column. We can use the unix `cut` command to extract just the 8th column.
+This gives us the coordinates of the overlap, the name of the CpG island, the coordinates of the CpG island and the overlap in base pairs. Notice how the repetition in line 3-5 is now gone.
+
+#Challenge: stranded merge
+
+All we actually need from the output of `bedtools intersect` is the overlap column. We can use the unix `cut` command to extract just the 8th field (or column) using the -f flag.
 
 ~~~
-$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | head -n 10
+$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | head
 ~~~
 {:class="in"}
 ~~~
@@ -409,10 +448,10 @@ $ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | head -n 1
 ~~~
 {:class="out"}
 
-Now we need a program to read in this column of numbers and give us the sum. The `readings.py` program we wrote earlier will do this with a few small changes.
+Now we need a program to read in this column of numbers and give us the sum. Hopefully this should ring some bells - The `readings.py` program we wrote earlier will do this with a few small changes.
 
 ~~~
-$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | python readings.py --sum
+$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum
 ~~~
 {:class="in"}
 ~~~
@@ -420,12 +459,20 @@ $ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8,8 | python re
 ~~~
 {:class="out"}
 
+Let's save this result to a file so that we remember it later:
+
+~~~
+$ bedtools intersect -a cpg.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum > cpg_result.txt
+~~~
+{:class="in"}
+
+
 So we know what the overlap is between CpG islands and exons. How do we know whether this is significant? How do we know what we would expect by random chance?
 
 BedTools includes a command called shuffle which will randomize one of our input files. Let's try randomizing the CpG islands:
 
 ~~~
-bedtools shuffle -i cpg.bed -g genome.txt  | head -n 10 
+bedtools shuffle -i cpg.bed -g genome.txt  | head 
 ~~~
 {:class="in"}
 ~~~
@@ -442,11 +489,10 @@ chr6    39650957    39651305    CpG:_28
 ~~~
 {:class="out"}
 
-But keep the intervals on the same chromosome...
+Notice how the CpG islands have been shuffled to random chromosomes. This means that each chromosome will have roughly the same number of CpG islands, regardless of length. A better random model would be to keep the distribution of CpG islands over chromosomes and just shuffle the positions of each CpG on each chromosome. We can do this using the -chrom flag.
 
 ~~~
-bedtools shuffle -chrom -i cpg.bed -g genome.txt \
-| head -n 10 
+$ bedtools shuffle -chrom -i cpg.bed -g genome.txt | head 
 ~~~
 {:class="in"}
 ~~~
@@ -463,13 +509,11 @@ chr1    51954093    51954441    CpG:_28
 ~~~
 {:class="out"}
 
-Save to a file, and count the overlap for the shuffled features
+Now let's save that result to a file, and count the overlap for the shuffled features.
 
 ~~~
-bedtools shuffle -chrom -i cpg.bed -g genome.txt \
-> cpg.shuffled.bed
-bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \
-| cut -f 8,8 | python readings.py --sum
+$ bedtools shuffle -chrom -i cpg.bed -g genome.txt > cpg.shuffled.bed
+$ bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum
 ~~~
 {:class="in"}
 ~~~
@@ -477,17 +521,19 @@ bedtools intersect -a cpg.shuffled.bed -b exons.merged.bed -wo \
 ~~~
 {:class="out"}
 
-Do this lots of times to get a p-value
+Great, so our shuffled file overlaps much less with exons than real CpG islands do. How do we know if the difference is significant? One approach is to make lots of different randomly shuffled files, and count the overlap for each one. If we do this 100 times, and we never see an overlap that is as big as the one we see for the original `data/cpg.bed` file, then we can say that the overlap between exons and CpG islands is significant with a P-value (roughly) < 0.01.
+
+First, prepare a directory to hold our shuffled results:
 
 ~~~
-mkdir shuffled_cpg
+$ mkdir shuffled_cpg
 ~~~
 {:class="in"}
 
-We have a script for that:
+In the `scripts` folder, you should see a file called `scripts/do_shuffle.sh`. This is a bash script that contains a loop from 1 to 100. It runs bedtools shuffle 100 times, and saves the results to a folder. Have a look at the contents of the file:
 
 ~~~
-cat do_shuffle.sh
+$ cat scripts/do_shuffle.sh
 ~~~
 {:class="in"}
 ~~~
@@ -496,26 +542,50 @@ cat do_shuffle.sh
 
 for i in {1..100}
 do
-    bedtools shuffle -chrom -i cpg.bed -g genome.txt > shuffled_cpg/cpg.shuffled${i}.bed
+    bedtools shuffle -chrom -i data/cpg.bed -g data/genome.txt > shuffled_cpg/cpg.shuffled${i}.bed
 done
 ~~~
 {:class="out"}
 
-Run the script:
+Run the script, and then check to make sure you have some shiny new files in the `shuffled_cpg` folder:
 
 ~~~
-./do_suffle.sh
-ls -l shuffled_cpg | head -n 5
+$ scripts/do_suffle.sh
+$ ls -1 shuffled_cpg | head
 ~~~
 {:class="in"}
+~~~
+cpg.shuffled100.bed
+cpg.shuffled10.bed
+cpg.shuffled11.bed
+cpg.shuffled12.bed
+cpg.shuffled13.bed
+cpg.shuffled14.bed
+cpg.shuffled15.bed
+cpg.shuffled16.bed
+cpg.shuffled17.bed
+cpg.shuffled18.bed
+~~~
+{:class="out"}
+
+Now we need to calculate the overlap between `data/exons.bed` and each of our shuffled bed files. The command for the first file is:
 
+~~~
+$ bedtools intersect -a shuffled_cpg/cpg.shuffled1.bed -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum
+~~~
+{:class="in"}
+~~~
+630845.0
+~~~
+{:class="out"}
 
-And then:
+See if you can adapt this command and turn it into a for loop which runs over all of our shuffled results.
 
 ~~~
-for f in shuffled_cpg/*; 
-do bedtools intersect -a $f -b exons.merged.bed -wo \
-| cut -f 8,8 | python readings.py --sum; done | head -n 10
+$ for filename in shuffled_cpg/*
+> do
+>     bedtools intersect -a $filename -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum
+> done | head
 ~~~
 {:class="in"}
 ~~~
@@ -535,9 +605,10 @@ do bedtools intersect -a $f -b exons.merged.bed -wo \
 Save that to a file:
 
 ~~~
-for f in shuffled_cpg/*; 
-do bedtools intersect -a $f -b exons.merged.bed -wo \
-| cut -f 8,8 | python readings.py --sum; done > shuffled_results.txt
+$ for filename in shuffled_cpg/*; 
+do
+    bedtools intersect -a $filename -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum
+done > shuffled_results.txt
 ~~~
 {:class="in"}
 

From c1cf14192100b2d7d19541a4c4667e8c92ddcd89 Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Wed, 23 Jul 2014 16:11:16 +0100
Subject: [PATCH 11/12] Changes from PR review comments

---
 .../bedtools-python/bedtools-capstone.md      |  88 +++++++++---------
 .../bedtools-python/img/git-fork-ui.png       | Bin 0 -> 9847 bytes
 2 files changed, 45 insertions(+), 43 deletions(-)
 create mode 100644 novice/capstones/bedtools-python/img/git-fork-ui.png

diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
index 09d3bdf38..665462981 100644
--- a/novice/capstones/bedtools-python/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -7,8 +7,9 @@ title: Capstone example Python and Bedtools
 Synopsis
 ========
 
-Our goal is to work through examples that demonstrate how to 
-explore, process and manipulate genomic interval files (e.g., BED, VCF, BAM) with the `bedtools` software package.
+Our goal is to work through examples that demonstrate how to explore, process and manipulate bed files with the `bedtools` software package.
+
+Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. If you are working with published data from other groups (for example, transcription factor binding sites identified by ChIP-seq) you will likely find that the data is provided in bed format. One of the most commonly asked questions of bioinformaticians is whether two different sets of features overlap with each other, so today we'll be showing you how to answer that question.
 
 This tutorial is meant as an introduction to whet your appetite. There are many, many more tools and options than presented here. We therefore encourage you to read the bedtools [documentation](http://bedtools.readthedocs.org/en/latest/).
 
@@ -17,11 +18,11 @@ During this lesson, you'll be working on some scripts and data which are in thei
 Forking the github repository
 =============================
 
-The data and code that you need to work on are located at [](git@github.com:rbeagrie/bedtools-example.git). You are going to want to make changes to the code, but this repository is owned by someone else. Instead of creating a new project, you want to [fork](../../gloss.html#fork) it; i.e., clone it on GitHub. You can do this using the GitHub web interface:
+The data and code that you need to work on are located at <http://github.com/rbeagrie/bedtools-example>. You are going to want to make changes to the code, but this repository is owned by someone else. Instead of creating a new project, you want to [fork](../../gloss.html#fork) it; i.e., clone it on GitHub. You can do this using the GitHub web interface:
 
 <img src="img/git-fork-ui.png" alt="The Fork Button" />
 
-If you visit the [bedtools-example repository](git@github.com:rbeagrie/bedtools-example.git) and click the "fork" button, you will have your own copy of the repository under your own github account - so there should now be a page that looks like https://github.com/YOUR_USERNAME/bedtools-example.
+If you visit the [bedtools-example repository](http://github.com/rbeagrie/bedtools-example) and click the "fork" button, you will have your own copy of the repository under your own github account - so there should now be a page that looks like https://github.com/YOUR_USERNAME/bedtools-example.
 
 Setup
 =====
@@ -45,28 +46,24 @@ $ cd bedtools-example
 Let's take a look at what files we now have.
 
 ~~~
-$ ls -1
+$ ls
 ~~~
 {:class="in"}
 
 ~~~
-data
-scripts
+data scripts
 ~~~
 {:class="out"}
 
 So we have two directories, one containing data and the other containing some scripts. We're going to start with the data folder, so lets look at what's in there.
 
 ~~~
-$ ls -1 data
+$ ls data
 ~~~
 {:class="in"}
 
 ~~~
-cpg.bed
-exons.bed
-genome.txt
-gwas.bed
+cpg.bed exons.bed genome.txt gwas.bed
 ~~~
 {:class="out"}
 
@@ -83,8 +80,7 @@ These 4 files were extracted from the UCSC Genome Browser's [Table Browser](http
 Introducing BedTools
 ====================
 
-We're going to be exploring these files using a suite of tools called BedTools. As the name suggests, these tools all work on a particular type of file called a bed file. Bed files are commonly used to represent features in genomes, for example promotors, genes, centromeres etc. - essentially anything that can be identified with a particular genomic location. All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`:
-
+All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`:
 ~~~
 head data/cpg.bed 
 ~~~
@@ -277,7 +273,7 @@ chr1    762416  763445  CpG:_115    chr1    762970  763155  NR_047523_exon_0_0_c
 Do exons and CpG islands overlap significantly?
 ===============================================
 
-OK, now we've explored the capabilities of BedTools a little, let's use it to answer a biological question. Let's try to answer if CpG islands and exons overlap significantly - that is, do the overlap more than we would expect by random chance?
+OK, now we've explored the capabilities of BedTools a little, let's use it to answer a biological question. Let's try to answer if CpG islands and exons overlap significantly - that is, do they overlap more than we would expect by random chance?
 
 Going back to base pairs overlap
 --------------------------------
@@ -307,7 +303,7 @@ Can you spot anything odd about this output? Lines 3-5 indicate that one of our
 bedtools "merge"
 ====================
 
-Many datasets of genomic features have many individual features that overlap one another (e.g. aligments from a ChiP seq experiment). It is often useful to just cobine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you.
+Many datasets of genomic features have many individual features that overlap one another (e.g. alignments from a ChiP seq experiment). It is often useful to just combine the overlapping into a single, contiguous interval. The bedtools `merge` command will do this for you.
 
 ![](http://bedtools.readthedocs.org/en/latest/_images/merge-glyph.png)
 
@@ -551,20 +547,23 @@ Run the script, and then check to make sure you have some shiny new files in the
 
 ~~~
 $ scripts/do_suffle.sh
-$ ls -1 shuffled_cpg | head
+$ ls shuffled_cpg
 ~~~
 {:class="in"}
 ~~~
-cpg.shuffled100.bed
-cpg.shuffled10.bed
-cpg.shuffled11.bed
-cpg.shuffled12.bed
-cpg.shuffled13.bed
-cpg.shuffled14.bed
-cpg.shuffled15.bed
-cpg.shuffled16.bed
-cpg.shuffled17.bed
-cpg.shuffled18.bed
+cpg.shuffled100.bed  cpg.shuffled21.bed  cpg.shuffled33.bed  cpg.shuffled45.bed  cpg.shuffled57.bed  cpg.shuffled69.bed  cpg.shuffled80.bed  cpg.shuffled92.bed
+cpg.shuffled10.bed   cpg.shuffled22.bed  cpg.shuffled34.bed  cpg.shuffled46.bed  cpg.shuffled58.bed  cpg.shuffled6.bed   cpg.shuffled81.bed  cpg.shuffled93.bed
+cpg.shuffled11.bed   cpg.shuffled23.bed  cpg.shuffled35.bed  cpg.shuffled47.bed  cpg.shuffled59.bed  cpg.shuffled70.bed  cpg.shuffled82.bed  cpg.shuffled94.bed
+cpg.shuffled12.bed   cpg.shuffled24.bed  cpg.shuffled36.bed  cpg.shuffled48.bed  cpg.shuffled5.bed   cpg.shuffled71.bed  cpg.shuffled83.bed  cpg.shuffled95.bed
+cpg.shuffled13.bed   cpg.shuffled25.bed  cpg.shuffled37.bed  cpg.shuffled49.bed  cpg.shuffled60.bed  cpg.shuffled72.bed  cpg.shuffled84.bed  cpg.shuffled96.bed
+cpg.shuffled14.bed   cpg.shuffled26.bed  cpg.shuffled38.bed  cpg.shuffled4.bed   cpg.shuffled61.bed  cpg.shuffled73.bed  cpg.shuffled85.bed  cpg.shuffled97.bed
+cpg.shuffled15.bed   cpg.shuffled27.bed  cpg.shuffled39.bed  cpg.shuffled50.bed  cpg.shuffled62.bed  cpg.shuffled74.bed  cpg.shuffled86.bed  cpg.shuffled98.bed
+cpg.shuffled16.bed   cpg.shuffled28.bed  cpg.shuffled3.bed   cpg.shuffled51.bed  cpg.shuffled63.bed  cpg.shuffled75.bed  cpg.shuffled87.bed  cpg.shuffled99.bed
+cpg.shuffled17.bed   cpg.shuffled29.bed  cpg.shuffled40.bed  cpg.shuffled52.bed  cpg.shuffled64.bed  cpg.shuffled76.bed  cpg.shuffled88.bed  cpg.shuffled9.bed
+cpg.shuffled18.bed   cpg.shuffled2.bed   cpg.shuffled41.bed  cpg.shuffled53.bed  cpg.shuffled65.bed  cpg.shuffled77.bed  cpg.shuffled89.bed
+cpg.shuffled19.bed   cpg.shuffled30.bed  cpg.shuffled42.bed  cpg.shuffled54.bed  cpg.shuffled66.bed  cpg.shuffled78.bed  cpg.shuffled8.bed
+cpg.shuffled1.bed    cpg.shuffled31.bed  cpg.shuffled43.bed  cpg.shuffled55.bed  cpg.shuffled67.bed  cpg.shuffled79.bed  cpg.shuffled90.bed
+cpg.shuffled20.bed   cpg.shuffled32.bed  cpg.shuffled44.bed  cpg.shuffled56.bed  cpg.shuffled68.bed  cpg.shuffled7.bed   cpg.shuffled91.bed
 ~~~
 {:class="out"}
 
@@ -606,9 +605,9 @@ Save that to a file:
 
 ~~~
 $ for filename in shuffled_cpg/*; 
-do
-    bedtools intersect -a $filename -b exons.merged.bed -wo | cut -f 8 | python readings.py --sum
-done > shuffled_results.txt
+> do
+>     bedtools intersect -a $filename -b data/exons.merged.bed -wo | cut -f 8 | python scripts/readings.py --sum
+> done > shuffled_results.txt
 ~~~
 {:class="in"}
 
@@ -616,10 +615,10 @@ done > shuffled_results.txt
 Plotting our results with python
 --------------------------------
 
-Let's copy the readings.py script and alter it to read in our results files:
+Also in your `scripts` directory, there should be a file called results.py. Open it and have a look.
 
 ~~~
-cat results.py
+$ cat scripts/results.py
 ~~~
 {:class="in"}
 ~~~
@@ -679,7 +678,7 @@ Shuffled data:
 What happens if we call the script with no arguments?
 
 ~~~
-python results.py
+$ python scripts/results.py
 ~~~
 {:class="in"}
 ~~~
@@ -692,7 +691,7 @@ IndexError: list index out of range
 ~~~
 {:class="err"}
 
-Not very helpful. Let's add a usage description:
+Not very helpful. Let's add a usage description. Modify the python file so that it looks like this:
 
 ~~~
 import sys
@@ -703,7 +702,7 @@ Results.py plots the real overlap between two bed files
 versus the overlap when one of the results files is randomly
 shuffled.
 
-Usage: python random.py shuffled_results.txt real_results.txt
+Usage: python results.py shuffled_results.txt real_results.txt
 """
 
 def main():
@@ -712,14 +711,17 @@ def main():
         sys.exit(usage_string)
 ~~~
 
-Now we can plot a histogram of the shuffled data:
+Let's try plotting a histogram of the shuffled data, first you'll need to add `from matplotlib import pyplot as plt` at the top of the file, then add this at the bottom of the process function.
 
 ~~~
-plt.hist(shuffled_data, color='black')
-plt.show()
+    print 'Shuffled data:'
+    print shuffled_data
+
+    plt.hist(shuffled_data, color='black')
+    plt.savefig('cpg_overlap_exons.png')
 ~~~
 
-And we can plot the real result as a vertical red line:
+Have a look at the image you just made. At the moment we're only plotting the shuffled results, so let's add the real overlap as a vertical red line. Update your results.py file so that it looks like this: 
 
 ~~~
 import sys 
@@ -731,7 +733,7 @@ Results.py plots the real overlap between two bed files
 versus the overlap when one of the results files is randomly
 shuffled.
 
-Usage: python random.py shuffled_results.txt real_results.txt
+Usage: python results.py shuffled_results.txt real_results.txt
 """
 
 def main():
@@ -758,7 +760,7 @@ def process(shuffled_results_file, real_results_file):
 
     plt.hist(shuffled_data, color='black')
     plt.axvline(real_data, color='red', ls='--', lw=2)
-    plt.show()
+    plt.savefig('cpg_overlap_exons.png')
 
 main()
 ~~~
@@ -775,7 +777,7 @@ Results.py plots the real overlap between two bed files
 versus the overlap when one of the results files is randomly
 shuffled.
 
-Usage: python random.py shuffled_results.txt real_results.txt
+Usage: python results.py shuffled_results.txt real_results.txt
 """
 
 def main():
@@ -805,7 +807,7 @@ def process(shuffled_results_file, real_results_file):
     plt.xlabel('Nucleotides overlap')
     plt.ylabel('Number of shuffled results')
     plt.legend()
-    plt.show()
+    plt.savefig('cpg_overlap_exons.png')
 
 main()
 ~~~
diff --git a/novice/capstones/bedtools-python/img/git-fork-ui.png b/novice/capstones/bedtools-python/img/git-fork-ui.png
new file mode 100644
index 0000000000000000000000000000000000000000..fe90450479b2f3f8fbc08d9db63ecfe80d651d0c
GIT binary patch
literal 9847
zcmd6NRa9NwvgXE};O_1a+&1ok;LgT1I2(6&CrEH7xVyW1u;A_zAh<ODxwp@`J^H;z
zuZJ~iRefLC%bGRkj8OV2je<yk2mk<3WMw2&000Q>4_Oo*^5cDLEHVWEAl6xliz~^B
zi<2oi*_&J0m;nGXD3vL0TB@T2Lx-8SQb@t5dG{NTGSzwTbE0Zu81T%YB4{;~QH(7(
z5)vYySxNdpYls2ZaAPBMn`K;6bfv|?K$K)J$kdj5A-DJTBnxxj#{=Nz57+EN7=OJ<
z4EQ-Cd;t5yN*|lBqNltK5%E2AJOr!~fS9g!SsOBLW`^u4ZvPbsh=8-thycG9ocVnk
zaca@F0pu!~@+Kq=PKi8VsUR&w8gT>YU@?|J$;@bA!M<xT(?EEnflR^a5aX|zhmm{=
z5^1>kR?q-N^V;B1K>Y7`USL>O^d9-T^m5{~>^79h5arp}K`k%*Go?*>LaoX~I4T~^
z8?)l|X4;_K_dV0vL~6lqoITmS!3}!tKxd7OwVZd0qy!eRJL@V-GG;zTW<j*$h>`65
zI&GnCq={YjK?`nNTvM-5W(PTL4&wk1iuVy1oFYk2CWqf+A`#g+Exe}WQ&ZPscfLXv
zLaY>?0epJ)IWRa?$Vo@_PJPsWyt9{fU=$b_MZ7n_U=sxNCm!(pE7c{oX_(x>sG<Rr
z_LsyN&HQb19|L=CRxxj1FkgKLI);KuBrBMF3iIj9PT*~;0I&o%vXAPyXJ1<GF24Ud
z<%~HTcCiQNQ5!0LauR#<(^jKOJgvO!f!RDTK4Y4zh~v=)tlHLCl#GJrCvA)Zu)m!4
zAw@%Z2c|3`_&;m!#WHxu4W>cO{6qjSKX}QzAp^>U02WpCl6{$v-j<GGq8bb+djae-
z0RA4dH1eA|xTzlE2Y&z>B!vi*7V5G;9J@bS1*CP4loGUD4{8meS|HRG8ZTmO0A?<B
zCRElIVhhAmFX##$IY9XeLq6c%5H^sENFv~8JbNL+u_#SE*MPqsIVl}HVvunjG#&nq
zIE5uj0|G;UT%Oz!nHt<HVpafup7jji7C;b6lgl$>fd`Xh<e^2p7{+V}dj<j5hknJ*
zi$4{7vn}RMs0Xps6XSrihOp^>*(ZMW#hm~~i3$s)D1=@7vy%Q#G&Ldy1Zs&&aqNOI
zWv-tzY64|&E`(gc=@LF-CbMds#I~WXQqKjHvo_8&p2VJvZ5TdS>wy<in^K<&N@h0M
zu%Zcupm&3fjE&hw)6~-_<oSwmzyx4?FfK==N8F&Xl@?S5)FLTil)_-uklo-#KevfM
z4b(FF87FF(>d=oZJqL1L92P>&;MFMW0nosDU(Alsj;y;v2iFEJY1rW|($)Th=L^gW
z^9!DDD4F;SB`>-dJdqfQk-#=vUY0`=n_L?C7zRo>%K*GFgCk{a;Ic%Aytlmf7m9=i
z+1NtrLaaGh6_{h*V+O0V+XyCUZW^zI-BF5b?rU@>xJQUb<`A*jk`~n+x|Q#?-}%R=
z$5h9p>C3*iVyQ+_HtLtjl}VnY{9?2UY{+g1yLCGDK6W}zn&T;=HG^k=V%KM9`r*Q^
z&Te4}YgTMlTBlJ*UWZ?&Yj$c@JsLZ{y?49sy~pbm;uN;WG8&_#tVUdho2;-d0hI5Q
zCr%z7BOKw_SKa@z7dLu8sz^UVzsw*(zsI0PpQ(wLpJ(~RNyD7gl;s-e8h59y44M^B
zC~uLhmHm^l14pe+?M7XtQmaxw2P&E#b2cO0=iJX4s~l5Qn_(DY_@bs!dZyW0!CuNx
zeNr)|!K?wyds0Ht;Lt!))va`usnoR90g10Gf^xewUlX4SVHNs8l<EcYyMaT&H0tx}
z9Fq0&H~9xhZIJ6h*<Y)3-R4Ugel@%_O6b5fJ~fy(pfpf6A~j?*<{Y89;PK1yoAHDA
zRol|rdfTj>37tt#&~NcP1v`PBKu^+Jl-q*8Jg3bkafEdEY$(?#1jr<WD#+`|94OYv
zmv}}ocv4nU&(TjYMkWF#*^bG76fl<uHmC+ob4LB?hkng|x7RV@x0K?(VH^K!%|^&w
z@&oP%Zd!ACcil{5c>S`~tjV9rxheMb>zaK^M!s@DIb^wOInUzIqRlD8sQ@<?p+Ux9
zZd>kZTLqierf1vBRps-*bM94_6`w}88r<JJD~h~F{8YSZ0?6!p_<7Nl(dDst(u%`>
zV-Jah`BntX1hO2S##bizCR;Z=yc2pUF7o@Mr1Tm<S}bs&P`%?Sgeu{7)K;t3+O=<M
z*{!E7;5F;@VZmrY6(9jn-$Ta}5vb_pd*yVkbGon{bA5Nia(8i)a1(J`dh2mgex7l*
zx+So0aoGI3x}S8H^r(B@FTQljl;bo_6#^B)8mbLyG+;X*Ibb?4Iw)B5yw|}&<f>|G
zIO;FX4Q_L&KbrYSRXlATY~GJNi&VqZV3tyyYrzQy8wD*zUM_D*cb|u6`04O-^f(F>
zO34I}@|^M?Wfx`A{LX?e1!2Za#*0I1;p!0z5w=nNgOY;_Lkg^awcEUoEjvKo$}gQ}
zu;WgtAl<4uL7&G*>^Vj)#;-Jn82Okdm|ZnuT3i*!66VG(J6NkxonfzuyHoVL@(!Su
zqP1G8)WW&KKc=aBpixXg9(_K0_CdxX_I;N!zmtGppRCBOr02}%3+6L8uJ94?Egqa+
zIDPNsVv<`K;h<>(DSDNPRSzyE=#6q~xRJ(mQWjE}<M2}S6Y`GWW*uU!zp76St1Fj(
zJs3N<27BE^-!$#T>^~p8*AaB4ceYu_ET+d;hc0<m09#OXQd_F5JExxd{uU6va1*V|
ztfe*mX@(pRoC|cz$meOqEyJ~=BYNz;3=5?wr7)V^GC!Pz^Obl{yEfUx$zb+s_GUc5
zGGMJ`qGx@qOYZ#pq^KdR{=(VLY^A&D`h5QEX6Sb1wzhiQbYod+W9551QndwQHoL9c
zsdH=BJnPc1)Acp{<QQi;tt6==xvIb3DRAdox~INTd0p|XN~3&3<<*GPVzct~;yDBL
zCn+7@)#Ks!oJ-PMfn}8f165ymFS^5-n||dmFJGe<70q%LU6tp*@b4otvo!_lE~qXp
zf4}`*CES8l?w2s}HhKSiX=1Bjt^k)|HAy|(>p1%dx0~k0^HO3fqADpmNulKSt9njx
zHj7<=&b3aljz^`qZLKYo1>}@-qd;?xkhSmK2wx>w{SbS~zFVY2$=&GA_-*hQRdm?Y
z!s$?EDY8mB)tF`dSbuKajqmDufw4=I<*)m#UdM^k-!-lhYl|w5s^*QF?%6r(C9fx*
z_RX*NEBB3S&Mhm8m&=<%HI`d_%+0!4&7JlJt^FO~-HV-esC$abb<${3qPFQ);Wx6^
z-qSF?PZlJ$?$I84FA@i`kE)F&T-}7P8-k|7&c4fDE%(|F_0p7Q38{XjuYb%MuK4yn
z9@7rA)?5BQv?6Rqyy1Q^i0;Db$o#?dq<S}fq8VEK^>zPE^HOc%*OB%1=?ahTCPO#y
zdg>0e<a~5He~*7@A%x|p_U3+zxvM&Tcr<OoQQo!HrRL4`^!ttZYIS!ddPN)J%SieS
zBwb-hQX;?|3=NQD+QEK8N221o1Rr9}bnr-Poy`ua>%B!Cc3!Q80|8uwfN-gAB5;5O
znd*=K1;Ag)4M4F5VPisBDx;=XmuQ2g8H^SFD^e@zdS#_KoM!+Z;>yLr)wS5g_0F1k
z$AtuP=8x}&LDV2Y_)?rFfUOM$C}1`a@J{y*$xZf$BPxM7;A1(PX{Dy+tfe5&Z(?uD
zVq|J>Y{uel>+rF<1poxy`9DNkGiM_*cUv1fFuyyH;$H~<5BVQ8D+SrVAkNl63M~aC
zGI4t+Gcqog&n)Z|LWpE!WP(nn=KLxWQvXf=_yST`Iy*b?v$DFmxv{u$u-H3Uuzu#_
z<6~uGXJuz+{y;E;J?xx~+?nmbl>g=AfBTUz1DiNmIXGL{+mZd_*T~r3#TiIJ@lT+C
zKmV1dnY-10BH4leyR44_S^u%Her91~{kQKAs^C9bekCh+GaD@lD_b)=@J9?G4o)t?
zf8qZx%YP#N52p5inB4yd@qZZpn<&WoPyYX-p8smrzqB9S5<(PY{debu5KB&BkpTew
zWmySPHFt<}5Q4Y7=4y_>GZiL+_8#>>cI2E&V=`T$3KnaTeVMwB_3X}|X3>%6m-C{#
z#NP{ZzZn+EI2V;TDQYDo;1kIb<;`0;?!7KDe<X3Z9mLAYY~Ck<eSK-CJ*Mij-uVt&
z53@52*ocN@E%V@&pvQoab53yp7=H>8^eyz<R{tufNPwijwEqEQ8!8<n76b`II?@q;
z?g!Kz+L9&r^#3EE$OSnvPSpOzdOHjZ%&mQXtAu=|M*Cj7g<O$JrN-&y<uqF~;b<y?
zH{$n<h#`<!B*q;D8Q!ORZYszWGe*s-3MD!bm_-&9iQwlVCWps~V%j@k0y_9su#LsG
zxVShpEUf*^X&PFlzP=t}&fg!vlgNtlfJLY1DyLmQ)ifvQBRw)b{Rt5<aeSW+CRb;I
z?y43KjhdOa+pup1@}e)eJ{k`$>^^_T&1BdIcMBg6559!?-uT1EMc8q1vmV4gEZjYa
zB%jK>5c(2+3qd;o80t#K88F0k3<bSqio|s?cpWa{t)f3cBh%JE=@8k6_CRr-<U8n%
zum^ggK99dHDuz&Z*wZJ5)+e}HkNc%y{|GUL0zZ9r4Nd)-GW!6sP|=R>b5tP}{d8?B
z?Jg^h4Cmx((=B)(2Pr9KHdLZrG9FiCq9~G(=AsY!II1uKfP;zFB3t&ID<;cWY=RP_
z#~2clN7f~Ef*msFRu_kp1c_##!{1-%aELU5$JCdLQYU!y>|Ge3;#_)>_-Gt}5!F=j
zSjAlwu9!*}KP`(Uq7VQ<m^7aLJjtV-7?0w}>|@sV{ym_9Qps%v>t<hN871kK(R>~g
zpNB<XMsu{6wIiXbT$OSrX-gJgF&`Uh40F9M#qpDx(bV68sOHYHU)DO^1d&u8moN4y
z@AOD7dm|TYr7jRq1~&ezm`YQa%7J=Y712ulT(WpbPM4-@of*mJ2G#08XrR7?sER+_
zbq?1@+s;L9N#O_B*Z`p%p_=^hhyi&C`1BF;Qfoz<M0P6BT(a^B*&x=ap$IHj!qvQW
zyAa9uj?-l~(M4|h2upVNI?XMlRjd)iE|ItH`iOk44T0YUOG`&~#pBB^yZLDxxbwUB
zpnx>|F9+KX?lJ5w3Q?Zy0*Nx^F97wA?6z+SNlGeo*!)z9K7Ib~=bX`KE5Fq-nsllA
z$uN3vs_v2oEZ>;slku|G)DFvixU<HWk@Hhr=>4073(C#IQ=W$~u`!p6)K@$;^+`eb
zSWv@38pG{*ZNGqZymJ15mXCgbe%_*I`QT8#aKksFwXDbyf;=5Lj9q{DHKroI>0fe3
zlJBcKcteG&uBJza=jDkcHN;_CtsVWRDXgzn(@@!`g!n+ihn{=J@<$?&($xTy^7FSZ
zH-q1ApPOgh{ijG&6^Hg2DNHcUsaQ|=nBU~<Uv5-2GNp`d?AF{221g;E2QMnh8|!Ma
zZdG+PhBwHB@lUT?+(Kice`@!xgmBRAkA$YZHhUoxu8eVbSRnZk4$oi0qoL&nA?7QG
z_w<59);b&Nb(f|cDB_>x_B`T{GdnPLk7Nkwn*EZYkyPxfGiVjFcwfeUoGSK?OsclW
z(ohizcQD@6qNmdjH;Hn@RQ@FQUJ=`FSu30bwK?NAKMBepMgqC-e%A&K@5b&CGRq(h
zDB;p%J{Hoq6YC?E8$y1T^?<OYZ{sgBiTbfxjrc(Q?YgTHreTYAPBNIoAl22-EFaa>
zSG$FN$@B9ullW2kosY1?*w}hJyhupG-k|MU-hFq9a6sx-exYwxTi5DwQu(8x^9UDv
zi1k3Qj*{LOlBvVZXx9Av@NHal^qKJ4^8lwW%J;SwWMmjjG?lE+#2a<4=Wm#wekFT1
z2<bzz3UNVgjR6n}2%s;h2ZsmGsmEu9jV}ewYw<p&{WWOFv847AObp}8gN$6<bjY|g
ztK>%YDR!#EwITXR0fk9AC!<I#Mi}NhCgu?O(bs_(pDsPb^vv7jSyd@#OO5@ZL8_pR
zor1nH>#59xYgTD#)HeFk=8LP}4y5$f?CH!+f7nibgYp9tKbcY&nRaf@Fg|)S;2BXV
zR;^Fj<JibOaX_VmC=EoTl+k`jox%~33Rm*ZME}l6VA~)-5j@hWA9&-FFdPw5-*L^v
z${yTW&cuZKMEqcOPTG2=+xl|A)E?N^v=}0w$s`&(!6)Dc{OzUmti<EXEJ;sjav9VU
zmZae~3w)6~%bzZ`;kK~n_2g>v0<ro@dFCKo*O4Htk?j_b$5`v)rZziWEfcRincJpk
z18aRA9L)62EGt(Hx;&c{G%<Ra^TVXibn8qJXsxg3=cZ6zy|SPE`5te6vZm7f{C2n5
z?b4FMK^RkEm_)y@F{0J&h#7?M`-L!O2$uW>0b2@XcuovBKC<AzG)QI$X^MrM<7mhk
zF6#<Gik&L&W$|bF%(}$n{`GD5>9wXC@qz6zXcdWxdISdTvu87{y#hHFCkEJeu5fld
z@0j$n)A(AP50m+#%^c|6)_6kFFjl{%+=5aHWtKx<MO?p1xpX11yR3)P?m1<z=MG$=
zGI<HN_SIq*o)y%(bgxh?tSZmu5o7&~W15}C;o_+$$8y`Xm^WQx6jXLPHIezq!HIc4
z@jUyo5+A3Fy&;zd^tp@puE%@YMF@1Cu~-jOTLduQyA4gm8FXW2`zE}WyiWHtFWj+u
zxYLaAh+_fDW_xPW{CgBs28%7*n%z$VIol~MAg1}~VB_5oe0KsWaqTuHJF>Dp4j2Cl
z#L3oM)82zPR+@(xxx}p(;1Q5*W{+QK79Iv;`a*vj*V(pn+lSP`f?wT2QR4*KAKa_@
zoGweOgsF5VqWAjtA{a~j1X{%IRy;PRcy0RMgYv|MeYnzA|Mo{ebj-M-OxEMtJQ*KX
z3B|l5xM8tfQ@}DMHXPa09uSn@rZqaNgi(T@Vpm<s3;dDp>G#E?eS7PWa+&l|2$xcb
zx-T&|7$y|Ina7xGP^eycRB33jl`j#VS*UO2j#NUcVb~V<eY#vouD2JSnyACM*tuk+
znfUF`LZ)*^#N+tOt1Eg6BA!~{^y;^v0`>NF6RY-}=ADV;{?x9peR;!mC1{Z60-nw*
z?&`2onESm&2*S~<C|KGu$tINy*Pj3)p3F-QwL0Bx!8iftZjJX)Tz$L$*|S<sXHYkb
z|4T)y;ahjN-_4fZn{zH}i?iu5TzA+-`X;f#od14VbrYhw?_`@hGe^7m31yT1Dczmv
zS|edIi8+HIpN7qyD?Avy9A=`*l=54FGN|<w!BMwce%|hhoAZXC`*!68uC%!>l7vRW
z??sr6C;4<~3Dnu<ca!-=k<sS03?IIS-sH56SYZ?1&q3(%DqXgVoKVUI8AeFk(AoBP
z136!skmC+?K02hGLl1+93|CB3HZUWjq08hwgsR)Qfv|);8M~n5mBt(=_O)4+rB7J9
zM|krYZKJ^k90vcRSj?;x3uAF+>iuN7;FYkto8Zr-+3TUYsS8h6yCd;RM{>Fen;z-9
z8&SY}in7&4`_k=Ho9CV0^&fDUhiGHq^yLT>lXU}OKpgK+DcDg0rO4Rp>?ZKRa$o1+
zry|%cAoaxXpP0n&cm6}MY(|<p$gtU**aXNpsOvzk@DrYelA?)fRGY6%aqw%<*strV
ztG;1hph7u*nEL+x-CX5%-V>cre<!+RU8yT9Bg6Z5HPiP~t1SlwF+jPG_4t~k)0D5>
zBKDP>J4)60!OTw}BqRjd|4aaCkg<hfQV5jBo9i&^|J6DP<RZ*d&DrWNf5S#%neO6e
zyZi&T7x~#WchS(;#VV+M7&&Wpe<rTp7F~w~8|TNZ6Zm33tyi_lajxc938<g(%HARu
zl|9;EJzQxaCA!=~QM1FW7Ws5>-1Naje5etO@N&7`5c7>`S8YAa_cz;c`x33PN5m?|
zRCO{NdydZKKn0`a{cL86&*9`LZg$N#WU4>{Vp^M*i=3y^+8jkK)6r?>I@Z_OFZa?s
zmCk3w<hx7n9;YOxXhOKTfSC%Bn}<KU=;-Jo@i1vx{(DKh@h|N8%E*PPL@&Y{N8r7s
zt$@z%KxXEHbx-}%^3Qma8X}SBBjt*ET#kKdR(`-S!4{MCUg8c2^}k=oas7a1^JTZ(
zP*pM~b{oK=NTGgwzcjugm#4N*23LdKI_|w0IV&laBWw<tRP|{G>`AKI#xe!-H`U5<
z)l~awZ5kTi-+B*W>Gsr6tpEtJ%eEY0%nCE5-+Y_ewV4kytca*yoF>++XqA&xUF{u@
z3f9MWa`_xCVtdim6|@mLJ)=kE9`>uxTp>lKbXe=mkuQ+?P6&WJRfMDyJy=%|hP!DZ
zBHP7Z#H@rfTF!h(%n_UdKIY?SQ`IsA-^xS^;IPB~we5I03RU#jxYOh6pO;k#mLsb4
z1gE6K5o@?_Hq$8Xo`4{u7o9G`E|-;%ka|_Vh)qVL#>tqIQ_ag4-w5b*TmAB}&yPc4
zOZWD1n@tOM(fby^jK}7&1Nh5X8RiI+3EtKxepZ>zo~m>4S8Z;<cp^Uu8ul@2B(bpq
zuR1;OlYxvkpxAl}8Mql}{8Sj=z0wN{j6NAATm-!6d>;XJvabDd`&m?1KPKtQMvctn
z<db~1Kods+Px`qYg?m(!`Cepa)-4R*@lrK)mT)b-`O<6jYm~2-o3H&T5F*o#^pw$B
zjpeFY2g=6l-cz*~U%~uMJ1nt%ocDZ+*!MtBu!eFS?X&knAjIGIHQsMCfd+AU_(ZR5
z=b(PkUqPV&d{IpN8Q}@pR?i<kPq#Z=D5n??&{zxY#7s!I?MVhC9T>EJ){mi|u_rwN
z@d?jmh<4OMI|iUt9#ntH!H4d})&PDQdJeAsre)9WO?mpyKiJkZhKt2wjs8|3qcNZ;
zX%L!eC2z6O_+zJsn^Xo#F=YdcNI*h1(?x<(&vMO+sIuI-lftLpesug=H()tPI^9by
zfZ9gsH6nE~Qds@NL>F01j;+4Xe^QVlSmUEr!^vYc9eDnyIDJ<4Q<mV{sWWNUG=aZM
zRiWGiOD`7ol_tU)1HzgoVcl0>pz}z}k5wewIJsCZUO^5IvvQWQ1U|Ll;bABU82Iw<
zr-;xsRq7qcjA!uZbY$#+sFi_lgm0JkP+39#)B?Vy(z588QBg>X3l@W&>rD#di{kUz
zn>9<icO;xf?$RWnQu(1=S{S|dGV1(?n>XI1fK5t6;>v4d9u36j9kZ-R`-R_Im(veE
z7D-dYf{v&ve(Pn}S%rdT{^C@T@K^{AY^0XpyI9uHXgrRKQy1xndG3-kz7}gb1k|~{
zHnLsSZpZcb+;{>mu{^cE>JP!3cbaSmNVs-_u%DaA_ATomD5^Cl?<D8cOzW1J0-Fkb
z)vC=G_P5+l6iXZ_$T|%TM6M8b1}zn|t6~jy__O~CUbX^4P%VF!b=7-ZoFCN+Pvbx8
zU;IiAGQYfjw%RD${E92rVC#uO%<*%5vfsKptpzgw>1b8qa#0xpHR&pCZqc(dLf*C4
zrfRt*p?Z*_avd#2rb|Ml9mHr5MI~*npqmnwI&}f+^5wo#m&ACeF)?~9U+2g;tghxP
z1J`pq0RJp4I?hv*?&rMhtI_E4)nv?NU!2dZ`qxTNiN{%{QQbB`Pt}?X&tyWx&`uX?
z_mMhD!4P~a$`AHL-JugR<O>54b*!YNyN`;Za7|bp-#Z8NKD3RAN~jh&N{3wgAJ6$u
z&4SvvxP&{C9KOGo-Fs@q58nQi>FbR&Xmp<*KCcLK4SsJkez9#LgP$Vjh9kYJfsy~E
z2Hp5@aD%#=YWLLfPP27O#8K^Nh(MwYnE{VtNFk;KCh|k<8JwQ3c3kzPl$LzfMvfm6
z!8;(hPYf<an^ME!vuLVDeE)gDbzk{4i^F-jGVXT?={huP(@8q;1a6E4^BZp($?K4_
z@wdiT%c=cI1LCt9);iSp!hz1<P-`0c+`wt0Xie=k;uK`(a1s>2f@wHT{pZ5{_@Iw@
zypk&J2Y-uXxe?Dk3Tzw!m+!BhwmQaF-G(>kok(x(ByDTd=<qW0;4fOkki|^$GGQZR
zWE++Qu+O1?D({MgyZ!PhPWqraRXSZkV#!adj`Or%r;#m5CfGt7)Jx`Felu)utKdKH
z=}e3zmBDu6#suC&j`mg(lA)_b;a%VGB&gi_4lc7yC+qlHcv@EuW8{VhhJ_;hipYp1
z$8CluWRcl~^})czv_P@k%de(l+WW+*BnYX_sG5_gmx;u1O{)9Y=a)6<u0~WvX;%B-
z*XgC*TpWPx#^btF1trb8_vJ<C2Np+-x>>1WmeAz_ERwXFol$u=DB6!~D92AU=&~`E
z<`JhxIR;^EPn{^T?X{%-D@jz;di~`HX$h_Ga^K(@?}%_gZ~MoYEPc&NiFs8+XY*5Y
zR=mT-TAp}VR}LC6-M(V9)?gV(FAoM&trDXRR>HVOC&Q=dp?sqn!m_){z)xtt!gl87
z5Icw~lPh@{qV{P&Zw5G-5bR30RLe4hbYLGMz@+DI(yT47g$+`VzK=H>p3c$6FJ@ku
zaxwcdvJ1|OKP=12%+%}jbom9vaCdfeH38ivU!_(mFIy6Ow%&fZ)eAlXBYAIFlb~ZS
zRDy?_Q1)PfZ-J*a(&t6VU#M~1pLu#vQh$x>QkqfG1jT9^nG(bZvhu7_+?T5LexV2<
ziKzWKGcelOVG8>&#F&(ZQlWQFJu5sld3c%srmU?C4QGN-0E;k`zl7a8Dp&RE=f~Xu
z>zU(6D453yw^->jdh(v_sVsq_=VzFauR1z9l{$@R=XG~z-Uu07f7cavv@yMLU(9dA
zT%b6u4)5eMIA5!!4``qcrrrzbtul!!)QXi#zlWB@W%r4WQ~g0fR&r#6Vd|Gj{uZ+W
zoq9>s{~587<Bh-c_m?THNJJNZW(1mJq<MUN{CCmZL{}f6aRkp(XTL11z`5SK=I$Mn
ze_skc*S7~Vk0{v^jvjb)yBU}ak;+-T8`8al47^rFnVPsR8GBfo+GP=~(mR#0A{w8(
zfPFOrdl<nPJ{<YpekB{1WG*>(erO4+4fG0;Wp?=J6q`OCUE=XyUInT_2cWL>DSJ)p
z?X5Fg?jO-G01HR-BQe!#@fE1iG?x$wOf)7_b^R_jq!9k=LJI~af5koI_ddW1#6>=S
zOf{<?EaS{^WoQdUAF3&iukC6S82P7kPji&O%dNza=r$k02rj$7l_&s5HLuXOgp@YM
zl0>K1kgX?2^5@*o`Lb7K-1|y>ueVx=MQU_FDH6DVky9=?9Z>_};lkQyt>38`FA{M!
z5ym6w%L{^yY({W-ce2Fi7HEbR4LlQ_dcNYR&fQN9T#uSGijbkU69hJK@><iuJ|n3i
z(RE@XfcGc~m=NxXkwI9ap1?gnk3}7Tyzy~D@fpVZY);VxjGf7Wyi_Lw)Rz7)Uxq)X
z8a;_CHtj#CNkzp#Q!IOnf02%<G=`9BX}=9%N=MV!ES5#kbawST*}~Xp(*`2M5!K}u
zz+_7XVv;pQz(>K`Xdxq0yufS~6k8to*~8&i_JC(jFXPU$*xeqTTrBtcW^I8$49sUz
zd|FVC4!9-3)%kG~`MXJ>cR6%xkM;@*+?7+S#J~*tZW4<O{!S8}ALpQ(Z{f?$40X)A
zhj%9%=yVpLf)93jTN)FAjGl6c7K+krjKF*D5~*;qqUkU#gfu!;j+etbD0I~t19}df
zPfJBAnzi`(9qA{=9wZ=y=!V0J=@M3yDF|LIYBY&`YrW`#qu^Q(u!<WM7TpW3FG?Ed
zb^jT=4@4kADA1A5{!R|ZvgL)RzY>w>E^(F-<=nJH%H+R42#cDi6-rc*CP+_N9M$=*
ziB8=T<18h?X5(bof&^bHMj?43#vMZQOGyNGK_vdVB?4g%GT#xGX6;56Nec$P7V>&V
z@>?=-3^6%m2o(;1K8PyI`ZGv`u+9Bw9QLIy2wq#tWVk>aYFHt_jp_^;brgLb#yfXB
zQuRlLis(;2@%ec$gsG^z=J!#X951wRc%&n-miQkSZ8gXg&<dYiWZ?(S;=OZ~@p8i<
zg9zYM+wf$bm$A^;zU@ICso*}Sj$EbAu9pX;e(Z(c2VtKw3yVaMZy{`Aejhk<Pe<zL
z?937L*f~9onulnjTQ>ZOFpdk~@ktr;V{`1(7!^%s)(n?E)M^gboI;?5<`I%WKn%J}
zHlv4L`33UXF8#<pzuzxmOA5A4-}q|%6Vbg&E*q8|{3m6M`mJL#20^If_hbrKimUNk
z;b!+AJtZQo+LA5h^2&Ys(7`3GBDqo-SP+ptUqX90Hq;F2X6Lq*py?ou;b-*7LXr1X
zp~r9`IdQ*NKNc4h!H1KJfCKQ@zbm5C8GHbgx(+g8Qz*zn0h2JsBT7P&x!xZD(Z_Uq
zew|2vyboSDx%eYiz;(z6VBfmy*pmMLz2hrC0ujX<B+#zWKPPUol3yjN#0&%f7q$13
AI{*Lx

literal 0
HcmV?d00001


From c34e721e225c5dc869d7051418dd32b919ba406a Mon Sep 17 00:00:00 2001
From: Rob Beagrie <rob@beagrie.com>
Date: Wed, 23 Jul 2014 17:15:59 +0100
Subject: [PATCH 12/12] Last work for me from the #mozsprint - more fixes from
 PR comments

---
 .../bedtools-python/bedtools-capstone.md      | 68 +++++++++++++++----
 1 file changed, 53 insertions(+), 15 deletions(-)

diff --git a/novice/capstones/bedtools-python/bedtools-capstone.md b/novice/capstones/bedtools-python/bedtools-capstone.md
index 665462981..3d53a645a 100644
--- a/novice/capstones/bedtools-python/bedtools-capstone.md
+++ b/novice/capstones/bedtools-python/bedtools-capstone.md
@@ -81,6 +81,7 @@ Introducing BedTools
 ====================
 
 All bed files contain three columns which list the chromosome, the starting coordinate and the end co-ordinate. They may also include some other information, such as a name for each region, but the first three columns are always the same. For example, have a look at `data/cpg.bed`:
+
 ~~~
 head data/cpg.bed 
 ~~~
@@ -99,18 +100,21 @@ chr1    788863  789211  CpG:_28
 ~~~
 {:class="out"}
 
+The fourth column here is the name of each CpG island - which in this case is just a numerical ID assigned by the folks at UCSC.
+
 The bedtools help
 -----------------
 
-To bring up the help, just type
+You should have installed bedtools as part of the setup for this bootcamp. In which case, you can bring up the help by typing:
 
 ~~~
 $ bedtools
 ~~~
 {:class="in"}
 
-As you can see, there are multiple "subcommands" and for bedtools to
-work you must tell it which subcommand you want to use. Examples:
+If that didn't work, you probably don't have bedtools installed properly. Have a look at <http://bedtools.readthedocs.org/en/latest/content/installation.html> for some help.
+
+If it did work, you should see that there are multiple "subcommands". For bedtools to work you must tell it which subcommand you want to use. Examples:
 
 ~~~
 $ bedtools intersect
@@ -130,10 +134,10 @@ The `intersect` command is the workhorse of the `bedtools` suite. It compares tw
 
 Default behavior
 ----------------
-By default, `intersect` reports the intervals that represent overlaps between your two files.  To demonstrate, let's identify all of the CpG islands that overlap exons. For example, let's list the first five overlaps between CpG islands and exons.
+By default, `intersect` reports the intervals that represent overlaps between your two files. This is represented by the top line of the image above (the one which reads "A intersect B"). As an example, you may know that CpG islands often mark transcription start sites. Imagine that we want to find all the exons in the genome which might contain a promotor. One way of doing this would be to identify all of the CpG islands that overlap exons. For example, let's list the first five overlaps between CpG islands and exons.
 
 ~~~
-$ bedtools intersect -a cpg.bed -b exons.bed | head -5
+$ bedtools intersect -a data/cpg.bed -b data/exons.bed | head
 ~~~
 {:class="in"}
 ~~~
@@ -142,28 +146,38 @@ chr1    135124  135563  CpG:_30
 chr1    327790  328229  CpG:_29
 chr1    327790  328229  CpG:_29
 chr1    327790  328229  CpG:_29
+chr1    713984  714068  CpG:_60
+chr1    762970  763155  CpG:_115
+chr1    763177  763229  CpG:_115
+chr1    762970  763155  CpG:_115
+chr1    762970  763155  CpG:_115
 ~~~
 {:class="out"}
 
-Notice how the new file we create lists the name of the CpG island which is overlapping, but not the name of the exon. If we switch the order of files, we get the exon name instead:
+Notice how the output lists the name of the CpG island which is overlapping, but not the name of the exon. If we switch the order of files, we get all the extra columns from the exons.bed file, including the name (fourth column). The fifth column is a score (not used in this context) and the sixth is the strand.
 
 ~~~
 $ bedtools intersect -a data/exons.bed -b data/cpg.bed | head -5 
 ~~~
 {:class="in"}
 ~~~
-chr1    50489434    50489626    NM_032785_exon_13_0_chr1_50489435_r 0   -
-chr1    16767166    16767348    NM_018090_exon_0_0_chr1_16767167_f  0   +
-chr1    33546713    33546895    NM_052998_exon_0_0_chr1_33546714_f  0   +
-chr1    33546988    33547109    NM_052998_exon_1_0_chr1_33546989_f  0   +
-chr1    33547201    33547243    NM_052998_exon_2_0_chr1_33547202_f  0   +
+chr1  50489434  50489626  NM_032785_exon_13_0_chr1_50489435_r    0  -
+chr1  16767166  16767348  NM_018090_exon_0_0_chr1_16767167_f     0  +
+chr1  33546713  33546895  NM_052998_exon_0_0_chr1_33546714_f     0  +
+chr1  33546988  33547109  NM_052998_exon_1_0_chr1_33546989_f     0  +
+chr1  33547201  33547243  NM_052998_exon_2_0_chr1_33547202_f     0  +
+chr1  16767166  16767270  NM_001145278_exon_0_0_chr1_16767167_f  0  +
+chr1  16767166  16767348  NM_001145277_exon_0_0_chr1_16767167_f  0  +
+chr1  8384389   8384719   NM_001080397_exon_0_0_chr1_8384390_f   0  +
+chr1  8385895   8386102   NM_001080397_exon_2_0_chr1_8385878_f   0  +
+chr1  8390268   8390800   NM_001080397_exon_3_0_chr1_8390269_f   0  +
 ~~~
 {:class="out"}
 
 
 Reporting the original feature in each file.
 --------------------------------------------
-The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped.  As such, instead of only showing you *where* the intersections occurred, it shows you *what* intersected.
+The `-wa` (write A) and `-wb` (write B) options allow one to see the original records from the A and B files that overlapped.  As such, instead of only showing you *where* the intersections occurred, it shows you *what* intersected. This is represented by the second line of the image above (the one which reads "A intersect B (-wa)"). A
 
 ~~~
 $ bedtools intersect -a cpg.bed -b exons.bed -wa -wb | head -5
@@ -178,12 +192,14 @@ chr1    327790  328229  CpG:_29 chr1    324438  328581  NR_028325_exon_2_0_chr1_
 ~~~
 {:class="out"}
 
+The first three columns of the output show us where there are overlaps between exons and cpg islands. The fourth column contains the name of the CpG island which is overlapping. The rest of the output is the full line from the exons.bed file.
+
 Counting the number of overlapping features.
 --------------------------------------------
 We can also count, for each feature in the "A" file, the number of overlapping features in the "B" file. This is handled with the `-c` option.
 
 ~~~
-$ bedtools intersect -a cpg.bed -b exons.bed -c  | head
+$ bedtools intersect -a data/cpg.bed -b data/exons.bed -c  | head
 ~~~
 {:class="in"}
 ~~~
@@ -200,7 +216,27 @@ chr1    788863  789211  CpG:_28 9
 ~~~
 {:class="out"}
 
-Again, we see that the order of the bed files is important. We get an output line for every feature in `data/cpg.bed`, even if there is no overlap with `data/exons.bed`.
+Again, we see that the order of the bed files is important. We get an output line for every feature in `data/cpg.bed`, even if there is no overlap with `data/exons.bed`. If we reverse the order of the two files again, we get a different output:
+
+~~~
+$ bedtools intersect -a data/exons.bed -b data/cpg.bed -c  | head
+~~~
+{:class="in"}
+~~~
+chr1    66999824    67000051    NM_032291_exon_0_0_chr1_66999825_f  0   +   0
+chr1    67091529    67091593    NM_032291_exon_1_0_chr1_67091530_f  0   +   0
+chr1    67098752    67098777    NM_032291_exon_2_0_chr1_67098753_f  0   +   0
+chr1    67101626    67101698    NM_032291_exon_3_0_chr1_67101627_f  0   +   0
+chr1    67105459    67105516    NM_032291_exon_4_0_chr1_67105460_f  0   +   0
+chr1    67108492    67108547    NM_032291_exon_5_0_chr1_67108493_f  0   +   0
+chr1    67109226    67109402    NM_032291_exon_6_0_chr1_67109227_f  0   +   0
+chr1    67126195    67126207    NM_032291_exon_7_0_chr1_67126196_f  0   +   0
+chr1    67133212    67133224    NM_032291_exon_8_0_chr1_67133213_f  0   +   0
+chr1    67136677    67136702    NM_032291_exon_9_0_chr1_67136678_f  0   +   0
+~~~
+{:class="out"}
+
+Now we get one line for every exon. In this case, the first ten exons actually don't overlap with any CpG islands.
 
 Require a minimal fraction of overlap.
 --------------------------------------------
@@ -226,9 +262,11 @@ chr1    788863  789211  CpG:_28 7
 ~~~
 {:class="out"}
 
+If you compare this to the result you see without the -f flag you will see that the very first CpG island is now reported as having 0 overlaps. This means it must have overlapped an exon by less than 50%.
+
 Find features that DO NOT overlap
 --------------------------------------------
-Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case.
+Often we want to identify those features in our A file that **do not** overlap features in the B file. The `-v` option is your friend in this case. This is represented in the last line of the image above.
 
 ~~~
 $ bedtools intersect -a cpg.bed -b exons.bed -v  | head