📗
Essential Python For Genome Science
  • Before Start
  • Chapter Contents
  • Prerequisites
    • About the UNIX system
    • About python
  • UNDERSTAND RAW DATA
    • Stages of Genome Data Generation
    • From Bulk To Single Cell
    • Introduction To the Datasets
      • bulk RNA-seq
      • single-cell data
  • Work Environment
    • Chapter Ensemble
    • All About Installations
    • Keep Running
    • Coding Environment
    • Git and Github
    • Other Tips
  • Python and UNIX System
    • Run Python
    • File I/O
    • Run Shell Command In Python - I
    • 🎉Case Study: Mapping bulk RNA-seq reads with salmon
  • Data Cleaning
    • 🎉Key Concept of Pandas
    • 🎉Case Study: Aggregate Salmon Quant
    • Case Study: Exploring The Dataset 🚩
    • The "copy" and "inplace" Parameter 🚩
    • Case Study: Extract and Reformat GTF file 🚩
    • the correct vs. the wrong way of using pandas 🚩
    • Case Study: Bulk Sample PCA 🚩
  • PYTHON BASICS
    • Python can be lightning-fast ⚡️ 🚩
    • Run Shell Command In Python - II 🚩
    • Pointers In Python 🚩
    • Everything is an object 🚩
    • Thread and Process 🚩
    • Resource For Intermediate Python Knowledge 🚩
    • Python magic method 🚩
  • Genome Science Data
    • NGS Data Formats and Tools 🚩
      • SAM/BAM 🚩
      • BED 🚩
      • GTF 🚩
      • Bigwig / Bigbed 🚩
      • VCF / BCF 🚩
    • The Python Packages 🚩
  • Data visualization
    • Matplotlib Basics 🚩
    • Seaborn Basics 🚩
    • Interactive Data Visualization 🚩
  • Use R in Python
    • Why? 🚩
    • rpy2 🚩
  • Gotchas
    • Check whether package X is installed
    • BAM to FASTQ
    • Genomic Websites
Powered by GitBook
On this page
  • Stage I - Understand the Technologies
  • Stage II - Library Preparation
  • The additional sequence added onto the original DNA fragment
  • The fragment length
  • Stage III - Sequencing
  • Illumina Sequencing by Synthesis
  • Pair-end sequencing
  • Stage IV - Process Raw Reads
  • The BCL file
  • The FASTQ file
  • Trim and filtering
  • Stage V - Give the reads a location (Alignment or Mapping)
  • The SAM/BAM format
  • SAM/BAM QC
  • The Reference
  • Stage VI - Get the processed "raw" data from located reads
  • How to choose among tools?
  • Do not reinvent the wheels

Was this helpful?

  1. UNDERSTAND RAW DATA

Stages of Genome Data Generation

PreviousAbout pythonNextFrom Bulk To Single Cell

Last updated 5 years ago

Was this helpful?

Regardless of what technology or what programming langue you care about, let's discuss some principles of genome data generation first. I will not talk about the routine processing in the later chapters. Everything here is just making sure you understand how the "raw input" files is generated before python based analysis.

Stage I - Understand the Technologies

Almost every month, the CNS, Nature Methods, and Nature Biotechnology will publish some papers that report "new sequencing technologies" for specific purposes. Although we could easily name hundreds of sequencing technologies today, they are not that different.

Here I try to organize them in two ways. Once having the organization in mind, you won't be afraid of the tons of "new terms" you met in genomic papers. They eventually based on things mentioned below:

Organization via Biological Question

This nice plot comes from , a project about building encyclopedia of DNA elements. Based on this plot, here are some technologies worth remembering:

  • Genome: Just sequencing the DNA sequence itself to identify mutations compared to the reference genome. This includes whole genome sequencing and exon-seq. The later use a fluid chip to capture exon region DNAs to only sequence this 2% of the genome only, representing a category of money-saving technologies called targeted sequencing.

Organization via Coverage Or Mutation

We can regroup the leading technologies into two categories:

  1. These technologies care about coverage at particular genome locations (RNA-seq, ATAC-seq, ChIP-seq, HiC). They are different in that their reads come from different types of features in the genome -- due to various treatments during library preparation. Analysis related to these technologies can also be aggregated into these categories:

    • Determine whether a particular region has more reads than the background noise -- active/expressed or not

    • In a specific region, determine whether reads number in one sample is different from the other sample (such as differentially expressed gene analysis)

    • De-novel determination of new genome features. For example, we can define or refine gene annotation using the RNA-seq.

  2. These technologies care about the actual sequence and mutation of the sequence (genome-seq, exon-seq, WGBS-seq). Coverage information is also vital because high coverage means high confidence. However, coverage is not the final output of these technologies. The final output is the fraction of reads with mutation compare to the total covered reads in the same genome locus. Analysis related to these technologies can be summarized as:

    • Confidently determine real mutations over technical errors in sequencing. A.K.A. Call Variants.

    • For specific mutation, compare whether it more frequently occurs in one sample versus the other samples.

Stage II - Library Preparation

The library preparation is the most diverse stage among different technologies. Most new technology papers are indeed talking about new ways of preparing library, which allow them to

  • Sequencing a different subset of the genome, epigenome, or transcriptome, to study specified biological questions. For example, total RNA-seq sequencing sequence all the RNA molecules, poly-A captured RNA-seq sequencing mRNAs.

  • Increasing throughput, efficiency, decreasing cost, etc.

The additional sequence added onto the original DNA fragment

One of the primary purposes of library preparation is adding additional sequences to your input DNA fragments:

  • Adapter: during library preparation, there must be a step adding adapter sequences to the ends of DNA fragments. Only DNA fragments with adapters can be captured by complementary adapters on the Illumina flowcell, thus being sequenced. The adapters can be added via ligation or PCR.

  • Barcode: The barcode is the predefined sample-specific sequence incorporated into somewhere (usually in conjunction with the adapter) in the DNA fragments during library preparation. The barcode is used to distinguish different samples, reads with the same barcode will be assign to the same sample. It's used in almost all sequencing experiments because usually, single sample does not fulfill the capacity of the Illumina machine. We need to multiplex samples' DNA fragments together to sequence them. The barcodes can be added via ligation or PCR.

These artificial sequences must be removed before you start any biological analysis, see stage IV.

  • Only the gray part in the middle is actual DNA fragment (~120 bp) contains biological information

  • The colored regions (~30bp in total) are used for amplifying genome DNA, barcodes, and adapter sequences.

  • The library preparation is all about adding those parts onto DNA fragments via PCR and ligation.

  • We then use the barcode information to demultiplex samples and remove all of them before further analysis.

The fragment length

Stage III - Sequencing

Illumina Sequencing by Synthesis

Pair-end sequencing

Pair-end sequencing is most common nowadays. The above video explained how Illumina machine uses end-specific adapter and bridge PCR to sequence both ends of the DNA fragments, which gives you two fastq files called "R1" and "R2".

Fragments vs Reads

We call the single read (either R1 or R2) a read, and we call paired R1 and R2 a fragment because they should represent the original input DNA fragment. Whether R1 and R2 are overlapped depends on your read length and DNA fragment length.

In addition to illumina sequencing, the third generation sequencing will become popular in coming years. I do want to start learning those, with a real project, some times later, but I will not talk about them in this book.

Stage IV - Process Raw Reads

The BCL file

  • Generate the reads with their base quality score from the BCL file, output FASTQ format.

  • Demultiplex samples from the same sequencing run using the barcode information.

Usually, this step is done by the service provider; you don't need to do that. And BCL file is larger than the FASTQ file, so not used for data transfer.

The FASTQ file

We call the bcl2fastq output FASTQ files unprocessed raw files, which is the one needed by GEO for publication related data sharing. In the raw FASTQ files:

  • All reads should have the same length

  • R1 and R2 files should have the same number of reads

  • Paired reads occur with same order in R1 and R2 files

Trim and filtering

The usual FASTQ file QC includes:

  • removing low-quality bases, if the quality score is low, that base call might introduce artificial mutation, impact mapping rate and further analysis

  • removing additional barcode and adapters, they might not be recognized fully by bcl2fastq if there are technical errors.

  • Filtering reads that are too short after the above two steps.

All the adapter and barcode sequences must be removed from your reads before mapping. They are artificial.

For most cases, R1 & R2 files need to be processed together. For pair-end mapping, if R1 is filtered out, so does R2. After this, the R1 R2 still have the same number of reads and the same read order.

Stage V - Give the reads a location (Alignment or Mapping)

# how to give your reads a location

if has_reference_genome:
    if rna_seq:
        if has_reference_gene_annotation:
            if only_care_about_transcript_abundance:
                run_kallisto() # or run_salmon()
            else:
                # you care about the actual genome location
                # you want the BAM file for genome browser and other purpose
                # you want to define posible noval splicing variants
                run_star()
                # and you need to run other feature counts softwares bellow
        else:
            # no reference gene annotation, not covered in this book
            run_de_noval_transcriptome_assembly()
    else:
        # all other technologies, DNA-seq, epigenomic-seq
        # run_bwa(), run_bowtie(), run_bowtie2(), run_hisat2() ...
        run_one_of_the_aligner()
else:
    # no reference genome, not covered in this book
    run_de_noval_genome_assembly()

The SAM/BAM format

Except for Kallisto and Salmon that do not provide you genome location and mismatch information, all other aligners will produce SAM/BAM files. Key points are:

  • The FASTQ file contains the named reads with their sequence and sequence base quality.

  • The SAM file contains everything from the FASTQ file plus genome location information and mismatch information.

  • The BAM file is a binarized and compressed version of SAM.

  • The BAM file is better for data storage and further analysis.

SAM/BAM QC

SAM/BAM file also needs QC, mainly:

  • Remove PCR duplicates.

  • Remove non-uniquely mapped reads.

The Reference

It is VERY IMPORTANT to keep track of your reference genome and your gene annotation version. I created a standalone reference directory in my home directory ~/ref/ with its subdirectories organized by species and data sources. And I recorded every change I made in that directory. All informations will go into the methods section of my paper.

Never make custom changes to any reference files; consistency is the key to the mental health of bioinformaticians.

Stage VI - Get the processed "raw" data from located reads

Processed raw data refer to the direct input of custom analysis using python. It is technology-specific:

  • For RNA-seq, the processed raw data is "read counts in each gene/transcript of each sample", without any normalization or filtering

  • For ATAC-seq and ChIP-seq, the processed raw data can be just the BAM file, because there is usually no "reference annotation" for regulatory elements, we just de novo define the regulatory elements (a.k.a. peaks) via the reads coverage.

  • For HiC, the processed raw data are BEDPE. It records two genomic regions in a bed-like format, representing two anchors of the DNA-DNA interaction.

  • For mutation-based technologies (genome-seq, exon-seq, and WGBS-seq), there is a standard format called VCF/BCF, but WGBS-seq usually uses other lightweight custom table formats. Nevertheless, the information in processed raw data is "number of reads having the mutation (numerator) and total number of reads (denominator) in the same genome loci".

How to choose among tools?

The quick answer is don't spent too much time choosing similar tools. As long as you are not writing a technical benchmarking paper, they are not that different. Find a highly cited paper that used the same technology as you do and use whatever they use, with same parameters.

Do not reinvent the wheels

If you are new to genome science, remember that the analysis infrastructure for most common sequencing technologies is rather mature. If you feel like any step during preprocessing needs you to write a bunch of custom code to accomplish, that usually means you haven't found the right tool. Try search Pubmed, google, and youtube first.

Transcriptome: RNA-seq (and its numerous variations focusing on , here is a )

:

Protein - DNA interaction: ChIP-seq (using antibody targeting transcription factor, to study where that protein bind to DNA, here is a )

Histone modification: Also ChIP-seq (using antibody targeting specific histone modifications, here is a ).

DNA methylation: WGBS-seq (using a chemical called bisulfite to create artificial DNA mutations that representing unmethylated cytosine in DNA. It studies the DNA methylation status through detecting these mutations. My lab published the first , and studying DNA methylation is also my main project - but now at the single-cell level.)

Chromatin accessibility: ATAC-seq (using a Transposes called Tn5 that can insert a DNA fragment with specific adapter sequences into open regions of the DNA to break it, then sequence those fragments with the adapter to get the open information. Open means active. Here is a ).

Multiplexing different samples or cells together via adding barcode sequences to the ends of DNA fragments. This is a critical step in single-cell sequencing, but also common in bulk-seq. Fancy barcoding techniques can multiplex .

To give you a feeling of adapter and barcodes, below is the read structure from . No need to understand the details, but do notice that:

The Illumina sequencer requires input DNA fragments having similar lengths, which is usually selected via magnetic beads (see this ). Each library has a fragment length distribution. They must be measured and reported together with the sequencing data.

visualizes the Illumina sequencing, explains why the adapter is needed and what is pair-end sequencing.

Check out the two fancy technologies and of 3rd generation sequencing.

The BCL file is indeed the raw file produced by the sequencer. A sequencer is just a camera machine taking thousands of photos of the flowcell. The BCL file contains raw base calls from the images. The FASTQ file is generated from the BCL file with a software called provided by the Illumina. It does two things:

The raw fastq file usually needs to be cleaned. An excellent tool to profile and monitor the cleaning process is the , and the standard tool for actual cleaning is the . For mature sequencing technologies, using is a good option to save your time. Usually, the trimming and filtering steps only remove small portion (likely < 10%), if you lost lots of reads after this step, something must went wrong in the sequencing.

After getting the clean FASTQ files, the next stage is to give your reads location information, a.k.a mapping/alignment. These two words were interchangeable, until a special type of mapper for RNA-seq been developed a few years ago ( or ). So now, things differ as follow:

Both doable via Samtools, as .

Kallisto does generate genome BAM file for genome browser .

Unlike FASTQ and SAM/BAM format, the processed data are usually in custom CSV/TSV format (except the VCF format) or just a data matrix in an . This is where python comes in to help. You need to start writing your own code finally!

different parts of the transcriptome
video
Epigenome
video
video
human bulk WGBS-seq paper in 2009
video
hundreds of thousands of cells in one experiment
technology used in my project
video
This GREAT video
NANOPORE
PacBio
Bcl2fastq
FastQC
cutadapt
trim_galore
Kallisto
Salmon
explained later
if needed
HDF5 file
the ENCODE project
Schematic of Reads Profiles in Each Main Technology