📗
Essential Python For Genome Science
  • Before Start
  • Chapter Contents
  • Prerequisites
    • About the UNIX system
    • About python
  • UNDERSTAND RAW DATA
    • Stages of Genome Data Generation
    • From Bulk To Single Cell
    • Introduction To the Datasets
      • bulk RNA-seq
      • single-cell data
  • Work Environment
    • Chapter Ensemble
    • All About Installations
    • Keep Running
    • Coding Environment
    • Git and Github
    • Other Tips
  • Python and UNIX System
    • Run Python
    • File I/O
    • Run Shell Command In Python - I
    • 🎉Case Study: Mapping bulk RNA-seq reads with salmon
  • Data Cleaning
    • 🎉Key Concept of Pandas
    • 🎉Case Study: Aggregate Salmon Quant
    • Case Study: Exploring The Dataset 🚩
    • The "copy" and "inplace" Parameter 🚩
    • Case Study: Extract and Reformat GTF file 🚩
    • the correct vs. the wrong way of using pandas 🚩
    • Case Study: Bulk Sample PCA 🚩
  • PYTHON BASICS
    • Python can be lightning-fast ⚡️ 🚩
    • Run Shell Command In Python - II 🚩
    • Pointers In Python 🚩
    • Everything is an object 🚩
    • Thread and Process 🚩
    • Resource For Intermediate Python Knowledge 🚩
    • Python magic method 🚩
  • Genome Science Data
    • NGS Data Formats and Tools 🚩
      • SAM/BAM 🚩
      • BED 🚩
      • GTF 🚩
      • Bigwig / Bigbed 🚩
      • VCF / BCF 🚩
    • The Python Packages 🚩
  • Data visualization
    • Matplotlib Basics 🚩
    • Seaborn Basics 🚩
    • Interactive Data Visualization 🚩
  • Use R in Python
    • Why? 🚩
    • rpy2 🚩
  • Gotchas
    • Check whether package X is installed
    • BAM to FASTQ
    • Genomic Websites
Powered by GitBook
On this page
  • Experiment Design
  • The preprocessing steps
  • Step 1. Download data, rename the files and make a metadata table
  • Step 2. FASTQ QC
  • Step 3. Salmon Mapping
  • The Raw Processed Data
  • Get this dataset

Was this helpful?

  1. UNDERSTAND RAW DATA
  2. Introduction To the Datasets

bulk RNA-seq

PreviousIntroduction To the DatasetsNextsingle-cell data

Last updated 5 years ago

Was this helpful?

This book aims to explain coding skills related to data cleaning and visualization, and I choose the dataset to represent typical data structure. You don't need to worry whether the biology question or the species fits your project, because the programming skills have great universality.

The first dataset I will use in this book is a bulk RNA-seq dataset. This dataset is , containing 16 bulk polyA-plus RNA-seq experiments. All data comes from a more massive project profiling many other tissues in the same way, but for simplicity, I will only use the forebrain tissue as an example.

I provided all my code for preprocessing this dataset from downloading FASTQ to salmon quant result. All steps were done with python + shell command in the Jupyter Notebook. I want to emphasize two good practices of doing analysis:

  1. Everything can be documented within the Jupyter notebook to allow good reproducibility.

  2. I create a metadata table for each step to give each set of files meaningful annotations instead of relying on a super long file name that contains everything.

Packages used in the preprocessing steps:

  • Jupyter notebook: record all your python code, markdown notes, and shell command in the same notebook, together with all the results and visualization.

  • Python packages I used:

    • subprocess: run any shell command from python

    • pathlib: handle path related stuff

    • pandas: the "excel" in python.

  • Other RNA-seq softwares:

    • Trim Galore: FASTQ QC and trimming, based on another two packages called fastqc and cutadapt

    • salmon: For mapping and quantify RNA reads

Experiment Design

The samples are forebrain tissues from eight time points during mouse embryo development. Each time point has two biological replicates.

Development Time Point

Replicate 1

Replicate 2

E10.5

-

-

E11.5

-

-

E12.5

-

-

E13.5

-

-

E14.5

-

-

E15.5

-

-

E16.5

-

-

P0

-

-

I choose this dataset because:

  • It has good data quality, and all time points have two replicates.

  • We can do simple differential analysis between two timepoints, to mimic case-control like study.

  • We can also do time-series analysis, taking advantage of the ordered developmental timepoints.

The preprocessing steps

Some quick notes before you read the Jupyter notebooks

  1. Jupyter notebook supports markdown so that I can write my explanations in a rich format easily.

  2. When a cell contains lines start with "!", it will be interpreted as a shell command, executed via your system default shell, providing a convenient way to put python and shell script together.

  3. Other cells are all python code and output.

The python code in Jupyter notebook is not fully explained, but provided as records of how I did the work. I will recap many of them and explain in the following chapters.

Step 1. Download data, rename the files and make a metadata table

In this notebook, I did the following steps:

  1. Download all the FASTQ files from ENCODE.

  2. Prepare a metadata table, contains all necessary sample information for each downloaded file.

  3. Rename the files with meaningful names (the original names are meaningless ids for computer).

Step 2. FASTQ QC

In this notebook, I did the following steps:

  1. The FASTQ file information comes from the metadata table from step 1. I also made another metadata table for the trimmed FASTQ.

Step 3. Salmon Mapping

In this notebook, I did the following steps:

  1. The trimmed FASTQ file information comes from the metadata table from step 2. I also made another metadata table for the salmon output.

The Raw Processed Data

After mapping, I provide you the raw processed data to start the analysis, including:

  1. 16 salmon quant raw outputs for each RNA-seq sample

  2. A metadata table that contains experiment design information

  3. The original GTF file (GENCODE mouse vm24) used to build the salmon index. I will play with it more in the following pandas chapters.

Get this dataset

# if you haven't got the github repo
git clone https://github.com/lhqing/py_genome_sci_book.git
# if you already did, git pull will update it
git pull

cd py_genome_sci_book/data/DevFB
# where the data and notebook locates

cd py_genome_sci_book/ref/GENCODEvM24
# where the gene annotation gtf locates

FASTQ files and the salmon index are not included in github, because they are too large and not needed for further analysis.

E10.5 means embryo day 10.5; P0 means postnatal day 0. See to get a bit more knowledge about mouse embryo development.

I provide whole preprocessing steps to allow you to get a sense of how this is done using python and shell script to do preprocessing and using Jupyter notebook to document everything. The actual computation is done on a server, which you don't need to repeat. I've provided the processed raw data (raw salmon results) for all further steps, which are doable on most laptops.

I did FASTQ trimming and QC using on each FASTQ file from step 1. I also write a parallel computation example using the python built-in concurrent.futures package to fasten computation.

I did salmon quant on each sample. There are 16 samples but 25 trimmed FASTQ files. All samples were single-end sequenced, so they all mapped using single-end mode. Some samples have 2 FASTQ files, and they need to be provided to salmon quant together. , I explained how this could be easily done with python (and the pandas.Dataframe)

All files are included in the of this book.

downloaded from the ENCODE project
this page
A beginner's guide on Jupyter Notebook
See Jupyter notebook
See Jupyter notebook
trim_galore
See Jupyter notebook
In the code
Github repository
on github