bulk RNA-seq
Last updated
Was this helpful?
Last updated
Was this helpful?
This book aims to explain coding skills related to data cleaning and visualization, and I choose the dataset to represent typical data structure. You don't need to worry whether the biology question or the species fits your project, because the programming skills have great universality.
The first dataset I will use in this book is a bulk RNA-seq dataset. This dataset is , containing 16 bulk polyA-plus RNA-seq experiments. All data comes from a more massive project profiling many other tissues in the same way, but for simplicity, I will only use the forebrain tissue as an example.
I provided all my code for preprocessing this dataset from downloading FASTQ to salmon quant result. All steps were done with python + shell command in the Jupyter Notebook. I want to emphasize two good practices of doing analysis:
Everything can be documented within the Jupyter notebook to allow good reproducibility.
I create a metadata table for each step to give each set of files meaningful annotations instead of relying on a super long file name that contains everything.
Packages used in the preprocessing steps:
Jupyter notebook: record all your python code, markdown notes, and shell command in the same notebook, together with all the results and visualization.
Python packages I used:
subprocess: run any shell command from python
pathlib: handle path related stuff
pandas: the "excel" in python.
Other RNA-seq softwares:
Trim Galore: FASTQ QC and trimming, based on another two packages called fastqc and cutadapt
salmon: For mapping and quantify RNA reads
The samples are forebrain tissues from eight time points during mouse embryo development. Each time point has two biological replicates.
Development Time Point
Replicate 1
Replicate 2
E10.5
-
-
E11.5
-
-
E12.5
-
-
E13.5
-
-
E14.5
-
-
E15.5
-
-
E16.5
-
-
P0
-
-
I choose this dataset because:
It has good data quality, and all time points have two replicates.
We can do simple differential analysis between two timepoints, to mimic case-control like study.
We can also do time-series analysis, taking advantage of the ordered developmental timepoints.
In this notebook, I did the following steps:
Download all the FASTQ files from ENCODE.
Prepare a metadata table, contains all necessary sample information for each downloaded file.
Rename the files with meaningful names (the original names are meaningless ids for computer).
In this notebook, I did the following steps:
The FASTQ file information comes from the metadata table from step 1. I also made another metadata table for the trimmed FASTQ.
In this notebook, I did the following steps:
The trimmed FASTQ file information comes from the metadata table from step 2. I also made another metadata table for the salmon output.
After mapping, I provide you the raw processed data to start the analysis, including:
16 salmon quant
raw outputs for each RNA-seq sample
A metadata table that contains experiment design information
The original GTF file (GENCODE mouse vm24) used to build the salmon index. I will play with it more in the following pandas chapters.
E10.5 means embryo day 10.5; P0 means postnatal day 0. See to get a bit more knowledge about mouse embryo development.
I provide whole preprocessing steps to allow you to get a sense of how this is done using python and shell script to do preprocessing and using Jupyter notebook to document everything. The actual computation is done on a server, which you don't need to repeat. I've provided the processed raw data (raw salmon results) for all further steps, which are doable on most laptops.
I did FASTQ trimming and QC using on each FASTQ file from step 1. I also write a parallel computation example using the python built-in concurrent.futures
package to fasten computation.
I did salmon quant on each sample. There are 16 samples but 25 trimmed FASTQ files. All samples were single-end sequenced, so they all mapped using single-end mode. Some samples have 2 FASTQ files, and they need to be provided to salmon quant
together. , I explained how this could be easily done with python (and the pandas.Dataframe)
All files are included in the of this book.