🎉Case Study: Mapping bulk RNA-seq reads with salmon

In this book, we will use a bulk RNA-seq data from mouse developing forebrain as an example. In the github repo, I already provided quantified salmon tables. In this page, we will use a small subset of the data to reproduce the these salmon quant tables using exact process.

The input are 25 FASTQ files (truncated for demo purpose) downloaded from ENCODE that belong to 16 samples, the output are 16 salmon quant tables contain transcript-level quantification for each sample.

How to follow this case study?

Download or update github repository, the files for this case study located in py_genome_sci_book/analysis/salmon_demo
The analysis contain four main steps, each associated with a jupyter notebook in that directory
On github, you can see the executed version of these notebooks, you should be able to execute each of them in your local environment, and check the content with online version.

I prepared 25 small FASTQ files (only contain 10,000 reads in each for demo purpose) for this demo. The reduced file size allows you go through all of these steps in about 30 min.

Main Steps and Notebooks

Step 1. Create Salmon Index

See notebook here.

In this step, we will create salmon index for the mouse GENCODE transcriptome annotation.

Step 2. Prepare Input FASTQ files

See notebook here.

In this step, we will go through these 25 FASTQ files, rename them via soft-link with meaningful names, and create a metadata table for them.

Step 3. Trim FASTQ

See notebook here.

In this step, we will trim the FASTQ using trim_galore. These step generate a new set of 25 trimmed FASTQ files so we will also update the metadata table.

Step 4. Salmon mapping

See notebook here.

In this step, we will use salmon quant to quantify the 25 trimmed FASTQ files into 16 salmon quant tables. These tables has same layout as the one provided in our real dataset, but the numbers are mostly zero, because here we used a much smaller FASTQ files as input. For any further analysis, please refer to the real dataset.

My suggestions:

When doing analysis, keep your file organized into sub-directory;
Always associate your data files with a metadata table locating besides them, include all necessary informations in that metadata table;
Document all steps into Jupyter Notebook, or at least some files with your commands for each step. Doing analysis is just like doing wet lab experiment 🧪, we need to write down what we did📘📝

PreviousRun Shell Command In Python - I NextKey Concept of Pandas

Last updated 5 years ago

Was this helpful?