๐Ÿ“—
Essential Python For Genome Science
  • Before Start
  • Chapter Contents
  • Prerequisites
    • About the UNIX system
    • About python
  • UNDERSTAND RAW DATA
    • Stages of Genome Data Generation
    • From Bulk To Single Cell
    • Introduction To the Datasets
      • bulk RNA-seq
      • single-cell data
  • Work Environment
    • Chapter Ensemble
    • All About Installations
    • Keep Running
    • Coding Environment
    • Git and Github
    • Other Tips
  • Python and UNIX System
    • Run Python
    • File I/O
    • Run Shell Command In Python - I
    • ๐ŸŽ‰Case Study: Mapping bulk RNA-seq reads with salmon
  • Data Cleaning
    • ๐ŸŽ‰Key Concept of Pandas
    • ๐ŸŽ‰Case Study: Aggregate Salmon Quant
    • Case Study: Exploring The Dataset ๐Ÿšฉ
    • The "copy" and "inplace" Parameter ๐Ÿšฉ
    • Case Study: Extract and Reformat GTF file ๐Ÿšฉ
    • the correct vs. the wrong way of using pandas ๐Ÿšฉ
    • Case Study: Bulk Sample PCA ๐Ÿšฉ
  • PYTHON BASICS
    • Python can be lightning-fast โšก๏ธ ๐Ÿšฉ
    • Run Shell Command In Python - II ๐Ÿšฉ
    • Pointers In Python ๐Ÿšฉ
    • Everything is an object ๐Ÿšฉ
    • Thread and Process ๐Ÿšฉ
    • Resource For Intermediate Python Knowledge ๐Ÿšฉ
    • Python magic method ๐Ÿšฉ
  • Genome Science Data
    • NGS Data Formats and Tools ๐Ÿšฉ
      • SAM/BAM ๐Ÿšฉ
      • BED ๐Ÿšฉ
      • GTF ๐Ÿšฉ
      • Bigwig / Bigbed ๐Ÿšฉ
      • VCF / BCF ๐Ÿšฉ
    • The Python Packages ๐Ÿšฉ
  • Data visualization
    • Matplotlib Basics ๐Ÿšฉ
    • Seaborn Basics ๐Ÿšฉ
    • Interactive Data Visualization ๐Ÿšฉ
  • Use R in Python
    • Why? ๐Ÿšฉ
    • rpy2 ๐Ÿšฉ
  • Gotchas
    • Check whether package X is installed
    • BAM to FASTQ
    • Genomic Websites
Powered by GitBook
On this page
  • How to follow this case study?
  • Main Steps and Notebooks
  • Step 1. Create Salmon Index
  • Step 2. Prepare Input FASTQ files
  • Step 3. Trim FASTQ
  • Step 4. Salmon mapping
  • My suggestions:

Was this helpful?

  1. Python and UNIX System

Case Study: Mapping bulk RNA-seq reads with salmon

PreviousRun Shell Command In Python - INextKey Concept of Pandas

Last updated 4 years ago

Was this helpful?

In this book, we will use a bulk RNA-seq data from mouse developing forebrain as an example. In the github repo, I already provided . In this page, we will use a small subset of the data to reproduce the these salmon quant tables using exact process.

The input are 25 FASTQ files (truncated for demo purpose) downloaded from ENCODE that belong to 16 samples, the output are 16 salmon quant tables contain transcript-level quantification for each sample.

How to follow this case study?

  1. , the files for this case study located in py_genome_sci_book/analysis/salmon_demo

  2. The analysis contain four main steps, each associated with a jupyter notebook in that directory

  3. On github, you can see the executed version of these notebooks, you should be able to execute each of them in your local environment, and check the content with online version.

I prepared 25 small FASTQ files (only contain 10,000 reads in each for demo purpose) for this demo. The reduced file size allows you go through all of these steps in about 30 min.

Main Steps and Notebooks

Step 1. Create Salmon Index

.

In this step, we will create salmon index for the mouse GENCODE transcriptome annotation.

Step 2. Prepare Input FASTQ files

In this step, we will go through these 25 FASTQ files, rename them via soft-link with meaningful names, and create a metadata table for them.

Step 3. Trim FASTQ

In this step, we will trim the FASTQ using trim_galore. These step generate a new set of 25 trimmed FASTQ files so we will also update the metadata table.

Step 4. Salmon mapping

My suggestions:

  1. When doing analysis, keep your file organized into sub-directory;

  2. Always associate your data files with a metadata table locating besides them, include all necessary informations in that metadata table;

  3. Document all steps into Jupyter Notebook, or at least some files with your commands for each step. Doing analysis is just like doing wet lab experiment ๐Ÿงช, we need to write down what we did๐Ÿ“˜๐Ÿ“

In this step, we will use salmon quant to quantify the 25 trimmed FASTQ files into 16 salmon quant tables. These tables has same layout as the one provided in our, but the numbers are mostly zero, because here we used a much smaller FASTQ files as input. For any further analysis, please refer to the .

๐ŸŽ‰
See notebook here.
See notebook here.
real dataset
real dataset
quantified salmon tables
See notebook here
See notebook here.
Download or update github repository