Demultiplex

Related Commands

1
# demultiplex the random index, generate cell-level FASTQ files
2
yap demultiplex
Copied!

Purpose

The bcl2fastq command only demultiplexed the PCR index. Therefore, each set of the raw FASTQ files still contain reads from several cells (8 cells in V1; 64 cells in V2). This step further demultiplex random index on the 5' of R1, generating cell-level R1 and R2 FASTQ files.
The random index is trimmed after demultiplex. The random index name occurs at the FASTQ file name, which combines with previous information to form the cell id.
This step also prepares Snakefiles that contain all the commands for mapping (using snakemake).

Input

Illumina bcl2fastq created FASTQ file sets.
  • For MiSeq, each set has two files (R1 & R2 from one lane)
  • For NovaSeq, each set has 2 * N_lane files, N_lane depends on the flowcell used and the way of loading.

Demultiplex

1
yap demultiplex -h
2
usage: yap demultiplex [-h] --fastq_pattern FASTQ_PATTERN --output_dir
3
OUTPUT_DIR --config_path CONFIG_PATH --cpu CPU
4
​
5
optional arguments:
6
-h, --help show this help message and exit
7
​
8
Required inputs:
9
--fastq_pattern FASTQ_PATTERN, -fq FASTQ_PATTERN
10
FASTQ files with wildcard to match all bcl2fastq
11
results, pattern with wildcard must be quoted.
12
(default: None)
13
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
14
Pipeline output directory, will be created
15
recursively. (default: None)
16
--config_path CONFIG_PATH, -config CONFIG_PATH
17
Path to the mapping config, see 'yap default-mapping-
18
config' about how to generate this file. (default:
19
None)
20
--cpu CPU, -j CPU Number of cores to use. Note that the demultiplex step
21
will only use at most 16 cores. (default: None)
Copied!
  • It took several minutes to demultiplex MiSeq files, several hours to demultiplex NovaSeq FASTQ files (~100GB / h with 16 cores).
  • This command creates lots of files simultaneously, to reduce the burden on the file system, I set max CPU = 16
  • Remember to use "..." to quote the fastq pattern like this, otherwise, the wildcard will be expanded in the shell and cause an error: --fastq_pattern "path/pattern/to/your/bcl2fastq/results/fastq.gz"
  • An error will occur if theoutput_diralready exists.
For Ecker Lab users, do not run this step on DDN drive, cutadapt demultiplex (what yap is based on) constantly raise errors on DDN drive. More safely, do not run mapping on DDN drive.

Output

  • The random index sequence will be removed from the reads
    • 6bp removed from R1 5' in V1 indexed libraries.
    • 8bp removed from R1 5' in V2 indexed libraries.
  • Each cell will have two FASTQ files in the output directory, with a fixed name pattern:
    • {cell_id}-R1.fq.gz for R1
    • {cell_id}-R2.fq.gz for R2
  • Files are organized by the following structure, a minimum example is also attached below.
1
output_dir
2
β”œβ”€β”€ CEMBA200709_9E_4-1-E18 # a set of FASTQ files
3
β”‚Β Β  β”œβ”€β”€ fastq
4
| | β”œβ”€β”€{cell1}-R1.fq.gz
5
| | β”œβ”€β”€{cell1}-R2.fq.gz
6
| | β”œβ”€β”€{cell2}-R1.fq.gz
7
| | β”œβ”€β”€{cell2}-R2.fq.gz
8
| | β”œβ”€β”€...
9
| | β”œβ”€β”€ skipped # some FASTQ files may be skipped due to too less or too much reads
10
| | | β”œβ”€β”€...
11
β”‚Β Β  └── Snakefile # command files for snakemake
12
β”œβ”€β”€ (Other sets)
13
β”‚Β Β  β”œβ”€β”€ fastq
14
| | β”œβ”€β”€...
15
| | β”œβ”€β”€ skipped
16
| | | β”œβ”€β”€...
17
β”‚Β Β  └── Snakefile
18
β”œβ”€β”€ mapping_config.ini # mapping config copied here
19
β”œβ”€β”€ snakemake # this only occur when using yap in Ecker Lab server
20
β”‚Β Β  β”œβ”€β”€ qsub # job script for qsub on gale
21
β”‚Β Β  └── sbatch # job script for sbatch on stampede2
22
└── stats # place for summary stats
23
β”œβ”€β”€ demultiplex.stats.csv
24
β”œβ”€β”€ fastq_dataframe.csv
25
└── UIDTotalCellInputReadPairs.csv
Copied!
snmC-seq3.miseq.V2.tgz
4MB
Binary
snmC-seq3, MiSeq, V2 barcode, demultiplex results