Demultiplex
Related Commands
Purpose
The bcl2fastq
command only demultiplexed the PCR index. Therefore, each set of the raw FASTQ files still contain reads from several cells (8 cells in V1; 64 cells in V2). This step further demultiplex random index on the 5' of R1, generating cell-level R1 and R2 FASTQ files.
The random index is trimmed after demultiplex. The random index name occurs at the FASTQ file name, which combines with previous information to form the cell id.
This step also prepares Snakefiles that contain all the commands for mapping (using snakemake).
Input
Illumina bcl2fastq
created FASTQ file sets.
For MiSeq, each set has two files (R1 & R2 from one lane)
For NovaSeq, each set has 2 * N_lane files, N_lane depends on the flowcell used and the way of loading.
Demultiplex
It took several minutes to demultiplex MiSeq files, several hours to demultiplex NovaSeq FASTQ files (~100GB / h with 16 cores).
This command creates lots of files simultaneously, to reduce the burden on the file system, I set max CPU = 16
Remember to use "..." to quote the fastq pattern like this, otherwise, the wildcard will be expanded in the shell and cause an error:
--fastq_pattern "path/pattern/to/your/bcl2fastq/results/fastq.gz"
An error will occur if the
output_dir
already exists.
For Ecker Lab users, do not run this step on DDN drive, cutadapt demultiplex
(what yap
is based on) constantly raise errors on DDN drive. More safely, do not run mapping on DDN drive.
Output
The random index sequence will be removed from the reads
6bp removed from R1 5' in V1 indexed libraries.
8bp removed from R1 5' in V2 indexed libraries.
Each cell will have two FASTQ files in the output directory, with a fixed name pattern:
{cell_id}-R1.fq.gz
for R1{cell_id}-R2.fq.gz
for R2
Files are organized by the following structure, a minimum example is also attached below.
Last updated