Demultiplex
Related Commands
# demultiplex the random index, generate cell-level FASTQ files
yap demultiplex
Purpose
The bcl2fastq
command only demultiplexed the PCR index. Therefore, each set of the raw FASTQ files still contain reads from several cells (8 cells in V1; 64 cells in V2). This step further demultiplex random index on the 5' of R1, generating cell-level R1 and R2 FASTQ files.
The random index is trimmed after demultiplex. The random index name occurs at the FASTQ file name, which combines with previous information to form the cell id.
This step also prepares Snakefiles that contain all the commands for mapping (using snakemake).
Input
Illumina bcl2fastq
created FASTQ file sets.
For MiSeq, each set has two files (R1 & R2 from one lane)
For NovaSeq, each set has 2 * N_lane files, N_lane depends on the flowcell used and the way of loading.
Demultiplex
yap demultiplex -h
usage: yap demultiplex [-h] --fastq_pattern FASTQ_PATTERN --output_dir
OUTPUT_DIR --config_path CONFIG_PATH --cpu CPU
optional arguments:
-h, --help show this help message and exit
Required inputs:
--fastq_pattern FASTQ_PATTERN, -fq FASTQ_PATTERN
FASTQ files with wildcard to match all bcl2fastq
results, pattern with wildcard must be quoted.
(default: None)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
Pipeline output directory, will be created
recursively. (default: None)
--config_path CONFIG_PATH, -config CONFIG_PATH
Path to the mapping config, see 'yap default-mapping-
config' about how to generate this file. (default:
None)
--cpu CPU, -j CPU Number of cores to use. Note that the demultiplex step
will only use at most 16 cores. (default: None)
For Ecker Lab users, do not run this step on DDN drive, cutadapt demultiplex
(what yap
is based on) constantly raise errors on DDN drive. More safely, do not run mapping on DDN drive.
Output
The random index sequence will be removed from the reads
6bp removed from R1 5' in V1 indexed libraries.
8bp removed from R1 5' in V2 indexed libraries.
Each cell will have two FASTQ files in the output directory, with a fixed name pattern:
{cell_id}-R1.fq.gz
for R1{cell_id}-R2.fq.gz
for R2
Files are organized by the following structure, a minimum example is also attached below.
output_dir
├── CEMBA200709_9E_4-1-E18 # a set of FASTQ files
│ ├── fastq
| | ├──{cell1}-R1.fq.gz
| | ├──{cell1}-R2.fq.gz
| | ├──{cell2}-R1.fq.gz
| | ├──{cell2}-R2.fq.gz
| | ├──...
| | ├── skipped # some FASTQ files may be skipped due to too less or too much reads
| | | ├──...
│ └── Snakefile # command files for snakemake
├── (Other sets)
│ ├── fastq
| | ├──...
| | ├── skipped
| | | ├──...
│ └── Snakefile
├── mapping_config.ini # mapping config copied here
├── snakemake # this only occur when using yap in Ecker Lab server
│ ├── qsub # job script for qsub on gale
│ └── sbatch # job script for sbatch on stampede2
└── stats # place for summary stats
├── demultiplex.stats.csv
├── fastq_dataframe.csv
└── UIDTotalCellInputReadPairs.csv
Last updated
Was this helpful?