Generate MCDS

What is MCDS file? See here.

Input

  • AllcPaths.tsv: automatically generated by yap summary and stored at {output_dir}/stats.

  • If you want to count features based on a bed file (gene body bed, etc.), you need to provide a BED file with four columns, and you need to bgzip + tabix this bed file.

    1. chrom

    2. start

    3. end

    4. region id, MUST be unique to each region.

  • Chrom size file: this file is needed for determining which chromosomes to include in the MCDS file, see the documentation below.

Generate MCDS

Generate MCDS is only a single command with allcools mcds. This is the documentation, see an example command I usually use below.

$ allcools mcds -h
usage: allcools generate-mcds [-h] --allc_table ALLC_TABLE --output_prefix
                              OUTPUT_PREFIX --chrom_size_path CHROM_SIZE_PATH
                              --mc_contexts MC_CONTEXTS [MC_CONTEXTS ...]
                              [--split_strand]
                              [--bin_sizes BIN_SIZES [BIN_SIZES ...]]
                              [--region_bed_paths REGION_BED_PATHS [REGION_BED_PATHS ...]]
                              [--region_bed_names REGION_BED_NAMES [REGION_BED_NAMES ...]]
                              [--cov_cutoff COV_CUTOFF] [--cpu CPU]
                              [--max_per_mcds MAX_PER_MCDS]
                              [--cell_chunk_size CELL_CHUNK_SIZE]
                              [--dtype {uint8,uint16,uint32,uint64,int8,int16,int32,int64,bool}]
                              [--binarize]

optional arguments:
  -h, --help            show this help message and exit
  --split_strand        If true, Watson (+) and Crick (-) strands will be
                        count separately (default: False)
  --bin_sizes BIN_SIZES [BIN_SIZES ...]
                        Fix-size genomic bins can be defined by bin_sizes and
                        chrom_size_path. Space separated sizes of genome bins,
                        each size will be count separately. (default: None)
  --region_bed_paths REGION_BED_PATHS [REGION_BED_PATHS ...]
                        Arbitrary genomic regions can be defined in several
                        BED files to count on. Space separated paths to each
                        BED files, the fourth column of BED file should be
                        unique id of the region. (default: None)
  --region_bed_names REGION_BED_NAMES [REGION_BED_NAMES ...]
                        Space separated names for each BED file provided in
                        region_bed_paths. (default: None)
  --cov_cutoff COV_CUTOFF
                        Max cov filter for a single site in ALLC. Sites with
                        cov > cov_cutoff will be skipped. (default: 9999)
  --cpu CPU             Number of processes to use in parallel. (default: 1)
  --max_per_mcds MAX_PER_MCDS
                        Maximum number of ALLC files to aggregate into 1 MCDS,
                        if number of ALLC provided > max_per_mcds, will
                        generate MCDS in chunks, with same prefix provided.
                        (default: 3072)
  --cell_chunk_size CELL_CHUNK_SIZE
                        Size of cell chunk in parallel aggregation. Do not
                        have any effect on results. Large chunksize needs
                        large memory. (default: 100)
  --dtype {uint8,uint16,uint32,uint64,int8,int16,int32,int64,bool}
                        Data type of MCDS count matrix. Default is np.uint32.
                        For single cell feature count, this can be set to
                        np.uint16 [0, 65536] to decrease file size. The values
                        exceed min/max will be clipped while keep the mc/cov
                        same, and a warning will be sent. (default: uint32)
  --binarize            If set, binarize each single site in each individual
                        ALLC file. This means each cytosine will only
                        contribute at most 1 cov and 0/1 mc, this is suitable
                        to account for single cell ALLC R1 R2 overlap issue,
                        Only use this on single cell ALLC, not bulk ALLC.
                        (default: False)

required arguments:
  --allc_table ALLC_TABLE
                        Contain all the ALLC file information in 2 columns: 1.
                        file_uid, 2. file_path. No header (default: None)
  --output_prefix OUTPUT_PREFIX
                        Output prefix of the MCDS (default: None)
  --chrom_size_path CHROM_SIZE_PATH
                        Path to UCSC chrom size file. This can be generated
                        from the genome fasta or downloaded via UCSC
                        fetchChromSizes tools. All ALLCools functions will
                        refer to this file whenever possible to check for
                        chromosome names and lengths, so it is crucial to use
                        a chrom size file consistent to the reference fasta
                        file ever since mapping. ALLCools functions will not
                        change or infer chromosome names. (default: None)
  --mc_contexts MC_CONTEXTS [MC_CONTEXTS ...]
                        Space separated mC context patterns to extract from
                        ALLC. The context length should be the same as ALLC
                        file context. Context pattern follows IUPAC nucleotide
                        code, e.g. N for ATCG, H for ATC, Y for CT. (default:
                        None)

Example command

allcools generate-mcds \
--allc_table AllcPaths.tsv \
--output_prefix /output_dir/lib_name \  # you don't need to add .mcds, will add automatically
--chrom_size_path chrom_size_file \
--mc_contexts CGN CHN \  # H means ATC, N means ATCG, this means both mCG and mCH will be counted for each feature
--bin_sizes 100000 \  # MCDS will generate separate array counting non-overlapping genomic bins at this length, no bed files needed for non-overlapping genomic bins.
--region_bed_paths gene_body.bed.gz \  # MCDS will generate separate array counting regions in each of the bed file(s) 
--region_bed_names gene \  # name of the bed file(s)
--cov_cutoff 2 \  # for single cell ALLC, skip rows having > 2 cov
--cpu 20 \  # mcds can run in parallel

Run time and memory

The speed and memory usage of MCDS depends on the number of cells and the number of features used. In the above example, the speed is about 20 cells / CPU / Hour. Usually, for a 3000 cell library, it finished in several hours. The MCDS generation is memory intensive, it's probably safe to give each CPU 5 GB memory, and no less than 40 GB in total.

Output

A single netCDF4 based MCDS file contains all the cell-by-feature raw count arrays. If you provide multiple feature sets (e.g., both genomic bins and gene body), they will all be saved in this MCDS file, while you can easily select them using ALLCools and/or xarray.

Last updated