AllcPaths.tsv: automatically generated by yap summary and stored at {output_dir}/stats.
If you want to count features based on a bed file (gene body bed, etc.), you need to provide a BED file with four columns, and you need to bgzip + tabix this bed file.
chrom
start
end
region id, MUST be unique to each region.
Chrom size file: this file is needed for determining which chromosomes to include in the MCDS file, see the documentation below.
Generate MCDS
Generate MCDS is only a single command with allcools mcds. This is the documentation, see an example command I usually use below.
$ allcools mcds -h
usage: allcools generate-mcds [-h] --allc_table ALLC_TABLE --output_prefix
OUTPUT_PREFIX --chrom_size_path CHROM_SIZE_PATH
--mc_contexts MC_CONTEXTS [MC_CONTEXTS ...]
[--split_strand]
[--bin_sizes BIN_SIZES [BIN_SIZES ...]]
[--region_bed_paths REGION_BED_PATHS [REGION_BED_PATHS ...]]
[--region_bed_names REGION_BED_NAMES [REGION_BED_NAMES ...]]
[--cov_cutoff COV_CUTOFF] [--cpu CPU]
[--max_per_mcds MAX_PER_MCDS]
[--cell_chunk_size CELL_CHUNK_SIZE]
[--dtype {uint8,uint16,uint32,uint64,int8,int16,int32,int64,bool}]
[--binarize]
optional arguments:
-h, --help show this help message and exit
--split_strand If true, Watson (+) and Crick (-) strands will be
count separately (default: False)
--bin_sizes BIN_SIZES [BIN_SIZES ...]
Fix-size genomic bins can be defined by bin_sizes and
chrom_size_path. Space separated sizes of genome bins,
each size will be count separately. (default: None)
--region_bed_paths REGION_BED_PATHS [REGION_BED_PATHS ...]
Arbitrary genomic regions can be defined in several
BED files to count on. Space separated paths to each
BED files, the fourth column of BED file should be
unique id of the region. (default: None)
--region_bed_names REGION_BED_NAMES [REGION_BED_NAMES ...]
Space separated names for each BED file provided in
region_bed_paths. (default: None)
--cov_cutoff COV_CUTOFF
Max cov filter for a single site in ALLC. Sites with
cov > cov_cutoff will be skipped. (default: 9999)
--cpu CPU Number of processes to use in parallel. (default: 1)
--max_per_mcds MAX_PER_MCDS
Maximum number of ALLC files to aggregate into 1 MCDS,
if number of ALLC provided > max_per_mcds, will
generate MCDS in chunks, with same prefix provided.
(default: 3072)
--cell_chunk_size CELL_CHUNK_SIZE
Size of cell chunk in parallel aggregation. Do not
have any effect on results. Large chunksize needs
large memory. (default: 100)
--dtype {uint8,uint16,uint32,uint64,int8,int16,int32,int64,bool}
Data type of MCDS count matrix. Default is np.uint32.
For single cell feature count, this can be set to
np.uint16 [0, 65536] to decrease file size. The values
exceed min/max will be clipped while keep the mc/cov
same, and a warning will be sent. (default: uint32)
--binarize If set, binarize each single site in each individual
ALLC file. This means each cytosine will only
contribute at most 1 cov and 0/1 mc, this is suitable
to account for single cell ALLC R1 R2 overlap issue,
Only use this on single cell ALLC, not bulk ALLC.
(default: False)
required arguments:
--allc_table ALLC_TABLE
Contain all the ALLC file information in 2 columns: 1.
file_uid, 2. file_path. No header (default: None)
--output_prefix OUTPUT_PREFIX
Output prefix of the MCDS (default: None)
--chrom_size_path CHROM_SIZE_PATH
Path to UCSC chrom size file. This can be generated
from the genome fasta or downloaded via UCSC
fetchChromSizes tools. All ALLCools functions will
refer to this file whenever possible to check for
chromosome names and lengths, so it is crucial to use
a chrom size file consistent to the reference fasta
file ever since mapping. ALLCools functions will not
change or infer chromosome names. (default: None)
--mc_contexts MC_CONTEXTS [MC_CONTEXTS ...]
Space separated mC context patterns to extract from
ALLC. The context length should be the same as ALLC
file context. Context pattern follows IUPAC nucleotide
code, e.g. N for ATCG, H for ATC, Y for CT. (default:
None)
Example command
allcools generate-mcds \
--allc_table AllcPaths.tsv \
--output_prefix /output_dir/lib_name \ # you don't need to add .mcds, will add automatically
--chrom_size_path chrom_size_file \
--mc_contexts CGN CHN \ # H means ATC, N means ATCG, this means both mCG and mCH will be counted for each feature
--bin_sizes 100000 \ # MCDS will generate separate array counting non-overlapping genomic bins at this length, no bed files needed for non-overlapping genomic bins.
--region_bed_paths gene_body.bed.gz \ # MCDS will generate separate array counting regions in each of the bed file(s)
--region_bed_names gene \ # name of the bed file(s)
--cov_cutoff 2 \ # for single cell ALLC, skip rows having > 2 cov
--cpu 20 \ # mcds can run in parallel
Run time and memory
The speed and memory usage of MCDS depends on the number of cells and the number of features used. In the above example, the speed is about 20 cells / CPU / Hour. Usually, for a 3000 cell library, it finished in several hours. The MCDS generation is memory intensive, it's probably safe to give each CPU 5 GB memory, and no less than 40 GB in total.
Output
A single netCDF4 based MCDS file contains all the cell-by-feature raw count arrays. If you provide multiple feature sets (e.g., both genomic bins and gene body), they will all be saved in this MCDS file, while you can easily select them using ALLCools and/or xarray.