Generate MCDS
What is MCDS file? See here.

Input

  • AllcPaths.tsv: automatically generated by yap summary and stored at {output_dir}/stats.
  • If you want to count features based on a bed file (gene body bed, etc.), you need to provide a BED file with four columns, and you need to bgzip + tabix this bed file.
    1. 1.
      chrom
    2. 2.
      start
    3. 3.
      end
    4. 4.
      region id, MUST be unique to each region.
  • Chrom size file: this file is needed for determining which chromosomes to include in the MCDS file, see the documentation below.

Generate MCDS

Generate MCDS is only a single command with allcools mcds. This is the documentation, see an example command I usually use below.
1
$ allcools mcds -h
2
usage: allcools generate-mcds [-h] --allc_table ALLC_TABLE --output_prefix
3
OUTPUT_PREFIX --chrom_size_path CHROM_SIZE_PATH
4
--mc_contexts MC_CONTEXTS [MC_CONTEXTS ...]
5
[--split_strand]
6
[--bin_sizes BIN_SIZES [BIN_SIZES ...]]
7
[--region_bed_paths REGION_BED_PATHS [REGION_BED_PATHS ...]]
8
[--region_bed_names REGION_BED_NAMES [REGION_BED_NAMES ...]]
9
[--cov_cutoff COV_CUTOFF] [--cpu CPU]
10
[--max_per_mcds MAX_PER_MCDS]
11
[--cell_chunk_size CELL_CHUNK_SIZE]
12
[--dtype {uint8,uint16,uint32,uint64,int8,int16,int32,int64,bool}]
13
[--binarize]
14
15
optional arguments:
16
-h, --help show this help message and exit
17
--split_strand If true, Watson (+) and Crick (-) strands will be
18
count separately (default: False)
19
--bin_sizes BIN_SIZES [BIN_SIZES ...]
20
Fix-size genomic bins can be defined by bin_sizes and
21
chrom_size_path. Space separated sizes of genome bins,
22
each size will be count separately. (default: None)
23
--region_bed_paths REGION_BED_PATHS [REGION_BED_PATHS ...]
24
Arbitrary genomic regions can be defined in several
25
BED files to count on. Space separated paths to each
26
BED files, the fourth column of BED file should be
27
unique id of the region. (default: None)
28
--region_bed_names REGION_BED_NAMES [REGION_BED_NAMES ...]
29
Space separated names for each BED file provided in
30
region_bed_paths. (default: None)
31
--cov_cutoff COV_CUTOFF
32
Max cov filter for a single site in ALLC. Sites with
33
cov > cov_cutoff will be skipped. (default: 9999)
34
--cpu CPU Number of processes to use in parallel. (default: 1)
35
--max_per_mcds MAX_PER_MCDS
36
Maximum number of ALLC files to aggregate into 1 MCDS,
37
if number of ALLC provided > max_per_mcds, will
38
generate MCDS in chunks, with same prefix provided.
39
(default: 3072)
40
--cell_chunk_size CELL_CHUNK_SIZE
41
Size of cell chunk in parallel aggregation. Do not
42
have any effect on results. Large chunksize needs
43
large memory. (default: 100)
44
--dtype {uint8,uint16,uint32,uint64,int8,int16,int32,int64,bool}
45
Data type of MCDS count matrix. Default is np.uint32.
46
For single cell feature count, this can be set to
47
np.uint16 [0, 65536] to decrease file size. The values
48
exceed min/max will be clipped while keep the mc/cov
49
same, and a warning will be sent. (default: uint32)
50
--binarize If set, binarize each single site in each individual
51
ALLC file. This means each cytosine will only
52
contribute at most 1 cov and 0/1 mc, this is suitable
53
to account for single cell ALLC R1 R2 overlap issue,
54
Only use this on single cell ALLC, not bulk ALLC.
55
(default: False)
56
57
required arguments:
58
--allc_table ALLC_TABLE
59
Contain all the ALLC file information in 2 columns: 1.
60
file_uid, 2. file_path. No header (default: None)
61
--output_prefix OUTPUT_PREFIX
62
Output prefix of the MCDS (default: None)
63
--chrom_size_path CHROM_SIZE_PATH
64
Path to UCSC chrom size file. This can be generated
65
from the genome fasta or downloaded via UCSC
66
fetchChromSizes tools. All ALLCools functions will
67
refer to this file whenever possible to check for
68
chromosome names and lengths, so it is crucial to use
69
a chrom size file consistent to the reference fasta
70
file ever since mapping. ALLCools functions will not
71
change or infer chromosome names. (default: None)
72
--mc_contexts MC_CONTEXTS [MC_CONTEXTS ...]
73
Space separated mC context patterns to extract from
74
ALLC. The context length should be the same as ALLC
75
file context. Context pattern follows IUPAC nucleotide
76
code, e.g. N for ATCG, H for ATC, Y for CT. (default:
77
None)
Copied!

Example command

1
allcools generate-mcds \
2
--allc_table AllcPaths.tsv \
3
--output_prefix /output_dir/lib_name \ # you don't need to add .mcds, will add automatically
4
--chrom_size_path chrom_size_file \
5
--mc_contexts CGN CHN \ # H means ATC, N means ATCG, this means both mCG and mCH will be counted for each feature
6
--bin_sizes 100000 \ # MCDS will generate separate array counting non-overlapping genomic bins at this length, no bed files needed for non-overlapping genomic bins.
7
--region_bed_paths gene_body.bed.gz \ # MCDS will generate separate array counting regions in each of the bed file(s)
8
--region_bed_names gene \ # name of the bed file(s)
9
--cov_cutoff 2 \ # for single cell ALLC, skip rows having > 2 cov
10
--cpu 20 \ # mcds can run in parallel
Copied!

Run time and memory

The speed and memory usage of MCDS depends on the number of cells and the number of features used. In the above example, the speed is about 20 cells / CPU / Hour. Usually, for a 3000 cell library, it finished in several hours. The MCDS generation is memory intensive, it's probably safe to give each CPU 5 GB memory, and no less than 40 GB in total.

Output

A single netCDF4 based MCDS file contains all the cell-by-feature raw count arrays. If you provide multiple feature sets (e.g., both genomic bins and gene body), they will all be saved in this MCDS file, while you can easily select them using ALLCools and/or xarray.