Chen and Yang Lab Multi fork Development 细胞谱系树比对

项目描述

美三角洲

mDELTA 是一种用于多分支发展的c Ell Lineage Tree A ligment的算法。本质上，它比较了两个有根的、无序的、带有尖端标记的树，并找到节点之间的最佳全局/局部对应关系。mDELTA 程序旨在分析通过单细胞 DNA 条形码重建的发育细胞谱系树（例如通过 scGESTALT 或 SMALT 完成，而更大的细胞覆盖率有望产生更有意义的 mDELTA 比对）。
除了处理细胞谱系树而不是生物序列之外，mDELTA 在概念上类似于序列比对。它有助于量化不同谱系树之间的相似性，解开共识和变异，找到重复的基序，并促进比较/进化分析。
此存储库中还包括用于统计分析和可视化 mDELTA 结果的 Python/R 脚本，这有助于它们的生物学解释。
mDELTA 是由陈靖宇在中国中山大学中山医学院杨建荣教授的指导下开发的。

快速开始

安装

pip install modelta

这将安装 mDELTA 及其先决条件

运行程序

mDELTA.py -h

将树与自身对齐，输出前三个局部对齐

mDELTA.py ExampleFile/tree.nwk ExampleFile/tree.nwk -t 3

所需包

pandas：基于数据框的评分矩阵架构。
numpy：许多计算必备软件包。
munkres : 一种求得分矩阵动态规划最大值的算法

可选包

tqdm：显示计算阶段的进度。
multiprocess：在计算p值的时候，由于需要多次打乱原序列，进行多次计算，使用多进程可以有效减少等待时间。

源代码安装

(1) Offline
Step1: $git clone https://github.com/Chenjy0212/modelta.git
Step2: $cd modelta -> run "python setup.py install"
(2) Online
$pip install git+https://github.com/Chenjy0212/modelta.git@main

对于python coder用户↓

你可以在你的 Python 代码中使用这个包。例如，在 Jupiter notebook 下运行：

import modelta
from pprint import pprint
#前两项为必选项，后者皆为可选
example = modelta.scoremat(TreeSeqFile = 'ExampleFile/tree.nwk',  
                       TreeSeqFile2 = 'ExampleFile/tree.nwk',      
                       Name2TypeFile = 'ExampleFile/Name2Type.csv',
                       Name2TypeFile2 ='ExampleFile/Name2Type.csv',
                       ScoreDictFile = 'ExampleFile/TypeTypeScore.csv',
                       top = 3,
                       notebook = 1,
                       overlap = 5,
                          )

结果

Matrix Node: |██████████| 121/121 100%
121/121 [00:00<00:00, 2573.11it/s]
{'TopScoreList': [{'Root1_label': 'root',
                   'Root1_match': ['0',
                                   '1',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root1_match_tree': '(((a,b,c),d,(e,f)),a);',
                   'Root1_node': '(((a,b,c),d,(e,f)),a)',
                   'Root1_prune': [],
                   'Root1_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Root2_label': 'root',
                   'Root2_match': ['0',
                                   '1',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root2_match_tree': '(((a,b,c),d,(e,f)),a);',
                   'Root2_node': '(((a,b,c),d,(e,f)),a)',
                   'Root2_prune': [],
                   'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Score': 21.0,
                   'col': 10,
                   'row': 10},
                  {'Root1_label': '0',
                   'Root1_match': ['0',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root1_match_tree': '(((a,b,c),d,(e,f)));',
                   'Root1_node': '((a,b,c),d,(e,f))',
                   'Root1_prune': [],
                   'Root1_seq': '((a1,a2,a3),a4,(a5,a6))',
                   'Root2_label': 'root',
                   'Root2_match': ['0',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root2_match_tree': '(((a,b,c),d,(e,f)));',
                   'Root2_node': '(((a,b,c),d,(e,f)),a)',
                   'Root2_prune': ['1'],
                   'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Score': 17.0,
                   'col': 10,
                   'row': 9},
                  {'Root1_label': '0,0',
                   'Root1_match': ['0,0', '0,0,0', '0,0,1', '0,0,2'],
                   'Root1_match_tree': '(((a,b,c),d,(e,f)),(a,b,c));',
                   'Root1_node': '(a,b,c)',
                   'Root1_prune': [],
                   'Root1_seq': '(a1,a2,a3)',
                   'Root2_label': '0',
                   'Root2_match': ['0,0', '0,0,0', '0,0,1', '0,0,2'],
                   'Root2_match_tree': '(((a,b,c),d,(e,f)),(a,b,c));',
                   'Root2_node': '((a,b,c),d,(e,f))',
                   'Root2_prune': ['0,1', '0,2,0', '0,2,1'],
                   'Root2_seq': '((a1,a2,a3),a4,(a5,a6))',
                   'Score': 6.0,
                   'col': 9,
                   'row': 7}],
 'matrix': Root2  0,0,0  0,0,1  0,0,2  0,1  0,2,0  0,2,1    1  0,0  0,2     0  root
Root1                                                                   
0,0,0    3.0    2.0    2.0  1.0    1.0    0.0  3.0  1.0  0.0  -1.0  -1.0
0,0,1    2.0    3.0    1.0 -1.0   -1.0   -1.0  2.0  1.0 -1.0  -1.0  -1.0
0,0,2    2.0    1.0    3.0 -1.0   -1.0   -1.0  2.0  1.0 -1.0  -1.0  -1.0
0,1      1.0   -1.0   -1.0  3.0    0.0   -1.0  1.0 -1.0 -1.0  -1.0  -1.0
0,2,0    1.0   -1.0   -1.0  0.0    3.0   -1.0  1.0 -1.0  2.0  -1.0  -1.0
0,2,1    0.0   -1.0   -1.0 -1.0   -1.0    3.0  0.0 -1.0  2.0  -1.0  -1.0
1        3.0    2.0    2.0  1.0    1.0    0.0  3.0  1.0  0.0  -1.0  -1.0
0,0      1.0    1.0    1.0 -1.0   -1.0   -1.0  1.0  9.0 -1.0   6.0   5.0
0,2      0.0   -1.0   -1.0 -1.0    2.0    2.0  0.0 -1.0  6.0   2.0   1.0
0       -1.0   -1.0   -1.0 -1.0   -1.0   -1.0 -1.0  6.0  2.0  18.0  17.0
root    -1.0   -1.0   -1.0 -1.0   -1.0   -1.0 -1.0  5.0  1.0  17.0  21.0,
 'score_dict': {'a_a': 3.0,
                'a_b': 2.0,
                'a_c': 2.0,
                'a_d': 1.0,
                'a_e': 1.0,
                'a_f': 0.0,
                'b_a': 2.0,
                'b_b': 3.0,
                'b_c': 1.0,
                'b_d': -1.0,
                'b_e': -1.0,
                'b_f': -1.0,
                'c_a': 2.0,
                'c_b': 1.0,
                'c_c': 3.0,
                'c_d': -1.0,
                'c_e': -1.0,
                'c_f': -1.0,
                'd_a': 1.0,
                'd_b': -1.0,
                'd_c': -1.0,
                'd_d': 3.0,
                'd_e': 0.0,
                'd_f': -1.0,
                'e_a': 1.0,
                'e_b': -1.0,
                'e_c': -1.0,
                'e_d': 0.0,
                'e_e': 3.0,
                'e_f': -1.0,
                'f_a': 0.0,
                'f_b': -1.0,
                'f_c': -1.0,
                'f_d': -1.0,
                'f_e': -1.0,
                'f_f': 3.0},
 'tree1_leaves_celltype': 'a;b;c;d;e;f;a',
 'tree1_leaves_label': '0,0,0;0,0,1;0,0,2;0,1;0,2,0;0,2,1;1',
 'tree1_leaves_nodename': 'a1;a2;a3;a4;a5;a6;a1',
 'tree2_leaves_celltype': 'a;b;c;d;e;f;a',
 'tree2_leaves_label': '0,0,0;0,0,1;0,0,2;0,1;0,2,0;0,2,1;1',
 'tree2_leaves_nodename': 'a1;a2;a3;a4;a5;a6;a1'}

参数分析

如果参数有*，则为必填项；否则，它是可选的

TreeSeqFile& TreeSeqFile2: [路径/文件名 *] 删除了分支长度信息的细胞谱系树文件。参考文档格式如下：ExampleFile/tree.nwk
mv: [ float and default= 2. ] 相同节点之间的匹配分数，常在参数ScoreDictFile为默认值时使用。
pv：[浮动和default= -1。] 不同节点之间的剪枝得分。
top: [ int > 0 and default= 0 ] 选择分数矩阵中前几个有意义的分数。如果是默认值：

{'T1root_T2root': [{'Root1_label': 'root',
                    'Root1_match': ['0',
                                    '1',
                                    '0,0',
                                    '0,1',
                                    '0,2',
                                    '0,0,0',
                                    '0,0,1',
                                    '0,0,2',
                                    '0,2,0',
                                    '0,2,1'],
                    'Root1_match_tree': '(((a,b,c),d,(e,f)),a);',
                    'Root1_node': '(((a,b,c),d,(e,f)),a)',
                    'Root1_prune': [],
                    'Root1_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                    'Root2_label': 'root',
                    'Root2_match': ['0',
                                    '1',
                                    '0,0',
                                    '0,1',
                                    '0,2',
                                    '0,0,0',
                                    '0,0,1',
                                    '0,0,2',
                                    '0,2,0',
                                    '0,2,1'],
                    'Root2_match_tree': '(((a,b,c),d,(e,f)),a);',
                    'Root2_node': '(((a,b,c),d,(e,f)),a)',
                    'Root2_prune': [],
                    'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                    'Score': 21.0,
                    'col': 10,
                    'row': 10}],

notebook: [ bool and default= False ] 是不是在jupyter notebook环境下编写运行的。
Tqdm: [ bool and default= True ] 是否显示操作进度条。
overlap: [ int > 0 and default= 0 ] 在局部结果中，后面的比较结果不能有 X% 或更多的节点对与前面的结果重复。
merge: [ int > 0 and default= 0 ] 合并内部节点进行修剪。

如果定性计算：

Name2TypeFile& Name2TypeFile2: [ path/filename * ] 将树节点名称转换为类型。参考文件格式如下：ExampleFile/Name2Type.csv
ScoreDictFile: [ path/filename and default=''] 定义节点之间的匹配分数。参考文件格式如下：ExampleFile/socrefile.csv

The matching score between nodes is determined according to the "ScoreDictFile" file.
If the file is empty, only the same nodes are taken for pairing, and the default matching score is 2 (float)

node: a <-> a = 2.(custom)
      b <-> b = 3.(custom)
      a <-> b = ?(custom)
The higher the score, the stronger the similarity

如果定量计算

ScoreDictFile: [ path/filename * ] 定义节点之间的匹配分数。参考文件格式如下：ExampleFile/Qscorefile.csv
Name2TypeFile& Name2TypeFile2: [路径/文件名或无输入] 将树节点名称转换为类型。参考文件格式如下：ExampleFile/Name2Type.csv

The matching score between nodes is determined according to the "ScoreDictFile" file.
The file is required. You can modify the score of the same node by modifying parameter "mv"

   Gene0  Gene1  Gene2  
a    1      2      3  
b    2      3      4

node: (1-2)**2 + (2-3)**2 + (3-4)**2 #Euclidean distance
Then get the final score according to the smoothing function. 
The lower the score, the stronger the similarity

P值计算

modelta.pvalue(times = 3, 
               topscorelist = example['TopScoreList'], 
               ScoreDictFile='',
               CPUs = 50, 
               mv = 2, 
               pv = -1)

结果

 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.05it/s]
 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.38it/s]
 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.45it/s]
[[3.0, 4.0, 0.0, 14.0], [4.0, 5.0, 3.0, 11.0], [5.0, 0.0, 1.0, 11.0]]

返回的结果代表最大值times对应的匹配分数top

参数分析

如果参数有*，则为必填项；否则，它是可选的

times: [ int > 0 * ] 原序列需要被打乱的次数，如：

times = 3 #Randomly disrupt the nodes, but the structure remains unchanged
(((a,b,c),d,(e,f)),a) -> (((a,b,c),d,(e,f)),a)
                      -> (((a,c,d),b,(a,f)),e)
                      -> (((e,f,a),d,(b,c)),a)

topscorelist: [ example['TopScoreList'] * ] 输入参数是前面得到的最大值序列。
CPUs: [ int > 0 and default= 50 ] 多进程计算可以大大减少等待时间。默认进程池为 50，但受限于本地计算机资源，可以达到本地 CPU 内核的最大数量 - 1。
mv&&&&参数pv前面已经详细介绍过了notebookTqdmoverlap

普通用户↓

快速开始

运行程序

mDELTA.py -h

视窗

mDELTA.py ./ExampleFile/tree.nwk ./ExampleFile/tree.nwk -t 3

Linux

./mDELTA.py ../ExampleFile/tree.nwk ../ExampleFile/tree.nwk -t 3

帮助

Windows: $mDELTA.py -h
 Linux:  $./mDELTA.py -h

usage: mDELTA [-h] [-nt NAME2TYPEFILE] [-nt2 NAME2TYPEFILE2] [-sd SCOREDICTFILE] [-t TOP] [-ma MAV] [-mi MIV] [-p PV] [-T TQDM]
              [-n NOTEBOOK] [-P PERM] [-a ALG] [-c CPUS] [-o OUTPUT] [-x DIFF] [-mg MERGE]
              TreeSeqFile TreeSeqFile2

Multifuricating Developmental cEll Lineage Tree Alignment(mDELTA)

positional arguments:
  TreeSeqFile           [path/filename] A text file storing cell lineage tree #1 in newick format. Tips can be labeled by name or
                        cell type. Branch lengths should be removed.
  TreeSeqFile2          [path/filename] A text file storing cell lineage tree #2 in newick format. Tips can be labeled by name or
                        cell type. Branch lengths should be removed.

optional arguments:
  -h, --help            show this help message and exit
  -nt NAME2TYPEFILE, --Name2TypeFile NAME2TYPEFILE
                        [path/filename] List of correspondance between tip name and cell type for cell lineage tree #1.
  -nt2 NAME2TYPEFILE2, --Name2TypeFile2 NAME2TYPEFILE2
                        [path/filename] List of correspondance between tip name and cell type for cell lineage tree #2.
  -sd SCOREDICTFILE, --ScoreDictFile SCOREDICTFILE
                        [path/filename] A comma-delimited text file used to determine similarity scores between cells. If there
                        are exactly three columns, they will be interpreted as (1) the cell (name or type) in Tree #1, (2) the
                        cell in Tree #2, and (3) the similarity score. If otherwise, the first column will be interpreted as the
                        cell (name or type) and the remaining columns as features of the cell (e.g. expression of a gene). The
                        similarity scores will be estimated between all pairs of cells based on the Euclidean distance calculated
                        using all the features. Overrides `-ma` and `-mi`.
  -t TOP, --top TOP     [int > 0] Performs local (instead of global) alignment, and output the top NUM local alignments with the
                        highest score (e.g. `-t 10`). In the case of global alignment, this parameter should be omitted.
  -ma MAV, --mav MAV    [float]
  -mi MIV, --miv MIV    [float] Shorthand for a simple matching score scheme, where the matching score between a pair of the same
                        cell types is MAV and all other pairs are MIV. (e.g. `-ma 2 -mi -2`). Overridden by `-sd`.
  -p PV, --pv PV        [float] The score for pruning a tip of the tree (e.g. `-p -2`). Default to -1.
  -T TQDM, --Tqdm TQDM  [0(off) or 1(on)] Toggle for the jupyter notebook environment.
  -n NOTEBOOK, --notebook NOTEBOOK
                        [0(off) or 1(on)] Toggle for the jupyter notebook environment.
  -P PERM, --PERM PERM  [int > 0] Toggle for the statistical significance. For each observed alignment, the aligned trees will be
                        permuted PERM times to generate a null distribution of alignment scores, with which a P value can be
                        calculated for the observed alignment score.
  -a ALG, --Alg ALG     [KM / GA] Use Kuhn-Munkres or Greedy Algorithm to find the optimal alignment score.
  -c CPUS, --CPUs CPUS  [int > 0] Number of threads for multi-processing. Default to 50., it can reach the maximum number of local
                        CPU cores - 1.
  -o OUTPUT, --output OUTPUT
                        [path/filename] Output filename
  -x DIFF, --diff DIFF  [int > 0] Alignment must consist of a minimal of DIFF{'option_strings': ['-x', '--diff'], 'dest': 'diff',
                        'nargs': None, 'const': None, 'default': 0, 'type': 'int', 'choices': None, 'required': False, 'help': '
                        [int > 0] Alignment must consist of a minimal of DIFF% aligned cell pairs that are different from
                        previous(better) local alignments in order to be considered as another new alignment (e.g. `-x 20` means
                        20 persent).', 'metavar': None, 'container': <argparse._ArgumentGroup object at 0x0000024B6EEA3E80>,
                        'prog': 'mDELTA'}ligned cell pairs that are different from previous(better) local alignments in order to
                        be considered as another new alignment (e.g. `-x 20` means 20 persent).
  -mg MERGE, --merge MERGE
                        [float] This is the scaling factor for calculating the score of merging an internal node (e.g. -mg -1),
                        which is multiplied by the number of tips of the internal node to be merged. Default to 0.

More details on https://github.com/Chenjy0212/modelta

引文

如果您在研究中使用此项目，请引用此项目。

@misc{modelta2022,
    author = {Jingyu Chen},
    title = {mDELTA: Multifuricating Developmental cEll Lineage Tree Alignment},
    year = {2022},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/Chenjy0212/modelta}},
}