python中的字符串比较但不是Levenshtein距离(我认为)

问题描述

我在我正在阅读的一篇论文中发现了一个粗略的字符串比较,如下所示:

I found a crude string comparison in a paper I am reading done as follows:

他们使用的方程式如下(摘自论文,稍作改动以使其更通用和可读)由于作者的描述不是很清楚(使用作者的例子),我试图用我自己的话解释更多

The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable) I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author)

例如对于 2 个序列 ABCDE 和 BCEFA,有两个可能的图

For example for 2 sequences ABCDE and BCEFA, there are two possible graphs

图 1) 连接 B 与 B C 与 C 和 E 与 E

graph 1) which connects B with B C with C and E with E

图 2) 连接 A 和 A

graph 2) connects A with A

当我连接其他三个(图 1)时,我无法将 A 与 A 连接起来,因为那将是交叉线(假设您在 B-B、C-C 和 E-E 之间画线);也就是说,A-A 上墨的线将穿过连接 B-B、C-C 和 E-E 的线.所以这两个序列产生了两个可能的图;一个有 3 个连接(BB、CC 和 EE),另一个只有一个(AA),然后我按照以下等式计算得分 d.

I cannot connect A with A when I am connecting the other three (graph 1) since that would be crossing lines (imagine you draw lines between B-B, C-C and E-E); that is the line inking A-A will cross the lines linking B-B, C-C and E-E. So these two sequences result in 2 possible graphs; one has 3 connections (BB, CC and EE) and the other only one (AA) then I calculate the score d as given by the equation below.

因此,定义两个之间的相似程度五弦我们计算它们之间的距离d.对齐两个五弦,我们寻找它们之间的所有身份字符,无论它们位于何处.如果每个身份都是由两个五弦之间的链接表示,我们定义了一个图对于这一对.我们将此图的任何部分称为配置.

Consequently, to define the degree of similarity between two penta-strings we calculate the distance d between them. Aligning the two penta-strings, we look for all the identities between their characters, wherever these may be located. If each identity is represented by a link between both penta-strings, we define a graph for this pair. We call any part of this graph a configuration.

接下来,我们保留所有没有字符的配置交叉配对(含义在我上面的示例中进行了解释,即相同字符之间没有交叉链接,只保留那些图形).然后将这些中的每一个作为与图形相关的字符数 p,位移 Δi 为对应对和连接字符之间的间隙δij每个五弦.最小值被选为特征和称为距离d:d Min(50 – 10p + ΣΔi + Σδij) 虽然很粗略,该措施通常与定性观察非常吻合引导估计.例如 abcdeabcfg 之间的距离是 20,而 abcdeabfcg 之间是 23 =(50 – 30 + 1 +2).

Next, we retain all of those configurations in which there is no character cross pairing (the meaning is explained in my example above, i.e., no crossings of links between identical characters and only those graphs are retained). Each of these is then evaluated as a function of the number p of characters related to the graph, the shifting Δi for the corresponding pairs and the gap δij between connected characters of each penta-string. The minimum value is chosen as characteristic and is called distance d: d Min(50 – 10p + ΣΔi + Σδij) Although very rough, this measure is generally in good agreement with the qualitative eye guided estimation. For instance, the distance between abcde and abcfg is 20, whereas that between abcde and abfcg is 23 =(50 – 30 + 1 +2).

我对如何去做这件事感到困惑.任何可以帮助我的建议将不胜感激.

I am confused as to how to go about doing this. Any suggestions to help me would be much appreciated.

我尝试了 Levenshtein 以及用于蛋白质序列比较的简单序列比对论文的链接是:http://peds.oxfordjournals.org/content/16/2/103.长

I tried the Levenshtein and also simple sequence alignment as used in protein sequence comparison The link to the paper is: http://peds.oxfordjournals.org/content/16/2/103.long

我找不到关于第一作者 Alain Figureau 的任何信息以及我给 MA Soto 的电子邮件尚未得到答复(截至今天).

I could not find any information on the first author, Alain Figureau and my emails to MA Soto have not been answered (as of today).

谢谢


解决方案

在你引用的文本块之后,有对同一作者以前的论文的参考:蛋白质的二级结构和三维模式识别.如果没有对距离的解释,我认为值得研究一下(我不在工作,所以我无法访问完整的文档).

Just after the text block you cite, there is a reference to a previous paper from the same authors : Secondary Structure of Proteins and Three-dimensional Pattern Recognition. I think it is worth to look into it if there is no explanantion of the distance (I'm not at work so I haven't the access to the full document).

否则,您也可以尝试直接联系作者:Alain Figureau 似乎是一位老派的法国研究员,没有任何联系(没有网页,没有电子邮件,没有社交网络",..)所以我建议尝试联系 MA Soto,他的电子邮件在论文末尾给出.我想他们会给你你正在寻找的答案:实验的过程必须非常清楚才能重复,这是研究中科学方法的一部分.

Otherwise, you can also try to contact directly the authors : Alain Figureau seems to be an old-school French researcher with no contact whatsoever (no webpage, no e-mail, no "social networking",..) so I advise to try contacting M.A. Soto , whose e-mail is given at the end of the paper. I think they will give you the answer you're looking for : the experiment's procedure has to be crystal clear in order to be repeatable, it's part of the scientific method in research.

相关文章