转换为 Logn Python 3.7

2022-01-25 00:00:00 python file duplicates compare search

问题描述

我有这段代码很好用,可以做我想做的事,但是它以线性形式执行,这会减慢我的数据文件的大小,所以我想将它转换为 Log.我尝试了这段代码和许多其他人在这里发布但仍然没有让它工作的运气.我将发布两组代码并举例说明我的期望.

I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.

import pandas
import fileinput

'''This code runs fine and does what I expect removing duplicates from big 
file that are in small file, however it is a linear function.'''

with open('small.txt') as fin:
    exclude = set(line.rstrip() for line in fin)
    for line in fileinput.input('big.txt', inplace=True):
        if line.rstrip() not in exclude:
            print(line, end='')
        else:
            print('')

'''This code is my attempt at conversion to a log function.'''


def log_search(small, big):
    first = 0
    last = len(big.txt) - 1
    while first <= last:
        mid = (first + last) / 2
        if str(mid) == small.txt:
            return True
        elif small.txt < str(mid):
            last = mid - 1
        else:
            first = mid + 1
    with open('small.txt') as fin:
        exclude = set(line.rstrip() for line in fin)
        for line in fileinput.input('big.txt', inplace=True):
            if line.rstrip() not in exclude:
                print(line, end='')
            else:
                print('')
            return log_search(small, big)

  1. 大文件有数百万行 int 数据.
  2. 小文件有数百行 int 数据.
  3. 比较数据并删除大文件中的重复数据,但将行号留空.

运行第一个代码块可以,但是搜索大文件需要很长时间.也许我以错误的方式处理问题.我尝试将其转换为日志运行没有错误,但什么也没做.

running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.


解决方案

我认为没有比您目前在第一种方法中所做的更好或更快的方法来做到这一点.(更新:有,见下文.)将 small.txt 中的行存储在 set 中并迭代 big.txt 中的行,检查它们是否在该集合中,将具有 O(b) 的复杂度,其中 bbig.txt 中的行数.

I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.

您似乎正在尝试将其减少为 O(s*logb),其中 ssmall.txt<中的行数/code>,通过使用二进制搜索来检查 small.txt 中的每一行是否在 big.txt 中,然后删除/覆盖它.

What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.

如果所有行都在一个 list 中并且可以随机访问任何数组,但您只有文件,该文件不允许随机访问任何行.但是,它确实 允许使用 file.seek 随机访问任何字符,这(至少在某些情况下?)似乎是 O(1).但是,您仍然必须先找到该位置的前一个换行符,然后才能真正阅读该行.此外,您不仅可以用空行替换行,还必须用相同数量的字符覆盖数字,例如空格.

This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.

所以,是的,理论上它可以在 O(s*logb) 中完成,如果您执行以下操作:

So, yes, theoretically it can be done in O(s*logb), if you do the following:

  • 实现二分查找,不是在行上搜索,而是在大文件的字符上搜索
    • 对于每个位置,回溯到最后一个换行符,然后读取该行以获取数字
    • 像往常一样使用二分搜索在下半部分/上半部分重试

    在我的系统上,读取和写入一个包含 1000 万行数字的文件每个只需要 3 秒,而使用 fileinput.input.inputprint 大约需要 8 秒.因此,恕我直言,这并不值得付出努力,但这当然可能取决于您必须多久执行一次此操作.

    On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.

    好的,所以我自己很好奇——谁需要午休呢?——所以我尝试实现这个......而且效果出奇的好.这将在文件中找到给定的数字并将其替换为一致的 - 数字(not 只是一个空行,如果不重写整个文件是不可能的).请注意,我没有彻底测试二进制搜索算法的边缘情况、非一错误等.

    Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.

    import os
    
    def getlineat(f, pos):
        pos = f.seek(pos)
        while pos > 0 and f.read(1) != "
    ":
            pos = f.seek(pos-1)
        return pos+1 if pos > 0 else 0
    
    def bsearch(f, num):
        lower = 0
        upper = os.stat(f.name).st_size - 1
        while lower <= upper:
            mid = (lower + upper) // 2
            pos = getlineat(f, mid)
            line = f.readline()
            if not line: break # end of file
            val = int(line)
            if val == num:
                return (pos, len(line.strip()))
            elif num < val:
                upper = mid - 1
            elif num > val:
                lower = mid + 1 
        return (-1, -1)
    
    def overwrite(filename, to_remove):
        with open(filename, "r+") as f:
            positions = [bsearch(f, n) for n in to_remove]
            for n, (pos, length) in sorted(zip(to_remove, positions)):
                print(n, pos)
                if pos != -1:
                    f.seek(pos)
                    f.write("-" * length)
    
    import random
    to_remove = [random.randint(-500, 1500) for _ in range(10)]
    overwrite("test.txt", to_remove)
    

    这将首先收集所有要覆盖的位置,然后在第二个步骤中进行实际覆盖,否则二进制搜索在遇到先前删除"的行之一时会出现问题.我使用一个包含 0 到 1,000 的所有数字(按排序顺序)和一个要删除的随机数列表(包括边界内和边界外)的文件对此进行了测试,效果很好.

    This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.

    更新:还测试了一个随机数从 0 到 100,000,000 的文件(944 MB)并覆盖了 100 个随机数,它立即完成,所以这确实应该是 O(s*logb),至少在我的系统上(file.seek 的复杂性可能取决于文件系统、文件类型等).

    Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).

    bsearch 函数也可以泛化为接受另一个参数value_function,而不是硬编码val = int(line).然后它可以用于在任意文件中进行二进制搜索,例如庞大的字典、基因数据库、csv 文件等,只要这些行按相同的值函数排序即可.

    The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.

相关文章