重复附加到一个大列表(Python 2.6.6)

2022-01-22 00:00:00 python list performance append

问题描述

我有一个项目,我正在通过串行端口从微控制器读取 ASCII 值(如下所示:AA FF BA 11 43 CF 等)输入很快(38 个两个字符集/秒).我正在接受此输入并将其附加到所有测量的运行列表中.

I have a project where I am reading in ASCII values from a microcontroller through a serial port (looks like this : AA FF BA 11 43 CF etc) The input is coming in quickly (38 two character sets / second). I'm taking this input and appending it to a running list of all measurements.

大约 5 小时后,我的列表已增长到大约 855000 个条目.

After about 5 hours, my list has grown to ~ 855000 entries.

我了解到,列表越大,列表操作就越慢.我的意图是让这个测试运行 24 小时,这应该会产生大约 300 万个结果.

I'm given to understand that the larger a list becomes, the slower list operations become. My intent is to have this test run for 24 hours, which should yield around 3M results.

有没有比 list.append() 更有效、更快的方法来追加到列表?

Is there a more efficient, faster way to append to a list then list.append()?

谢谢大家.


解决方案

我了解到,列表越大,列表操作就越慢.

I'm given to understand that the larger a list becomes, the slower list operations become.

一般情况下并非如此.Python 中的列表,尽管有名称,但不是链表,而是数组.数组上有一些 O(n) 操作(例如复制和搜索),但您似乎没有使用任何这些操作.作为一个经验法则:如果它被广泛使用和惯用,一些聪明的人就会去选择一种聪明的方式来做这件事.list.append 是一个广泛使用的内置函数(底层 C 函数也用于其他地方,例如列表推导式).如果有更快的方法,它已经在使用了.

That's not true in general. Lists in Python are, despite the name, not linked lists but arrays. There are operations that are O(n) on arrays (copying and searching, for instance), but you don't seem to use any of these. As a rule of thumb: If it's widely used and idiomatic, some smart people went and chose a smart way to do it. list.append is a widely-used builtin (and the underlying C function is also used in other places, e.g. list comprehensions). If there was a faster way, it would already be in use.

当您检查 源代码时,您将看到, 列表是过度分配的,即当它们被调整大小时,它们为一个项目分配了比需要更多的空间,因此可以附加下一个 n 项目而不需要另一个调整大小(即 O(n)).增长不是恒定的,它与列表大小成正比,因此随着列表变大,调整大小变得越来越少.这是 listobject.c:list_resize 中确定过度分配的片段:

As you will see when you inspect the source code, lists are overallocating, i.e. when they are resized, they allocate more than needed for one item so the next n items can be appended without need to another resize (which is O(n)). The growth isn't constant, it is proportional with the list size, so resizing becomes rarer as the list grows larger. Here's the snippet from listobject.c:list_resize that determines the overallocation:

/* This over-allocates proportional to the list size, making room
 * for additional growth.  The over-allocation is mild, but is
 * enough to give linear-time amortized behavior over a long
 * sequence of appends() in the presence of a poorly-performing
 * system realloc().
 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
 */
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);

正如 Mark Ransom 所指出的,较旧的 Python 版本(<2.7、3.0)存在一个错误,导致 GC 破坏了这一点.如果你有这样的 Python 版本,你可能想要禁用 gc.如果你不能因为你产生了太多的垃圾(这会导致重新计数),那么你就不走运了.

As Mark Ransom points out, older Python versions (<2.7, 3.0) have a bug that make the GC sabotage this. If you have such a Python version, you may want to disable the gc. If you can't because you generate too much garbage (that slips refcounting), you're out of luck though.

相关文章