为什么 DataFrame 的连接会呈指数级变慢?

问题描述

我有一个处理 DataFrame 的函数,主要用于将数据处理到存储桶中,使用 pd.get_dummies(df[col]) 在特定列中创建特征的二进制矩阵.

I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col]).

为了避免一次使用此函数处理我的所有数据(内存不足并导致 iPython 崩溃),我使用以下方法将大型 DataFrame 分成块:

To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:

chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)

pd.get_dummies(df) 将根据 df[col] 的内容自动创建新列,每个 dfdf_list 中.

pd.get_dummies(df) will automatically create new columns based on the contents of df[col] and these are likely to differ for each df in df_list.

处理后,我将使用以下方法将 DataFrame 连接在一起:

After processing, I am concatenating the DataFrames back together using:

for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    super_x = pd.concat([super_x, x], axis=0)
    super_y = pd.concat([super_y, y], axis=0)
    print datetime.datetime.utcnow()

第一个块的处理时间是完全可以接受的,但是,它会随着块的增加而增长!这与 preprocess_data(df_chunk) 无关,因为它没有理由增加.是否由于调用 pd.concat() 而导致时间增加?

The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk) as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()?

请看下面的日志:

chunks 6
chunk 0
2016-04-08 00:22:17.728849
chunk 1
2016-04-08 00:22:42.387693 
chunk 2
2016-04-08 00:23:43.124381
chunk 3
2016-04-08 00:25:30.249369
chunk 4
2016-04-08 00:28:11.922305
chunk 5
2016-04-08 00:32:00.357365

有没有办法加快这个速度?我有 2900 个块要处理,因此感谢您的帮助!

Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!

接受 Python 中的任何其他建议!

Open to any other suggestions in Python!


解决方案

永远不要在 for 循环中调用 DataFrame.appendpd.concat.它会导致二次复制.

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

pd.concat 返回一个新的 DataFrame.空间必须分配给新的DataFrame,旧 DataFrame 中的数据必须复制到新的 DataFrame 中数据框.考虑 for-loop 中这一行所需的复制量(假设每个 x 的大小为 1):

pd.concat returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame. Consider the amount of copying required by this line inside the for-loop (assuming each x has size 1):

super_x = pd.concat([super_x, x], axis=0)

| iteration | size of old super_x | size of x | copying required |
|         0 |                   0 |         1 |                1 |
|         1 |                   1 |         1 |                2 |
|         2 |                   2 |         1 |                3 |
|       ... |                     |           |                  |
|       N-1 |                 N-1 |         1 |                N |

1 + 2 + 3 + ... + N = N(N+1)/2.所以需要 O(N**2) 个副本完成循环.

1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2) copies required to complete the loop.

现在考虑

super_x = []
for i, df_chunk in enumerate(df_list):
    [x, y] = preprocess_data(df_chunk)
    super_x.append(x)
super_x = pd.concat(super_x, axis=0)

追加到列表是一个O(1)操作并且不需要复制.现在循环完成后,对 pd.concat 有一次调用.这个呼吁pd.concat 需要制作 N 份副本,因为 super_x 包含 N大小为 1 的 DataFrame.因此,当以这种方式构造时,super_x 需要 O(N)副本.

Appending to a list is an O(1) operation and does not require copying. Now there is a single call to pd.concat after the loop is done. This call to pd.concat requires N copies to be made, since super_x contains N DataFrames of size 1. So when constructed this way, super_x requires O(N) copies.

相关文章