pandas 和 numpy 的意思不同
问题描述
我有一个 MEMS IMU,我一直在其上收集数据,我正在使用 pandas 从中获取一些统计数据.每个周期收集 6 个 32 位浮点数.对于给定的收集运行,数据速率是固定的.数据速率在 100Hz 和 1000Hz 之间变化,采集时间长达 72 小时.数据保存在一个平面二进制文件中.我是这样读取数据的:
I have a MEMS IMU on which I've been collecting data and I'm using pandas to get some statistical data from it. There are 6 32-bit floats collected each cycle. Data rates are fixed for a given collection run. The data rates vary between 100Hz and 1000Hz and the collection times run as long as 72 hours. The data is saved in a flat binary file. I read the data this way:
import numpy as np
import pandas as pd
dataType=np.dtype([('a','<f4'),('b','<f4'),('c','<f4'),('d','<f4'),('e','<f4'),('e','<f4')])
df=pd.DataFrame(np.fromfile('FILENAME',dataType))
df['c'].mean()
-9.880581855773926
x=df['c'].values
x.mean()
-9.8332081
-9.833 是正确的结果.我可以创建一个类似的结果,有人应该能够以这种方式重复:
-9.833 is the correct result. I can create a similar result that someone should be able to repeat this way:
import numpy as np
import pandas as pd
x=np.random.normal(-9.8,.05,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-9.859579086303711
x.mean()
-9.8000648778888628
我在 linux 和 windows、AMD 和 Intel 处理器、Python 2.7 和 3.5 上重复了这一点.我难住了.我究竟做错了什么?得到这个:
I've repeated this on linux and windows, on AMD and Intel processors, in Python 2.7 and 3.5. I'm stumped. What am I doing wrong? And get this:
x=np.random.normal(-9.,.005,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-8.999998092651367
x.mean()
-9.0000075889406528
我可以接受这种差异.它处于 32 位浮点数精度的极限.
I could accept this difference. It's at the limit of the precision of 32 bit floats.
没关系.我在星期五写了这篇文章,今天早上我遇到了解决方案.这是由于大量数据而加剧的浮点精度问题.我需要以这种方式在创建数据帧时将数据转换为 64 位浮点数:
NEVERMIND. I wrote this on Friday and the solution hit me this morning. It is a floating point precision problem exacerbated by the large amount of data. I needed to convert the data into 64 bit float on the creation of the dataframe this way:
df=pd.DataFrame(np.fromfile('FILENAME',dataType),dtype='float64')
如果其他人遇到类似问题,我会离开帖子.
I'll leave the post should anyone else run into a similar issue.
解决方案
短版:
之所以不同,是因为 pandas
在调用 mean
操作时使用了 bottleneck
(如果已安装),而不是仅仅依赖numpy
.bottleneck
可能被使用,因为它似乎比 numpy
更快(至少在我的机器上),但以精度为代价.它们恰好匹配 64 位版本,但在 32 位土地上有所不同(这是有趣的部分).
Short version:
The reason it's different is because pandas
uses bottleneck
(if it's installed) when calling the mean
operation, as opposed to just relying on numpy
. bottleneck
is presumably used since it appears to be faster than numpy
(at least on my machine), but at the cost of precision. They happen to match for the 64 bit version, but differ in 32 bit land (which is the interesting part).
仅通过检查这些模块的源代码很难判断发生了什么(它们非常复杂,即使对于像 mean
这样的简单计算,数值计算也很困难).最好使用调试器来避免大脑编译和那些类型的错误.调试器不会在逻辑上出错,它会准确地告诉你发生了什么.
It's extremely difficult to tell what's going on just by inspecting the source code of these modules (they're quite complex, even for simple computations like mean
, turns out numerical computing is hard). Best to use the debugger to avoid brain-compiling and those types of mistakes. The debugger won't make a mistake in logic, it'll tell you exactly what's going on.
这是我的一些堆栈跟踪(值略有不同,因为没有 RNG 种子):
Here's some of my stack trace (values differ slightly since no seed for RNG):
可以重现(Windows):
>>> import numpy as np; import pandas as pd
>>> x=np.random.normal(-9.,.005,size=900000)
>>> df=pd.DataFrame(x,dtype='float32',columns=['x'])
>>> df['x'].mean()
-9.0
>>> x.mean()
-9.0000037501099754
>>> x.astype(np.float32).mean()
-9.0000029
numpy
的版本没有什么特别之处.pandas
版本有点古怪.
Nothing extraordinary going on with numpy
's version. It's the pandas
version that's a little wacky.
我们来看看df['x'].mean()
:
>>> def test_it_2():
... import pdb; pdb.set_trace()
... df['x'].mean()
>>> test_it_2()
... # Some stepping/poking around that isn't important
(Pdb) l
2307
2308 if we have an ndarray as a value, then simply perform the operation,
2309 otherwise delegate to the object
2310
2311 """
2312 -> delegate = self._values
2313 if isinstance(delegate, np.ndarray):
2314 # Validate that 'axis' is consistent with Series's single axis.
2315 self._get_axis_number(axis)
2316 if numeric_only:
2317 raise NotImplementedError('Series.{0} does not implement '
(Pdb) delegate.dtype
dtype('float32')
(Pdb) l
2315 self._get_axis_number(axis)
2316 if numeric_only:
2317 raise NotImplementedError('Series.{0} does not implement '
2318 'numeric_only.'.format(name))
2319 with np.errstate(all='ignore'):
2320 -> return op(delegate, skipna=skipna, **kwds)
2321
2322 return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
2323 numeric_only=numeric_only,
2324 filter_type=filter_type, **kwds)
所以我们找到了问题所在,但现在事情变得有点奇怪:
So we found the trouble spot, but now things get kind of weird:
(Pdb) op
<function nanmean at 0x000002CD8ACD4488>
(Pdb) op(delegate)
-9.0
(Pdb) delegate_64 = delegate.astype(np.float64)
(Pdb) op(delegate_64)
-9.000003749978807
(Pdb) delegate.mean()
-9.0000029
(Pdb) delegate_64.mean()
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float64)
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float32)
-9.0000029
注意 delegate.mean()
和 np.nanmean
输出 -9.0000029
类型为 float32
,not -9.0
不像 pandas
nanmean
那样.稍微摸索一下,您可以在 pandas.core.nanops
中找到 pandas
nanmean
的源代码.有趣的是,它实际上看起来应该首先匹配numpy
.我们来看看pandas
nanmean
:
Note that delegate.mean()
and np.nanmean
output -9.0000029
with type float32
, not -9.0
as pandas
nanmean
does. With a bit of poking around, you can find the source to pandas
nanmean
in pandas.core.nanops
. Interestingly, it actually appears like it should be matching numpy
at first. Let's have a look at pandas
nanmean
:
(Pdb) import inspect
(Pdb) src = inspect.getsource(op).split("
")
(Pdb) for line in src: print(line)
@disallow('M8')
@bottleneck_switch()
def nanmean(values, axis=None, skipna=True):
values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
dtype_sum = dtype_max
dtype_count = np.float64
if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype):
dtype_sum = np.float64
elif is_float_dtype(dtype):
dtype_sum = dtype
dtype_count = dtype
count = _get_counts(mask, axis, dtype=dtype_count)
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
if axis is not None and getattr(the_sum, 'ndim', False):
the_mean = the_sum / count
ct_mask = count == 0
if ct_mask.any():
the_mean[ct_mask] = np.nan
else:
the_mean = the_sum / count if count > 0 else np.nan
return _wrap_results(the_mean, dtype)
这是 bottleneck_switch
装饰器的(短)版本:
Here's a (short) version of the bottleneck_switch
decorator:
import bottleneck as bn
...
class bottleneck_switch(object):
def __init__(self, **kwargs):
self.kwargs = kwargs
def __call__(self, alt):
bn_name = alt.__name__
try:
bn_func = getattr(bn, bn_name)
except (AttributeError, NameError): # pragma: no cover
bn_func = None
...
if (_USE_BOTTLENECK and skipna and
_bn_ok_dtype(values.dtype, bn_name)):
result = bn_func(values, axis=axis, **kwds)
这是用 alt
作为 pandas
nanmean
函数调用的,所以 bn_name
是 'nanmean'
,这是从 bottleneck
模块中获取的属性:
This is called with alt
as the pandas
nanmean
function, so bn_name
is 'nanmean'
, and this is the attr that's grabbed from the bottleneck
module:
(Pdb) l
93 result = np.empty(result_shape)
94 result.fill(0)
95 return result
96
97 if (_USE_BOTTLENECK and skipna and
98 -> _bn_ok_dtype(values.dtype, bn_name)):
99 result = bn_func(values, axis=axis, **kwds)
100
101 # prefer to treat inf/-inf as NA, but must compute the fun
102 # twice :(
103 if _has_infs(result):
(Pdb) n
> d:anaconda3libsite-packagespandascore
anops.py(99)f()
-> result = bn_func(values, axis=axis, **kwds)
(Pdb) alt
<function nanmean at 0x000001D2C8C04378>
(Pdb) alt.__name__
'nanmean'
(Pdb) bn_func
<built-in function nanmean>
(Pdb) bn_name
'nanmean'
(Pdb) bn_func(values, axis=axis, **kwds)
-9.0
假设 bottleneck_switch()
装饰器一秒钟不存在.我们实际上可以看到,手动单步执行这个函数(没有 bottleneck
)调用会得到与 numpy
相同的结果:
Pretend that bottleneck_switch()
decorator doesn't exist for a second. We can actually see that calling that manually stepping through this function (without bottleneck
) will get you the same result as numpy
:
(Pdb) from pandas.core.nanops import _get_counts
(Pdb) from pandas.core.nanops import _get_values
(Pdb) from pandas.core.nanops import _ensure_numeric
(Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna)
(Pdb) count = _get_counts(mask, axis=None, dtype=dtype)
(Pdb) count
900000.0
(Pdb) values.sum(axis=None, dtype=dtype) / count
-9.0000029
但是,如果您安装了 bottleneck
,则永远不会调用它.取而代之的是,bottleneck_switch()
装饰器使用 bottleneck
的版本对 nanmean
函数进行爆破.这就是差异所在(有趣的是,它匹配 float64
案例):
That never gets called, though, if you have bottleneck
installed. Instead, the bottleneck_switch()
decorator instead blasts over the nanmean
function with bottleneck
's version. This is where the discrepancy lies (interestingly it matches on the float64
case, though):
(Pdb) import bottleneck as bn
(Pdb) bn.nanmean(delegate)
-9.0
(Pdb) bn.nanmean(delegate.astype(np.float64))
-9.000003749978807
bottleneck
据我所知,仅用于速度.我假设他们正在使用他们的 nanmean
函数采取某种快捷方式,但我没有深入研究它(有关此主题的详细信息,请参阅@ead 的答案).你可以看到它通常比 numpy
通过他们的基准测试快一点:https://github.com/kwgoodman/bottleneck.显然,要为这种速度付出的代价就是精度.
bottleneck
is used solely for speed, as far as I can tell. I'm assuming they're taking some type of shortcut with their nanmean
function, but I didn't look into it much (see @ead's answer for details on this topic). You can see that it's typically a bit faster than numpy
by their benchmarks: https://github.com/kwgoodman/bottleneck. Clearly the price to pay for this speed is precision.
瓶颈真的更快吗?
确实看起来像(至少在我的机器上).
Sure looks like it (at least on my machine).
In [1]: import numpy as np; import pandas as pd
In [2]: x=np.random.normal(-9.8,.05,size=900000)
In [3]: y_32 = x.astype(np.float32)
In [13]: %timeit np.nanmean(y_32)
100 loops, best of 3: 5.72 ms per loop
In [14]: %timeit bn.nanmean(y_32)
1000 loops, best of 3: 854 µs per loop
pandas
在这里引入一个标志可能会很好(一个用于速度,另一个用于更好的精度,默认是速度,因为这是当前的实现).有些用户更关心计算的准确性,而不是计算速度.
It might be nice for pandas
to introduce a flag here (one for speed, the other for better precision, default is for speed since that's the current impl). Some users care much more about the accuracy of the computation than the speed at which it happens.
HTH.
相关文章