PANDAS.READ_EXCEL()输出';溢出错误:日期值超出范围';而不存在日期列
问题描述
我正在尝试将一个大的Excel文件(400k X 40)导入到Pandas DataFrame。虽然它在我的本地机器上运行得很好,但当它被移植到一台Linux服务器上时,它就崩溃了,该服务器的Python=3.7、Pandas=1.2.4和Openpyxl=3.0.7。在当地,我有稍微老一点的套餐。我已经尝试了有关类型和日期的所有参数星座:
df = pd.read_excel(fpath)
df = pd.read_excel(fpath, dtype=str, parse_dates=['the_only_actual_date_column']) # all dates are within 2017
df = pd.read_excel(fpath, dtype={k:str for k in column_names})
df = pd.read_excel(fpath, converters={k:str for k in column_names})
但都不起作用,我总是收到相同的错误:
OverflowError: date value out of range
我怀疑某个地方的随机单元格被视为日期,但如果我声明将所有内容都视为字符串,日期值怎么可能是一个问题呢?即使我只对一列指定usecols
,而该列绝对不是日期,也会发生同样的情况。
编辑:完整错误消息
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-12-d38de141bf91> in <module>
----> 1 fu = pd.read_excel(fpath, nrows=10)
/opt/tljh/user/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
297 )
298 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299 return func(*args, **kwargs)
300
301 return wrapper
/opt/tljh/user/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
365 skipfooter=skipfooter,
366 convert_float=convert_float,
--> 367 mangle_dupe_cols=mangle_dupe_cols,
368 )
369 finally:
/opt/tljh/user/lib/python3.7/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
1188 convert_float=convert_float,
1189 mangle_dupe_cols=mangle_dupe_cols,
-> 1190 **kwds,
1191 )
1192
/opt/tljh/user/lib/python3.7/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
490 sheet = self.get_sheet_by_index(asheetname)
491
--> 492 data = self.get_sheet_data(sheet, convert_float)
493 usecols = maybe_convert_usecols(usecols)
494
/opt/tljh/user/lib/python3.7/site-packages/pandas/io/excel/_openpyxl.py in get_sheet_data(self, sheet, convert_float)
546 data: List[List[Scalar]] = []
547 last_row_with_data = -1
--> 548 for row_number, row in enumerate(sheet.rows):
549 converted_row = [self._convert_cell(cell, convert_float) for cell in row]
550 if not all(cell == "" for cell in converted_row):
/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_read_only.py in _cells_by_row(self, min_col, min_row, max_col, max_row, values_only)
77 data_only=self.parent.data_only, epoch=self.parent.epoch,
78 date_formats=self.parent._date_formats)
---> 79 for idx, row in parser.parse():
80 if max_row is not None and idx > max_row:
81 break
/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py in parse(self)
153 element.clear()
154 elif tag_name == ROW_TAG:
--> 155 row = self.parse_row(element)
156 element.clear()
157 yield row
/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py in parse_row(self, row)
284 self.row_dimensions[str(self.row_counter)] = attrs
285
--> 286 cells = [self.parse_cell(el) for el in row]
287 return self.row_counter, cells
288
/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py in <listcomp>(.0)
284 self.row_dimensions[str(self.row_counter)] = attrs
285
--> 286 cells = [self.parse_cell(el) for el in row]
287 return self.row_counter, cells
288
/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py in parse_cell(self, element)
205 try:
206 value = from_excel(
--> 207 value, self.epoch, timedelta=style_id in self.timedelta_formats
208 )
209 except ValueError:
/opt/tljh/user/lib/python3.7/site-packages/openpyxl/utils/datetime.py in from_excel(value, epoch, timedelta)
120 if 0 < value < 60 and epoch == WINDOWS_EPOCH:
121 day += 1
--> 122 return epoch + datetime.timedelta(days=day) + diff
123
124
OverflowError: date value out of range
解决方案
如果您在使用PANDA读取EXCEL/csv文件时遇到此问题,则请检查您的EXCEL/csv文件,任何一列或多列都将具有类似#的值,这意味着值为负数ex:(-11111)或日期太长而无法放入单元格
import pandas as pd
df = pd.read_excel(file path,dtype='string') #this will convert all the column
type to string
or if you want to convert specific column then
df = pd.read_excel(file path,converters={'column name':str})
change the dtype according your need
df = pd.read_excel(fpath)
df = pd.read_excel(fpath, dtype='string')
you dont need to loop through each column to change the dtype, you can do it once
相关文章