PANDAS.READ_EXCEL()输出'溢出错误:日期值超出范围'而不存在日期列

2022-04-07 00:00:00 python pandas openpyxl

问题描述

我正在尝试将一个大的Excel文件(400k X 40)导入到Pandas DataFrame。虽然它在我的本地机器上运行得很好,但当它被移植到一台Linux服务器上时,它就崩溃了,该服务器的Python=3.7、Pandas=1.2.4和Openpyxl=3.0.7。在当地,我有稍微老一点的套餐。我已经尝试了有关类型和日期的所有参数星座:

df = pd.read_excel(fpath)
df = pd.read_excel(fpath, dtype=str, parse_dates=['the_only_actual_date_column']) # all dates are within 2017
df = pd.read_excel(fpath, dtype={k:str for k in column_names})
df = pd.read_excel(fpath, converters={k:str for k in column_names})

但都不起作用,我总是收到相同的错误:

OverflowError: date value out of range

我怀疑某个地方的随机单元格被视为日期,但如果我声明将所有内容都视为字符串,日期值怎么可能是一个问题呢?即使我只对一列指定usecols,而该列绝对不是日期,也会发生同样的情况。

编辑:完整错误消息

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-12-d38de141bf91> in <module>
----> 1 fu = pd.read_excel(fpath, nrows=10)

/opt/tljh/user/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

/opt/tljh/user/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    365             skipfooter=skipfooter,
    366             convert_float=convert_float,
--> 367             mangle_dupe_cols=mangle_dupe_cols,
    368         )
    369     finally:

/opt/tljh/user/lib/python3.7/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
   1188             convert_float=convert_float,
   1189             mangle_dupe_cols=mangle_dupe_cols,
-> 1190             **kwds,
   1191         )
   1192 

/opt/tljh/user/lib/python3.7/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    490                 sheet = self.get_sheet_by_index(asheetname)
    491 
--> 492             data = self.get_sheet_data(sheet, convert_float)
    493             usecols = maybe_convert_usecols(usecols)
    494 

/opt/tljh/user/lib/python3.7/site-packages/pandas/io/excel/_openpyxl.py in get_sheet_data(self, sheet, convert_float)
    546         data: List[List[Scalar]] = []
    547         last_row_with_data = -1
--> 548         for row_number, row in enumerate(sheet.rows):
    549             converted_row = [self._convert_cell(cell, convert_float) for cell in row]
    550             if not all(cell == "" for cell in converted_row):

/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_read_only.py in _cells_by_row(self, min_col, min_row, max_col, max_row, values_only)
     77                                  data_only=self.parent.data_only, epoch=self.parent.epoch,
     78                                  date_formats=self.parent._date_formats)
---> 79         for idx, row in parser.parse():
     80             if max_row is not None and idx > max_row:
     81                 break

/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py in parse(self)
    153                 element.clear()
    154             elif tag_name == ROW_TAG:
--> 155                 row = self.parse_row(element)
    156                 element.clear()
    157                 yield row

/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py in parse_row(self, row)
    284             self.row_dimensions[str(self.row_counter)] = attrs
    285 
--> 286         cells = [self.parse_cell(el) for el in row]
    287         return self.row_counter, cells
    288 

/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py in <listcomp>(.0)
    284             self.row_dimensions[str(self.row_counter)] = attrs
    285 
--> 286         cells = [self.parse_cell(el) for el in row]
    287         return self.row_counter, cells
    288 

/opt/tljh/user/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py in parse_cell(self, element)
    205                     try:
    206                         value = from_excel(
--> 207                             value, self.epoch, timedelta=style_id in self.timedelta_formats
    208                         )
    209                     except ValueError:

/opt/tljh/user/lib/python3.7/site-packages/openpyxl/utils/datetime.py in from_excel(value, epoch, timedelta)
    120     if 0 < value < 60 and epoch == WINDOWS_EPOCH:
    121         day += 1
--> 122     return epoch + datetime.timedelta(days=day) + diff
    123 
    124 

OverflowError: date value out of range

解决方案

如果您在使用PANDA读取EXCEL/csv文件时遇到此问题,则请检查您的EXCEL/csv文件,任何一列或多列都将具有类似#的值,这意味着值为负数ex:(-11111)或日期太长而无法放入单元格

import pandas as pd

df = pd.read_excel(file path,dtype='string') #this will convert all the column 
     type to string

or if you want to convert specific column then 

df = pd.read_excel(file path,converters={'column name':str})

change the dtype according your need 

df = pd.read_excel(fpath)
df = pd.read_excel(fpath, dtype='string')

you dont need to loop through each column to change the dtype, you can do it once

相关文章