自动检测文件中是否存在 CSV 标头

2021-12-29 00:00:00 csv algorithm automation php

小问题:如何自动检测 CSV 文件的第一行是否有标题?

Short question: How do I automatically detect whether a CSV file has headers in the first row?

详细信息:我编写了一个小型 CSV 解析引擎,将数据放入我可以作为(大约)内存数据库访问的对象中.原始代码是为了解析具有可预测格式的第三方 CSV 文件而编写的,但我希望能够更广泛地使用此代码.

Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.

我正在尝试找出一种可靠的方法来自动检测 CSV 标头的存在,以便脚本可以决定是使用 CSV 文件的第一行作为键名/列名还是立即开始解析数据.由于我只需要一个布尔测试,我可以在自己检查 CSV 文件后轻松指定一个参数,但我宁愿不必(去自动化).

I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).

我想我必须将前 3 个解析为 ?CSV 文件的行并查找某种模式以与标题进行比较.我正在做三个特别糟糕的噩梦,其中:

I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:

  1. 由于某种原因,标题包含数字数据
  2. 前几行(或 CSV 的大部分)为空
  3. 标题和数据看起来太相似,无法区分

如果我能得到最佳猜测"并且让解析器因错误而失败或在无法决定时发出警告,那也没关系.如果这是在时间或计算方面非常昂贵的事情(并且花费的时间比它应该节省的时间更多),我会很高兴地放弃这个想法并回到重要的事情"上.

If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".

我正在使用 PHP,但这更像是一个算法/计算问题,而不是特定于实现的问题.如果有我可以使用的简单算法,那就太好了.如果你能指点我一些相关的理论/讨论,那也太好了.如果有一个巨大的库可以进行自然语言处理或 300 种不同的解析,我不感兴趣.

I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.

推荐答案

正如其他人所指出的,您无法以 100% 的可靠性做到这一点.然而,在某些情况下,基本正确"是有用的 - 例如,具有 CSV 导入功能的电子表格工具通常会尝试自己解决这个问题.这里有一些启发式方法,可以表明第一行不是标题:

As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:

  • 第一行的列不是字符串或为空
  • 第一行的列并非都是唯一的
  • 第一行似乎包含日期或其他常见数据格式(例如,xx-xx-xx)

相关文章