如何将数据从 ASCII (ISO/IEC 8859-1) 导入 Rails/PGSQL 数据库?

我正在尝试使用来自美国农业部的数据:http://www.ars.usda.gov/Services/docs.htm?docid=23634

I am trying to use data from the USDA found here: http://www.ars.usda.gov/Services/docs.htm?docid=23634

ASCII (8.6Mb) - 此文件包含 ASCII (ISO/IEC 8859-1) 分隔文件中的 SR26 数据.这些表以关系格式组织,最好与关系数据库管理系统 (RDBMS) 一起使用,这将允许您形成自己的数据库查询并生成自定义报告.

ASCII (8.6Mb) - This file contains the SR26 data in ASCII (ISO/IEC 8859-1), delimited files. These tables are organized in a relational format, and are best used with a relational database management system (RDBMS), which will allow you to form your own queries of the database and generate custom reports.

我是这样操作数据的新手,我想我想以 CSV 格式获取它,也许吧?但是,那么我可能会失去这些关系,所以也许我应该直接使用 PGSQL.不知道如何处理这个.

I am new to manipulating data like this and think I would like to get it in CSV, maybe? But, then I might lose the relationships so maybe I should go right to PGSQL. Not sure how to approach this.

寻求指导,谢谢.

推荐答案

zip 包含多个文件:

The zip contains a number of files:

  inflating: DATA_SRC.txt            
  inflating: DATSRCLN.txt            
  inflating: DERIV_CD.txt            
  inflating: FD_GROUP.txt            
  inflating: FOOD_DES.txt            
  inflating: FOOTNOTE.txt            
  inflating: LANGDESC.txt            
  inflating: LANGUAL.txt             
  inflating: NUT_DATA.txt            
  inflating: NUTR_DEF.txt            
  inflating: sr26_doc.pdf            
  inflating: SRC_CD.txt              
  inflating: WEIGHT.txt         

每一个似乎都是一种奇怪的几乎类似于 CSV 的格式,例如NUTR_DEF.txt:

each of which appears to be in a bizarre almost-CSV-like format, e.g. NUTR_DEF.txt:

~287~^~g~^~GALS~^~Galactose~^~2~^~2100~
~291~^~g~^~FIBTG~^~Fiber, total dietary~^~1~^~1200~

加上 sr26_doc.pdf,文档.

所以你需要在这里做的是为数据库创建 SQL 表定义 - 每个输入文件一个表.为此,您需要 CREATE TABLE 命令;请参阅 PostgreSQL 文档.

So what you need to do here is create SQL table definitions for the database - with one table for each input file. You need the CREATE TABLE command for this; see the PostgreSQL documentation.

PDF 的第 35 页应该对您有所帮助 - 图 1.美国农业部国家营养数据库标准参考文件之间的关系".以下页面描述了文件格式,告诉您每列的含义.您可以根据此描述编写 CREATE TABLE 语句.

Page 35 of the PDF should help you - "Figure 1. Relationships among files in the USDA National Nutrient Database for Standard Reference". The following pages describe the file formats, telling you what each column means. You can write CREATE TABLE statements based on this description.

这是一个示例,对于 FOOD_DES.txt(食物描述),第一个条目.

Here's an example, for FOOD_DES.txt (food description), the first entry.

CREATE TABLE food_des (
    "NDB_No"      varchar(5) NOT NULL PRIMARY KEY,
    "FdGrp_Cd"    varchar(4) NOT NULL,
    "Long_Desc"   varchar(200) NOT NULL,
    "Shrt_Desc"   varchar(60) NOT NULL,
    "ComName"     varchar(100),
    "ManufacName" varchar(65),
    "Survey"      varchar(1),
    "Ref_desc"    varchar(135),
    "Refuse"      smallint,
    "SciName"     varchar(65),
    "N_Factor"    NUMERIC(4,2),
    "Pro_Factor"  NUMERIC(4,2),
    "Fat_Factor"  NUMERIC(4,2),
    "CHO_Factor"  NUMERIC(4,2)
);

这是描述的一个非常字面的副本.这不是我设计桌子的方式

That's a pretty literal copy of the description. It's not how I'd design the table

我使用 NUMERIC 任意精度十进制浮点类型来确保非整数数字类型的准确性.如果性能比准确性更重要,您可以使用 float4 代替.

I've used NUMERIC arbitrary-precision decimal floating point types for accuracy in non-integer numeric types. If performance is more important than accuracy, you can use float4 instead.

对于关系,您使用 FOREIGN KEY 约束 - 只需 colname coltype REFERENCES othertable(othercol) 就足以创建一个.

For relationships, you use FOREIGN KEY constraints - just colname coltype REFERENCES othertable(othercol) is sufficient to create one.

重要:我将列名用双引号括起来以保留与定义中相同的名称.这意味着您在引用它们时必须始终双引号,例如SELECT "NDB_No" FROM food_des; .如果您不想要那样,请不要使用双引号 - 或选择不同的名称.你不必拘泥于他们使用的笨拙的缩写列名,写起来也很合理:

Important: I double quoted the column names to preserve the same name as in the definitions. That means you have to always double quote them when you refer to them, e.g. SELECT "NDB_No" FROM food_des; . If you don't want that, just leave off the double quotes - or pick different names. You don't have to stick to the clumsy abbreviated column names they used, and it's quite reasonable to write:

CREATE TABLE food_description (
    ndb_no              varchar(5) NOT NULL PRIMARY KEY,
    foodgroup_code      varchar(4) NOT NULL,
    long_description    varchar(200) NOT NULL,
    short_description   varchar(60) NOT NULL,
    common_name         varchar(100),
    manufacturer_name   varchar(65),

等等.同样,如果您正在使用 Rails,您可以将表定义转换为遵循 Rails 的约定,尤其是当您打算通过 Rails 进行数据加载时.

etc. Similarly, if you're working with Rails, you can convert the table definitions to follow Rails's conventions, especially if you then intend to do the data loading via Rails.

如果这些是合理的、合理的分隔文件,那么您可以使用 psql 命令 copy 或 PgAdmin-III 的导入"选项加载每个表.

If these were sane, sensible delimited files you could then just load each table using the psql command copy, or PgAdmin-III's "import" option.

它实际上是 CSV,他们刚刚决定使用完全奇怪的分隔符和引号字符.通过 psql 导入:

It is actually CSV, they've just decided to use totally bizarre delimiter and quote chars. Import via psql with:

copy food_des FROM 'FOOD_DES.txt' (FORMAT CSV, DELIMITER '^', QUOTE '~');

或您用来与 PostgreSQL 对话的任何工具中的等效工具.

or the equivalent in whatever tool you use to talk to PostgreSQL.

结果是一个看起来很合理的表格:

The results are a sensible looking table:

craig=> select * from food_des limit 2;
 NDB_No | FdGrp_Cd |         Long_Desc          |        Shrt_Desc         | ComName | ManufacName | Survey | Ref_desc | Refuse | SciName | N_Factor | Pro_Factor | Fat_Factor | CHO_Factor 
--------+----------+----------------------------+--------------------------+---------+-------------+--------+----------+--------+---------+----------+------------+------------+------------
 01001  | 0100     | Butter, salted             | BUTTER,WITH SALT         |         |             | Y      |          |      0 |         |     6.38 |       4.27 |       8.79 |       3.87
 01002  | 0100     | Butter, whipped, with salt | BUTTER,WHIPPED,WITH SALT |         |             | Y      |          |      0 |         |     6.38 |       4.27 |       8.79 |       3.87
(2 rows)

同样,如果使用 Rails,您可以使用任何您想要的 Rails CSV 库并批量加载到模型中.

Similarly, if using Rails you can use whatever Rails CSV library you want and bulk-load into models.

相关文章