从 SQLlite 数据库中读取许多表并在 R 中组合
我正在使用一个输出结果数据库的程序.我有数百个结构相同的数据库,我想将它们组合成一个大数据库.我最感兴趣的是每个数据库中的 1 个表.我不太使用数据库/sql,但它会简化过程中的其他步骤,跳过输出 csv.
I'm working with a program that outputs a database of results. I have hundreds of these databases that are all identical in structure and I'd like to combine them into ONE big database. I'm mostly interested in 1 table from each database. I don't work with databases/sql very much, but it would simplify other steps in the process, to skip outputting a csv.
以前我是通过导出一个 csv 并使用这些步骤来组合所有 csv 来做到这一点的:
Previously I did this by exporting a csv and used these steps to combine all csvs:
library(DBI)
library(RSQLite)
library(dplyr)
csv_locs<- list.files(newdir, recursive = TRUE, pattern="*.csv", full.names = TRUE)
pic_dat <- do.call("rbind", lapply(csv_locs,
FUN=function(files){data.table::fread(files, data.table = FALSE)}))
如何用sql类型的数据库表来做这个??
我基本上是拉出第一张桌子,然后用一个循环连接其余的桌子.
How to do this with sql type database tables??
I'm basically pulling out the first table, then joining on the rest with a loop.
db_locs <- list.files(directory, recursive = TRUE, pattern="*.ddb", full.names = TRUE)
# first table
con1<- DBI::dbConnect(RSQLite::SQLite(), db_locs [1])
start <- tbl(con1, "DataTable")
# open connection to location[i], get table, union, disconnect; repeat.
for(i in 2:length(db_locs )){
con <- DBI::dbConnect(RSQLite::SQLite(), db_locs[i])
y <- tbl(con, "DataTable")
start <- union(start, y, copy=TRUE)
dbDisconnect(con)
}
这特别慢!好吧,公平地说,它的大数据和 csv 也很慢.
老实说,我想我写了最慢的方法来做到这一点:) 我无法让 do.call/lapply 选项在这里工作,但也许我遗漏了一些东西.
This is exceptionally slow! Well, to be fair, its large data and the csv one is also slow.
I think I honestly wrote the slowest possible way to do this :) I could not get the do.call/lapply option to work here, but maybe I'm missing something.
推荐答案
这看起来类似于迭代rbind
帧",因为每次你这样做union
,它会将整个表复制到一个新对象中(未经证实,但这是我的直觉).这可能对少数人有效,但扩展性很差.我建议您将所有表收集到一个列表中,并在最后调用 data.table::rbindlist
一次,然后插入到一个表中.
This looks similar to "iterative rbind
ing of frames", in that each time you do this union
, it will copy the entire table into a new object (unconfirmed, but that's my gut feeling). This might work well for a few but scales very poorly. I suggest you collect all tables in a list and call data.table::rbindlist
once at the end, then insert into a table.
没有你的数据,我会设计一个情况.并且因为我不完全确定每个 sqlite3 文件是否只有一个表,所以我将为每个数据库添加两个表.如果您只有一个,则解决方案会很容易简化.
Without your data, I'll contrive a situation. And because I'm not entirely certain if you have just one table per sqlite3 file, I'll add two tables per database. If you only have one, the solution simplifies easily.
for (i in 1:3) {
con <- DBI::dbConnect(RSQLite::SQLite(), sprintf("mtcars_%d.sqlite3", i))
DBI::dbWriteTable(con, "mt1", mtcars[1:3,1:3])
DBI::dbWriteTable(con, "mt2", mtcars[4:5,4:7])
DBI::dbDisconnect(con)
}
(lof <- list.files(pattern = "*.sqlite3", full.names = TRUE))
# [1] "./mtcars_1.sqlite3" "./mtcars_2.sqlite3" "./mtcars_3.sqlite3"
现在我将遍历它们并读取表格的内容
Now I'll iterate over each them and read the contents of a table
allframes <- lapply(lof, function(fn) {
con <- DBI::dbConnect(RSQLite::SQLite(), fn)
mt1 <- tryCatch(DBI::dbReadTable(con, "mt1"),
error = function(e) NULL)
mt2 <- tryCatch(DBI::dbReadTable(con, "mt2"),
error = function(e) NULL)
DBI::dbDisconnect(con)
list(mt1 = mt1, mt2 = mt2)
})
allframes
# [[1]]
# [[1]]$mt1
# mpg cyl disp
# 1 21.0 6 160
# 2 21.0 6 160
# 3 22.8 4 108
# [[1]]$mt2
# hp drat wt qsec
# 1 110 3.08 3.215 19.44
# 2 175 3.15 3.440 17.02
# [[2]]
# [[2]]$mt1
# mpg cyl disp
# 1 21.0 6 160
# 2 21.0 6 160
# 3 22.8 4 108
### ... repeated
从这里开始,只需将它们组合在 R 中并写入新数据库.虽然您可以使用 do.call(rbind,...)
或 dplyr::bind_rows
,但您已经提到了 data.table
所以我会坚持下去:
From here, just combine them in R and write to a new database. While you can use do.call(rbind,...)
or dplyr::bind_rows
, you already mentioned data.table
so I'll stick with that:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars_all.sqlite3")
DBI::dbWriteTable(con, "mt1", data.table::rbindlist(lapply(allframes, `[[`, 1)))
DBI::dbWriteTable(con, "mt2", data.table::rbindlist(lapply(allframes, `[[`, 2)))
DBI::dbGetQuery(con, "select count(*) as n from mt1")
# n
# 1 9
DBI::dbDisconnect(con)
如果您不能一次将它们全部加载到 R 中,则将它们实时附加到表中:
In the event that you can't load them all into R at one time, then append them to the table in real-time:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars_all2.sqlite3")
for (fn in lof) {
con2 <- DBI::dbConnect(RSQLite::SQLite(), fn)
mt1 <- tryCatch(DBI::dbReadTable(con2, "mt1"), error = function(e) NULL)
if (!is.null(mt1)) DBI::dbWriteTable(con, "mt1", mt1, append = TRUE)
mt2 <- tryCatch(DBI::dbReadTable(con2, "mt2"), error = function(e) NULL)
if (!is.null(mt1)) DBI::dbWriteTable(con, "mt2", mt2, append = TRUE)
DBI::dbDisconnect(con2)
}
DBI::dbGetQuery(con, "select count(*) as n from mt1")
# n
# 1 9
这不会受到您正在经历的迭代放缓的影响.
This doesn't suffer the iterative-slowdown that you're experiencing.
相关文章