估算缺失数据,同时强制相关系数保持不变
问题描述
考虑以下(excel)数据集:
Consider the following (excel) dataset:
m | r
----|------
2.0 | 3.3
0.8 |
| 4.0
1.3 |
2.1 | 5.2
| 2.3
| 1.9
2.5 |
1.2 | 3.0
2.0 | 2.6
我的目标是使用以下条件填充缺失值:
将上述两列之间的成对相关性表示为 R(约 0.68).将在空单元格填充后的相关性表示为 R*.填写表格,使 (R - R*)^2 = 0.也就是说,我想保持数据的相关结构不变.
Denote as R the pairwise correlation between the above two columns (around 0.68). Denote as R* the correlation after the empty cells have been filled in. Fill in the table so that (R - R*)^2 = 0. This is, I want to keep the correlation structure of the data intact.
到目前为止,我已经使用 Matlab 完成了:
So far I have done it using Matlab:
clear all;
m = xlsread('data.xlsx','A2:A11') ;
r = xlsread('data.xlsx','B2:B11') ;
rho = corr(m,r,'rows','pairwise');
x0 = [1,1,1,1,1,1];
lb = [0,0,0,0,0,0];
f = @(x)my_correl(x,rho);
SOL = fmincon(f,x0,[],[],[],[],lb)
函数my_correl
在哪里:
function X = my_correl(x,rho)
sum_m = (11.9 + x(1) + x(2) + x(3));
sum_r = (22.3 + x(1) + x(2) + x(3));
avg_m = (11.9 + x(1) + x(2) + x(3))/8;
avg_r = (22.3 + x(4) + x(5) + x(6))/8;
rho_num = 8*(26.32 + 4*x(1) + 2.3*x(2) + 1.9*x(3) + 0.8*x(4) + 1.3*x(5) + 2.5*x(6)) - sum_m*sum_r;
rho_den = sqrt(8*(22.43 + (4*x(1))^2 + (2.3*x(2))^2 + (1.9*x(3))^2) - sum_m^2)*sqrt(8*(78.6 + (0.8*x(4))^2 + (1.3*x(5))^ + (2.5*x(6))^2) - sum_r^2);
X = (rho - rho_num/rho_den)^2;
end
此函数手动计算相关性,其中每个缺失数据都是一个变量x(i)
.
This function computes the correlation manually, where every missing data is a variable x(i)
.
问题:我的实际数据集有超过 20,000 个观察值.我无法手动创建该 rho 公式.
The problem: my actual dataset has more than 20,000 observations. There is no way I can create that rho formula manually.
如何填写我的数据集?
注意 1:我愿意使用 Python、Julia 或 R 等替代语言.Matlab 这只是我的默认语言.
Note 1: I am open to use alternative languages like Python, Julia, or R. Matlab it's just my default one.
注意 2:答案将获得 100 分奖励.承诺从现在开始.
Note 2: a 100 points bounty will be awarded to the answer. Promise from now.
解决方案
这就是我的处理方式,提供了 R 中的实现:
This is how I would approach it, with an implementation in R provided:
对于缺失数据点的插补没有唯一的解决方案,使得完整(插补)数据的成对相关性等于不完整数据的成对相关性.因此,为了找到一个好的"解决方案而不仅仅是任何"解决方案,我们可以引入一个额外的标准,即完整的估算数据也应该与原始数据共享相同的线性回归.这使我们采用了一种相当简单的方法.
There is not a unique solution for imputing the missing data points, such that the pairwise correlation of the complete (imputed) data is equal to the pairwise correlation of the incomplete data. So to find a 'good' solution rather than just 'any' solution, we can introduce an additional criteria that the complete imputed data should also share the same linear regression with the original data. This leads us to a fairly simple approach.
- 计算原始数据的线性回归模型.
- 找出恰好位于这条回归线上的缺失值的插补值.
- 为这条回归线周围的估算值生成随机散布的残差
- 缩放插补残差以强制完整插补数据的相关性等于原始数据的相关性
R 中这样的解决方案:
A solution like this in R:
library(data.table)
set.seed(123)
rho = cor(dt$m,dt$r,'pairwise')
# calculate linear regression of original data
fit1 = lm(r ~ m, data=dt)
fit2 = lm(m ~ r, data=dt)
# extract the standard errors of regression intercept (in each m & r direction)
# and multiply s.e. by sqrt(n) to get standard deviation
sd1 = summary(fit1)$coefficients[1,2] * sqrt(dt[!is.na(r), .N])
sd2 = summary(fit2)$coefficients[1,2] * sqrt(dt[!is.na(m), .N])
# find where data points with missing values lie on the regression line
dt[is.na(r), r.imp := coefficients(fit1)[1] + coefficients(fit1)[2] * m]
dt[is.na(m), m.imp := coefficients(fit2)[1] + coefficients(fit2)[2] * r]
# generate randomised residuals for the missing data, using the s.d. calculated above
dt[is.na(r), r.ran := rnorm(.N, sd=sd1)]
dt[is.na(m), m.ran := rnorm(.N, sd=sd2)]
# function that scales the residuals by a factor x, then calculates how close correlation of imputed data is to that of original data
obj = function(x, dt, rho) {
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x]
dt[is.na(m), m.comp := m.imp + m.ran*x]
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2
}
# find the value of x that minimises the discrepencay of imputed versus original correlation
fit = optimize(obj, c(-5,5), dt, rho)
x=fit$minimum
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x]
dt[is.na(m), m.comp := m.imp + m.ran*x]
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2 # check that rho2 is approximately equal to rho
作为最终检查,计算完整估算数据的线性回归并绘制以显示回归线与原始数据相同.请注意,下图是针对如下所示的大型数据集,以演示在大型数据上使用此方法.
As a final check, calculate linear regression of the complete imputed data and plot to show that regression line is same as for original data. Note that the plot below was for the large data set shown below, to demonstrate use of this method on large data.
fit.comp = lm(r.comp ~ m.comp, data=dt)
plot(dt$m.comp, dt$r.comp)
points(dt$m, dt$r, col="red")
abline(fit1, col="green")
abline(fit.comp, col="blue")
mtext(paste(" Rho =", round(rho,5)), at=-1)
mtext(paste(" Rho2 =", round(rho2, 5)), at=6)
数据
来自 OP 示例的原始玩具数据:
original toy data from OP example:
dt=structure(list(m = c(2, 0.8, NA, 1.3, 2.1, NA, NA, 2.5, 1.2, 2),
r = c(3.3, NA, 4, NA, 5.2, 2.3, 1.9, NA, 3, 2.6)),
.Names = c("m", "r"), row.names = c(NA, -10L),
class = c("data.table", "data.frame"))
一个更大的数据集来演示大数据
A larger data set to demonstrate on big data
dt = data.table(m=rnorm(1e5, 3, 2))[, r:=1.5 + 1.1*m + rnorm(1e5,0,2)]
dt[sample(.N, 3e4), m:=NA]
dt[sample(which(!is.na(m)), 3e4), r := NA]
相关文章