基于 R base 语法的 50 幅图之一

2022-04-08 00:00:00 数据 软件包 颜色 分组 据点

本文无意引发论战,不管是 matplotlibggplot2,还是 base plot 我都没有偏见,我初接触 R 时,ggplot2 还不火,到处都是 base plot 的教程,后来 ggplot 火了,我也跟着去 ggplot 了,见证了 ggplot 的大的升级,再后来 tidyverse 特别火,我也很迷恋了一阵。我也经常使用 tidyverse 的包,只是近特别喜欢 base 的语法而已。语言是工具,哪个工具用好了,都能是大神,例如同样的太祖长拳,在乔峰手里就是不一样了。本文就是针对工具的学习。

看着容易实现起来就难了,有些图懒得做的和下面链接的完全一样(主要是觉得设定不符合我的审美,大部分细节能接受的我都改的一样了),有些是改起来费劲,我就不想浪费太多时间了,有些图用 base 的软件包太费劲了,不值得为无关紧要的细节再添加太多代码,我已经为像原作者的图大动干戈了,于是就用的基于 base 语法的软件包。

年初有篇很火的 matplotlib 做了 50 幅图的帖子:https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/ 被人翻译了中文,写的很棒,不知为什么,近又在微信里看到了推送,恰逢我近很迷恋 R base plot 的语法,于是准备不自量力,暂时中断假设检验,来写一篇 base plot 版本的 50 幅图。其中有的图用现成的 ggplot 会非常容易,我不反对使用,只是近在强制练习 base plot。

由于 github 连接不怎么稳定的原因,我把所有的数据都下载到本地读取。因为 RColorBrewer 在颜色的细节上做的确实不错,我们在一些图形中使用它的颜色来显示。

library(RColorBrewer)library(data.table)library(scales)

Correlation

Scatterplot

这幅图很简单,就是按照 category 这个分组数据来作图,使用不同的颜色。

midwest <- read.csv("data/midwest_filter.csv")# the same no. will use same colors# R 4.0 stringasfactors = falsecat_level <- unique(midwest$category)col_n <- as.numeric(as.factor(midwest$category))uni_col <- length(unique(midwest$category))cat_col <- colorRampPalette(brewer.pal(8, "Set1"))(uni_col)
plot( midwest$area, midwest$poptotal, pch = 20, col = cat_col[col_n], xlab = 'Area', ylab = 'Population', xlim = c(0.0, 0.1), ylim = c(, 90000), axes = FALSE, frame.plot = TRUE, main = "Scatterplot of Midwest Area vs Population")
axis(1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis(2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)legend("topright", legend = sort(cat_level), pch = 20, col = cat_col, bty = "n")

试了几个颜色,感觉使用这个颜色的划分,区分度还高一些。

Bubble plot with Encircling

这个图本身没什么特别的,只是多了一个边界,将某些重要的点圈起来,数据还是使用的上面的。看了一下, legend 中应该没有特别的只改图形大小,而不改变字符的控制,而且既然每个分组大小并不一致(dot_size并为分组),似乎大小不一致也不合理,看数据点有重叠,因此将使用的颜色的 alpha 通道设置了一下,略微透明,应该更清晰。

# use scale to make alpha transparency# because of the overlap col_alpha <- alpha(cat_col, 0.5)cat_size <- midwest$dot_size / 100
# data for borderpoly_data <- midwest[which(midwest$state == 'IN'),]
circle_data <- function(x, y){ index <- chull(x, y) circle_frame <- data.frame(xpoint = x[index], ypoint = y[index])} # only get the convex hull data points# polygonpath do the plotencircle_data <- circle_data(poly_data$area, poly_data$poptotal)
plot( midwest$area, midwest$poptotal, pch = 20, col = col_alpha[col_n], cex = cat_size, xlab = 'Area', ylab = 'Population', xlim = c(0.0, 0.1), ylim = c(, 90000), axes = FALSE, frame.plot = TRUE, main = "Scatterplot of Midwest Area vs Population")polypath(encircle_data$xpoint, encircle_data$ypoint, border = "red", lwd = 3)axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8,)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)legend( "topright", legend = sort(cat_level), pch = 20, col = col_alpha, bty = "n")

Scatter plot with linear regression line of best fit

这个图没什么特别的,只不过是使用 ggplot2 会秒杀所有的方法。一点一点的从头开始使用 base plot 作图,可以利用 predict.lm 来求置信区间,其实我觉得使用的 displ 范围应该是数据的范围,但是看到的所有的作图上,都是使用了更广的范围,作为参考的区间无可厚非,作为预测来讲,当然不能随便扩大模型的范围。

mpg <- data.table(ggplot2::mpg)cyl4 <- mpg[cyl == 4]cyl8 <- mpg[cyl == 8]fit_cyl4 <- lm(data = cyl4, hwy ~ displ)fit_cyl8 <- lm(data = cyl8, hwy ~ displ)
new_df <- data.frame(displ = seq(, 8, by = 0.1))#newdata must be a data frame contain the name of displcy4_bound <- predict(fit_cyl4, newdata = new_df, interval = "confidence")cy8_bound <- predict(fit_cyl8, newdata = new_df, interval = "confidence")
# make the alpha of confidence intervalbluea = alpha("blue", 0.3)orangea = alpha("orange", 0.3)
plot.new()box()plot.window( xlim = c(, 8), ylim = c(, 50),)
points(cyl4$displ, cyl4$hwy, col = "blue", cex = 2.5, pch = 20)points(cyl8$displ, cyl8$hwy, col = "orange", cex = 2.5, pch = 20)
abline(fit_cyl4$coefficients[1], fit_cyl4$coefficients[2], col = "blue", lwd = 3)abline(fit_cyl8$coefficients[1], fit_cyl8$coefficients[2], col = "orange", lwd = 3)
polygon(c(new_df$displ, rev(new_df$displ)), c(cy4_bound[, 2], rev(cy4_bound[, 3])), col = bluea, border = bluea)polygon(c(new_df$displ, rev(new_df$displ)), c(cy8_bound[, 2], rev(cy8_bound[, 3])), col = orangea, border = orangea )
axis( 1, lwd = 1, col = 'gray', tick = FALSE, cex.axis = 0.8)mtext('dspl', side = 1, font = 2, line = 3)axis( 2, lwd = 1, col = 'gray', tick = FALSE, cex.axis = 0.8)mtext('hwy', side =2, font = 2, line = 3)
title('Scatterplot with line of best fit grouped by number of cylinders', cex.main = 1)legend( "topright", legend = c('cyl = 4', 'cyl = 8'), pch = 20, col = c("blue", "orange"), bty = 'n')

Each regression line in its own column

这个图是上面的翻版,其实比较简单,就是将一幅图分成两幅图来做,选择不同吧,作者原来就是想演示 sns 想做到这一点很容易。

oldpar <- par()
layout(matrix(c(1, 2), 1, 2, byrow = TRUE))par(mar = c(5.1, 4.1, 4.1, 1.1))plot( cyl4$displ, cyl4$hwy, col = "blue", cex = 2.5, pch = 20, xlim = c(, 7.5), ylim = c(, 50), xlab = 'displ', ylab = 'hwy', tck = )abline(fit_cyl4$coefficients[1], fit_cyl4$coefficients[2], col = "blue", lwd = 3)polygon(c(new_df$displ, rev(new_df$displ)), c(cy4_bound[, 2], rev(cy4_bound[, 3])), col = bluea, border = bluea)legend( "top", legend = 'cyl = 4', pch = 20, col = "blue", bty = 'n')
par(mar = c(5.1, 1.1, 4.1, 4.1))plot( cyl8$displ, cyl8$hwy, col = "blue", cex = 2.5, pch = 20, xlim = c(, 7.5), ylim = c(, 50), xlab = 'displ', ylab = NA, tck = )
abline(fit_cyl8$coefficients[1], fit_cyl8$coefficients[2], col = "blue", lwd = 3)polygon(c(new_df$displ, rev(new_df$displ)), c(cy8_bound[, 2], rev(cy8_bound[, 3])), col = bluea, border = bluea)legend( "top", legend = 'cyl = 8', pch = 20, col = "blue", bty = 'n')
par(oldpar)

Jittering with stripplot

这也是散点图的一种,只不过有的数据有严重的重合,人为的使这些数据点有偏离(加点噪音),这就叫 jitter。原文这幅图的数据点使用了不同的颜色,我不知道他使用的哪个分组,代码中没看到这一项,不过这不重要,我就以 cyl 为分组上色好了。其中,jitter 默认的添加噪音程度可能太保守,如果加大后数据点分离加重,那么意味着我们可以修改一下这个权重。在这里 ggplot2 更有优势。

cat_level <- unique(mpg$cyl)col_n <- as.numeric(as.factor(mpg$cyl))uni_col <- length(unique(mpg$cyl))cat_col <- brewer.pal(4, "Set1")
plot( mpg$cty, jitter(mpg$hwy, factor = 5), pch = 20, col = cat_col[col_n], xlab = 'cty', ylab = 'hwy', xlim = c(8, 36), ylim = c(10, 50), axes = FALSE, frame.plot = TRUE, main = "use jitter to avoid overlapping of points")
axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)legend( "topleft", legend = sort(cat_level), title = 'cyl', pch = 20, col = cat_col, bty = "n")

Counts Plot

上面为了展示给数据加噪音,尽管对理解可能有影响,但毕竟是改动了数据,另外一个不改动数据的方法就是使得相同的数值(重叠多)的数据计数,依据这个来改变该位置点的大小。在这里我们只需要先对重叠的点进行记数就好了,dplyr 或 data.table 都有该功能,这里使用 data.table

mpg_size <- mpg[, n := .N, by = c("hwy", 'cty')]size_n <- (mpg_size$n+10)/8
plot( mpg_size$cty, mpg_size$hwy, pch = 20, cex = size_n, col = cat_col[col_n], xlab = 'cty', ylab = 'hwy', xlim = c(8, 36), ylim = c(10, 50), axes = FALSE, frame.plot = TRUE, main = "counts the number of overlap points")
axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)legend( "topleft", legend = sort(cat_level), title = 'cyl', pch = 20, col = cat_col, bty = "n")

Marginal Histogram

这个在图形的边缘显示其他图形的技巧,这里用的是频度直方图。我这里仍然不知道分组(我没找到 scatter 的 col 选项),我随便用一个 drv 进行颜色分组。

oldpar <- par()
cat_level <- unique(mpg$drv)col_n <- as.numeric(as.factor(mpg$drv))uni_col <- length(unique(mpg$drv))cat_col <- brewer.pal(3, "Set1")
par(fig = c(, 0.75, , 0.75), new = TRUE)plot( mpg$displ, mpg$hwy, pch = 20, col = cat_col[col_n], xlab = 'dispal', ylab = 'hwy', axes = FALSE, frame.plot = TRUE,)
axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)
par(fig = c(, 0.75, 0.45, 1), new = TRUE)xhist <- hist(mpg$displ, breaks = 45, plot = FALSE)barplot(xhist$count, axes = FALSE, col = '#FF0099')par(fig = c(0.6, 1, , 0.75), new = TRUE)yhist <- hist(mpg$hwy, breaks = 40, plot = FALSE)barplot(yhist$count, horiz = TRUE, axes = FALSE, col = '#FF0099')mtext( "scatter plot with histograms", side = 3, outer = TRUE, line = -1)mtext("displ vs hwy", side = 3, outer = TRUE, line = -2)
par(oldpar)

Marginal Boxplot

这个和上面的图没有差别,改个 boxplot 即可。另外,我略微改了一下这两幅图,没有将边缘的图像放在大图的底部,我觉得放在上部更符合审美。

oldpar <- par()
cat_level <- unique(mpg$drv)col_n <- as.numeric(as.factor(mpg$drv))uni_col <- length(unique(mpg$drv))cat_col <- brewer.pal(3, "Set1")
par(fig = c(, 0.75, , 0.75), new = TRUE)plot( mpg$displ, mpg$hwy, pch = 20, col = cat_col[col_n], xlab = 'dispal', ylab = 'hwy', axes = FALSE, frame.plot = TRUE,)
axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)
par(fig = c(, 0.75, 0.45, 1), new = TRUE)boxplot(mpg$displ, axes = FALSE, col = 'darkblue', horizontal = TRUE)par(fig = c(0.6, 1, , 0.75), new = TRUE)boxplot(mpg$hwy, axes = FALSE, col = 'darkblue')mtext( "scatter plot with histograms", side = 3, outer = TRUE, line = -1)mtext("displ vs hwy", side = 3, outer = TRUE, line = -2)
par(oldpar)

Correllogram

这个图看似复杂,其实比较容易实现,有很多现成的包可以做这个事情,也很简单,但是基于 base 的似乎没有完美的简单的base package 的解决方案。尽管前面我尽量避免使用现成的包,但目前来看,我只能使用基于 base plot 的软件包来做这个图了。

library(corrplot)mtc <- read.csv("data/mtcars.csv")
heat_mtc <- cor(mtc[,1:12])col <- colorRampPalette(c("darkred", "white", "darkgreen"))(40)corrplot(heat_mtc, method="color", col=col, type="upper", order="hclust", addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE )

Pairwise Plot

这个图形的实现,相比上图来讲,实现起来容易,就是代码太多,还是使用现成的软件包,这个也有很多,例如 CorrrPerformanceAnalyticsGGally,以及我今天使用的 psych

library(psych)pairs.panels(iris, scale=TRUE)

参考资料

Koncevicius., Karolis. 2020. “R Base Plotting Without Wrappers.” http://karolis.koncevicius.lt/posts/r_base_plotting_without_wrappers/.


相关文章