基于 R base 语法的 50 幅图之一

2022-04-08 00:00:00 数据软件包颜色分组据点

本文无意引发论战，不管是 matplotlib、ggplot2，还是 base plot 我都没有偏见，我初接触 R 时，ggplot2 还不火，到处都是 base plot 的教程，后来 ggplot 火了，我也跟着去 ggplot 了，见证了 ggplot 的大的升级，再后来 tidyverse 特别火，我也很迷恋了一阵。我也经常使用 tidyverse 的包，只是近特别喜欢 base 的语法而已。语言是工具，哪个工具用好了，都能是大神，例如同样的太祖长拳，在乔峰手里就是不一样了。本文就是针对工具的学习。

看着容易实现起来就难了，有些图懒得做的和下面链接的完全一样（主要是觉得设定不符合我的审美，大部分细节能接受的我都改的一样了），有些是改起来费劲，我就不想浪费太多时间了，有些图用 base 的软件包太费劲了，不值得为无关紧要的细节再添加太多代码，我已经为像原作者的图大动干戈了，于是就用的基于 base 语法的软件包。

年初有篇很火的 matplotlib 做了 50 幅图的帖子：https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/ 被人翻译了中文，写的很棒，不知为什么，近又在微信里看到了推送，恰逢我近很迷恋 R base plot 的语法，于是准备不自量力，暂时中断假设检验，来写一篇 base plot 版本的 50 幅图。其中有的图用现成的 ggplot 会非常容易，我不反对使用，只是近在强制练习 base plot。

由于 github 连接不怎么稳定的原因，我把所有的数据都下载到本地读取。因为 RColorBrewer 在颜色的细节上做的确实不错，我们在一些图形中使用它的颜色来显示。

library(RColorBrewer)library(data.table)library(scales)

Correlation

Scatterplot

这幅图很简单，就是按照 category 这个分组数据来作图，使用不同的颜色。

midwest <- read.csv("data/midwest_filter.csv")# the same no. will use same colors# R 4.0 stringasfactors = falsecat_level <- unique(midwest$category)col_n <- as.numeric(as.factor(midwest$category))uni_col <- length(unique(midwest$category))cat_col <- colorRampPalette(brewer.pal(8, "Set1"))(uni_col)

plot( midwest$area, midwest$poptotal, pch = 20, col = cat_col[col_n], xlab = 'Area', ylab = 'Population', xlim = c(0.0, 0.1), ylim = c(, 90000), axes = FALSE, frame.plot = TRUE, main = "Scatterplot of Midwest Area vs Population")

axis(1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis(2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)legend("topright", legend = sort(cat_level), pch = 20, col = cat_col, bty = "n")

试了几个颜色，感觉使用这个颜色的划分，区分度还高一些。

Bubble plot with Encircling

这个图本身没什么特别的，只是多了一个边界，将某些重要的点圈起来，数据还是使用的上面的。看了一下， legend 中应该没有特别的只改图形大小，而不改变字符的控制，而且既然每个分组大小并不一致（dot_size并为分组），似乎大小不一致也不合理，看数据点有重叠，因此将使用的颜色的 alpha 通道设置了一下，略微透明，应该更清晰。

# use scale to make alpha transparency# because of the overlap col_alpha <- alpha(cat_col, 0.5)cat_size <- midwest$dot_size / 100

# data for borderpoly_data <- midwest[which(midwest$state == 'IN'),]

circle_data <- function(x, y){ index <- chull(x, y) circle_frame <- data.frame(xpoint = x[index], ypoint = y[index])} # only get the convex hull data points# polygonpath do the plotencircle_data <- circle_data(poly_data$area, poly_data$poptotal)

plot( midwest$area, midwest$poptotal, pch = 20, col = col_alpha[col_n], cex = cat_size, xlab = 'Area', ylab = 'Population', xlim = c(0.0, 0.1), ylim = c(, 90000), axes = FALSE, frame.plot = TRUE, main = "Scatterplot of Midwest Area vs Population")polypath(encircle_data$xpoint, encircle_data$ypoint, border = "red", lwd = 3)axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8,)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)legend( "topright", legend = sort(cat_level), pch = 20, col = col_alpha, bty = "n")

Scatter plot with linear regression line of best fit

这个图没什么特别的，只不过是使用 ggplot2 会秒杀所有的方法。一点一点的从头开始使用 base plot 作图，可以利用 predict.lm 来求置信区间，其实我觉得使用的 displ 范围应该是数据的范围，但是看到的所有的作图上，都是使用了更广的范围，作为参考的区间无可厚非，作为预测来讲，当然不能随便扩大模型的范围。

mpg <- data.table(ggplot2::mpg)cyl4 <- mpg[cyl == 4]cyl8 <- mpg[cyl == 8]fit_cyl4 <- lm(data = cyl4, hwy ~ displ)fit_cyl8 <- lm(data = cyl8, hwy ~ displ)

new_df <- data.frame(displ = seq(, 8, by = 0.1))#newdata must be a data frame contain the name of displcy4_bound <- predict(fit_cyl4, newdata = new_df, interval = "confidence")cy8_bound <- predict(fit_cyl8, newdata = new_df, interval = "confidence")

# make the alpha of confidence intervalbluea = alpha("blue", 0.3)orangea = alpha("orange", 0.3)

plot.new()box()plot.window( xlim = c(, 8), ylim = c(, 50),)

points(cyl4$displ, cyl4$hwy, col = "blue", cex = 2.5, pch = 20)points(cyl8$displ, cyl8$hwy, col = "orange", cex = 2.5, pch = 20)

abline(fit_cyl4$coefficients[1], fit_cyl4$coefficients[2], col = "blue", lwd = 3)abline(fit_cyl8$coefficients[1], fit_cyl8$coefficients[2], col = "orange", lwd = 3)

polygon(c(new_df$displ, rev(new_df$displ)), c(cy4_bound[, 2], rev(cy4_bound[, 3])), col = bluea, border = bluea)polygon(c(new_df$displ, rev(new_df$displ)), c(cy8_bound[, 2], rev(cy8_bound[, 3])), col = orangea, border = orangea )

axis( 1, lwd = 1, col = 'gray', tick = FALSE, cex.axis = 0.8)mtext('dspl', side = 1, font = 2, line = 3)axis( 2, lwd = 1, col = 'gray', tick = FALSE, cex.axis = 0.8)mtext('hwy', side =2, font = 2, line = 3)

title('Scatterplot with line of best fit grouped by number of cylinders', cex.main = 1)legend( "topright", legend = c('cyl = 4', 'cyl = 8'), pch = 20, col = c("blue", "orange"), bty = 'n')

Each regression line in its own column

这个图是上面的翻版，其实比较简单，就是将一幅图分成两幅图来做，选择不同吧，作者原来就是想演示 sns 想做到这一点很容易。

oldpar <- par()

layout(matrix(c(1, 2), 1, 2, byrow = TRUE))par(mar = c(5.1, 4.1, 4.1, 1.1))plot( cyl4$displ, cyl4$hwy, col = "blue", cex = 2.5, pch = 20, xlim = c(, 7.5), ylim = c(, 50), xlab = 'displ', ylab = 'hwy', tck = )abline(fit_cyl4$coefficients[1], fit_cyl4$coefficients[2], col = "blue", lwd = 3)polygon(c(new_df$displ, rev(new_df$displ)), c(cy4_bound[, 2], rev(cy4_bound[, 3])), col = bluea, border = bluea)legend( "top", legend = 'cyl = 4', pch = 20, col = "blue", bty = 'n')

par(mar = c(5.1, 1.1, 4.1, 4.1))plot( cyl8$displ, cyl8$hwy, col = "blue", cex = 2.5, pch = 20, xlim = c(, 7.5), ylim = c(, 50), xlab = 'displ', ylab = NA, tck = )

abline(fit_cyl8$coefficients[1], fit_cyl8$coefficients[2], col = "blue", lwd = 3)polygon(c(new_df$displ, rev(new_df$displ)), c(cy8_bound[, 2], rev(cy8_bound[, 3])), col = bluea, border = bluea)legend( "top", legend = 'cyl = 8', pch = 20, col = "blue", bty = 'n')

par(oldpar)

Jittering with stripplot

这也是散点图的一种，只不过有的数据有严重的重合，人为的使这些数据点有偏离（加点噪音），这就叫 jitter。原文这幅图的数据点使用了不同的颜色，我不知道他使用的哪个分组，代码中没看到这一项，不过这不重要，我就以 cyl 为分组上色好了。其中，jitter 默认的添加噪音程度可能太保守，如果加大后数据点分离加重，那么意味着我们可以修改一下这个权重。在这里 ggplot2 更有优势。

cat_level <- unique(mpg$cyl)col_n <- as.numeric(as.factor(mpg$cyl))uni_col <- length(unique(mpg$cyl))cat_col <- brewer.pal(4, "Set1")

plot( mpg$cty, jitter(mpg$hwy, factor = 5), pch = 20, col = cat_col[col_n], xlab = 'cty', ylab = 'hwy', xlim = c(8, 36), ylim = c(10, 50), axes = FALSE, frame.plot = TRUE, main = "use jitter to avoid overlapping of points")

axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)legend( "topleft", legend = sort(cat_level), title = 'cyl', pch = 20, col = cat_col, bty = "n")

Counts Plot

上面为了展示给数据加噪音，尽管对理解可能有影响，但毕竟是改动了数据，另外一个不改动数据的方法就是使得相同的数值（重叠多）的数据计数，依据这个来改变该位置点的大小。在这里我们只需要先对重叠的点进行记数就好了，dplyr 或 data.table 都有该功能，这里使用 data.table。

mpg_size <- mpg[, n := .N, by = c("hwy", 'cty')]size_n <- (mpg_size$n+10)/8

plot( mpg_size$cty, mpg_size$hwy, pch = 20, cex = size_n, col = cat_col[col_n], xlab = 'cty', ylab = 'hwy', xlim = c(8, 36), ylim = c(10, 50), axes = FALSE, frame.plot = TRUE, main = "counts the number of overlap points")

Marginal Histogram

这个在图形的边缘显示其他图形的技巧，这里用的是频度直方图。我这里仍然不知道分组（我没找到 scatter 的 col 选项），我随便用一个 drv 进行颜色分组。

oldpar <- par()

cat_level <- unique(mpg$drv)col_n <- as.numeric(as.factor(mpg$drv))uni_col <- length(unique(mpg$drv))cat_col <- brewer.pal(3, "Set1")

par(fig = c(, 0.75, , 0.75), new = TRUE)plot( mpg$displ, mpg$hwy, pch = 20, col = cat_col[col_n], xlab = 'dispal', ylab = 'hwy', axes = FALSE, frame.plot = TRUE,)

axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)

par(fig = c(, 0.75, 0.45, 1), new = TRUE)xhist <- hist(mpg$displ, breaks = 45, plot = FALSE)barplot(xhist$count, axes = FALSE, col = '#FF0099')par(fig = c(0.6, 1, , 0.75), new = TRUE)yhist <- hist(mpg$hwy, breaks = 40, plot = FALSE)barplot(yhist$count, horiz = TRUE, axes = FALSE, col = '#FF0099')mtext( "scatter plot with histograms", side = 3, outer = TRUE, line = -1)mtext("displ vs hwy", side = 3, outer = TRUE, line = -2)

par(oldpar)

Marginal Boxplot

这个和上面的图没有差别，改个 boxplot 即可。另外，我略微改了一下这两幅图，没有将边缘的图像放在大图的底部，我觉得放在上部更符合审美。

oldpar <- par()

cat_level <- unique(mpg$drv)col_n <- as.numeric(as.factor(mpg$drv))uni_col <- length(unique(mpg$drv))cat_col <- brewer.pal(3, "Set1")

par(fig = c(, 0.75, , 0.75), new = TRUE)plot( mpg$displ, mpg$hwy, pch = 20, col = cat_col[col_n], xlab = 'dispal', ylab = 'hwy', axes = FALSE, frame.plot = TRUE,)

axis( 1, lwd = 1, las = , col = 'gray', tick = FALSE, cex.axis = 0.8)axis( 2, lwd = 1, las = 2, col = 'gray', tick = FALSE, cex.axis = 0.8)

par(fig = c(, 0.75, 0.45, 1), new = TRUE)boxplot(mpg$displ, axes = FALSE, col = 'darkblue', horizontal = TRUE)par(fig = c(0.6, 1, , 0.75), new = TRUE)boxplot(mpg$hwy, axes = FALSE, col = 'darkblue')mtext( "scatter plot with histograms", side = 3, outer = TRUE, line = -1)mtext("displ vs hwy", side = 3, outer = TRUE, line = -2)

par(oldpar)

Correllogram

这个图看似复杂，其实比较容易实现，有很多现成的包可以做这个事情，也很简单，但是基于 base 的似乎没有完美的简单的base package 的解决方案。尽管前面我尽量避免使用现成的包，但目前来看，我只能使用基于 base plot 的软件包来做这个图了。

library(corrplot)mtc <- read.csv("data/mtcars.csv")

heat_mtc <- cor(mtc[,1:12])col <- colorRampPalette(c("darkred", "white", "darkgreen"))(40)corrplot(heat_mtc, method="color", col=col, type="upper", order="hclust", addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE )

Pairwise Plot

这个图形的实现，相比上图来讲，实现起来容易，就是代码太多，还是使用现成的软件包，这个也有很多，例如 Corrr, PerformanceAnalytics，GGally，以及我今天使用的 psych。

library(psych)pairs.panels(iris, scale=TRUE)

参考资料

Koncevicius., Karolis. 2020. “R Base Plotting Without Wrappers.” http://karolis.koncevicius.lt/posts/r_base_plotting_without_wrappers/.

相关文章