10分钟搭建深度学习平台及建立神经网络模型（R+H2O）

2019-07-04 00:00:00 搭建神经网络深度

原文地址：Zalando’s images classification using H2O with R

以下只是俺个人基于原文的理解及尝试再现。

什么是H2O？

H2O是一个用Java写的深度学习平台，支持R、Python等语言，内置了很多深度学习算法如：

1、监督学习类（Supervised）：

Deep Learning (Neural Networks)
Distributed Random Forest (DRF)
Generalized Linear Model (GLM)
Gradient Boosting Machine (GBM)
Naive Bayes Classifier
Stacked Ensembles
XGBoost

2、非监督学习类（Unsupervised）：

Generalized Low Rank Models (GLRM)
K-Means Clustering
Principal Component Analysis (PCA)

此处用到的示例数据，请参考：

https://www.kaggle.com/zalando-research/fashionmnist

如果您已经有R Studio，安装H2O非常简单（如果没有，请参考https://www.rstudio.com/ 安装R及R Studio）。

#安装必要的包，tidyverse包括了一大票好用的包，如dplyr, ggplot2等
#install.packages('gridExtra') 
#install.packages("tidyverse")
#install.packages('h2o') #安装h2o

H2O的运行依赖于JDK，请注意用当前最新版的JAVA9暂不支持，请安装JAVA8。

> library(gridExtra)
> library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages --------------------------------------------------------------------------------------------------------------
combine(): dplyr, gridExtra
filter():  dplyr, stats
lag():     dplyr, stats
> library(h2o)

----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------
载入程辑包：‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames, colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc

> h2o.init() #其运行需要JDK，当前版本不支持Java9，建议安装java8
H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\stone\AppData\Local\Temp\Rtmp8IixjQ/h2o_stone_started_from_r.out
    C:\Users\stone\AppData\Local\Temp\Rtmp8IixjQ/h2o_stone_started_from_r.err

java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

Starting H2O JVM and connecting: . Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         6 seconds 980 milliseconds 
    H2O cluster version:        3.14.0.3 
    H2O cluster version age:    1 month and 11 days  
    H2O cluster name:           H2O_started_from_R_stone_mgc364 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.75 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.2 (2017-09-28) 

> #导入数据
> fmnist_train <- h2o.importFile(path = "fashion-mnist_train.csv",
+ destination_frame = "fmnist_train",
+ col.types=c("factor", rep("int", 784)))
  |=================================================================================================================================| 100%
> fmnist_test <- h2o.importFile(path = "fashion-mnist_test.csv",
+ destination_frame = "fmnist_test",
+ col.types=c("factor", rep("int", 784)))
  |=================================================================================================================================| 100%

查看一下是否导入成功

> h2o.ls()
           key
1  fmnist_test
2 fmnist_train

#试用建模之前，先看一下源数据是啥样的
xy_axis <- data.frame(x = expand.grid(1:28,28:1)[,1],
y = expand.grid(1:28,28:1)[,2])
plot_theme <- list(
                  raster = geom_raster(hjust = 0, vjust = 0),
                  gradient_fill = scale_fill_gradient(low = "white", high = "black", guide = FALSE),
                  theme = theme(axis.line = element_blank(),
                  axis.text = element_blank(),
                  axis.ticks = element_blank(),
                  axis.title = element_blank(),
                  panel.background = element_blank(),
                  panel.border = element_blank(),
                  panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  plot.background = element_blank())
                  )
sample_plots <- sample(1:nrow(fmnist_train),100) %>% map(~ {
                  plot_data <- cbind(xy_axis, fill = as.data.frame(t(fmnist_train[.x, -1]))[,1])
                  ggplot(plot_data, aes(x, y, fill = fill)) + plot_theme
                  })
do.call("grid.arrange", c(sample_plots, ncol = 10, nrow = 10))

随机读出100张图片感觉一下

《10分钟搭建深度学习平台及建立神经网络模型（R+H2O）》

以下开始H2O的精彩表演：

#试用快速建模
fmnist_nn_1 <- h2o.deeplearning(x = 2:785,
                                y = "label",
                                training_frame = fmnist_train,
                                distribution = "multinomial",
                                model_id = "fmnist_nn_1",
                                l2 = 0.4,
                                ignore_const_cols = FALSE,
                                hidden = 10,
                                export_weights_and_biases = TRUE)

深度学习模型参数export_weights_and_biases设置为TRUE，以便于我们获取我们的神经网络的weight和bias，以下我们把隐藏层的神经元信息可视化展示：

weights_nn_1 <- as.data.frame(h2o.weights(fmnist_nn_1, 1))
biases_nn_1 <- as.vector(h2o.biases(fmnist_nn_1, 1))
neurons_plots <- 1:10 %>% map(~ {
              plot_data <- cbind(xy_axis, fill = t(weights_nn_1[.x,]) + biases_nn_1[.x])
              colnames(plot_data)[3] <- "fill"
              ggplot(plot_data, aes(x, y, fill = fill)) + plot_theme
              })
do.call("grid.arrange", c(neurons_plots, ncol = 3, nrow = 4))

《10分钟搭建深度学习平台及建立神经网络模型（R+H2O）》

可以看到好象是很相像的衬衣、鞋子之类，以下看一下我们的模型的错误率：

> h2o.confusionMatrix(fmnist_nn_1, fmnist_test)
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
          0   1    2    3    4    5   6    7   8    9  Error             Rate
0       834   6   31   70    2   19  10    3  25    0 0.1660 =    166 / 1,000
1         7 935   22   30    1    3   2    0   0    0 0.0650 =     65 / 1,000
2        37   0  686    8  170   32  53    0  14    0 0.3140 =    314 / 1,000
3        61  18   11  860   16   13  19    0   2    0 0.1400 =    140 / 1,000
4         4   8  156   56  709   16  48    0   3    0 0.2910 =    291 / 1,000
5         1   0    0    1    0  838   0  118   1   41 0.1620 =    162 / 1,000
6       308   4  241   43  189   52 139    0  23    1 0.8610 =    861 / 1,000
7         0   0    0    0    0  107   0  810   0   83 0.1900 =    190 / 1,000
8         8   0   26   10    4   40   0   19 888    5 0.1120 =    112 / 1,000
9         0   0    0    0    0   45   0   63   0  892 0.1080 =    108 / 1,000
Totals 1260 971 1173 1078 1091 1165 271 1013 956 1022 0.2409 = 2,409 / 10,000

貌似不太强劲，1万居然错误有2409，即正确率为75.91%。

其实H2O的深度学习函数有多达70多个参数，上述为了速度起见我仅用了不到10个，此文原作者试过以下参数，准确率高达91.6%，我试图尝试，但我的机器CPU占用100%，太慢了差不多5分钟才跑了1%，就没有尝试下去。其实H2O还有GPU版本的，有兴趣的同学可以试一下。

#deeplearning的参数有非常多，以下模型不要经易运行，可能非常慢，CPU占用100%
fmnist_nn_final <- h2o.deeplearning(x = 2:785,
                                    y = "label",
                                    training_frame = fmnist_train,
                                    distribution = "multinomial",
                                    model_id = "fmnist_nn_final",
                                    activation = "RectifierWithDropout",
                                    hidden=c(1000, 1000, 2000),
                                    epochs = 180,
                                    adaptive_rate = FALSE,
                                    rate=0.01,
                                    rate_annealing = 1.0e-6,
                                    rate_decay = 1.0,
                                    momentum_start = 0.4,
                                    momentum_ramp = 384000,
                                    momentum_stable = 0.98, 
                                    input_dropout_ratio = 0.22,
                                    l1 = 1.0e-5,
                                    max_w2 = 15.0, 
                                    initial_weight_distribution = "Normal",
                                    initial_weight_scale = 0.01,
                                    nesterov_accelerated_gradient = TRUE,
                                    loss = "CrossEntropy",
                                    fast_mode = TRUE,
                                    diagnostics = TRUE,
                                    ignore_const_cols = TRUE,
                                    force_load_balance = TRUE,
                                    seed = 3.656455e+18)
h2o.confusionMatrix(fmnist_nn_final, fmnist_test)

以下俺未实际尝试。

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
          0    1    2    3    4   5   6    7    8   9  Error            Rate
0       898    0   14   15    1   1  66    0    5   0 0.1020  = 102 / 1 000
1         2  990    2    6    0   0   0    0    0   0 0.0100   = 10 / 1 000
2        12    1  875   13   60   1  35    0    3   0 0.1250  = 125 / 1 000
3        16   11    8  925   23   1  14    0    2   0 0.0750   = 75 / 1 000
4         1    0   61   21  885   0  30    0    2   0 0.1150  = 115 / 1 000
5         0    0    1    0    0 964   0   24    1  10 0.0360   = 36 / 1 000
6       131    2   66   22   50   0 722    0    7   0 0.2780  = 278 / 1 000
7         0    0    0    0    0  10   0  963    0  27 0.0370   = 37 / 1 000
8         4    1    4    1    1   2   3    2  981   1 0.0190   = 19 / 1 000
9         0    0    0    0    0   6   0   37    0 957 0.0430   = 43 / 1 000
Totals 1064 1005 1031 1003 1020 985 870 1026 1001 995 0.0840 = 840 / 10 000

    原文作者：Stone Shi
    原文地址: https://zhuanlan.zhihu.com/p/30679384
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。

相关文章