手把手带你玩转 iceberg - trino on k8s
前言
本文旨在搭建一套iceberg数据湖的系统,为学习iceberg的功能做准备。使用集成了iceberg的trino相对更简单易懂,且github上已经有现成的项目可以借鉴,我们只需要做些适配就行。整个工程我们基于阿里云,后期如果有需要我也会在华为云或者腾讯云做一些demo。
下图是基于iceberg的数据湖的基本框架。
trino是一个大数据查询引擎,它使用常驻进程和纯内存计算等达到低延迟高并发的查询效果,一般用于秒级到几分钟的近实时场景。网上介绍的文章很多,这里不作过多介绍。
下图大概是我规划的整个demo工程的设计图:
我们使用一台低配版的ECS作为统一连接公网的入口。
申请多台阿里云按量计费的突发型云主机(便宜的很,适合demo),可以先申请两台4核8G的,然后创建K8S集群。
为了简单,我们的对象存储先使用minio的二进制安装方式安装在我们的对外连接的云主机上(方便直接web连接,查看数据存储)。
K8S集群中我们先安装一台mariadb,在iceberg的数据湖方案中主要用来存储当前快照的对象存储路径信息。
然后我们安装metastore,用来管理元信息。
后,我们安装下trino就ok了。
代码仓库地址
gitee.com/nativesailo…
注:文章中的yaml配置可能不是终版本,可以直接参考代码仓配置。
本文主要分两个部分,部分是环境搭建,第二部分就是使用trino简单体验存储数据到iceberg数据湖。
环境搭建
小A同学闪亮登场
申请云主机
参考:juejin.cn/post/684516…
准备一台ECS服务器,同样我们也称之为小A同学。
远程连接到云主机
使用pem密钥登录
安装git
sudo yum install git
下载github代码
搜索 trino-on-k8s github 获取到工程连接 github.com/joshuarobin…
clone代码到云主机
git clone github.com/joshuarobin…
查看README
主要应用流程
-
Build Docker image for Hive Metastore
-
Deploy Hive Metastore: MariaDB (pvc and deployment), init-schemas, Metastore
-
Deploy Trino services (coordinator, workers, and cli)
小A上安装minio-server
下载 minio
wget dl.min.io/server/mini…
添加可执行权限
chmod +x minio
设置登录minio的 access key
export MINIO_ACCESS_KEY=minioadmin
设置登录minio的 secret key
export MINIO_SECRET_KEY=minioadmin
后台启动 minio
指定Console访问的端口为固定的9090
nohup ./minio server -console-address :9090 /data/minio/ > /data/minio/minio.log 2>&1 &
复制代码
前台启动 minio
指定Console访问的端口为固定的9090
[root@iZ8vbgn00yu75o270lb2azZ code]# ./minio server --console-address :9090 /data/minio/
WARNING: MINIO_ACCESS_KEY and MINIO_SECRET_KEY are deprecated.
Please use MINIO_ROOT_USER and MINIO_ROOT_PASSWORD
API: http://172.28.166.95:9000 http://127.0.0.1:9000
RootUser: minioadmin
RootPass: minioadmin
Console: http://172.28.166.95:9090 http://127.0.0.1:9090
RootUser: minioadmin
RootPass: minioadmin
Command-line: https://docs.min.io/docs/minio-client-quickstart-guide
$ mc alias set myminio http://172.28.166.95:9000 minioadmin minioadmin
Documentation: https://docs.min.io
复制代码
访问
使用小A的公网地址 39.103.234.55 和 9090端口进行web访问
小A上安装minio-client
下载 minio
wget http://dl.minio.org.cn/client/mc/release/linux-amd64/mc
复制代码
添加权限
chmod +x mc
./mc --help
复制代码
添加host
./mc config host add minio http://172.28.166.95:9000 minioadmin minioadmin --api s3v4
复制代码
访问存储
./mc ls minio/datalake
复制代码
搭建K8S集群
申请两台突发型云主机
创建突发性实例参考文档 :help.aliyun.com/document_de…
按量服务;2核4G突发性能实例;
不分配公网IP;使用和小A一样的安全组;
密钥对选择和小A一样的;
点击创建
创建K8S集群
ACK托管版; 标准版; 按量付费;
不配置SNAT; API Server访问选用简约型; 不适用EIP暴露API Server;
选择刚刚申请的两台ECS作为Worker实例;操作系统选择CentOS 7.9;
组件配置只保留Ingerss, 负载均衡类型使用私网;负载均衡规格同样选择简约型;(此篇博客并没有使用K8S集群的Ingerss功能,这边可以不按照Ingress组件)
点击创建集群,然后ok。
小A安装kubectl客户端
yum install -y kubectl
小A配置集群访问权限
vi $HOME/.kube/config 黏贴,保存。
[root@iZ8vbgn00yu75o270lb2azZ nfs]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-zhangjiakou.172.28.166.120 Ready <none> 27m v1.18.8-aliyun.1
cn-zhangjiakou.172.28.166.121 Ready <none> 27m v1.18.8-aliyun.1
复制代码
安装nfs
进入trino-on-k8s目录,查看maria_pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: maria-pv-claim
spec:
storageClassName: pure-block
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
复制代码
可以看到mariaDB需要依赖一个storageClass来作为数据存储 所以这边我们先在集群里定义storageClass
参考:
blog.csdn.net/huwh_/artic…
www.cnblogs.com/allmdzz/p/1…
小A上创建NFS共享服务
参考:TKE使用自建NFS持久化存储
- 安装nfs服务器
dnf install nfs-utils
- 启动nfs-server
systemctl start nfs-server && systemctl enable nfs-server
- 创建共享目录
mkdir /data/nfs
- 配置共享目录
vi /etc/exports
/data/nfs *(insecure,rw,async,no_root_squash)
- 启动
exportfs -rv
- 检查
showmount -e
创建命名空间
kubectl create ns datalake
复制代码
创建SA并绑定权限
rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfs-client-provisioner
namespace: datalake
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: nfs-client-provisioner-runner
rules:
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update", "patch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: run-nfs-client-provisioner
subjects:
- kind: ServiceAccount
name: nfs-client-provisioner
namespace: datalake
roleRef:
kind: ClusterRole
name: nfs-client-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: leader-locking-nfs-client-provisioner
namespace: datalake
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: leader-locking-nfs-client-provisioner
subjects:
- kind: ServiceAccount
name: nfs-client-provisioner
namespace: datalake
roleRef:
kind: Role
name: leader-locking-nfs-client-provisioner
apiGroup: rbac.authorization.k8s.io
复制代码
应用rbac.yaml
kubectl create ns datalake
kubectl apply -f rbac.yaml -n datalake
复制代码
创建storageclass
nfs-StorageClass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: pure-block
provisioner: qgg-nfs-storage
parameters:
archiveOnDelete: "false"
复制代码
provisioner的名称要和provisioner配置文件中的环境变量PROVISIONER_NAME保持一致;
nfs-StorageClass.yaml metadata.name 要和 maria_pvc.yaml里指定的storage-class名字保持一致。
kubectl apply -f nfs-StorageClass.yaml -n datalake
复制代码
创建阿里云私有仓库
创建了命名空间为sailor-datalake的仓库,比较简单,忽略。
镜像
因为k8s集群目前不能联网,所以需要选择镜像,重新打tag,并推送到私有仓库(内网仓库)内。
- 小A拉取新的mariadb服务端镜像
docker pull quay.io/external_storage/nfs-client-provisioner:latest
复制代码
- 给镜像添加新的标签
docker tag quay.io/external_storage/nfs-client-provisioner:latest registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest
复制代码
- 登录镜像仓库
docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com
复制代码
注意:这里的账号和地址要根据自己创建的阿里云的容器镜像仓库的信息填写。
复制代码
- 将镜像推送到镜像仓库
docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest
复制代码
K8S集群worker节点配置docker login
配置访问集群worker节点
- 配置hosts
打开配置文件:
vi /etc/hosts
复制代码
添加ECS节点的内外IP和对应给其取的名字
172.28.166.xxx worker1
172.28.166.yyy worker2
复制代码
- 删除host key (如果重启worker后,需要删除之前记录的hosts)
/root/.ssh/known_hosts
复制代码
- ssh访问
ssh-agent bash
ssh-add -k Happy_123.pem
ssh worker1
或者
ssh -i Happy_123.pem worker1
复制代码
访问worker节点执行docker login
发现registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest 这我推到自己镜像仓库的镜像包, nfs-client-provisioner的deployment都会因为权限问题拉取失败导致pod起不来。其他镜像都没有此问题。
通过登录worker节点,手动执行docker login后台进行拉取到worker节点,然后配置 nfs-client-provisioner的deployment的镜像拉取规则为IfNotPresent才能规避解决。具体原因我会在我的其他博客进行讲解。
ssh worker1
sudo docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com
sudo docker login --username=悟滋味的人生 registry-vpc.cn-zhangjiakou.aliyuncs.com
Happy@123
sudo docker pull registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest
复制代码
注意使用内网地址 registry-vpc
配置worker加入小A安全组
注:我们也可以申请ECS(直接配置小A的安全组),然后再创建K8S集群,创建集群的时候再选择ECS。千万注意,选择ECS的时候不要把小A选了,不然会导致小A重新安装系统,所有你的数据和配置都会丢失哟!
创建NFS provisioner
nfs-provisioner.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nfs-client-provisioner
labels:
app: nfs-client-provisioner
namespace: datalake
spec:
replicas: 1
selector:
matchLabels:
app: nfs-client-provisioner
strategy:
type: Recreate
selector:
matchLabels:
app: nfs-client-provisioner
template:
metadata:
labels:
app: nfs-client-provisioner
spec:
serviceAccountName: nfs-client-provisioner
containers:
- name: nfs-client-provisioner
image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/nfs-client-provisioner:latest
volumeMounts:
- name: nfs-client-root
mountPath: /persistentvolumes
env:
- name: PROVISIONER_NAME
value: qgg-nfs-storage
- name: NFS_SERVER
value: 172.28.166.95
- name: NFS_PATH
value: /data/nfs
volumes:
- name: nfs-client-root
nfs:
server: 172.28.166.95
path: /data/nfs
复制代码
NFS 挂载卷的路径为之前我在小A上创建的 /data/nfs;
我们的NFS服务器安装在小A上,NFS_SERVER 和 volumes的server 地址都要改成小A的私网地址;
PROVISIONER_NAME是provisioner名称,请确保该名称与 nfs-StorageClass.yaml文件中的provisioner名称保持一致;
kubectl apply -f nfs-provisioner.yaml -n datalake
复制代码
安装mariaDB
应用pvc
修改pvc文件中存储容量大小为5Gi
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: maria-pv-claim
spec:
storageClassName: pure-block
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
复制代码
应用:
kubectl apply -f maria_pvc.yaml -n datalake
复制代码
小A安装docker
yum -y install docker
systemctl start docker
systemctl status docker
复制代码
镜像
因为k8s集群目前不能联网,所以需要选择镜像,重新打tag,并推送到私有仓库(内网仓库)内。
-
小A拉取新的mariadb服务端镜像
docker pull mariadb/server:latest 复制代码
-
给镜像添加新的标签
docker tag mariadb/server:latest registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest 复制代码
-
登录镜像仓库
docker login --username=XXX registry-vpc.cn-zhangjiakou.aliyuncs.com 复制代码
注意:这里的账号和地址要根据自己创建的阿里云的容器镜像仓库的信息填写。
-
将镜像推送到镜像仓库
docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest 复制代码
安装mariaDB服务
cd /opt/code/trino-on-k8s
vi maria_deployment.yaml
复制代码
修改yaml
---
apiVersion: v1
kind: Service
metadata:
name: metastore-db
spec:
ports:
- port: 13306
targetPort: 3306
selector:
app: mysql
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql
spec:
selector:
matchLabels:
app: mysql
strategy:
type: Recreate
template:
metadata:
labels:
app: mysql
spec:
containers:
- name: mariadb
image: "registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/mariadb-server:latest"
env:
- name: MYSQL_ROOT_PASSWORD
value: Happy_123
ports:
- containerPort: 3306
name: mysql
volumeMounts:
- name: mariadb-for-hive
mountPath: /var/lib/mysql
resources:
requests:
memory: "1G"
cpu: 0.5
volumes:
- name: mariadb-for-hive
persistentVolumeClaim:
claimName: maria-pv-claim
复制代码
修改镜像地址;修改mariadb的密码;修改所需资源为很小的值。
kubectl apply -f maria_deployment.yaml -n datalake
复制代码
查看数据库metastore_db
-
小A上安装mysql客户端
yum install mysql -y
-
集群开放Nodeport
vi maria_deployment.yaml
修改yaml里的Service,添加NodePort端口32005
---
apiVersion: v1
kind: Service
metadata:
name: metastore-db
spec:
type: NodePort
ports:
- port: 13306
targetPort: 3306
nodePort: 32005
selector:
app: mysql
复制代码
- 查看数据库
[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# mysql -h worker1 -P 32005 -u root -p
Enter password: Happy_123
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 26
Server version: 10.5.10-MariaDB-1:10.5.10+maria~bionic mariadb.org binary distribution
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| metastore_db |
| mysql |
| performance_schema |
+--------------------+
4 rows in set (0.00 sec)
MariaDB [(none)]> use metastore_db
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MariaDB [metastore_db]> show tables;
复制代码
mysql -h worker1 -P 32005 -u root -p
mysql -h 172.28.166.120 -P 32005 -u root -p
复制代码
172.28.166.120是任意一个node节点的IP,32005是我们设置mariaDB服务的NodePort
安装metastore
执行脚本,构建镜像
编辑编译脚本 cd /opt/code/trino-on-k8s/hive_metastore vi build_image.sh
设置镜像地址和TAG;创建configmap时指定命名空间datalake;
set -e
REPONAME=registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake
TAG=metastore:v1.0.1
docker build -t $TAG .
# Tag and push to the public docker repository.
docker tag $TAG $REPONAME/$TAG
docker push $REPONAME/$TAG
# Update configmaps
kubectl create configmap metastore-cfg --dry-run --from-file=metastore-site.xml --from-file=core-site.xml -n datalake -o yaml | kubectl apply -f - -n datalake
复制代码
初始化schema
vi hive-initschema.yaml
修改下密码和镜像名;
apiVersion: batch/v1
kind: Job
metadata:
name: hive-initschema
spec:
template:
spec:
containers:
- name: hivemeta
image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/metastore:v1.0.1
command: ["/opt/hive-metastore/bin/schematool"]
args: ["--verbose" ,"-initSchema" , "-dbType", "mysql" , "-userName", "root",
"-passWord", "mypass" , "-url", "jdbc:mysql://metastore-db:13306/metastore_db?createDatabaseIfNotExist=true"]
restartPolicy: Never
backoffLimit: 4
复制代码
应用
[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl apply -f hive-initschema.yaml -n datalake
job.batch/hive-initschema created
[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl get pod -n datalake
NAME READY STATUS RESTARTS AGE
hive-initschema-9zd75 0/1 Completed 0 14s
mysql-8466d8694c-mmlqk 1/1 Running 0 75m
nfs-client-provisioner-75bc5bc89f-xmbw8 1/1 Running 0 37m
复制代码
修改core-site.xml
原来的core-site.xml 里访问S3存储的配置不够全,这边补充一些.
可参考 trino.io/docs/curren…
<configuration>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>http://172.28.166.95:9000</value>
</property>
<property>
<name>hive.s3a.aws-access-key</name>
<value>minioadmin</value>
</property>
<property>
<name>hive.s3a.aws-secret-key</name>
<value>minioadmin</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>minioadmin</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>minioadmin</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>
</configuration>
复制代码
fs.s3a.endpoint 改成 小A的私网地址+minio的端口
修改metastore-site.xml
<configuration>
<property>
<name>metastore.task.threads.always</name>
<value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value>
</property>
<property>
<name>metastore.expression.proxy</name>
<value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://metastore-db.datalake.svc.cluster.local:13306/metastore_db</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Happy_123</value>
</property>
<property>
<name>metastore.warehouse.dir</name>
<value>s3a://datalake/warehouse/</value>
</property>
<property>
<name>metastore.thrift.port</name>
<value>9083</value>
</property>
</configuration>
复制代码
修改数据库访问密码;修改url里的命名空间为datalake
执行脚本,更新configmap
安装metastore服务
- 修改yaml
---
apiVersion: v1
kind: Service
metadata:
name: metastore
spec:
ports:
- port: 9083
selector:
app: metastore
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: metastore
spec:
selector:
matchLabels:
app: metastore
strategy:
type: Recreate
template:
metadata:
labels:
app: metastore
spec:
containers:
- name: metastore
image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/metastore:v1.0.1
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: my-s3-keys
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: my-s3-keys
key: secret-key
ports:
- containerPort: 9083
volumeMounts:
- name: metastore-cfg-vol
mountPath: /opt/hive-metastore/conf/metastore-site.xml
subPath: metastore-site.xml
- name: metastore-cfg-vol
mountPath: /opt/hadoop/etc/hadoop/core-site.xml
subPath: core-site.xml
command: ["/opt/hive-metastore/bin/start-metastore"]
args: ["-p", "9083"]
resources:
requests:
memory: "1G"
cpu: 0.5
imagePullPolicy: Always
volumes:
- name: metastore-cfg-vol
configMap:
name: metastore-cfg
复制代码
- 新增secret的yaml配置文件 my-s3-keys.yaml
apiVersion: v1
kind: Secret
metadata:
name: my-s3-keys
type:
Opaque
data:
access-key: bWluaW9hZG1pbg==
secret-key: bWluaW9hZG1pbg==
复制代码
应用:
kubectl apply -f my-s3-keys.yaml -n datalake
kubectl apply -f metastore.yaml -n datalake
复制代码
安装trino
修改trino配置
vi trino-cfgs.yaml
编辑yaml
---
kind: ConfigMap
apiVersion: v1
metadata:
name: trino-configs
data:
jvm.config: |-
-server
-Xmx2G
-XX:-UseBiasedLocking
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+UseGCOverheadLimit
-XX:+HeapDumpOnOutOfMemoryError
-XX:ReservedCodeCacheSize=512M
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
config.properties.coordinator: |-
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=200GB
query.max-memory-per-node=0.2GB
query.max-total-memory-per-node=0.6GB
query.max-stage-count=200
task.writer-count=4
discovery-server.enabled=true
discovery.uri=http://trino:8080
config.properties.worker: |-
coordinator=false
http-server.http.port=8080
query.max-memory=200GB
query.max-memory-per-node=0.2GB
query.max-total-memory-per-node=0.6GB
query.max-stage-count=200
task.writer-count=4
discovery.uri=http://trino:8080
node.properties: |-
node.environment=test
spiller-spill-path=/tmp
max-spill-per-node=4TB
query-max-spill-per-node=1TB
hive.properties: |-
connector.name=hive-hadoop2
hive.metastore.uri=thrift://metastore:9083
hive.allow-drop-table=true
hive.max-partitions-per-scan=1000000
hive.s3.endpoint=172.28.166.95
hive.s3.path-style-access=true
hive.s3.ssl.enabled=false
hive.s3.max-connections=100
iceberg.properties: |-
connector.name=iceberg
hive.metastore.uri=thrift://metastore:9083
hive.max-partitions-per-scan=1000000
hive.s3.endpoint=172.28.166.95
hive.s3.path-style-access=true
hive.s3.ssl.enabled=false
hive.s3.max-connections=100
mysql.properties: |-
connector.name=mysql
connection-url=jdbc:mysql://metastore-db.datalake.svc.cluster.local:13306
connection-user=root
connection-password=Happy_123
复制代码
应用:
kubectl apply -f trino-cfgs.yaml -n datalake
复制代码
编辑trino.yaml
---
apiVersion: v1
kind: Service
metadata:
name: trino
spec:
ports:
- port: 8080
selector:
app: trino-coordinator
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: trino-coordinator
spec:
selector:
matchLabels:
app: trino-coordinator
strategy:
type: Recreate
template:
metadata:
labels:
app: trino-coordinator
spec:
containers:
- name: trino
image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361
ports:
- containerPort: 8080
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: my-s3-keys
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: my-s3-keys
key: secret-key
volumeMounts:
- name: trino-cfg-vol
mountPath: /etc/trino/jvm.config
subPath: jvm.config
- name: trino-cfg-vol
mountPath: /etc/trino/config.properties
subPath: config.properties.coordinator
- name: trino-cfg-vol
mountPath: /etc/trino/node.properties
subPath: node.properties
- name: trino-cfg-vol
mountPath: /etc/trino/catalog/hive.properties
subPath: hive.properties
- name: trino-cfg-vol
mountPath: /etc/trino/catalog/iceberg.properties
subPath: iceberg.properties
- name: trino-cfg-vol
mountPath: /etc/trino/catalog/mysql.properties
subPath: mysql.properties
resources:
requests:
memory: "1G"
cpu: 0.5
imagePullPolicy: Always
volumes:
- name: trino-cfg-vol
configMap:
name: trino-configs
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: trino-worker
spec:
serviceName: trino-worker
replicas: 1
selector:
matchLabels:
app: trino-worker
template:
metadata:
labels:
app: trino-worker
spec:
securityContext:
fsGroup: 1000
containers:
- name: trino
image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361
ports:
- containerPort: 8080
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: my-s3-keys
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: my-s3-keys
key: secret-key
volumeMounts:
- name: trino-cfg-vol
mountPath: /etc/trino/jvm.config
subPath: jvm.config
- name: trino-cfg-vol
mountPath: /etc/trino/config.properties
subPath: config.properties.worker
- name: trino-cfg-vol
mountPath: /etc/trino/node.properties
subPath: node.properties
- name: trino-cfg-vol
mountPath: /etc/trino/catalog/hive.properties
subPath: hive.properties
- name: trino-cfg-vol
mountPath: /etc/trino/catalog/iceberg.properties
subPath: iceberg.properties
- name: trino-cfg-vol
mountPath: /etc/trino/catalog/mysql.properties
subPath: mysql.properties
- name: trino-tmp-data
mountPath: /tmp
resources:
requests:
memory: "1G"
cpu: 0.5
imagePullPolicy: Always
volumes:
- name: trino-cfg-vol
configMap:
name: trino-configs
volumeClaimTemplates:
- metadata:
name: trino-tmp-data
spec:
storageClassName: pure-block
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 40Gi
---
apiVersion: v1
kind: Pod
metadata:
name: trino-cli
spec:
containers:
- name: trino-cli
image: registry-vpc.cn-zhangjiakou.aliyuncs.com/sailor-datalake/trino:361
command: ["tail", "-f", "/dev/null"]
imagePullPolicy: Always
restartPolicy: Always
复制代码
修改镜像名;修改trino-tmp-data里资源大小为40Gi; 设置worker个数为1个; worker和coordinator所占资源;
应用trino.yaml
kubectl apply -f trino.yaml -n datalake
[root@iZ8vbgn00yu75o270lb2azZ trino-on-k8s]# kubectl get pod -n datalake
NAME READY STATUS RESTARTS AGE
hive-initschema-9zd75 0/1 Completed 0 71m
metastore-7df56797cc-j7dcd 1/1 Running 0 11m
mysql-8466d8694c-mmlqk 1/1 Running 0 146m
nfs-client-provisioner-75bc5bc89f-xmbw8 1/1 Running 0 108m
trino-cli 1/1 Running 0 64s
trino-coordinator-54f78c7984-zqf8t 1/1 Running 0 64s
trino-worker-0 0/1 Pending 0 64s
复制代码
资源不够,新增一台worker,并加入集群。
二进制客户端
wget https://repo1.maven.org/maven2/io/trino/trino-cli/361/trino-cli-361-executable.jar
mv trino-cli-361-executable.jar trino
chmod +x trino
#连接trino,8080端口跟config.properties配置文件中端口对应
./trino --server localhost:8080 --catalog iceberg
复制代码
测试
kubectl exec -it trino-cli -n datalake -- trino --server trino:8080 --catalog iceberg --schema default
复制代码
访问trino的ui
由于我们的trino安装在集群中,但是K8s集群未配置外网访问,所以如果想访问UI的话只能通过小A。 方案是在小A上安装nginx,然后nginx后端对接trino开放的NodePort。
小A上安装nginx
参考文档:blog.csdn.net/lxh_worldpe…
安全组添加端口32001(32001为trino暴露的NodePort)
配置nginx
监听端口32001(其实随便,这里配置成trino暴露的NodePort) proxy_pass配置K8S集群node节点IP以及trino暴露的NodePort
cd /usr/local/nginx/conf
vim nginx.conf
server {
listen 32001;
server_name localhost;
location / {
proxy_read_timeout 300;
proxy_connect_timeout 300;
proxy_redirect off;
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Frame-Options SAMEORIGIN;
proxy_pass http://172.28.166.135:32001;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/share/nginx/html;
}
}
复制代码
可参考www.cnblogs.com/fps2tao/p/9…
nginx 更改配置文件后需要重启生效
/usr/local/nginx/sbin/nginx -t测试配置文件修改是否正常
/usr/local/nginx/sbin/nginx -s reload重新加载
web访问
http://39.103.234.55:32001/ui/
使用trino体验iceberg
创建表
CREATE TABLE iceberg.default.events (level VARCHAR, event_time TIMESTAMP(6), message VARCHAR, call_stack ARRAY(VARCHAR)) WITH (partitioning= ARRAY['day(event_time)']);
复制代码
插入三条数据
INSERT INTO iceberg.default.events VALUES('ERROR', timestamp '2021-04-01 12:00:00.000001', 'Oh noes', ARRAY ['Exception in thread "main" java.lang.NullPointerException']);
复制代码
INSERT INTO iceberg.default.events\
VALUES\
(
'ERROR',
timestamp '2021-04-01 12:00:00.000001',
'Oh noes',
ARRAY ['Exception in thread "main" java.lang.NullPointerException']
),\
(
'ERROR',
timestamp '2021-04-02 15:55:55.555555',
'Double oh noes',
ARRAY ['Exception in thread "main" java.lang.NullPointerException']),\
(
'WARN',
timestamp '2021-04-02 00:00:11.1122222',
'Maybeh oh noes?',
ARRAY ['Bad things could be happening??']\
);
复制代码
查看数据
SELECT * FROM iceberg.default.events WHERE event_time < timestamp '2021-04-02';
复制代码
查看mariaDB中存储的数据
MariaDB [(none)]> select * from metastore_db.TABLE_PARAMS;
+--------+----------------------------+---------------------------------------------------------------------------------------------------+
| TBL_ID | PARAM_KEY | PARAM_VALUE |
+--------+----------------------------+---------------------------------------------------------------------------------------------------+
| 1 | EXTERNAL | TRUE |
| 1 | metadata_location | s3a://datalake/warehouse/events/metadata/00001-9bfdcfaa-6e00-425e-aee7-6d4128cc6cfb.metadata.json |
| 1 | numFiles | 6 |
| 1 | previous_metadata_location | s3a://datalake/warehouse/events/metadata/00000-4f86466c-c60c-4b2a-b2c0-5a8b6fb76e65.metadata.json |
| 1 | table_type | iceberg |
| 1 | totalSize | 13167 |
| 1 | transient_lastDdlTime | 1633849635 |
+--------+----------------------------+---------------------------------------------------------------------------------------------------+
7 rows in set (0.00 sec)
复制代码
查看minio中存储的数据
[root@iZ8vbgn00yu75o270lb2azZ code]# ./mc ls minio/datalake/warehouse/events/metadata
[2021-10-10 14:32:25 CST] 1.9KiB 00000-1b2ca292-85c7-435c-b884-5218d74d2605.metadata.json
[2021-10-10 15:02:04 CST] 1.9KiB 00000-21e6307c-8b63-41f6-987c-fbc86de000f9.metadata.json
[2021-10-10 15:07:14 CST] 1.9KiB 00000-4f86466c-c60c-4b2a-b2c0-5a8b6fb76e65.metadata.json
[2021-10-10 13:45:22 CST] 1.9KiB 00000-b88a389c-e3ca-491b-bd9e-e901a01f82fe.metadata.json
[2021-10-10 13:58:33 CST] 1.9KiB 00000-fe989767-841a-4412-a2e3-bc094e148ab1.metadata.json
[2021-10-10 15:09:51 CST] 2.8KiB 00001-9bfdcfaa-6e00-425e-aee7-6d4128cc6cfb.metadata.json
[2021-10-10 15:09:51 CST] 5.9KiB a10aa346-3b55-437b-93a3-20219f7d8ad1-m0.avro
[2021-10-10 15:07:14 CST] 3.3KiB snap-1941033720141809751-1-2b0c70bb-ab8d-4db8-b377-9722e4926e08.avro
[2021-10-10 15:09:51 CST] 3.4KiB snap-4869152224340281922-1-a10aa346-3b55-437b-93a3-20219f7d8ad1.avro
[root@iZ8vbgn00yu75o270lb2azZ code]# ./mc ls minio/datalake/warehouse/events/data
[2021-10-10 23:49:32 CST] 0B event_time_day=2021-04-01/
[root@iZ8vbgn00yu75o270lb2azZ code]# ./mc ls minio/datalake/warehouse/events/data/event_time_day=2021-04-01/
复制代码
停用集群
停用ECS
1.停机节省模式下,计算资源(vCPU和内存)、固定公网IP和带宽不再收费。
2.仍旧收费的资源有:系统盘、数据盘、弹性公网IP和带宽(固定带宽模式)、收费镜像。
停用负载均衡
停用后发现负载均衡仍然在收费,ECS也在收费但只收了25%左右。已经可以省不少了。
启用集群
启用ECS
启用负载均衡
都启用后,可以观察集群中的实例都恢复了。
相关文章
手把手带你玩转iceberg - trino_on_k8s
手把手带你玩转iceberg - 开源节流spend_wisely
参考网址
- Iceberg Connector
trino.io/docs/curren…
- StarBurst
Starburst Enterprise for Presto是一个Presto开源分布式SQL查询引擎的商用版,用于查找和分析驻留在各种分布式数据源中的数据。
StarBurst的官网blog中有很多使用trino的优质文章,推荐给大家。
其中Trino On Ice 系列blog就介绍了基于trino了解Iceberg很多有趣实践。
blog.starburst.io/trino-on-ic…
- 集群搭建
juejin.cn/post/684516… juejin.cn/post/684790…
相关文章