Greenplum——大数据量写入和更新的性能优化之路

2023-03-17 00:00:00 数据 执行 测试 写入 博客

背景
Greenplum数据加载的三种方式:

insert
copyin
外部表 gpfdist
理论上来说,性能 gdfdist>copyin>insert(batch)>insert

1 安装所需工具
1.1 pgbench
yum安装pgbench

yum install -y https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7.9-x86_64/

yum install postgresql10
yum install postgresql10-server
yum install -y postgresql10-contrib
配置环境变量,默认安装是/usr/pgsql-10/bin,所以需要配置path 路径

vim /etc/profile

# 添加这行
export PATH=$PATH:/usr/pgsql-10/bin

source /etc/profile
在gp上面创建测试库

create database pgbenchdb;
执行测试指令

pgbench -i -s 5 pgbenchdb --初始化,将在pgbench_accounts表中创建 500,000行。
执行后查询pgbench_accounts的条数,如果50w条说明插入成功

2 基准测试
2.1 测试准备
1、在pgbenchdb数据库下创建测试表tbl

CREATE TABLE public.tbl (
id bigserial NOT NULL,
crt_time timestamp NULL,
sensorid int4 NULL,
sensorloc point NULL,
info text NULL
)
WITH (
autovacuum_enabled=on,
autovacuum_vacuum_threshold=1,
autovacuum_vacuum_cost_delay=0
)
DISTRIBUTED BY (id);
2、在/home/gpadmin下面创建insert.sql文件,向表中插入一条随机数据

insert into public.tbl (crt_time,sensorid,info) values ( clock_timestamp(),trunc(random()*500000),substring(md5(random()::text),1,8) );
3、在/home/gpadmin下面创建read.sql文件, 从表中读取一条随机数据

select * from public.tbl where sensorid = trunc(random()*500000);
2.2 开始测试
2.2.1 写入测试
测试前

truncate table public.tbl
1、链接数测试,模拟224个客户端连接,8个线程,每个客户端8个事务

pgbench -f /home/gpadmin/insert.sql -c 224 -C -j 8 -t 8 pgbenchdb
执行结果

starting vacuum...end.
transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 224
number of threads: 8
number of transactions per client: 8
number of transactions actually processed: 1792/1792
latency average = 2308.419 ms
tps = 97.036123 (including connections establishing)
tps = 97.929852 (excluding connections establishing)
2、测试客户端64 10w写入

pgbench -f /home/gpadmin/insert.sql -c 64 -C -j 8 -t 1563 pgbenchdb
执行结果

transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 8
number of transactions per client: 1563
number of transactions actually processed: 98733/100032
latency average = 649.590 ms
tps = 98.523657 (including connections establishing)
tps = 100.359377 (excluding connections establishing)
3、测试客户端128 10w写入

pgbench -f /home/gpadmin/insert.sql -c 128 -C -j 8 -t 782 pgbenchdb
执行结果

[gpadmin@pnode3 ~]$ pgbench -f /home/gpadmin/insert.sql -c 128 -C -j 8 -t 782 pgbenchdb
starting vacuum...end.
transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 128
number of threads: 8
number of transactions per client: 782
number of transactions actually processed: 100096/100096
latency average = 1316.215 ms
tps = 97.248556 (including connections establishing)
tps = 98.091034 (excluding connections establishing)
4、测试客户端164 100w写入

pgbench -f /home/gpadmin/insert.sql -c 64 -C -j 8 -t 15625 pgbenchdb
执行结果

transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 8
number of transactions per client: 15625
number of transactions actually processed: 1000000/1000000
latency average = 769.468 ms
tps = 83.174383 (including connections establishing)
tps = 84.614156 (excluding connections establishing)
5、客户端128 100w写入

pgbench -f /home/gpadmin/insert.sql -c 128 -C -j 8 -t 7813 pgbenchdb
执行结果

transaction type: /home/gpadmin/insert.sql
scaling factor: 1
query mode: simple
number of clients: 128
number of threads: 8
number of transactions per client: 7813
number of transactions actually processed: 1000064/1000064
latency average = 1389.130 ms
tps = 92.144034 (including connections establishing)
tps = 92.931354 (excluding connections establishing)
2.2.2 读取测试
1、客户端64 10w查询

pgbench -f /home/gpadmin/read.sql -c 64 -C -j 8 -t 1563 pgbenchdb
执行结果

transaction type: /home/gpadmin/read.sql
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 8
number of transactions per client: 1563
number of transactions actually processed: 100032/100032
latency average = 4188.179 ms
tps = 15.281105 (including connections establishing)
tps = 15.371656 (excluding connections establishing)
性能太差了 就不往下测试了

2.2.3 更新测试
3 优化
3.1 写入优化
1 copy in方式

采用CopyManager对Greenplum进行数据导出导入的Java Demo:Greenplum copy in的Java工具类_CarsonBigData的博客-CSDN博客

2 gpfdist方式

Greenplum——基于Greenplum-Spark Connector的Spark脚本开发及遇到的坑_CarsonBigData的博客-CSDN博客_greenplum spark connector

3.2 更新优化
方案描述:增量数据和目标表数据合并到临时表,然后覆盖目标表。

-- 业务库insert和update的数据:tmp_incr_data

-- 今天过来的新数据:tmp_update_data

-- 目标表里不需要更新的数据:tmp_not_update_data

-- 目标表:dest_table,按天分区

-- 主键字段:ID

1、取出当天未更新的数据,放到临时表里

CREATE TABLE tmp_not_update_data
SELECT *
FROM dest_table a
LEFT JOIN tmp_incr_data b
ON a.id = b.id
WHERE b.id is null;
2、从目标表删除当天数据

TRUNCATE TABLE dest_table PARTITION(DATA_DATE = '2022-07-28');
3、把当天未更新的数据插回目标表

INSERT INTO dest_table
SELECT * FROM tmp_not_update_data;
4、把当天更新的数据查到目标表

INSERT INTO dest_table
SELECT * FROM tmp_update_data;
持续更新。。。

参考博客:

Greenplum6 JDBC insert 性能百倍提升 *之踩坑之路_恋奴娇的博客-CSDN博客

HybridDB for PostgreSQL , Greenplum 写入性能优化实践_weixin_34391854的博客-CSDN博客
Greenplum6 JDBC insert性能媲美MySQL_Greenplum中文社区的博客-CSDN博客_greenplum写入性能

通过copyIn向PostgreSQL或Greenplum写入数据_陈序猿张的博客-CSDN博客

Greenplum基于pgbench的性能测试 - 腾讯云开发者社区-腾讯云

pgbench 使用介绍及示例 - 墨天轮

基于pgbench的GreenPlum压力测试_zyset的博客-CSDN博客_pgbench 官方压测结果

pgbench10安装记录(yum)_mingjie73的博客-CSDN博客_pgbench 安装
本文来源:https://blog.csdn.net/Carson073/article/details/125929390

相关文章