HashData 数据库故障处理(一)

2022-02-16 00:00:00 集群 数据库 报错 信息 防火墙

项目背景

在混合使用的环境中,特别是测试环境。如果数据库长时间没有使用,再次使用的时候可能会遇到数据库不能使用的情况,导致这个情况出现的原因:

在没有正常关闭 HashData 数据库的情况下进行了集群中某些机器的重启。
服务器时间未做 ntp 时钟同步,时差过大。
修改了防火墙信息。
人为误操作,导致数据库进程不正常关闭。
此文仅面对种情况。

报错信息
在长时间未使用数据库,再次使用数据库的时候,很大概率的情况下会出现某个节点的数据库进程已经非正常关闭了。所以我们可以使用 gpstop + gpstart 来确定一下数据库是否能够正常启动。明显在此处我们的数据库已经不能正常启动了,报错信息如下:

[gpadmin@master ~]$ gpstop -a
20180911:09:36:42:000760 gpstop:master:gpadmin-[INFO]:-Starting gpstop with args: -a
20180911:09:36:42:000760 gpstop:master:gpadmin-[INFO]:-Gathering information and validating the environment...
20180911:09:36:42:000760 gpstop:master:gpadmin-[ERROR]:-gpstop error: postmaster.pid file does not exist. is Greenplum instance already stopped?
[gpadmin@master ~]$ gpstart -a
20180911:09:36:52:000780 gpstart:master:gpadmin-[INFO]:-Starting gpstart with args: -m -a
20180911:09:36:52:000780 gpstart:master:gpadmin-[INFO]:-Gathering information and validating the environment...
20180911:09:36:52:000780 gpstart:master:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 5.3.0+dev.375.gbb4df85 build dev'
20180911:09:36:53:000780 gpstart:master:gpadmin-[INFO]:-Greenplum Catalog Version: '301705051'
20180911:09:36:53:000780 gpstart:master:gpadmin-[INFO]:-Clearing Master instance lock files
20180911:09:36:53:000780 gpstart:master:gpadmin-[CRITICAL]:-gpstart failed. (Reason='[Errno 1] Operation not permitted: '/tmp/.s.PGSQL.5432'') exiting...
1
2
3
4
5
6
7
8
9
10
11
解决方案
根据报错信息 (Reason='[Errno 1] Operation not permitted: '/tmp/.s.PGSQL.5432'') exiting... 能够明显的看到是 5432 端口被占用了。所以我们看一下是不是存在 postgres 进程。

[root@master ~]# ps aux | grep postgres
postgres 956 0.0 0.0 358184 15260 ? Ss Aug29 0:02 /usr/pgsql-9.6/bin/postmaster -D /var/lib/pgsql/9.6/data/
postgres 1028 0.0 0.0 213184 1712 ? Ss Aug29 0:00 postgres: logger process
postgres 1061 0.0 0.0 358184 1924 ? Ss Aug29 0:00 postgres: checkpointer process
postgres 1062 0.0 0.0 358184 2728 ? Ss Aug29 0:20 postgres: writer process
postgres 1064 0.0 0.0 358184 1700 ? Ss Aug29 0:20 postgres: wal writer process
postgres 1065 0.0 0.0 358588 2776 ? Ss Aug29 0:10 postgres: autovacuum launcher process
postgres 1066 0.0 0.0 213184 1772 ? Ss Aug29 0:08 postgres: stats collector process
root 1249 0.0 0.0 112712 940 pts/0 S+ 09:44 0:00 grep --color=auto postgres
1
2
3
4
5
6
7
8
9
从上面的信息可以看到 master 节点上面存在一个 postgres 数据库。经过确认把 postgres 数据库关闭。要注意使用正确的指令集。

[root@master ~]# find / -name pg_ctl
/opt/gpsql-v2.x-centos7-opt-1262/bin/pg_ctl
/usr/pgsql-9.6/bin/pg_ctl
[root@master ~]# su - postgres
Last login: Tue Sep 11 09:43:05 CST 2018 on pts/0
-bash-4.2$ echo $PGDATA
/var/lib/pgsql/9.6/data
-bash-4.2$ export PATH=/usr/pgsql-9.6/bin:$PATH
-bash-4.2$ pg_ctl stop -D $PGDATA -m immediate
waiting for server to shut down.... done
server stopped
-bash-4.2$ exit
1
2
3
4
5
6
7
8
9
10
11
12
至此我们的环境就已经处理好了。可以使用 gpstart -a 来试一把启动 HashData 数据库集群了。

[gpadmin@master ~]$ gpstart -a
20180911:09:48:35:001756 gpstart:master:gpadmin-[INFO]:-Starting gpstart with args: -a
20180911:09:48:35:001756 gpstart:master:gpadmin-[INFO]:-Gathering information and validating the environment...
20180911:09:48:36:001756 gpstart:master:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 5.3.0+dev.375.gbb4df85 build dev'
20180911:09:48:36:001756 gpstart:master:gpadmin-[INFO]:-Greenplum Catalog Version: '301705051'
20180911:09:48:37:001756 gpstart:master:gpadmin-[INFO]:-Starting Master instance in admin mode
20180911:09:48:39:001756 gpstart:master:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20180911:09:48:39:001756 gpstart:master:gpadmin-[INFO]:-Obtaining Segment details from master...
20180911:09:48:40:001756 gpstart:master:gpadmin-[INFO]:-Setting new master era
20180911:09:48:40:001756 gpstart:master:gpadmin-[INFO]:-Master Started...
20180911:09:48:41:001756 gpstart:master:gpadmin-[INFO]:-Heap checksum setting is consistent across the cluster
20180911:09:48:41:001756 gpstart:master:gpadmin-[INFO]:-Shutting down master
20180911:09:48:46:001756 gpstart:master:gpadmin-[INFO]:-Commencing parallel segment instance startup, please wait...
............
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:-Process results...
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:-----------------------------------------------------
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:- Successful segment starts = 7
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:- Failed segment starts = 0
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:- Skipped segment starts (segments are marked down in configuration) = 0
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:-----------------------------------------------------
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:-Successfully started 7 of 7 segment instances
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:-----------------------------------------------------
20180911:09:48:58:001756 gpstart:master:gpadmin-[INFO]:-Starting Master instance master directory /data0/hashdata/master/gpseg-1
20180911:09:49:00:001756 gpstart:master:gpadmin-[INFO]:-Command pg_ctl reports Master master instance active
20180911:09:49:00:001756 gpstart:master:gpadmin-[INFO]:-No standby master configured. skipping...
20180911:09:49:00:001756 gpstart:master:gpadmin-[INFO]:-Database successfully started
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
果断启动成功。

至此并不算是结束,我们要知道数据库是否正常,并不是他启动了就OK了。我们要登进去创建一个表,插入几条数据,查询一下,然后删除数据。这样才能基本上确认数据库能够正常使用。当然在 HashData 2.X 的版本中,由于存在对象存储的功能,所以我们要进行 AO 表的测试。另外还要检测一下读写对象存储的进程是否正常,在此这一部分就不做赘述了。

下面我们要登录进去做测试了。

报错信息
果然验证还是有意义的。报错信息如下:

[gpadmin@master ~]$ psql -d postgres
psql: FATAL: DTM initialization: failure during startup recovery, retry failed, check segment status (cdbtm.c:1513)
1
2
gpstate 看一下没有什么问题啊

20180911:09:49:36:001945 gpstate:master:gpadmin-[INFO]:-Starting gpstate with args: -a
20180911:09:49:37:001945 gpstate:master:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.3.0+dev.375.gbb4df85 build dev'
20180911:09:49:37:001945 gpstate:master:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.3.0+dev.375.gbb4df85 build dev) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit compiled on May 14 2018 18:33:09'
20180911:09:49:38:001945 gpstate:master:gpadmin-[INFO]:-Obtaining Segment details from master...
20180911:09:49:38:001945 gpstate:master:gpadmin-[INFO]:-Gathering data from segments...
....
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:-Greenplum instance status summary
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:-----------------------------------------------------
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Master instance = Active
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Master standby = No master standby configured
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total segment instance count from metadata = 7
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:-----------------------------------------------------
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Primary Segment Status
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:-----------------------------------------------------
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total primary segments = 7
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total primary segment valid (at master) = 7
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total primary segment failures (at master) = 0
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total number of postmaster.pid files missing = 0
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total number of postmaster.pid files found = 7
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total number of postmaster.pid PIDs missing = 0
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total number of postmaster.pid PIDs found = 7
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total number of /tmp lock files missing = 0
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total number of /tmp lock files found = 7
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total number postmaster processes missing = 0
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Total number postmaster processes found = 7
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:-----------------------------------------------------
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Mirror Segment Status
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:-----------------------------------------------------
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:- Mirrors not configured on this array
20180911:09:49:42:001945 gpstate:master:gpadmin-[INFO]:-----------------------------------------------------

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
此处是经验 ^_^
检查集群中机器的防火墙:

[gpadmin@master ~]$ systemctl status firewalld.service
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
[gpadmin@master ~]$ service iptables status
Redirecting to /bin/systemctl status iptables.service
● iptables.service - IPv4 firewall with iptables
Loaded: loaded (/usr/lib/systemd/system/iptables.service; disabled; vendor preset: disabled)
Active: active (exited) since Thu 2018-08-30 09:06:51 CST; 1 weeks 5 days ago
Main PID: 12270 (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CGroup: /system.slice/iptables.service
1
2
3
4
5
6
7
8
9
10
11
12
13
14
很 nice 集群开启了 iptables 防火墙。在很多环境中可能在作部署安装的过程中没有开机禁用防火墙,导致在服务器机器重启的情况下,数据库集群不能使用。我们把防火墙关闭一下。

[root@master ~]# service iptables stop
Redirecting to /bin/systemctl stop iptables.service
[root@master ~]# systemctl disable iptables.service
[root@master ~]# source /opt/gpsql/greenplum_path.sh
[root@master ~]# systemctl stop iptables.service
[root@master ~]# gpssh -f hostfile_seg
=> systemctl stop iptables.service
[segment2]
[segment5]
[segment3]
[segment7]
[segment1]
[segment6]
[segment4]
=> systemctl disable iptables.service
[segment2]
[segment5]
[segment3]
[segment7]
[segment1]
[segment6]
[segment4]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
然后在进行一些上面提到的建表,插入,查询,删除。此处就略去了。

总结
对于 HashData 数据库来说,在可控情况下,不要在集群未关闭的情况下,进行集群中机器的重启操作。也尽量不要开启防火墙。
————————————————
版权声明:本文为CSDN博主「GaryZhang000」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/zhang50303/article/details/82657217

相关文章