【DB宝91】PG高可用之主从流复制+keepalived 的高可用

vlambda
2022-02-23

【DB宝91】PG高可用之主从流复制+keepalived 的高可用

简介

常见的高可用架构：

pg + keepalived：https://www.xmmup.com/pggaokeyongzhizhucongliufuzhikeepalived-degaokeyong.html
pg + pgpool：pgpool-II和pgpoolAdmin的使用
pg + repmgr：PostgreSQL高可用之repmgr（1主2从+1witness）+Pgpool-II实现主从切换+读写分离+负载均衡
pg + Patroni + etcd：PG高可用集群之Patroni + etcd + HAProxy + keepalived + Prometheus + Grafana监控部署

通过keepalived 来实现 PostgreSQL 数据库的主从自动切换，以达到高可用。当主节点宕机时，从节点可自动切换为主节点，继续对外提供服务。

在这一方案中Keepalived程序主要用来探测PostgreSQL主库是否存活，如果Keepalived主节点或主库故障，Keepalived备节点将接管VIP 并日激活流复制备库．从而实现高可用。

环境架构

IP地址	操作系统	主机名	角色	端口	说明
172.72.6.6	CentOS 7.6	lhrpg66	主库	5433	安装postgesql 13.3 + keepalived v1.3.5
172.72.6.7	CentOS 7.6	lhrpg67	从库	5433	安装postgesql 13.3 + keepalived v1.3.5
172.72.6.8			VIP		在pg66和pg67之间进行漂移

配置主从流复制

参考：PG12高可用之1主2从流复制环境搭建及切换测试

主机环境准备

 1-- 创建PG高可用环境专用网络
 2docker network create --subnet=172.72.6.0/24 pg-network
 3
 4-- 申请主机
 5docker rm -f lhrpg66
 6docker run -d --name lhrpg66 -h lhrpg66 \
 7  -p 64306:5433 --net=pg-network --ip 172.72.6.6 \
 8  -v /sys/fs/cgroup:/sys/fs/cgroup \
 9  --privileged=true lhrbest/lhrpgall:2.0 \
10  /usr/sbin/init
11
12docker rm -f lhrpg67
13docker run -d --name lhrpg67 -h lhrpg67 \
14  -p 64307:5433 --net=pg-network --ip 172.72.6.7 \
15  -v /sys/fs/cgroup:/sys/fs/cgroup \
16  --privileged=true lhrbest/lhrpgall:2.0 \
17  /usr/sbin/init
18
19
20[root@docker35 ~]# docker ps
21CONTAINER ID   IMAGE                  COMMAND            CREATED          STATUS          PORTS                                         NAMES
22a320c1882201   lhrbest/lhrpgall:2.0   "/usr/sbin/init"   23 seconds ago   Up 21 seconds   0.0.0.0:64307->5433/tcp, :::64307->5433/tcp   lhrpg67
23e9c67922b0e8   lhrbest/lhrpgall:2.0   "/usr/sbin/init"   26 seconds ago   Up 24 seconds   0.0.0.0:64306->5433/tcp, :::64306->5433/tcp   lhrpg66

注意：该容器已安装PostgreSQL 13.3，故只需要配置主从即可，安装用户为pg13。

需要关闭其它版本的pg：

systemctl stop pg11
systemctl stop pg12
systemctl stop pg94
systemctl stop pg96
systemctl stop postgresql-13.service

systemctl disable pg11
systemctl disable pg12
systemctl disable pg94
systemctl disable pg96
systemctl disable postgresql-13.service

主库放开防火墙

1cat  << EOF > /pg13/pgdata/pg_hba.conf
2# TYPE  DATABASE    USER    ADDRESS       METHOD
3local     all       all                    trust
4host      all       all   127.0.0.1/32     trust
5host      all       all    0.0.0.0/0        md5
6host   replication  all    0.0.0.0/0        md5
7EOF

👉 注意添加replication

主库创建复制用户

1su - pg13
2psql
3create user replhr password 'lhr' replication;

👉 创建用户需要加上replication选项。

在从库对主库进行备份

1mkdir /bk
2chown pg13.postgres /bk
3
4su - pg13
5pg_basebackup -h 172.72.6.6 -p 5433 -U replhr -l bk20220210 -F p -P -R -D /bk

执行完成后，会产生文件standby.signal，如下：

 1[pg13@lhrpg67 ~]$ pg_basebackup -h 172.72.6.6 -p 5433 -U replhr -l bk20220210 -F p -P -R -D /bk
 2Password: 
 3WARNING:  skipping special file "./.s.PGSQL.5433"
 4WARNING:  skipping special file "./.s.PGSQL.5433"
 523411/23411 kB (100%), 1/1 tablespace
 6[pg13@lhrpg67 ~]$ ll /bk
 7total 260
 8-rw------- 1 pg13 postgres    209 Feb 10 16:27 backup_label
 9-rw------- 1 pg13 postgres 135710 Feb 10 16:27 backup_manifest
10drwx------ 5 pg13 postgres   4096 Feb 10 16:27 base
11-rw------- 1 pg13 postgres     33 Feb 10 16:27 current_logfiles
12drwx------ 2 pg13 postgres   4096 Feb 10 16:27 global
13drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_commit_ts
14drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_dynshmem
15-rw------- 1 pg13 postgres    243 Feb 10 16:27 pg_hba.conf
16-rw------- 1 pg13 postgres   1636 Feb 10 16:27 pg_ident.conf
17drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_log
18drwx------ 4 pg13 postgres   4096 Feb 10 16:27 pg_logical
19drwx------ 4 pg13 postgres   4096 Feb 10 16:27 pg_multixact
20drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_notify
21drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_replslot
22drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_serial
23drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_snapshots
24drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_stat
25drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_stat_tmp
26drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_subtrans
27drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_tblspc
28drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_twophase
29-rw------- 1 pg13 postgres      3 Feb 10 16:27 PG_VERSION
30drwx------ 3 pg13 postgres   4096 Feb 10 16:27 pg_wal
31drwx------ 2 pg13 postgres   4096 Feb 10 16:27 pg_xact
32-rw------- 1 pg13 postgres    314 Feb 10 16:27 postgresql.auto.conf
33-rw------- 1 pg13 postgres  28184 Feb 10 16:27 postgresql.conf
34-rw------- 1 pg13 postgres      0 Feb 10 16:27 standby.signal

👉 在PG12之前，-R备份结束之后会自动生成recovery.conf文件，用来做流复制判断主从同步的信息。但是从PG12开始，这个文件已经不需要了。只需要在参数文件postgresql.conf中配置primary_conninfo参数即可。

还原从库

1-- 关闭从库，删除从库的数据文件，并且将备份文件覆盖从库的数据文件
2pg_ctl stop
3cp -r /bk/* /pg13/pgdata/
4

修改从库primary_conninfo参数

1cat >> /pg13/pgdata/postgresql.conf <<"EOF"
2
3primary_conninfo = 'host=172.72.6.6 port=5433 user=replhr password=lhr'
4
5EOF

启动从库

1pg_ctl start

主库进程：

 1[root@lhrpg66 /]# ps -ef|grep pg13
 2pg13      1053     0  0 16:16 ?        00:00:00 /pg13/pg13/bin/postgres -D /pg13/pgdata -p 5433
 3pg13      1054  1053  0 16:16 ?        00:00:00 postgres: logger 
 4pg13      1056  1053  0 16:16 ?        00:00:00 postgres: checkpointer 
 5pg13      1057  1053  0 16:16 ?        00:00:00 postgres: background writer 
 6pg13      1058  1053  0 16:16 ?        00:00:00 postgres: walwriter 
 7pg13      1059  1053  0 16:16 ?        00:00:00 postgres: autovacuum launcher 
 8pg13      1060  1053  0 16:16 ?        00:00:00 postgres: stats collector 
 9pg13      1061  1053  0 16:16 ?        00:00:00 postgres: logical replication launcher 
10pg13      1827  1053  0 16:29 ?        00:00:00 postgres: walsender replhr 172.72.6.7(51662) streaming 0/3000148
11root      2024   551  0 16:32 pts/0    00:00:00 grep --color=auto pg13

从库进程：

1[root@lhrpg67 /]# ps -ef|grep pg13
2pg13      1900     0  0 16:29 ?        00:00:00 /pg13/pg13/bin/postgres
3pg13      1901  1900  0 16:29 ?        00:00:00 postgres: logger 
4pg13      1902  1900  0 16:29 ?        00:00:00 postgres: startup recovering 000000010000000000000003
5pg13      1903  1900  0 16:29 ?        00:00:00 postgres: checkpointer 
6pg13      1904  1900  0 16:29 ?        00:00:00 postgres: background writer
7pg13      1905  1900  0 16:29 ?        00:00:00 postgres: stats collector
8pg13      1906  1900  0 16:29 ?        00:00:00 postgres: walreceiver streaming 0/3000148
9root      2186   540  0 16:32 pts/0    00:00:00 grep --color=auto pg13

查询复制状态

 1-- 主库查看wal日志发送状态
 2select * from pg_stat_replication;
 3
 4-- 从库查看wal日志接收状态
 5select * from pg_stat_wal_receiver;
 6
 7
 8-- 也可以通过该名称查看
 9pg_controldata  | grep state
10
11-- 也可以查看这个，主库是f代表false ；备库是t，代表true
12select pg_is_in_recovery();

主库查询复制状态：

 1C:\Users\lhrxxt>psql -U postgres -h 192.168.66.35 -p 64306
 2Password for user postgres:
 3psql (14.0, server 13.3)
 4Type "help" for help.
 5
 6postgres=# select * from pg_stat_replication;
 7 pid | usesysid | usename | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time
 8-----+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------+-------------------------------
 9 623 |    16430 | replhr  | walreceiver      | 172.72.6.7  |                 |       51676 | 2022-02-10 16:37:28.351635+08 |              | streaming | 0/3000060 | 0/3000060 | 0/3000060 | 0/3000060  |           |           |            |             0 | async      | 2022-02-10 16:38:58.583056+08
10(1 row)
11
12
13postgres=# \x
14Expanded display is on.
15postgres=# select * from pg_stat_replication;
16-[ RECORD 1 ]----+------------------------------
17pid              | 623
18usesysid         | 16430
19usename          | replhr
20application_name | walreceiver
21client_addr      | 172.72.6.7
22client_hostname  |
23client_port      | 51676
24backend_start    | 2022-02-10 16:37:28.351635+08
25backend_xmin     |
26state            | streaming
27sent_lsn         | 0/3000060
28write_lsn        | 0/3000060
29flush_lsn        | 0/3000060
30replay_lsn       | 0/3000060
31write_lag        |
32flush_lag        |
33replay_lag       |
34sync_priority    | 0
35sync_state       | async
36reply_time       | 2022-02-10 16:39:18.625797+08
37

pg_stat_replication是一个视图，主要用于监控PG流复制情况。在这个系统视图中每个记录只代表一个slave。因此，可以看到谁处于连接状态，在做什么任务。pg_stat_replication也是检查slave是否处于连接状态的一个好方法。

每个字段代码的含义：

• pid: 这代表负责流连接的wal_sender进程的进程ID。例如“postgres: walsender replhr 172.72.6.3(40056) streaming 0/7000148”。

• usesysid: 每个内部用户都有一个独一无二的编号。该系统的工作原理很像UNIX。usesysid 是 (PostgreSQL) 用户连接到系统的唯一标识符。

• usename: (不是用户名, 注意少了 r)，它存储与用户相关的 usesysid 的名字。这是客户端放入到连接字符串中的东西。

• application_name:这是同步复制的通常设置。它可以通过连接字符串传递到master。

• client_hostname: 除了客户端的IP，您还可以这样做，通过它的主机名来标识客户端。您可以通过master上的postgresql.conf中的log_hostname启用DNS反向查找。

• client_port: 这是客户端用来和WALsender进行通信使用的TPC端口号。如果不本地UNIX套接字被使用了将显示-1。

• backend_start: 它告诉我们slave什么时间创建了流连接。

• state: 此列告诉我们数据的连接状态。如果事情按计划进行，它应该包含流信息。

• sent_lsn:这代表发送到连接的最后的事务日志的位置。已经通过网络发送了多少WAL？

• write_lsn: 这是写到standby系统磁盘上最后的事务日志位置。已向操作系统发送了多少WAL？（尚未 flushing）

• flush_lsn: 这是被刷新到standby系统的最后位置。（这里注意写和刷新之间的区别。写并不意味着刷新。）已经有多少WAL已 flush 到磁盘？

• replay_lsn: 这是slave上重放的最后的事务日志位置。已重放了多少WAL，因此对查询可见？

• sync_priority: 这个字段是唯一和同步复制相关的。每次同步复制将会选择一个优先权 —sync_priority—会告诉您选择了那个优先权。

• sync_state: 最后您会看到slave在哪个状态。这个状态可以是async, sync, or potential。当有一个带有较高优先权的同步slave时，PostgreSQL会把slave 标记为 potential。

人们经常说 pg_stat_replication 视图是primary 端的，这是不对的。该视图的作用是揭示有关wal sender 进程的信息。换句话说：如果你正在运行级联复制，该视图意味着在 secondary 复制到其他slaves 的时候， secondary 端的 pg_stat_replication 上的也会显示entries ( 条目 )

从库查询wal日志接收状态：

 1C:\Users\lhrxxt>psql -U postgres -h 192.168.66.35 -p 64307
 2Password for user postgres:
 3psql (14.0, server 13.3)
 4Type "help" for help.
 5
 6postgres=# select * from pg_stat_wal_receiver;
 7 pid |  status   | receive_start_lsn | receive_start_tli | written_lsn | flushed_lsn | received_tli |      last_msg_send_time       |     last_msg_receipt_time     | latest_end_lsn |        latest_end_time        | slot_name | sender_host | sender_port |                                                                                                                                  conninfo
 8-----+-----------+-------------------+-------------------+-------------+-------------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------+-----------+-------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 9 614 | streaming | 0/3000000         |                 1 | 0/3000060   | 0/3000060   |            1 | 2022-02-10 16:39:28.647917+08 | 2022-02-10 16:39:28.648022+08 | 0/3000060      | 2022-02-10 16:37:28.355389+08 |           | 172.72.6.6  |        5433 | user=replhr password=******** channel_binding=disable dbname=replication host=172.72.6.6 port=5433 fallback_application_name=walreceiver sslmode=disable sslcompression=0 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres target_session_attrs=any
10(1 row)
11
12
13postgres=# \x
14Expanded display is on.
15postgres=# select * from pg_stat_wal_receiver;
16-[ RECORD 1 ]---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
17pid                   | 614
18status                | streaming
19receive_start_lsn     | 0/3000000
20receive_start_tli     | 1
21written_lsn           | 0/3000060
22flushed_lsn           | 0/3000060
23received_tli          | 1
24last_msg_send_time    | 2022-02-10 16:39:28.647917+08
25last_msg_receipt_time | 2022-02-10 16:39:28.648022+08
26latest_end_lsn        | 0/3000060
27latest_end_time       | 2022-02-10 16:37:28.355389+08
28slot_name             |
29sender_host           | 172.72.6.6
30sender_port           | 5433
31conninfo              | user=replhr password=******** channel_binding=disable dbname=replication host=172.72.6.6 port=5433 fallback_application_name=walreceiver sslmode=disable sslcompression=0 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres target_session_attrs=any
32
33
34postgres=#
35

PostgreSQL数据库配置

在主库创建表sr_delay，后续Keepalived每探测一次会刷新这张表的last_alive字段为当前探测时间，这张表用来判断主备延迟，数据库故障切换时会用到这张表。

直接在主库执行，备库会自动同步：

1create table sr_delay(id int4, last_alive timestamp(0) without time zone);
2INSERT INTO sr_delay VALUES(1,now()) ;
3
4postgres=> select * from sr_delay;
5 id |     last_alive
6----+---------------------
7  1 | 2022-02-11 09:02:52
8(1 row)

配置keepalived

安装

1yum install -y keepalived

配置keepalived

以下所有脚本在主备库都创建：

keepalived.conf

 1cat > /etc/keepalived/keepalived.conf <<"EOF"
 2! Configuration File for keepalived
 3global_defs {
 4   router_id lhrpg
 5}
 6
 7vrrp_script check_pg_alived {
 8   script "/etc/keepalived/check_pg.sh"
 9   interval 10
10   fall 3
11}
12
13vrrp_instance VI_1 {
14    state BACKUP
15    nopreempt
16    interface eth0
17    virtual_router_id 10
18    priority 100
19    advert_int 1
20    authentication {
21        auth_type PASS
22        auth_pass lhr
23    }
24
25    track_script {
26        check_pg_alived
27    }
28
29    virtual_ipaddress {
30        172.72.6.8/32 dev eth0 label eth0:1
31    }
32
33   smtp_alert
34      notify_master /etc/keepalived/failover.sh
35      # notify_fault /etc/keepalived/fault.sh
36}
37
38EOF

以上是Keepalived主节点的配置，Keepalived备节点的priority参数改成90，其余参数配置一样。

check_pg.sh

对主从PG 状态进行监控，监控脚本 check_pg.sh：

 1cat > /etc/keepalived/check_pg.sh <<"EOF"
 2#!/bin/bash
 3export PGDATABASE=postgres
 4export PGPORT=5433
 5export PGUSER=postgres
 6export PGHOME=/pg13/pg13
 7export PATH=$PGHOME/bin:$PATH:.
 8
 9PGIP=127.0.0.1
10LOGFILE=/etc/keepalived/pg_keepalived.log
11
12#pg_port_status=`lsof -i :$PGPORT | grep LISTEN | wc -l`
13#pg_port_status=`ps -ef | grep LISTEN | wc -l`
14SQL1='SELECT pg_is_in_recovery from pg_is_in_recovery();'
15SQL2='update sr_delay set last_alive = now() where id =1;'
16SQL3='SELECT 1;'
17
18
19#此脚本不检查备库存活状态，如果是备库则退出
20db_role=`echo $SQL1  | $PGHOME/bin/psql -h $PGIP -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w`
21#if [ $pg_port_status -lt 1 ];then
22#    echo -e `date +"%F %T"` 'Error: The postgreSQL is not running,please check the postgreSQL server status!' >> $LOGFILE
23#    exit 1
24#fi
25
26if [ $db_role == 't' ]; then
27   echo -e `date +"%F %T"` 'Attention: the current database is standby DB!' >> $LOGFILE
28   exit 0
29fi
30
31
32# 判断主库是否可用,主库更新状态
33echo $SQL3 | $PGHOME/bin/psql -h $PGIP -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w
34
35if [ $? -eq 0 ]; then
36   echo $SQL2 | $PGHOME/bin/psql -h $PGIP -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w
37   echo -e `date +"%F %T"` 'Success: update the master sr_delay successed!' >> $LOGFILE
38   exit 0
39else
40   echo -e `date +"%F %T"` 'Error: Is the server is running?' >> $LOGFILE
41   exit 1
42fi
43
44
45EOF
46

此脚本每隔10秒执行一次，执行频率由keepalived.conf配置文件中interval参数设置,脚本主要作用为:
① 检测主库是否存活。
② 更新sr_delay表last_alive字段为当前探测时间。

③ 若主库不可用，则应该关闭主库的keepalived服务。

failover.sh

主库挂掉后，keepalived调用执行切换脚本 failover.sh进行主备切换，Keepalived备节点激活成主节点后触发notify_master参数定义的/etc/keepalived/scripts/failover.sh脚本。

 1cat > /etc/keepalived/failover.sh <<"EOF"
 2#!/bin/bash
 3export PGPORT=5433
 4export PGUSER=postgres
 5export PG_OS_USER=pg13
 6export PGDATA=/pg13/pgdata
 7export PGDBNAME=postgres
 8export PGHOME=/pg13/pg13
 9export PATH=$PGHOME/bin:$PATH:.
10
11PGIP=127.0.0.1
12LOGFILE=/etc/keepalived/pg_keepalived.log
13
14# 主备数据库同步时延，单位为秒，这里设置延迟5分钟
15sr_allowed_delay_time=300
16
17SQL1='select pg_is_in_recovery  from pg_is_in_recovery();'
18SQL2="select last_alive as delay_time from sr_delay where now()- last_alive < interval '$sr_allowed_delay_time';"
19
20db_role=`echo $SQL1 | $PGHOME/bin/psql -h $PGIP  -p $PGPORT -U $PGUSER -d $PGDBNAME -At -w`
21db_sr_delaytime=`echo $SQL2 | $PGHOME/bin/psql -h $PGIP -p $PGPORT -d $PGDBNAME -U $PGUSER -At -w`
22
23SWITCH_COMMAND='pg_ctl promote -D $PGDATA'
24
25# 如果为备库，且延迟在指定时间范围内则切换为主库
26if [ $db_role == f ]; then
27   echo -e `date +"%F %T"` 'Attention: The current postgreSQL DB is master database,cannot switched!' >> $LOGFILE
28   exit 0
29elif [[ $db_role == t ]] && [[ $db_sr_delaytime ]]; then
30   echo -e `date +"%F %T"` 'Attention: The current database is statndby, ready to switch master database!' >> $LOGFILE
31   su - $PG_OS_USER -c "$SWITCH_COMMAND"
32   db_role=`echo $SQL1 | $PGHOME/bin/psql -h $PGIP  -p $PGPORT -U $PGUSER -d $PGDBNAME -At -w`
33
34   if [ $db_role == f ]; then
35        echo -e `date +"%F %T"` 'success: The current standby database successed to switched the primary PG database !' >> $LOGFILE
36        exit 0
37   else
38        echo -e `date +"%F %T"` 'Error: the standby database failed to switch the primary PG database ! Pelease checked it!' >> $LOGFILE
39        exit 1
40   fi
41fi
42
43EOF

当异步流复制主库故障时，流复制的备库延迟时间在指定范围内才进行主备切换，如果备库延迟时间超出指定范围则不进行主备切换。

赋权

1chmod +x /etc/keepalived/*.sh

启动keepalived

 1systemctl start keepalived
 2systemctl enable keepalived
 3systemctl status keepalived
 4
 5
 6# keepalived启动报错：IPVS: Can't initialize ipvs: Protocol not available
 7lsmod | grep ip_vs
 8modprobe ip_vs
 9modprobe ip_vs_wrr
10lsmod | grep ip_vs
11# 如果是容器，那么宿主机也需要加载ip_vs模块。
12
13
14-- 使用VIP连接
15psql  -h 172.72.6.8 -U postgres -p 5433 -c 'SELECT inet_server_addr(),pg_is_in_recovery()'

查看状态：

 1-- 启动之后可以看到VIP 172.72.6.8已经在主节点了
 2[root@lhrpg66 log]# ifconfig
 3eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
 4        inet 172.72.6.6  netmask 255.255.255.0  broadcast 172.72.6.255
 5        ether 02:42:ac:48:06:06  txqueuelen 0  (Ethernet)
 6        RX packets 20149  bytes 25983425 (24.7 MiB)
 7        RX errors 0  dropped 0  overruns 0  frame 0
 8        TX packets 21118  bytes 26526579 (25.2 MiB)
 9        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
10
11eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
12        inet 172.72.6.8  netmask 255.255.255.255  broadcast 0.0.0.0
13        ether 02:42:ac:48:06:06  txqueuelen 0  (Ethernet)
14
15lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
16        inet 127.0.0.1  netmask 255.0.0.0
17        loop  txqueuelen 1000  (Local Loopback)
18        RX packets 16184  bytes 5223332 (4.9 MiB)
19        RX errors 0  dropped 0  overruns 0  frame 0
20        TX packets 16184  bytes 5223332 (4.9 MiB)
21        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
22
23
24-- 主节点日志
25[root@lhrpg66 ~]# tailf /etc/keepalived/log/pg_keepalived.log 
262022-02-11 14:49:28 Success: update the master sr_delay successed!
272022-02-11 14:49:38 Success: update the master sr_delay successed!
282022-02-11 14:49:48 Success: update the master sr_delay successed!
292022-02-11 14:49:58 Success: update the master sr_delay successed!
302022-02-11 14:50:08 Success: update the master sr_delay successed!
312022-02-11 14:50:18 Success: update the master sr_delay successed!
322022-02-11 14:50:28 Success: update the master sr_delay successed!
332022-02-11 14:50:38 Success: update the master sr_delay successed!
34
35-- 备节点日志
36[root@lhrpg67 ~]# tailf /etc/keepalived/log/pg_keepalived.log 
372022-02-11 14:49:44 Attention: the current database is standby DB!
382022-02-11 14:49:54 Attention: the current database is standby DB!
392022-02-11 14:50:04 Attention: the current database is standby DB!
402022-02-11 14:50:14 Attention: the current database is standby DB!
412022-02-11 14:50:24 Attention: the current database is standby DB!
422022-02-11 14:50:34 Attention: the current database is standby DB!
432022-02-11 14:50:44 Attention: the current database is standby DB!
442022-02-11 14:50:54 Attention: the current database is standby DB!
45
46
47-- 主从库的表sr_delay时间更新
48postgres=> select * from sr_delay;
49 id |     last_alive
50----+---------------------
51  1 | 2022-02-11 10:37:01
52(1 row)

验证高可用

1-- 判断是否同步
2select * from sr_delay;
3
4-- 判断主备库
5select * from pg_is_in_recovery();

停止主库的数据库服务

1pg_ctl stop

查看日志：

 1-- 主库
 22022-02-11 15:26:08 Success: update the master sr_delay successed!
 32022-02-11 15:26:18 Success: update the master sr_delay successed!
 42022-02-11 15:26:28 Success: update the master sr_delay successed!
 52022-02-11 15:26:38 Error: Is the server is running?
 62022-02-11 15:26:48 Error: Is the server is running?
 72022-02-11 15:26:58 Error: Is the server is running?
 82022-02-11 15:27:08 Error: Is the server is running?
 9
10-- 备库
112022-02-11 15:26:34 Attention: the current database is standby DB!
122022-02-11 15:26:44 Attention: the current database is standby DB!
132022-02-11 15:26:54 Attention: the current database is standby DB!
142022-02-11 15:27:00 Attention: The current database is statndby, ready to switch master database!
152022-02-11 15:27:00 success: The current standby database successed to switched the primary PG database !
162022-02-11 15:27:04 Success: update the master sr_delay successed!
172022-02-11 15:27:14 Success: update the master sr_delay successed!
182022-02-11 15:27:24 Success: update the master sr_delay successed!

在主库关闭后，检测3次，即30秒之后，发生切换，此时备库切换为主库，VIP也在此时漂移到备库：

 1[root@lhrpg67 /]# ifconfig
 2eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
 3        inet 172.72.6.7  netmask 255.255.255.0  broadcast 172.72.6.255
 4        ether 02:42:ac:48:06:07  txqueuelen 0  (Ethernet)
 5        RX packets 3005  bytes 33264390 (31.7 MiB)
 6        RX errors 0  dropped 0  overruns 0  frame 0
 7        TX packets 2393  bytes 286938 (280.2 KiB)
 8        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
 9
10eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
11        inet 172.72.6.8  netmask 255.255.255.255  broadcast 0.0.0.0
12        ether 02:42:ac:48:06:07  txqueuelen 0  (Ethernet)
13
14lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
15        inet 127.0.0.1  netmask 255.0.0.0
16        loop  txqueuelen 1000  (Local Loopback)
17        RX packets 5797  bytes 1549768 (1.4 MiB)
18        RX errors 0  dropped 0  overruns 0  frame 0
19        TX packets 5797  bytes 1549768 (1.4 MiB)
20        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

此时，可以修复源主库：

1touch /pg13/pgdata/standby.signal
2pg_ctl start

启动后，可以看到源主库作为新备库存在，新主备架构同步正常。

关闭主库的OS

1 docker stop lhrpg67

过程和“停止主库的数据库服务”基本一样，主库又回到了lhrpg66。

关闭主库的keepalived

若关闭主库的keepalived进程，那么，备库会立马进行切换为主库：

12022-02-11 15:44:11 Attention: the current database is standby DB!
22022-02-11 15:44:20 Attention: The current database is statndby, ready to switch master database!
32022-02-11 15:44:21 Attention: the current database is standby DB!
42022-02-11 15:44:21 success: The current standby database successed to switched the primary PG database !
52022-02-11 15:44:31 Success: update the master sr_delay successed!
62022-02-11 15:44:41 Success: update the master sr_delay successed!

关闭备库的数据库或keepalived服务或OS

经过测试，无论是关闭备库的pg数据库，还是关闭备库的keepalived，亦或关闭备库的OS，然后重启相关服务，对这个架构没有影响。

总结

1、挂掉的主库若重启作为主库的话，在启动之前需要配置standby.signal文件，若启动之前没有配置standby.signal文件，而是以主库的角色启动，那么需要重新关闭，然后按照如下步骤修复，否则启动过程会报错“requested timeline 2 is not a child of this server's history”：

1wal_log_hints = 'on'
2pg_rewind --target-pgdata=/pg13/pgdata --source-server='host=172.72.6.6 port=5433 user=postgres dbname=postgres password=lhr'
3touch standby.signal
4修改primary_conninfo

2、总体感受，keepalived可维护性不强，并不建议生产环境使用。

3、keepalived只是解决了高可用中的单点故障问题（故障可以自动切换），并不提供负载均衡、读写分离等特性。

vlambda博客
学习文章列表