介绍
repmgr(Replication Manager)是一个开源工具,用于管理 PostgreSQL 集群的复制和故障转移。在本教程中,我们将学习如何设置和配置集群,以实现自动故障转移。
前提条件
必须在主服务器和备用服务器上安装以下软件:
• PostgreSQL
• repmgr(与已安装的 PostgreSQL 主要版本匹配)
• 在网络层面,与 PostgreSQL 端口(默认值:5432)建立的连接,必须能够双向通信。
安装 PostgreSQL
使用 PostgreSQL 安装创建两个集群/服务器。您可以按照下面链接中的 PostgreSQL 说明,使用 PostgreSQL 的 PGDG 仓库软件包进行安装。为了命名约定,我们将主服务器和备用服务器视为两个服务器。
https://wiki.postgresql.org/wiki/YUM_Installation
yum -y install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum -y install epel-release yum-utils
yum-config-manager --enable pgdg12
yum install postgresql12-server postgresql12
/usr/pgsql-12/bin/postgresql-12-setup initdb
注意:在备用服务器上不需要执行上述集群初始化步骤。
systemctl enable --now postgresql-12
systemctl status postgresql-12
安装 repmgr
您需要在主服务器和备用服务器上安装 repmgr。
yum -y install repmgr12*
配置 PostgreSQL
在主服务器上,必须初始化 PostgreSQL 实例,并且已经启动运行。可能需要调整以下复制相关设置:
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'replica' or 'logical'
hot_standby = on
archive_mode = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'
创建用户
创建一个专用的 PostgreSQL 超级用户帐户,和一个用于 repmgr 元数据的数据库:
create user repmgr;
create database repmgr with owner repmgr;
配置 pg_hba.conf
确保 repmgr 用户在 pg_hba.conf 中具有适当的权限,并且可以在复制模式下进行连接;pg_hba.conf 应包含类似于以下内容的条目:
local replication repmgr trust
host replication repmgr 127.0.0.1/32 trust
host replication repmgr 192.168.1.0/24 trust
local repmgr repmgr trust
host repmgr repmgr 127.0.0.1/32 trust
host repmgr repmgr 192.168.1.0/24 trust
注意:请根据您的网络配置,调整上述设置。
配置 repmgr 文件
在主服务器上创建包含以下条目的 repmgr.conf:
cluster='failovertest'
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data/'
failover=automatic
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'
注意:我们用于 repmgr 的用户,必须可以访问其中的路径和文件。
注册主服务器
用 repmgr 注册主服务器:
$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf primary register
WARNING: the following problems were found in the configuration file:
parameter "cluster" is deprecated and will be ignored
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
然后检查集群的状态:
$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
WARNING: the following problems were found in the configuration file:
parameter "cluster" is deprecated and will be ignored
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2
构建/克隆备用服务器
在备用服务器上创建 repmgr.conf 文件:
$ cat repmgr.conf
node_id=2
node_name=node2
conninfo='host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover=automatic
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'
注意:在上述命令中,对于主机 IP 信息,我们需要指定备用服务器的 IP。在此示例中,172.16.140.137 是对应的备用服务器。
现在,我们可以执行试运行,并测试我们的配置是否正确:
$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met
如果没有问题,可以开始克隆:
$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone
NOTICE: destination directory "/var/lib/pgsql/12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/var/lib/pgsql/12/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
pg_basebackup -l "repmgr base backup" -D /var/lib/pgsql/12/data -h 172.16.140.135 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /var/lib/pgsql/12/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"
注册备用服务器
用 repmgr 注册备用服务器:
$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf standby register
INFO: connecting to local node "node2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "node2" (ID: 2) successfully registered
$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | 100 | 1 | host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2
启动 repmgrd 守护进程
为了启用自动故障转移,我们现在需要在主从和见证服务器上,启动 repmgrd 守护进程:
例如:
$ /usr/pgsql-12/bin/repmgrd -f /var/lib/pgsql/repmgr.conf
[2020-02-23 20:44:43] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2020-02-23 20:44:43] [INFO] connecting to database "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"
INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2020-02-23 20:44:43] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2020-02-23 20:44:43] [INFO] "connection_check_type" set to "ping"
[2020-02-23 20:44:43] [NOTICE] monitoring cluster primary "node1" (ID: 1)
[2020-02-23 20:44:43] [INFO] child node "node2" (ID: 2) is attached
我们还可以检查集群的事件:
$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster event
Node ID | Name | Event | OK | Timestamp | Details
---------+-------+--------------------+----+---------------------+----------------------------------------------------------------------------------------
2 | node2 | repmgrd_start | t | 2020-02-23 20:46:26 | monitoring connection to upstream node "node1" (ID: 1)
1 | node1 | repmgrd_start | t | 2020-02-23 20:44:43 | monitoring cluster primary "node1" (ID: 1)
2 | node2 | standby_register | t | 2020-02-23 20:39:24 | standby registration succeeded; upstream node ID is 1 (-F/--force option was used)
1 | node1 | primary_register | t | 2020-02-23 20:39:11 | existing primary record updated
2 | node2 | standby_register | t | 2020-02-23 20:38:25 | standby registration succeeded; upstream node ID is 1
2 | node2 | standby_unregister | t | 2020-02-23 20:37:56 |
2 | node2 | standby_register | t | 2020-02-23 20:12:23 | standby registration succeeded; upstream node ID is 1
2 | node2 | standby_clone | t | 2020-02-23 20:09:25 | cloned from host "172.16.140.135", port 5432; backup method: pg_basebackup; --force: N
1 | node1 | primary_register | t | 2020-02-23 19:57:11 |
1 | node1 | cluster_created | t | 2020-02-23 19:57:11 |
现在,如果主服务器发生故障,repmgrd 将检测到无法访问主服务器,然后激活下一个可用的服务器,并执行自动故障转移。
日志消息会如下面所示:
[2020-02-23 20:51:28] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state
[2020-02-23 20:52:40] [WARNING] unable to ping "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"
[2020-02-23 20:52:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
.
.
.
[2020-02-23 20:53:30] [INFO] checking state of node 1, 6 of 6 attempts
[2020-02-23 20:53:30] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.140.135 fallback_application_name=repmgr"
[2020-02-23 20:53:30] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-23 20:53:30] [WARNING] unable to reconnect to node 1 after 6 attempts
[2020-02-23 20:53:30] [INFO] 0 active sibling nodes registered
[2020-02-23 20:53:30] [INFO] primary and this node have the same location ("default")
[2020-02-23 20:53:30] [INFO] no other sibling nodes - we win by default
[2020-02-23 20:53:30] [NOTICE] this node is the only available candidate and will now promote itself
[2020-02-23 20:53:30] [INFO] promote_command is:
"/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file"
[2020-02-23 20:53:30] [NOTICE] promoting standby to primary
[2020-02-23 20:53:30] [DETAIL] promoting server "node2" (ID: 2) using pg_promote()
[2020-02-23 20:53:30] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2020-02-23 20:53:31] [NOTICE] STANDBY PROMOTE successful
[2020-02-23 20:53:31] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2020-02-23 20:53:31] [INFO] 0 followers to notify
[2020-02-23 20:53:31] [INFO] switching to primary monitoring mode
[2020-02-23 20:53:31] [NOTICE] monitoring cluster primary "node2" (ID: 2)