Description of problem: While performing fail-over testing, repmgr10.service is failing to start on a cfme-db appliance reboot Version-Release number of selected component (if applicable): cfme-rhos-5.11.6.0-1.x86_64 How reproducible: Always Steps to Reproduce: 1. Setup replication between Primary Database-Only Appliance and Standby Database-Only Appliance. 2. For Failover testing, reboot Primary Database-Only Appliance. After the node is up, the service "repmgr10.service" fails to start automatically and hence failover does not work. 3. After manually starting the service "repmgr10.service" the failover works and now "Standby Database-Only Appliance" is the "Primary Database-Only Appliance". 4. Again perform failover testing, Now perform same steps with current "Primary Database-Only Appliance" node by rebooting the node and "repmgr10.service" again fails to start automatically. Actual results: [root@movl-cfmedb2 ~]# systemctl status repmgr10.service ● repmgr10.service - A replication manager, and failover management tool for PostgreSQL Loaded: loaded (/usr/lib/systemd/system/repmgr10.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2020-07-13 21:13:50 IST; 22s ago Process: 1546 ExecStart=/usr/bin/repmgrd -f ${REPMGRDCONF} -p ${PIDFILE} -d --verbose (code=exited, status=6) Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: Starting A replication manager, and failover management tool for PostgreSQL... Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] using provided configuration file "/etc/repmgr/10/repmgr.conf" Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] repmgrd (repmgr 4.0.6) starting up Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log" Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: repmgr10.service: Control process exited, code=exited status=6 Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: repmgr10.service: Failed with result 'exit-code'. Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: Failed to start A replication manager, and failover management tool for PostgreSQL. Expected results: The service "repmgr10.service" should start automatically after reboot and failover to work. Additional info: - The "repmgr10.service" has dependencies configured in it's unit file to postgresql-10.service. It's observed that the "repmgr10.service" service starts before "postgresql.service" or at the same time, which results failure to start "repmgr10.service". [root@cfmedb2 ~]# systemctl status postgresql.service repmgr10.service ● postgresql.service - PostgreSQL database server Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-07-13 21:13:50 IST; 22s ago Process: 1547 ExecStartPre=/usr/libexec/postgresql-check-db-dir postgresql (code=exited, status=0/SUCCESS) Main PID: 1553 (postmaster) Tasks: 7 (limit: 101365) Memory: 87.6M CGroup: /system.slice/postgresql.service ├─1553 /usr/bin/postmaster -D /var/lib/pgsql/data ├─1567 postgres: logger process ├─1568 postgres: startup process recovering 0000000300000000000000DA ├─1570 postgres: checkpointer process ├─1571 postgres: writer process ├─1572 postgres: stats collector process └─1573 postgres: wal receiver process streaming 0/DA554E48 Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Starting PostgreSQL database server... Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG: listening on IPv4 address "0.0.0.0", port 5432 Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG: listening on IPv6 address "::", port 5432 Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG: redirecting log output to logging collector process Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:HINT: Future log output will appear in directory "log". Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Started PostgreSQL database server. ● repmgr10.service - A replication manager, and failover management tool for PostgreSQL Loaded: loaded (/usr/lib/systemd/system/repmgr10.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2020-07-13 21:13:50 IST; 22s ago Process: 1546 ExecStart=/usr/bin/repmgrd -f ${REPMGRDCONF} -p ${PIDFILE} -d --verbose (code=exited, status=6) Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Starting A replication manager, and failover management tool for PostgreSQL... Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] using provided configuration file "/etc/repmgr/10/repmgr.conf" Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] repmgrd (repmgr 4.0.6) starting up Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log" Jul 13 21:13:50 cfmedb2.example.com systemd[1]: repmgr10.service: Control process exited, code=exited status=6 Jul 13 21:13:50 cfmedb2.example.com systemd[1]: repmgr10.service: Failed with result 'exit-code'. Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Failed to start A replication manager, and failover management tool for PostgreSQL. - We found a workaround which is drafted in KCS article https://access.redhat.com/solutions/5219591
Ah the bug here is actually that the service name is incorrect in the repmgr systemd service file. This is mentioned in the KB, but not in the BZ. > The repmgr10.service has dependencies configured in it's unit file to postgresql-10.service whereas name of the postgresql service is postgresql.service. That change is all that's required to fix this.
On CFME 5.11.7.3 rebooting the appliance with configured failover manager (repmgr10) leads to the repmgr10 service fail to start. On CFME 5.11.8.0 the same doesn't lead to repmgr10 not starting and the rebooted node is able to take over. Seem to be fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: CloudForms 5.0.8 security, bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4134
RHCFQE-14630 https://github.com/ManageIQ/integration_tests/pull/10327