Bug 1856470

Summary: repmgr10.service is failing to start on cfme db appliance reboot
Product: Red Hat CloudForms Management Engine Reporter: Chinmay Paradkar <cparadka>
Component: ReplicationAssignee: Nick Carboni <ncarboni>
Status: CLOSED ERRATA QA Contact: Jaroslav Henner <jhenner>
Severity: medium Docs Contact: Red Hat CloudForms Documentation <cloudforms-docs>
Priority: high    
Version: 5.11.6CC: dmetzger, jhenner, mshriver, ncarboni, ngupta, obarenbo, simaishi
Target Milestone: GAKeywords: ZStream
Target Release: 5.11.8Flags: simaishi: cfme-5.11.z+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 5.11.8.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-30 14:01:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:

Description Chinmay Paradkar 2020-07-13 17:52:33 UTC
Description of problem:
While performing fail-over testing, repmgr10.service is failing to start on a cfme-db appliance reboot 

Version-Release number of selected component (if applicable):
cfme-rhos-5.11.6.0-1.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Setup replication between Primary Database-Only Appliance and Standby Database-Only Appliance.

2. For Failover testing, reboot Primary Database-Only Appliance. After the node is up, the service "repmgr10.service" fails to start automatically and hence failover does not work.

3. After manually starting the service "repmgr10.service" the failover works and now "Standby Database-Only Appliance" is the "Primary Database-Only Appliance". 

4. Again perform failover testing, Now perform same steps with current "Primary Database-Only Appliance" node by rebooting the node and "repmgr10.service" again fails to start automatically.

Actual results:
[root@movl-cfmedb2 ~]# systemctl status repmgr10.service
● repmgr10.service - A replication manager, and failover management tool for PostgreSQL
   Loaded: loaded (/usr/lib/systemd/system/repmgr10.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2020-07-13 21:13:50 IST; 22s ago
  Process: 1546 ExecStart=/usr/bin/repmgrd -f ${REPMGRDCONF} -p ${PIDFILE} -d --verbose (code=exited, status=6)

Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: Starting A replication manager, and failover management tool for PostgreSQL...
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] using provided configuration file "/etc/repmgr/10/repmgr.conf"
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] repmgrd (repmgr 4.0.6) starting up
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log"
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: repmgr10.service: Control process exited, code=exited status=6
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: repmgr10.service: Failed with result 'exit-code'.
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: Failed to start A replication manager, and failover management tool for PostgreSQL.

Expected results:
The service "repmgr10.service" should start automatically after reboot and failover to work.

Additional info:
- The "repmgr10.service" has dependencies configured in it's unit file to postgresql-10.service. It's observed that the "repmgr10.service" service starts before "postgresql.service" or at the same time, which results failure to start "repmgr10.service".


[root@cfmedb2 ~]# systemctl status postgresql.service repmgr10.service
● postgresql.service - PostgreSQL database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-07-13 21:13:50 IST; 22s ago
  Process: 1547 ExecStartPre=/usr/libexec/postgresql-check-db-dir postgresql (code=exited, status=0/SUCCESS)
 Main PID: 1553 (postmaster)
    Tasks: 7 (limit: 101365)
   Memory: 87.6M
   CGroup: /system.slice/postgresql.service
           ├─1553 /usr/bin/postmaster -D /var/lib/pgsql/data
           ├─1567 postgres: logger process
           ├─1568 postgres: startup process   recovering 0000000300000000000000DA
           ├─1570 postgres: checkpointer process
           ├─1571 postgres: writer process
           ├─1572 postgres: stats collector process
           └─1573 postgres: wal receiver process   streaming 0/DA554E48

Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Starting PostgreSQL database server...
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  listening on IPv4 address "0.0.0.0", port 5432
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  listening on IPv6 address "::", port 5432
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  redirecting log output to logging collector process
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:HINT:  Future log output will appear in directory "log".
Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Started PostgreSQL database server.

● repmgr10.service - A replication manager, and failover management tool for PostgreSQL
   Loaded: loaded (/usr/lib/systemd/system/repmgr10.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2020-07-13 21:13:50 IST; 22s ago
  Process: 1546 ExecStart=/usr/bin/repmgrd -f ${REPMGRDCONF} -p ${PIDFILE} -d --verbose (code=exited, status=6)

Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Starting A replication manager, and failover management tool for PostgreSQL...
Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] using provided configuration file "/etc/repmgr/10/repmgr.conf"
Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] repmgrd (repmgr 4.0.6) starting up
Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log"
Jul 13 21:13:50 cfmedb2.example.com systemd[1]: repmgr10.service: Control process exited, code=exited status=6
Jul 13 21:13:50 cfmedb2.example.com systemd[1]: repmgr10.service: Failed with result 'exit-code'.
Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Failed to start A replication manager, and failover management tool for PostgreSQL.

- We found a workaround which is drafted in KCS article https://access.redhat.com/solutions/5219591

Comment 4 Nick Carboni 2020-07-16 19:23:56 UTC
Ah the bug here is actually that the service name is incorrect in the repmgr systemd service file. This is mentioned in the KB, but not in the BZ.

> The repmgr10.service has dependencies configured in it's unit file to postgresql-10.service whereas name of the postgresql service is postgresql.service.

That change is all that's required to fix this.

Comment 8 Jaroslav Henner 2020-09-16 09:38:46 UTC
On CFME 5.11.7.3 rebooting the appliance with configured failover manager (repmgr10) leads to the repmgr10 service fail to start.
On CFME 5.11.8.0 the same doesn't lead to repmgr10 not starting and the rebooted node is able to take over.

Seem to be fixed.

Comment 12 errata-xmlrpc 2020-09-30 14:01:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: CloudForms 5.0.8 security, bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4134

Comment 13 Jaroslav Henner 2020-11-03 16:11:40 UTC
RHCFQE-14630
https://github.com/ManageIQ/integration_tests/pull/10327