Bug 1856470 - repmgr10.service is failing to start on cfme db appliance reboot
Summary: repmgr10.service is failing to start on cfme db appliance reboot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Replication
Version: 5.11.6
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: GA
: 5.11.8
Assignee: Nick Carboni
QA Contact: Jaroslav Henner
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-13 17:52 UTC by Chinmay Paradkar
Modified: 2023-12-15 18:26 UTC (History)
7 users (show)

Fixed In Version: 5.11.8.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-30 14:01:07 UTC
Category: Bug
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:
simaishi: cfme-5.11.z+


Attachments (Terms of Use)

Description Chinmay Paradkar 2020-07-13 17:52:33 UTC
Description of problem:
While performing fail-over testing, repmgr10.service is failing to start on a cfme-db appliance reboot 

Version-Release number of selected component (if applicable):
cfme-rhos-5.11.6.0-1.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Setup replication between Primary Database-Only Appliance and Standby Database-Only Appliance.

2. For Failover testing, reboot Primary Database-Only Appliance. After the node is up, the service "repmgr10.service" fails to start automatically and hence failover does not work.

3. After manually starting the service "repmgr10.service" the failover works and now "Standby Database-Only Appliance" is the "Primary Database-Only Appliance". 

4. Again perform failover testing, Now perform same steps with current "Primary Database-Only Appliance" node by rebooting the node and "repmgr10.service" again fails to start automatically.

Actual results:
[root@movl-cfmedb2 ~]# systemctl status repmgr10.service
● repmgr10.service - A replication manager, and failover management tool for PostgreSQL
   Loaded: loaded (/usr/lib/systemd/system/repmgr10.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2020-07-13 21:13:50 IST; 22s ago
  Process: 1546 ExecStart=/usr/bin/repmgrd -f ${REPMGRDCONF} -p ${PIDFILE} -d --verbose (code=exited, status=6)

Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: Starting A replication manager, and failover management tool for PostgreSQL...
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] using provided configuration file "/etc/repmgr/10/repmgr.conf"
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] repmgrd (repmgr 4.0.6) starting up
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log"
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: repmgr10.service: Control process exited, code=exited status=6
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: repmgr10.service: Failed with result 'exit-code'.
Jul 13 21:13:50 movl-cfmedb2.auranocnorth.airtel.com systemd[1]: Failed to start A replication manager, and failover management tool for PostgreSQL.

Expected results:
The service "repmgr10.service" should start automatically after reboot and failover to work.

Additional info:
- The "repmgr10.service" has dependencies configured in it's unit file to postgresql-10.service. It's observed that the "repmgr10.service" service starts before "postgresql.service" or at the same time, which results failure to start "repmgr10.service".


[root@cfmedb2 ~]# systemctl status postgresql.service repmgr10.service
● postgresql.service - PostgreSQL database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-07-13 21:13:50 IST; 22s ago
  Process: 1547 ExecStartPre=/usr/libexec/postgresql-check-db-dir postgresql (code=exited, status=0/SUCCESS)
 Main PID: 1553 (postmaster)
    Tasks: 7 (limit: 101365)
   Memory: 87.6M
   CGroup: /system.slice/postgresql.service
           ├─1553 /usr/bin/postmaster -D /var/lib/pgsql/data
           ├─1567 postgres: logger process
           ├─1568 postgres: startup process   recovering 0000000300000000000000DA
           ├─1570 postgres: checkpointer process
           ├─1571 postgres: writer process
           ├─1572 postgres: stats collector process
           └─1573 postgres: wal receiver process   streaming 0/DA554E48

Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Starting PostgreSQL database server...
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  listening on IPv4 address "0.0.0.0", port 5432
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  listening on IPv6 address "::", port 5432
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:LOG:  redirecting log output to logging collector process
Jul 13 21:13:50 cfmedb2.example.com postmaster[1553]: 2020-07-13 11:43:50 EDT::5f0c8136.611:@:[1553]:HINT:  Future log output will appear in directory "log".
Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Started PostgreSQL database server.

● repmgr10.service - A replication manager, and failover management tool for PostgreSQL
   Loaded: loaded (/usr/lib/systemd/system/repmgr10.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2020-07-13 21:13:50 IST; 22s ago
  Process: 1546 ExecStart=/usr/bin/repmgrd -f ${REPMGRDCONF} -p ${PIDFILE} -d --verbose (code=exited, status=6)

Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Starting A replication manager, and failover management tool for PostgreSQL...
Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] using provided configuration file "/etc/repmgr/10/repmgr.conf"
Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] repmgrd (repmgr 4.0.6) starting up
Jul 13 21:13:50 cfmedb2.example.com repmgrd[1546]: [2020-07-13 21:13:50] [NOTICE] redirecting logging output to "/var/log/repmgr/repmgrd.log"
Jul 13 21:13:50 cfmedb2.example.com systemd[1]: repmgr10.service: Control process exited, code=exited status=6
Jul 13 21:13:50 cfmedb2.example.com systemd[1]: repmgr10.service: Failed with result 'exit-code'.
Jul 13 21:13:50 cfmedb2.example.com systemd[1]: Failed to start A replication manager, and failover management tool for PostgreSQL.

- We found a workaround which is drafted in KCS article https://access.redhat.com/solutions/5219591

Comment 4 Nick Carboni 2020-07-16 19:23:56 UTC
Ah the bug here is actually that the service name is incorrect in the repmgr systemd service file. This is mentioned in the KB, but not in the BZ.

> The repmgr10.service has dependencies configured in it's unit file to postgresql-10.service whereas name of the postgresql service is postgresql.service.

That change is all that's required to fix this.

Comment 8 Jaroslav Henner 2020-09-16 09:38:46 UTC
On CFME 5.11.7.3 rebooting the appliance with configured failover manager (repmgr10) leads to the repmgr10 service fail to start.
On CFME 5.11.8.0 the same doesn't lead to repmgr10 not starting and the rebooted node is able to take over.

Seem to be fixed.

Comment 12 errata-xmlrpc 2020-09-30 14:01:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: CloudForms 5.0.8 security, bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4134

Comment 13 Jaroslav Henner 2020-11-03 16:11:40 UTC
RHCFQE-14630
https://github.com/ManageIQ/integration_tests/pull/10327


Note You need to log in before you can comment on or make changes to this bug.