Bug 2051912 - Some of the services failed to start after satellite restore
Summary: Some of the services failed to start after satellite restore
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Installation
Version: 6.10.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: 6.11.0
Assignee: Evgeni Golov
QA Contact: Lukas Pramuk
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-08 10:40 UTC by Devendra Singh
Modified: 2022-07-05 14:33 UTC (History)
4 users (show)

Fixed In Version: foreman-installer-3.1.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-05 14:32:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Foreman Issue Tracker 33744 0 Normal Closed workaround for pulpcore-content PostgreSQL connection issue 2022-02-09 07:45:26 UTC
Foreman Issue Tracker 34394 0 Normal New Restore of Katello 4.3 leaves the system in not fully functional state 2022-02-08 14:26:58 UTC
Red Hat Product Errata RHSA-2022:5498 0 None None None 2022-07-05 14:33:06 UTC

Description Devendra Singh 2022-02-08 10:40:36 UTC
Description of problem: Some of the services failed to start after satellite restore


Version-Release number of selected component (if applicable):
6.10.z 

How reproducible:
2/2

Steps to Reproduce:
1. Spin the RHEL7 machine with 32 Gigs of RAM and 8 core CPU.
2. Copy the 6.10.z backup data on the spawned machine.
3- Clone the master branch of the satellite clone(https://github.com/RedHatSatellite/satellite-clone.git).
4- update the configuration file to run the satellite restore.
5- Run the Satellite restore.
6- The satellite-clone playbook failed at the very end due to some of the services failing to start.
 
fatal: [xyz.com]: FAILED! => {"changed": true, "cmd": ["hammer", "ping"], "delta": "0:00:06.382378", "end": "2022-02-07 13:29:05.027946", "msg": "non-zero return code", "rc": 1, "start": "2022-02-07 13:28:58.645568", "stderr": "", "stderr_lines": [], "stdout": "database:         \n    Status:          ok\n    Server Response: Duration: 0ms\nkatello_agent:    \n    Status:          ok\n    message:         0 Processed, 0 Failed\n    Server Response: Duration: 0ms\ncandlepin:        \n    Status:          ok\n    Server Response: Duration: 29ms\ncandlepin_auth:   \n    Status:          ok\n    Server Response: Duration: 29ms\ncandlepin_events: \n    Status:          ok\n    message:         0 Processed, 0 Failed\n    Server Response: Duration: 0ms\nkatello_events:   \n    Status:          ok\n    message:         0 Processed, 0 Failed\n    Server Response: Duration: 0ms\npulp3:            \n    Status:          ok\n    Server Response: Duration: 85ms\npulp3_content:    \n    Status:          FAIL\n    Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3.\nforeman_tasks:    \n    Status:          FAIL\n    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status", "stdout_lines": ["database:         ", "    Status:          ok", "    Server Response: Duration: 0ms", "katello_agent:    ", "    Status:          ok", "    message:         0 Processed, 0 Failed", "    Server Response: Duration: 0ms", "candlepin:        ", "    Status:          ok", "    Server Response: Duration: 29ms", "candlepin_auth:   ", "    Status:          ok", "    Server Response: Duration: 29ms", "candlepin_events: ", "    Status:          ok", "    message:         0 Processed, 0 Failed", "    Server Response: Duration: 0ms", "katello_events:   ", "    Status:          ok", "    message:         0 Processed, 0 Failed", "    Server Response: Duration: 0ms", "pulp3:            ", "    Status:          ok", "    Server Response: Duration: 85ms", "pulp3_content:    ", "    Status:          FAIL", "    Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3.", "foreman_tasks:    ", "    Status:          FAIL", "    Server Response: Message: some executors are not responding, check 
Actual results:

Some of the services failed to start after satellite restore

Expected results:

Satellite restore should be completed successfully.

Additional info:

The Satellite clone failed due to that, but I tried to restart the services manually and it got started.

Comment 2 Evgeni Golov 2022-02-08 14:26:59 UTC
the output, in readable form:

database:         
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:    
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
candlepin:        
    Status:          ok
    Server Response: Duration: 29ms
candlepin_auth:   
    Status:          ok
    Server Response: Duration: 29ms
candlepin_events: 
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:   
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
pulp3:            
    Status:          ok
    Server Response: Duration: 85ms
pulp3_content:    
    Status:          FAIL
    Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3.
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status"

For the tasks part we have https://projects.theforeman.org/issues/34394 already, let's see what pulp is unhappy about

Comment 3 Evgeni Golov 2022-02-08 14:38:23 UTC
interestingly, the attached debug has the following hammer ping output:

COMMAND> hammer ping

database:         
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:    
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
candlepin:        
    Status:          ok
    Server Response: Duration: 46ms
candlepin_auth:   
    Status:          ok
    Server Response: Duration: 35ms
candlepin_events: 
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:   
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 1ms
pulp3:            
    Status:          ok
    Server Response: Duration: 111ms
pulp3_content:    
    Status:          ok
    Server Response: Duration: 64ms
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status


so pulp is actually happy there?

Comment 4 Evgeni Golov 2022-02-08 14:51:30 UTC
observations from talking to Devendra: this happened with two databases (call them S and B):
on S, only foreman_tasks was failing
on B, the output of "hammer ping" during the run shows pulp3_content *and* foreman_tasks failing, but a later generated foreman debug only shows foreman_tasks

so it looks like we're not waiting long enough for the pulp case (or not watching for some event)

Comment 6 Evgeni Golov 2022-02-08 16:01:08 UTC
so looking in more detail at the debug,

PostgreSQL is running "since Mon 2022-02-07 13:16:10 EST", Pulpcore "since Mon 2022-02-07 13:13:29 EST" and dynflow "since Mon 2022-02-07 13:14:28 EST"

so there is something that doesn't survive the restart of PostgreSQL nicely (bad) but I would actually expect us to restart all relevant services *together*

the PSQL restart is also in the installer log:
2022-02-07 13:16:09 [DEBUG ] [configure] Executing: '/bin/systemctl restart -- postgresql'

this kinda implies the fix from https://github.com/theforeman/puppet-pulpcore/commit/440a8e054248f271f3c0b022844a07a1bac580b2 isn't working?!
and indeed, I do not see a relationship in the log

Comment 7 Evgeni Golov 2022-02-09 07:48:29 UTC
Thanks to Devendra, we got a reproducer machine and after looking around I realized this is failing to restore a 6.10, not a 7.0 (should have been obvious yesterday, but alas).

6.10 doesn't have the above mentioned puppet-pulpcore patch, so it's failing there. I've attached Foreman Redmine #33744 to this BZ, but ultimately this needs to be fixed in a 6.10.z, as that part actually doesn't affect 7.0.

Still hunting the foreman_tasks part, as that seems to apply to both, 6.10 and 7.0

Comment 8 Evgeni Golov 2022-02-09 08:34:32 UTC
Anyone up for a history lesson? Sit down and have some tea!

It seems, we have two services that don't survive a PostgreSQL disconnect well: pulpcore-content (see https://pulp.plan.io/issues/9276, fixed in 3.14.9+, 3.16.1+, 3.17.0+) and dynflow-sidekiq.
For pulpcore-content we knew this and have implemented a workaround, that's present in 7.0: https://projects.theforeman.org/issues/33744, but not in 6.10 that's to-be-restored here. (technically, 7.0 doesn't need the workaround, it has pulpcore 3.16.1)
For dynflow-sidekiq I'm still digging.

Now, you'll rightfully will ask "but this BZ is against satellite-clone?!" and yes, this is where the effect of those services was found by Devendra.
But satellite-clone is merely triggering a restore via foreman-maintain,
which since https://github.com/theforeman/foreman_maintain/commit/6089aa60271d1036604c135c89ec5263da5d198b triggers an additional installer run,
which in turn triggers a reload of the PostgreSQL config and *that* disconnects existing clients.

So ultimately, we need to fix the services (like it happened for pulpcore), but we can deploy workarounds in the installer.

Comment 10 Evgeni Golov 2022-02-10 11:15:06 UTC
So I tried a restore with https://github.com/theforeman/foreman_maintain/commit/6089aa60271d1036604c135c89ec5263da5d198b reverted (more precisely, with the "Procedures::Installer::Upgrade" step removed), and it dynflow remains happy.

Which kinda underlines the finding what the *trigger* is, but not, why dynflow doesn't survive a restart of the DB.

Comment 12 Bryan Kearney 2022-02-17 16:06:45 UTC
Upstream bug assigned to egolov

Comment 13 Bryan Kearney 2022-02-17 16:06:47 UTC
Upstream bug assigned to egolov

Comment 14 Bryan Kearney 2022-02-22 12:05:33 UTC
Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/33744 has been resolved.

Comment 15 Lukas Pramuk 2022-03-30 20:20:49 UTC
VERIFIED.

@Satellite 6.11.0 Snap14
foreman-installer-3.1.2-1.el7sat.noarch

by the simplified reproducer based on comment#0 (avoid using satellite-clone just backup & restore on the same machine):

1) Install 6.11.0 Snap11 (before the fix)

2) Create an offline backup

# satellite-maintain backup offline /var/backup

3) Restore the backup

# satellite-maintain restore -y /var/backup/satellite-backup-*

4) Check Satellite status

# hammer ping
...
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status

>>> reproduced the original issue, now we can verify the fix with confidence

5) Upgrade to 6.11.0 Snap14 (after the fix)

6) Restore the backup 

# satellite-maintain restore -y /var/backup/satellite-backup-*

7) Check Satellite status

# hammer ping
...
foreman_tasks:    
    Status:          ok
    Server Response: Duration: 12ms

>>> after db restore foreman_tasks are in a good shape

Comment 18 errata-xmlrpc 2022-07-05 14:32:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498


Note You need to log in before you can comment on or make changes to this bug.