Description of problem: Some of the services failed to start after satellite restore Version-Release number of selected component (if applicable): 6.10.z How reproducible: 2/2 Steps to Reproduce: 1. Spin the RHEL7 machine with 32 Gigs of RAM and 8 core CPU. 2. Copy the 6.10.z backup data on the spawned machine. 3- Clone the master branch of the satellite clone(https://github.com/RedHatSatellite/satellite-clone.git). 4- update the configuration file to run the satellite restore. 5- Run the Satellite restore. 6- The satellite-clone playbook failed at the very end due to some of the services failing to start. fatal: [xyz.com]: FAILED! => {"changed": true, "cmd": ["hammer", "ping"], "delta": "0:00:06.382378", "end": "2022-02-07 13:29:05.027946", "msg": "non-zero return code", "rc": 1, "start": "2022-02-07 13:28:58.645568", "stderr": "", "stderr_lines": [], "stdout": "database: \n Status: ok\n Server Response: Duration: 0ms\nkatello_agent: \n Status: ok\n message: 0 Processed, 0 Failed\n Server Response: Duration: 0ms\ncandlepin: \n Status: ok\n Server Response: Duration: 29ms\ncandlepin_auth: \n Status: ok\n Server Response: Duration: 29ms\ncandlepin_events: \n Status: ok\n message: 0 Processed, 0 Failed\n Server Response: Duration: 0ms\nkatello_events: \n Status: ok\n message: 0 Processed, 0 Failed\n Server Response: Duration: 0ms\npulp3: \n Status: ok\n Server Response: Duration: 85ms\npulp3_content: \n Status: FAIL\n Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3.\nforeman_tasks: \n Status: FAIL\n Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status", "stdout_lines": ["database: ", " Status: ok", " Server Response: Duration: 0ms", "katello_agent: ", " Status: ok", " message: 0 Processed, 0 Failed", " Server Response: Duration: 0ms", "candlepin: ", " Status: ok", " Server Response: Duration: 29ms", "candlepin_auth: ", " Status: ok", " Server Response: Duration: 29ms", "candlepin_events: ", " Status: ok", " message: 0 Processed, 0 Failed", " Server Response: Duration: 0ms", "katello_events: ", " Status: ok", " message: 0 Processed, 0 Failed", " Server Response: Duration: 0ms", "pulp3: ", " Status: ok", " Server Response: Duration: 85ms", "pulp3_content: ", " Status: FAIL", " Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3.", "foreman_tasks: ", " Status: FAIL", " Server Response: Message: some executors are not responding, check Actual results: Some of the services failed to start after satellite restore Expected results: Satellite restore should be completed successfully. Additional info: The Satellite clone failed due to that, but I tried to restart the services manually and it got started.
the output, in readable form: database: Status: ok Server Response: Duration: 0ms katello_agent: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 0ms candlepin: Status: ok Server Response: Duration: 29ms candlepin_auth: Status: ok Server Response: Duration: 29ms candlepin_events: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 0ms katello_events: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 0ms pulp3: Status: ok Server Response: Duration: 85ms pulp3_content: Status: FAIL Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3. foreman_tasks: Status: FAIL Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status" For the tasks part we have https://projects.theforeman.org/issues/34394 already, let's see what pulp is unhappy about
interestingly, the attached debug has the following hammer ping output: COMMAND> hammer ping database: Status: ok Server Response: Duration: 0ms katello_agent: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 0ms candlepin: Status: ok Server Response: Duration: 46ms candlepin_auth: Status: ok Server Response: Duration: 35ms candlepin_events: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 0ms katello_events: Status: ok message: 0 Processed, 0 Failed Server Response: Duration: 1ms pulp3: Status: ok Server Response: Duration: 111ms pulp3_content: Status: ok Server Response: Duration: 64ms foreman_tasks: Status: FAIL Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status so pulp is actually happy there?
observations from talking to Devendra: this happened with two databases (call them S and B): on S, only foreman_tasks was failing on B, the output of "hammer ping" during the run shows pulp3_content *and* foreman_tasks failing, but a later generated foreman debug only shows foreman_tasks so it looks like we're not waiting long enough for the pulp case (or not watching for some event)
so looking in more detail at the debug, PostgreSQL is running "since Mon 2022-02-07 13:16:10 EST", Pulpcore "since Mon 2022-02-07 13:13:29 EST" and dynflow "since Mon 2022-02-07 13:14:28 EST" so there is something that doesn't survive the restart of PostgreSQL nicely (bad) but I would actually expect us to restart all relevant services *together* the PSQL restart is also in the installer log: 2022-02-07 13:16:09 [DEBUG ] [configure] Executing: '/bin/systemctl restart -- postgresql' this kinda implies the fix from https://github.com/theforeman/puppet-pulpcore/commit/440a8e054248f271f3c0b022844a07a1bac580b2 isn't working?! and indeed, I do not see a relationship in the log
Thanks to Devendra, we got a reproducer machine and after looking around I realized this is failing to restore a 6.10, not a 7.0 (should have been obvious yesterday, but alas). 6.10 doesn't have the above mentioned puppet-pulpcore patch, so it's failing there. I've attached Foreman Redmine #33744 to this BZ, but ultimately this needs to be fixed in a 6.10.z, as that part actually doesn't affect 7.0. Still hunting the foreman_tasks part, as that seems to apply to both, 6.10 and 7.0
Anyone up for a history lesson? Sit down and have some tea! It seems, we have two services that don't survive a PostgreSQL disconnect well: pulpcore-content (see https://pulp.plan.io/issues/9276, fixed in 3.14.9+, 3.16.1+, 3.17.0+) and dynflow-sidekiq. For pulpcore-content we knew this and have implemented a workaround, that's present in 7.0: https://projects.theforeman.org/issues/33744, but not in 6.10 that's to-be-restored here. (technically, 7.0 doesn't need the workaround, it has pulpcore 3.16.1) For dynflow-sidekiq I'm still digging. Now, you'll rightfully will ask "but this BZ is against satellite-clone?!" and yes, this is where the effect of those services was found by Devendra. But satellite-clone is merely triggering a restore via foreman-maintain, which since https://github.com/theforeman/foreman_maintain/commit/6089aa60271d1036604c135c89ec5263da5d198b triggers an additional installer run, which in turn triggers a reload of the PostgreSQL config and *that* disconnects existing clients. So ultimately, we need to fix the services (like it happened for pulpcore), but we can deploy workarounds in the installer.
So I tried a restore with https://github.com/theforeman/foreman_maintain/commit/6089aa60271d1036604c135c89ec5263da5d198b reverted (more precisely, with the "Procedures::Installer::Upgrade" step removed), and it dynflow remains happy. Which kinda underlines the finding what the *trigger* is, but not, why dynflow doesn't survive a restart of the DB.
Upstream bug assigned to egolov
Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/33744 has been resolved.
VERIFIED. @Satellite 6.11.0 Snap14 foreman-installer-3.1.2-1.el7sat.noarch by the simplified reproducer based on comment#0 (avoid using satellite-clone just backup & restore on the same machine): 1) Install 6.11.0 Snap11 (before the fix) 2) Create an offline backup # satellite-maintain backup offline /var/backup 3) Restore the backup # satellite-maintain restore -y /var/backup/satellite-backup-* 4) Check Satellite status # hammer ping ... foreman_tasks: Status: FAIL Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status >>> reproduced the original issue, now we can verify the fix with confidence 5) Upgrade to 6.11.0 Snap14 (after the fix) 6) Restore the backup # satellite-maintain restore -y /var/backup/satellite-backup-* 7) Check Satellite status # hammer ping ... foreman_tasks: Status: ok Server Response: Duration: 12ms >>> after db restore foreman_tasks are in a good shape
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5498