2051912 – Some of the services failed to start after satellite restore

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2051912 - Some of the services failed to start after satellite restore

Summary: Some of the services failed to start after satellite restore

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	6.10.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	6.11.0
Assignee:	Evgeni Golov
QA Contact:	Lukas Pramuk
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-08 10:40 UTC by Devendra Singh
Modified:	2022-07-05 14:33 UTC (History)
CC List:	4 users (show)
Fixed In Version:	foreman-installer-3.1.2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-05 14:32:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	33744	Normal	Closed	workaround for pulpcore-content PostgreSQL connection issue	2022-02-09 07:45:26 UTC
Foreman Issue Tracker	34394	Normal	New	Restore of Katello 4.3 leaves the system in not fully functional state	2022-02-08 14:26:58 UTC
Red Hat Product Errata	RHSA-2022:5498	None	None	None	2022-07-05 14:33:06 UTC

Description Devendra Singh 2022-02-08 10:40:36 UTC

Description of problem: Some of the services failed to start after satellite restore


Version-Release number of selected component (if applicable):
6.10.z 

How reproducible:
2/2

Steps to Reproduce:
1. Spin the RHEL7 machine with 32 Gigs of RAM and 8 core CPU.
2. Copy the 6.10.z backup data on the spawned machine.
3- Clone the master branch of the satellite clone(https://github.com/RedHatSatellite/satellite-clone.git).
4- update the configuration file to run the satellite restore.
5- Run the Satellite restore.
6- The satellite-clone playbook failed at the very end due to some of the services failing to start.
 
fatal: [xyz.com]: FAILED! => {"changed": true, "cmd": ["hammer", "ping"], "delta": "0:00:06.382378", "end": "2022-02-07 13:29:05.027946", "msg": "non-zero return code", "rc": 1, "start": "2022-02-07 13:28:58.645568", "stderr": "", "stderr_lines": [], "stdout": "database:         \n    Status:          ok\n    Server Response: Duration: 0ms\nkatello_agent:    \n    Status:          ok\n    message:         0 Processed, 0 Failed\n    Server Response: Duration: 0ms\ncandlepin:        \n    Status:          ok\n    Server Response: Duration: 29ms\ncandlepin_auth:   \n    Status:          ok\n    Server Response: Duration: 29ms\ncandlepin_events: \n    Status:          ok\n    message:         0 Processed, 0 Failed\n    Server Response: Duration: 0ms\nkatello_events:   \n    Status:          ok\n    message:         0 Processed, 0 Failed\n    Server Response: Duration: 0ms\npulp3:            \n    Status:          ok\n    Server Response: Duration: 85ms\npulp3_content:    \n    Status:          FAIL\n    Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3.\nforeman_tasks:    \n    Status:          FAIL\n    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status", "stdout_lines": ["database:         ", "    Status:          ok", "    Server Response: Duration: 0ms", "katello_agent:    ", "    Status:          ok", "    message:         0 Processed, 0 Failed", "    Server Response: Duration: 0ms", "candlepin:        ", "    Status:          ok", "    Server Response: Duration: 29ms", "candlepin_auth:   ", "    Status:          ok", "    Server Response: Duration: 29ms", "candlepin_events: ", "    Status:          ok", "    message:         0 Processed, 0 Failed", "    Server Response: Duration: 0ms", "katello_events:   ", "    Status:          ok", "    message:         0 Processed, 0 Failed", "    Server Response: Duration: 0ms", "pulp3:            ", "    Status:          ok", "    Server Response: Duration: 85ms", "pulp3_content:    ", "    Status:          FAIL", "    Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3.", "foreman_tasks:    ", "    Status:          FAIL", "    Server Response: Message: some executors are not responding, check 
Actual results:

Some of the services failed to start after satellite restore

Expected results:

Satellite restore should be completed successfully.

Additional info:

The Satellite clone failed due to that, but I tried to restart the services manually and it got started.

Comment 2 Evgeni Golov 2022-02-08 14:26:59 UTC

the output, in readable form:

database:         
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:    
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
candlepin:        
    Status:          ok
    Server Response: Duration: 29ms
candlepin_auth:   
    Status:          ok
    Server Response: Duration: 29ms
candlepin_events: 
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:   
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
pulp3:            
    Status:          ok
    Server Response: Duration: 85ms
pulp3_content:    
    Status:          FAIL
    Server Response: Message: No pulpcore content apps are running at https://xyz.com/pulp/api/v3.
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status"

For the tasks part we have https://projects.theforeman.org/issues/34394 already, let's see what pulp is unhappy about

Comment 3 Evgeni Golov 2022-02-08 14:38:23 UTC

interestingly, the attached debug has the following hammer ping output:

COMMAND> hammer ping

database:         
    Status:          ok
    Server Response: Duration: 0ms
katello_agent:    
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
candlepin:        
    Status:          ok
    Server Response: Duration: 46ms
candlepin_auth:   
    Status:          ok
    Server Response: Duration: 35ms
candlepin_events: 
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 0ms
katello_events:   
    Status:          ok
    message:         0 Processed, 0 Failed
    Server Response: Duration: 1ms
pulp3:            
    Status:          ok
    Server Response: Duration: 111ms
pulp3_content:    
    Status:          ok
    Server Response: Duration: 64ms
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status


so pulp is actually happy there?

Comment 4 Evgeni Golov 2022-02-08 14:51:30 UTC

observations from talking to Devendra: this happened with two databases (call them S and B):
on S, only foreman_tasks was failing
on B, the output of "hammer ping" during the run shows pulp3_content *and* foreman_tasks failing, but a later generated foreman debug only shows foreman_tasks

so it looks like we're not waiting long enough for the pulp case (or not watching for some event)

Comment 6 Evgeni Golov 2022-02-08 16:01:08 UTC

so looking in more detail at the debug,

PostgreSQL is running "since Mon 2022-02-07 13:16:10 EST", Pulpcore "since Mon 2022-02-07 13:13:29 EST" and dynflow "since Mon 2022-02-07 13:14:28 EST"

so there is something that doesn't survive the restart of PostgreSQL nicely (bad) but I would actually expect us to restart all relevant services *together*

the PSQL restart is also in the installer log:
2022-02-07 13:16:09 [DEBUG ] [configure] Executing: '/bin/systemctl restart -- postgresql'

this kinda implies the fix from https://github.com/theforeman/puppet-pulpcore/commit/440a8e054248f271f3c0b022844a07a1bac580b2 isn't working?!
and indeed, I do not see a relationship in the log

Comment 7 Evgeni Golov 2022-02-09 07:48:29 UTC

Thanks to Devendra, we got a reproducer machine and after looking around I realized this is failing to restore a 6.10, not a 7.0 (should have been obvious yesterday, but alas).

6.10 doesn't have the above mentioned puppet-pulpcore patch, so it's failing there. I've attached Foreman Redmine #33744 to this BZ, but ultimately this needs to be fixed in a 6.10.z, as that part actually doesn't affect 7.0.

Still hunting the foreman_tasks part, as that seems to apply to both, 6.10 and 7.0

Comment 8 Evgeni Golov 2022-02-09 08:34:32 UTC

Anyone up for a history lesson? Sit down and have some tea!

It seems, we have two services that don't survive a PostgreSQL disconnect well: pulpcore-content (see https://pulp.plan.io/issues/9276, fixed in 3.14.9+, 3.16.1+, 3.17.0+) and dynflow-sidekiq.
For pulpcore-content we knew this and have implemented a workaround, that's present in 7.0: https://projects.theforeman.org/issues/33744, but not in 6.10 that's to-be-restored here. (technically, 7.0 doesn't need the workaround, it has pulpcore 3.16.1)
For dynflow-sidekiq I'm still digging.

Now, you'll rightfully will ask "but this BZ is against satellite-clone?!" and yes, this is where the effect of those services was found by Devendra.
But satellite-clone is merely triggering a restore via foreman-maintain,
which since https://github.com/theforeman/foreman_maintain/commit/6089aa60271d1036604c135c89ec5263da5d198b triggers an additional installer run,
which in turn triggers a reload of the PostgreSQL config and *that* disconnects existing clients.

So ultimately, we need to fix the services (like it happened for pulpcore), but we can deploy workarounds in the installer.

Comment 10 Evgeni Golov 2022-02-10 11:15:06 UTC

So I tried a restore with https://github.com/theforeman/foreman_maintain/commit/6089aa60271d1036604c135c89ec5263da5d198b reverted (more precisely, with the "Procedures::Installer::Upgrade" step removed), and it dynflow remains happy.

Which kinda underlines the finding what the *trigger* is, but not, why dynflow doesn't survive a restart of the DB.

Comment 12 Bryan Kearney 2022-02-17 16:06:45 UTC

Upstream bug assigned to egolov

Comment 13 Bryan Kearney 2022-02-17 16:06:47 UTC

Upstream bug assigned to egolov

Comment 14 Bryan Kearney 2022-02-22 12:05:33 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/33744 has been resolved.

Comment 15 Lukas Pramuk 2022-03-30 20:20:49 UTC

VERIFIED.

@Satellite 6.11.0 Snap14
foreman-installer-3.1.2-1.el7sat.noarch

by the simplified reproducer based on comment#0 (avoid using satellite-clone just backup & restore on the same machine):

1) Install 6.11.0 Snap11 (before the fix)

2) Create an offline backup

# satellite-maintain backup offline /var/backup

3) Restore the backup

# satellite-maintain restore -y /var/backup/satellite-backup-*

4) Check Satellite status

# hammer ping
...
foreman_tasks:    
    Status:          FAIL
    Server Response: Message: some executors are not responding, check /foreman_tasks/dynflow/status

>>> reproduced the original issue, now we can verify the fix with confidence

5) Upgrade to 6.11.0 Snap14 (after the fix)

6) Restore the backup 

# satellite-maintain restore -y /var/backup/satellite-backup-*

7) Check Satellite status

# hammer ping
...
foreman_tasks:    
    Status:          ok
    Server Response: Duration: 12ms

>>> after db restore foreman_tasks are in a good shape

Comment 18 errata-xmlrpc 2022-07-05 14:32:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.11 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5498

Note You need to log in before you can comment on or make changes to this bug.