Bug 1738498

Summary: Collecting /var/lib/qpidd in Procedures::Backup::ConfigFiles can cause an incoherent backup is created
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: Satellite MaintainAssignee: Amit Upadhye <aupadhye>
Status: CLOSED DUPLICATE QA Contact: Lucie Vrtelova <lvrtelov>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.5.0CC: apatel, aupadhye, jpathan, kgaikwad, ofalk, wclark
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-14 11:12:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Hotfix RPM none

Description Pavel Moravec 2019-08-07 10:33:08 UTC
Description of problem:
Just theoretical use case / scenario, but I can come up with particular reproducer (esp. for QE).

Assume foreman-maintain backup (either online or offline) happens when qpidd is changing a content of its durable queue, as triggered by activity like:
- a new Content Host is (un)registered / new pulp.agent.* queue is being created/deleted
- a pulp task is created or changed status (so resource_manager or reserved_resource_worker-* queues change content)
- a candlepin event is received from candlepin or consumed by LOCE task
- few other activities affecting pulp.task or celery queues

There is a concurrency bug as follows:
- foreman-maintain executes Procedures::Backup::ConfigFiles at very early stage, causing /var/lib/qpidd (denoted as a part of pulp config_files) is archived
- now, some activity described in previous paragraph happens, causing /var/lib/qpidd changes its content
- even now, Satellite is put to maintenance mode and services stopped

IMHO /var/lib/qpidd should be collected at the same stage like /var/lib/pulp (BUT in either case, even with --skip-pulp-content, "just" at that stage of backup process). Since /var/lib/qpidd is not a static congifuration but varying data that are worth to be collected while services are stopped.

(this BZ is applicable even after https://bugzilla.redhat.com/show_bug.cgi?id=1673908 is fixed, sadly I realize this scenario even now - it is possible the codefix for bz1673908 will become redundant after this fix :( )


Version-Release number of selected component (if applicable):
Sat6.5


How reproducible:
??? with some probability


Steps to Reproduce:
1. Register many content hosts and start goferd on them concurrently, or create many pulp tasks concurrently
2. Meantime, call foreman-maintain backup (online or offline, doesnt matter)
3. Once the backup stops services, stop the activity from 1.
4. Once backup completes, compare content of backed-up /var/lib/qpidd with real /var/lib/qpidd


Actual results:
4. shows difference (while comparison of e.g. postgres data shows no diff). That  could mean an incoherent backup has been created.


Expected results:
4. to show no diff


Additional info:
The incoherent backup might not matter but it also can matter. E.g. a pulp task can be lost, candlepin event can be lost, or pulp.agent.* queue can be orphaned or oppositely not created.

In all such cases, there is a workaround (trigger new pulp task, katello:reimport, align pulp.agent.* queues per DB (there is KCS for that), so the current behaviour is not fatal. It just prevents identifying and workarounding those issues when recovering from an incoherent backup.

Comment 5 wclark 2019-09-24 22:10:11 UTC
Created attachment 1618746 [details]
Hotfix RPM

Hotfix RPM is created, see above attachment.

Installation instructions:

# rpm -Uvh rubygem-foreman_maintain-0.3.5-3.HOTFIXRHBZ1673908.el7sat.noarch.rpm

This hotfix resolves BZ1673908 as well.

Comment 8 Amit Upadhye 2021-04-14 11:12:19 UTC

*** This bug has been marked as a duplicate of bug 1673908 ***