Bug 2090271

Summary:	Manifest refresh randomly fails with "No such file or directory" when having multile dynflow workers
Product:	Red Hat Satellite	Reporter:	Pavel Moravec <pmoravec>
Component:	Subscription Management	Assignee:	Adam Ruzicka <aruzicka>
Status:	CLOSED ERRATA	QA Contact:	Cole Higgins <chiggins>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	6.10.5	CC:	aruzicka, egolov, osousa, pmendezh, sraut
Target Milestone:	6.11.1	Keywords:	EasyFix, Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	tfm-rubygem-katello-4.3.0.43-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2093408 2106092 (view as bug list)		Environment:
Last Closed:	2022-07-27 17:27:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Moravec 2022-05-25 13:12:46 UTC

Description of problem:
Manifest refresh randomly fails on a Satellite with multiple dynflow workers with error:

Error: No such file or directory @ rb_sysopen - /tmp/0.7851943882678857.zip

The reason is *tricky* :
- ManifestRefresh task determines filename for the new manifest file as /tmp/#{rand}.zip
- UpstreamExport dynflow step is asked to export the new manifest to that file
- subsequent Import dynflow step is asked to read the file and process the update further

The dynflow steps can be processed by different dynflow workers, which are run as different systemd services. And sadly for us, the services use their own private temp directory like:

/tmp/systemd-private-4f8b157ce7c040f4b27e7ecbba68aa22-dynflow-sidekiq/tmp/

So, when UpstreamExport step is executed by one dynflow worker, it puts the zip file to its own private temp. And if we are unlucky, the Import step is picked by another worker that misses the file in its own private temp /o\ .

Which means, having 3 dynflow workers, there is just 1/3 probability the manifest refresh succeeds.


We need to use static/shared tmp file instead.


Version-Release number of selected component (if applicable):
Sat 6.10.5


How reproducible:
2/3 when having 3 dynflow workers


Steps to Reproduce:
1. Set up Satellite with 3 dynflow workers, e.g. per https://access.redhat.com/solutions/5695311
2. Import a manifest
3. Repeatedly refresh it:
hammer subscription refresh-manifest  --organization-id=1


Actual results:
3. randomly fails with error:
Error: No such file or directory @ rb_sysopen - /tmp/0.7851943882678857.zip

in such a case, the zip file can be spot under a private temp dir of a worker's service, like:
/tmp/systemd-private-4f8b157ce7c040f4b27e7ecbba68aa22-dynflow-sidekiq/tmp/0.7851943882678857.zip


Expected results:
manifest refresh to always succeed


Additional info:

Comment 1 Adam Ruzicka 2022-05-25 14:20:07 UTC

We could either use a different temporary directory (~foreman/tmp maybe?) or make all the workers run in the same mount namespace using JoinsNamespaceOf[1] in the service definition. Depending on this, the fix will either need to go to katello or foreman, either way I'm not sure about the right component.

[1] - https://www.freedesktop.org/software/systemd/man/systemd.unit.html#JoinsNamespaceOf=

Comment 2 Adam Ruzicka 2022-05-25 14:27:32 UTC

Created redmine issue https://projects.theforeman.org/issues/34957 from this bug

Comment 3 Evgeni Golov 2022-05-25 14:33:19 UTC

I guess the "correct" solution depends on which part of this we consider a bug ;-)

Is the general answer "dynflow workers should be able to exchange data via the filesystem", then they need either be in the same namespace (JoinNamespaceOf above) or explicitly have a way to say "store this data for sharing" (in Rails.root/tmp, or somewhere else).

Is the general answer "dynflow workers should be as isolated as possible, but this specific katello workflow needs it" then this workflow should write to Rails.root/tmp or similar

Comment 4 Bryan Kearney 2022-05-26 16:04:46 UTC

Upstream bug assigned to aruzicka

Comment 5 Bryan Kearney 2022-05-26 16:04:49 UTC

Upstream bug assigned to aruzicka

Comment 6 Bryan Kearney 2022-05-27 16:04:44 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/34957 has been resolved.

Comment 15 errata-xmlrpc 2022-07-27 17:27:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Satellite 6.11.1 Async Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5742