Description of problem: When upgrading from 4.3.11, hosted-engine --deploy --restore blows up with a permission error against a temporary directory named after a GUID after running for a long time. Version-Release number of selected component (if applicable): 4.5.0 How reproducible: At will Steps to Reproduce: 1. Run an engine-backup on a 4.3.11 RHVM. Store the backup somewhere convenient. 2. Set up a new fiberchannel LUN to hold a new hosted engine. 3. Install VDSM on a RHEL 8.6 system to turn it into a hypervisor. 4. run hosted-engine --deploy --restore with the backup created above. Actual results: It runs for more than 1/2 hour and fails with a permission error several steps after setting up the new fiberchannel storage domain. Expected results: It should run to completion and make a new hosted engine. Additional info: I'll attach the output and log. Here is the critical part. . . . [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Restart fapolicyd service] [ INFO ] skipping: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Copy configuration archive to storage] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["dd", "bs=20480", "count=1", "oflag=direct", "if=/var/tmp/localvm0k3z_k73/2cdb117c-93d4-4a1e-b5da-4f95e230bd4b", "of=/rhev/data-center/mnt/blockSD/f8e2740b-d342-44d3-ac3b-deb626798402/images/e08590f5-bebb-4b3b-b3e4-b3fb3bf144eb/2cdb117c-93d4-4a1e-b5da-4f95e230bd4b"], "delta": "0:00:00.002387", "end": "2022-07-15 15:01:49.610596", "msg": "non-zero return code", "rc": 1, "start": "2022-07-15 15:01:49.608209", "stderr": "dd: failed to open '/var/tmp/localvm0k3z_k73/2cdb117c-93d4-4a1e-b5da-4f95e230bd4b': Permission denied", "stderr_lines": ["dd: failed to open '/var/tmp/localvm0k3z_k73/2cdb117c-93d4-4a1e-b5da-4f95e230bd4b': Permission denied"], "stdout": "", "stdout_lines": []} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook [ INFO ] Stage: Clean up . . .
We just tried it without the --restore. It failed the same way. I'll change this BZ title. The command: hosted-engine --deploy --config-append=/tmp/engine-answers.conf --generate-answer=/tmp/restore-engine-answers.conf And the failure with a little bit of context around it: . . . [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Restart fapolicyd service] [ INFO ] skipping: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Copy configuration archive to storage] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["dd", "bs=20480", "count=1", "oflag=direct", "if=/var/tmp/localvm289fhjs6/919f7105-b327-4392-9258-c6bf42d2b637", "of=/rhev/data-center/mnt/blockSD/1b4cd36c-0bb0-49f6-8079-8372b9799969/images/7456ecca-c17a-4d54-9bdd-6bdc5ada5e8a/919f7105-b327-4392-9258-c6bf42d2b637"], "delta": "0:00:00.002318", "end": "2022-07-15 16:44:09.347352", "msg": "non-zero return code", "rc": 1, "start": "2022-07-15 16:44:09.345034", "stderr": "dd: failed to open '/var/tmp/localvm289fhjs6/919f7105-b327-4392-9258-c6bf42d2b637': Permission denied", "stderr_lines": ["dd: failed to open '/var/tmp/localvm289fhjs6/919f7105-b327-4392-9258-c6bf42d2b637': Permission denied"], "stdout": "", "stdout_lines": []} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook [ INFO ] Stage: Clean up [ INFO ] Cleaning temporary resources [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Execute just a specific set of steps] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Force facts gathering] . . .
Digging deeper - it looks like the Ansible process switches to user vdsm. But root owns the file(s) in question. And that leads to the permission problem. The umask is 0027 - but wait - Umasks are backwards. I'll bet that's our problem....
That was the problem. User root owned the file and its protection bits were 640. User vdsm could not touch the file. They changed the umask to a more liberal number and retried the operation and it ran to completion. The problem was a umask too strict.
Seems like a duplicate of bug 2089332, which was fixed in ovirt-ansible-collection-2.0.4-1. Can you please check the ovirt-ansible-collection version?
Looks like it's an older version. ovirt-ansible-collection.noarch 2.0.3-1.el8ev It's an offline installation from local repositories. But wouldn't a reposync for the SP1 RHV repositories grab the latest ovirt-ansible-collection?
Ah - I just checked the 4.4 SP1 Package Manifest on the download site at https://access.redhat.com/downloads/content/415/ver=4.4/rhel---8/4.4/x86_64/product-software Looks like SP1 shipped with ovirt-ansible-collection-2.0.3-1.el8ev.noarch. I'l bet ovirt-ansible-collection-2.0.4-1.el8ev.noarch with the bugfix ships with batch 1, coming in a few days. - Greg
(In reply to Greg Scott from comment #11) > Ah - I just checked the 4.4 SP1 Package Manifest on the download site at > https://access.redhat.com/downloads/content/415/ver=4.4/rhel---8/4.4/x86_64/ > product-software > > Looks like SP1 shipped with ovirt-ansible-collection-2.0.3-1.el8ev.noarch. > > I'l bet ovirt-ansible-collection-2.0.4-1.el8ev.noarch with the bugfix ships > with batch 1, coming in a few days. > > - Greg You are right Greg: RHV 4.4 SP1 contains ovirt-ansible-collection-2.0.3: https://errata.devel.redhat.com/advisory/84835 RHV 4.4 SP1 Batch 1 contains ovirt-ansible-collection-2.1.0: https://errata.devel.redhat.com/advisory/96101 So could you please retest with latest RHV 4.4 SP1 Batch 1 packages?
> So could you please retest with latest RHV 4.4 SP1 Batch 1 packages? I'll ask the customer, but I'm not sure it will be possible. They modified their umask and moved forward. Let me see what we can do.
I'm closing it as a duplicate of bug 2089332, please reopen if this reappears. *** This bug has been marked as a duplicate of bug 2089332 ***