Description of problem: overcloud deployment deploy (update) is failing due to missing file and the following error is seen in the failures list: ~~~ "OSError: [Errno 2] No such file or directory: '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213'" ~~~ Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Failure do delete a file due to it being removed while the permission change is executed. Expected results: Should not fail Additional info:
OK, managed to get a full traceback out of the sosreports, and though it's not 100% clear on which task the error is coming from, I think the code/root cause is obvious enough: \"INFO:nova_statedir:Ownership of /var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472- 434a-a58e-4468120aa98c already 42436:42436\", \"stderr: Traceback (most recent call last):\", \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 169, in <module>\", \" NovaStatedirOwnershipManager('/var/lib/nova').run()\", \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 159, in run\", \" self._walk(self.statedir)\", \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 137, in _walk\", \" self._walk(pathname)\", \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 132, in _walk\", \" pathinfo = PathManager(pathname)\", \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 31, in __init__\", \" self._update()\", \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 34, in _update\", \" statinfo = os.stat(self.path)\", \"OSError: [Errno 2] No such file or directory: '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213'\" ] } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0a191486-091c-4d69-9701-997f51bdf431_playbook.retry So it looke like we don't handle the case of a file disappearing between when we list the contents of /var/lib/nova and when we attempt to poke its permissions/selinux labels. By the way, this is what I had to do to get the stack trace out of the sosreports: $ zgrep -l '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213' -R * 0400-rh_o1mwsc01.mgmt_2020-06-05.log $ Open that file, search for 'OSError', find that it's in the massive value (including literals \'n') of a 'debug_stdout' field in a REST API response, and realize I need to process it as follows to make it readable: $ echo -e `grep deploy_stdout 0400-rh_o1mwsc01.mgmt_2020-06-05.log` | less I was then able to scroll through the output, and isolate the paragraph I pasted above. If the full traceback (instead of just the isolated error) had just been present in the bug report, all this could have been avoided.
It doesn't look like this can be backported easily to Queens as I've get a merge conflict when trying to cherry-pick ...
~~~ resource_type: OS::Heat::StructuredDeployment physical_resource_id: f8cb2e2c-54e5-4d9a-ad75-57cc0a9247f3 status: UPDATE_FAILED status_reason: | Error: resources[20]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... " File \"/docker-config-scripts/nova_statedir_ownership.py\", line 126, in _walk", " for f in os.listdir(top):", "OSError: [Errno 2] No such file or directory: '/var/lib/nova/instances/b8d64704-d120-4f92-99b3-7b954056d8d2'" ] } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0dba793d-7677-4127-9205-8a8b92081039_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=5 changed=2 unreachable=0 failed=1 ~~~
(In reply to David Hill from comment #8) > It doesn't look like this can be backported easily to Queens as I've get a > merge conflict when trying to cherry-pick ... yeah, we don't have the podman selinux relabel workaround on stable/queens (as that was docker) so it won't be a clean cherry-pick - I'll propose the patch now
Thanks, many customer are hitting this issue by now so it'll be a great thing to have as we're all manually patching this as it is now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4388