Bug 1846683 - overcloud deployment deploy (update) is failing due to missing file
Summary: overcloud deployment deploy (update) is failing due to missing file
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: z13
: 13.0 (Queens)
Assignee: Rajesh Tailor
QA Contact: Archit Modi
URL:
Whiteboard:
Depends On:
Blocks: 1873160
TreeView+ depends on / blocked
 
Reported: 2020-06-12 16:35 UTC by David Hill
Modified: 2024-10-01 16:39 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.4.1-67.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1873160 (view as bug list)
Environment:
Last Closed: 2020-10-28 18:23:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 733906 0 None MERGED Avoid failing on deleted file 2021-02-16 06:36:42 UTC
OpenStack gerrit 748306 0 None MERGED Avoid failing on deleted file 2021-02-16 06:36:42 UTC
OpenStack gerrit 748307 0 None MERGED Avoid failing on deleted file 2021-02-16 06:36:43 UTC
OpenStack gerrit 748317 0 None MERGED Avoid failing on deleted file 2021-02-16 06:36:43 UTC
Red Hat Issue Tracker OSP-3673 0 None None None 2022-08-23 10:26:25 UTC
Red Hat Knowledge Base (Solution) 5345141 0 None None None 2020-08-24 19:28:31 UTC
Red Hat Product Errata RHBA-2020:4388 0 None None None 2020-10-28 18:24:09 UTC

Description David Hill 2020-06-12 16:35:11 UTC
Description of problem:
overcloud deployment deploy (update) is failing due to missing file and the following error is seen in the failures list:

~~~

            "OSError: [Errno 2] No such file or directory: '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213'"
~~~


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Failure do delete a file due to it being removed while the permission change is executed.

Expected results:
Should not fail

Additional info:

Comment 6 Artom Lifshitz 2020-06-19 17:01:53 UTC
OK, managed to get a full traceback out of the sosreports, and though it's not 100% clear on which task the error is coming from, I think the code/root cause is obvious enough:

 \"INFO:nova_statedir:Ownership of /var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-
434a-a58e-4468120aa98c already 42436:42436\", 
 \"stderr: Traceback (most recent call last):\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 169, in <module>\", 
 \" NovaStatedirOwnershipManager('/var/lib/nova').run()\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 159, in run\", 
 \" self._walk(self.statedir)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 137, in _walk\", 
 \" self._walk(pathname)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 132, in _walk\", 
 \" pathinfo = PathManager(pathname)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 31, in __init__\", 
 \" self._update()\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 34, in _update\", 
 \" statinfo = os.stat(self.path)\", 
 \"OSError: [Errno 2] No such file or directory: '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213'\"
 ]
}
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0a191486-091c-4d69-9701-997f51bdf431_playbook.retry

So it looke like we don't handle the case of a file disappearing between when we list the contents of /var/lib/nova and when we attempt to poke its permissions/selinux labels.

By the way, this is what I had to do to get the stack trace out of the sosreports:

$ zgrep -l '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213' -R *
0400-rh_o1mwsc01.mgmt_2020-06-05.log
$

Open that file, search for 'OSError', find that it's in the massive value (including literals \'n') of a 'debug_stdout' field in a REST API response, and realize I need to process it as follows to make it readable:

$ echo -e `grep deploy_stdout 0400-rh_o1mwsc01.mgmt_2020-06-05.log` | less

I was then able to scroll through the output, and isolate the paragraph I pasted above. If the full traceback (instead of just the isolated error) had just been present in the bug report, all this could have been avoided.

Comment 8 David Hill 2020-08-24 19:34:52 UTC
It doesn't look like this can be backported easily to Queens as I've get a merge conflict when trying to cherry-pick ...

Comment 9 David Hill 2020-08-24 19:36:32 UTC
~~~
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: f8cb2e2c-54e5-4d9a-ad75-57cc0a9247f3
  status: UPDATE_FAILED
  status_reason: |
    Error: resources[20]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "  File \"/docker-config-scripts/nova_statedir_ownership.py\", line 126, in _walk",
            "    for f in os.listdir(top):",
            "OSError: [Errno 2] No such file or directory: '/var/lib/nova/instances/b8d64704-d120-4f92-99b3-7b954056d8d2'"
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0dba793d-7677-4127-9205-8a8b92081039_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=5    changed=2    unreachable=0    failed=1
~~~

Comment 11 Ollie Walsh 2020-08-26 21:20:05 UTC
(In reply to David Hill from comment #8)
> It doesn't look like this can be backported easily to Queens as I've get a
> merge conflict when trying to cherry-pick ...

yeah, we don't have the podman selinux relabel workaround on stable/queens (as that was docker) so it won't be a clean cherry-pick - I'll propose the patch now

Comment 12 David Hill 2020-08-26 22:23:12 UTC
Thanks, many customer are hitting this issue by now so it'll be a great thing to have as we're all manually patching this as it is now.

Comment 31 errata-xmlrpc 2020-10-28 18:23:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4388


Note You need to log in before you can comment on or make changes to this bug.