1846683 – overcloud deployment deploy (update) is failing due to missing file

Bug 1846683 - overcloud deployment deploy (update) is failing due to missing file

Summary: overcloud deployment deploy (update) is failing due to missing file

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	z13
Target Release:	13.0 (Queens)
Assignee:	Rajesh Tailor
QA Contact:	Archit Modi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1873160
TreeView+	depends on / blocked

Reported:	2020-06-12 16:35 UTC by David Hill
Modified:	2024-10-01 16:39 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-8.4.1-67.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1873160 (view as bug list)
Environment:
Last Closed:	2020-10-28 18:23:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	733906	None	MERGED	Avoid failing on deleted file	2021-02-16 06:36:42 UTC
OpenStack gerrit	748306	None	MERGED	Avoid failing on deleted file	2021-02-16 06:36:42 UTC
OpenStack gerrit	748307	None	MERGED	Avoid failing on deleted file	2021-02-16 06:36:43 UTC
OpenStack gerrit	748317	None	MERGED	Avoid failing on deleted file	2021-02-16 06:36:43 UTC
Red Hat Issue Tracker	OSP-3673	None	None	None	2022-08-23 10:26:25 UTC
Red Hat Knowledge Base (Solution)	5345141	None	None	None	2020-08-24 19:28:31 UTC
Red Hat Product Errata	RHBA-2020:4388	None	None	None	2020-10-28 18:24:09 UTC

Description David Hill 2020-06-12 16:35:11 UTC

Description of problem:
overcloud deployment deploy (update) is failing due to missing file and the following error is seen in the failures list:

~~~

            "OSError: [Errno 2] No such file or directory: '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213'"
~~~


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Failure do delete a file due to it being removed while the permission change is executed.

Expected results:
Should not fail

Additional info:

Comment 6 Artom Lifshitz 2020-06-19 17:01:53 UTC

OK, managed to get a full traceback out of the sosreports, and though it's not 100% clear on which task the error is coming from, I think the code/root cause is obvious enough:

 \"INFO:nova_statedir:Ownership of /var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-
434a-a58e-4468120aa98c already 42436:42436\", 
 \"stderr: Traceback (most recent call last):\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 169, in <module>\", 
 \" NovaStatedirOwnershipManager('/var/lib/nova').run()\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 159, in run\", 
 \" self._walk(self.statedir)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 137, in _walk\", 
 \" self._walk(pathname)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 132, in _walk\", 
 \" pathinfo = PathManager(pathname)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 31, in __init__\", 
 \" self._update()\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 34, in _update\", 
 \" statinfo = os.stat(self.path)\", 
 \"OSError: [Errno 2] No such file or directory: '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213'\"
 ]
}
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0a191486-091c-4d69-9701-997f51bdf431_playbook.retry

So it looke like we don't handle the case of a file disappearing between when we list the contents of /var/lib/nova and when we attempt to poke its permissions/selinux labels.

By the way, this is what I had to do to get the stack trace out of the sosreports:

$ zgrep -l '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213' -R *
0400-rh_o1mwsc01.mgmt_2020-06-05.log
$

Open that file, search for 'OSError', find that it's in the massive value (including literals \'n') of a 'debug_stdout' field in a REST API response, and realize I need to process it as follows to make it readable:

$ echo -e `grep deploy_stdout 0400-rh_o1mwsc01.mgmt_2020-06-05.log` | less

I was then able to scroll through the output, and isolate the paragraph I pasted above. If the full traceback (instead of just the isolated error) had just been present in the bug report, all this could have been avoided.

Comment 8 David Hill 2020-08-24 19:34:52 UTC

It doesn't look like this can be backported easily to Queens as I've get a merge conflict when trying to cherry-pick ...

Comment 9 David Hill 2020-08-24 19:36:32 UTC

~~~
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: f8cb2e2c-54e5-4d9a-ad75-57cc0a9247f3
  status: UPDATE_FAILED
  status_reason: |
    Error: resources[20]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "  File \"/docker-config-scripts/nova_statedir_ownership.py\", line 126, in _walk",
            "    for f in os.listdir(top):",
            "OSError: [Errno 2] No such file or directory: '/var/lib/nova/instances/b8d64704-d120-4f92-99b3-7b954056d8d2'"
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0dba793d-7677-4127-9205-8a8b92081039_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=5    changed=2    unreachable=0    failed=1
~~~

Comment 11 Ollie Walsh 2020-08-26 21:20:05 UTC

(In reply to David Hill from comment #8)
> It doesn't look like this can be backported easily to Queens as I've get a
> merge conflict when trying to cherry-pick ...

yeah, we don't have the podman selinux relabel workaround on stable/queens (as that was docker) so it won't be a clean cherry-pick - I'll propose the patch now

Comment 12 David Hill 2020-08-26 22:23:12 UTC

Thanks, many customer are hitting this issue by now so it'll be a great thing to have as we're all manually patching this as it is now.

Comment 31 errata-xmlrpc 2020-10-28 18:23:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4388

Note You need to log in before you can comment on or make changes to this bug.