Bug 1846683

Summary: overcloud deployment deploy (update) is failing due to missing file
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: openstack-tripleo-heat-templatesAssignee: Rajesh Tailor <ratailor>
Status: CLOSED ERRATA QA Contact: Archit Modi <amodi>
Severity: low Docs Contact:
Priority: low    
Version: 13.0 (Queens)CC: egallen, emacchi, jbeaudoi, mburns, owalsh, rlondhe, sawaghma, sbauza
Target Milestone: z13Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.4.1-67.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1873160 (view as bug list) Environment:
Last Closed: 2020-10-28 18:23:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1873160    

Description David Hill 2020-06-12 16:35:11 UTC
Description of problem:
overcloud deployment deploy (update) is failing due to missing file and the following error is seen in the failures list:

~~~

            "OSError: [Errno 2] No such file or directory: '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213'"
~~~


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Failure do delete a file due to it being removed while the permission change is executed.

Expected results:
Should not fail

Additional info:

Comment 6 Artom Lifshitz 2020-06-19 17:01:53 UTC
OK, managed to get a full traceback out of the sosreports, and though it's not 100% clear on which task the error is coming from, I think the code/root cause is obvious enough:

 \"INFO:nova_statedir:Ownership of /var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-
434a-a58e-4468120aa98c already 42436:42436\", 
 \"stderr: Traceback (most recent call last):\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 169, in <module>\", 
 \" NovaStatedirOwnershipManager('/var/lib/nova').run()\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 159, in run\", 
 \" self._walk(self.statedir)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 137, in _walk\", 
 \" self._walk(pathname)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 132, in _walk\", 
 \" pathinfo = PathManager(pathname)\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 31, in __init__\", 
 \" self._update()\", 
 \" File \\"/docker-config-scripts/nova_statedir_ownership.py\\", line 34, in _update\", 
 \" statinfo = os.stat(self.path)\", 
 \"OSError: [Errno 2] No such file or directory: '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213'\"
 ]
}
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0a191486-091c-4d69-9701-997f51bdf431_playbook.retry

So it looke like we don't handle the case of a file disappearing between when we list the contents of /var/lib/nova and when we attempt to poke its permissions/selinux labels.

By the way, this is what I had to do to get the stack trace out of the sosreports:

$ zgrep -l '/var/lib/nova/triliovault-mounts/MTAuMjUzLjEzMC42NTovcHhlbmdjb3JlMDFfdHJpbGlvX3dzYzAx/.snapshot/daily.2020-05-30_0010/contego_tasks/snapshot_47674bc0-2472-434a-a58e-4468120aa98c/4b96867f-c149-4254-a27c-96dc6ead9109_1586511213' -R *
0400-rh_o1mwsc01.mgmt_2020-06-05.log
$

Open that file, search for 'OSError', find that it's in the massive value (including literals \'n') of a 'debug_stdout' field in a REST API response, and realize I need to process it as follows to make it readable:

$ echo -e `grep deploy_stdout 0400-rh_o1mwsc01.mgmt_2020-06-05.log` | less

I was then able to scroll through the output, and isolate the paragraph I pasted above. If the full traceback (instead of just the isolated error) had just been present in the bug report, all this could have been avoided.

Comment 8 David Hill 2020-08-24 19:34:52 UTC
It doesn't look like this can be backported easily to Queens as I've get a merge conflict when trying to cherry-pick ...

Comment 9 David Hill 2020-08-24 19:36:32 UTC
~~~
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: f8cb2e2c-54e5-4d9a-ad75-57cc0a9247f3
  status: UPDATE_FAILED
  status_reason: |
    Error: resources[20]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "  File \"/docker-config-scripts/nova_statedir_ownership.py\", line 126, in _walk",
            "    for f in os.listdir(top):",
            "OSError: [Errno 2] No such file or directory: '/var/lib/nova/instances/b8d64704-d120-4f92-99b3-7b954056d8d2'"
        ]
    }
        to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0dba793d-7677-4127-9205-8a8b92081039_playbook.retry

    PLAY RECAP *********************************************************************
    localhost                  : ok=5    changed=2    unreachable=0    failed=1
~~~

Comment 11 Ollie Walsh 2020-08-26 21:20:05 UTC
(In reply to David Hill from comment #8)
> It doesn't look like this can be backported easily to Queens as I've get a
> merge conflict when trying to cherry-pick ...

yeah, we don't have the podman selinux relabel workaround on stable/queens (as that was docker) so it won't be a clean cherry-pick - I'll propose the patch now

Comment 12 David Hill 2020-08-26 22:23:12 UTC
Thanks, many customer are hitting this issue by now so it'll be a great thing to have as we're all manually patching this as it is now.

Comment 31 errata-xmlrpc 2020-10-28 18:23:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4388