Bug 1930565 - Host upgrade failed in imgbased but RHVM shows upgrade successful
Summary: Host upgrade failed in imgbased but RHVM shows upgrade successful
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.4.6
: 4.4.6
Assignee: Dana
QA Contact: peyu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-19 07:35 UTC by peyu
Modified: 2021-06-01 13:23 UTC (History)
21 users (show)

Fixed In Version: ovirt-engine-4.4.6.4
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-01 13:22:12 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine logs from /var/log/ovirt-engine/host-deploy (225.87 KB, application/zip)
2021-02-20 03:53 UTC, peyu
no flags Details
/var/log/ovirt-eigine (19.48 MB, application/gzip)
2021-03-03 02:33 UTC, peyu
no flags Details
host logs from /var/log (6.75 MB, application/gzip)
2021-03-03 02:54 UTC, peyu
no flags Details
Part1 of the engine log /var/log (2.97 MB, application/zip)
2021-03-04 02:17 UTC, peyu
no flags Details
Part2 of engine log /var/log (19.49 MB, application/zip)
2021-03-04 02:26 UTC, peyu
no flags Details
logs of /var/log/ovirt-engine after replacing main.yml (7.38 MB, application/gzip)
2021-04-08 10:13 UTC, peyu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2179 0 None None None 2021-06-01 13:23:04 UTC
oVirt gerrit 114190 0 master MERGED engine: add cluster version's value to ansible command config variables 2021-04-09 12:45:58 UTC

Description peyu 2021-02-19 07:35:54 UTC
Description of problem:
When the RHVH upgrade failed, RHVM did not capture the failure information and still showed that "Upgrade was successful and host will be rebooted.".

Version-Release number of selected component (if applicable):
RHVM: 4.4.5.5-0.13.el8ev
RHVH: redhat-virtualization-host-4.4.5-20210215.0.el8_3

How reproducible:
100%

Steps to Reproduce:
1. Install RHVH-4.4-20210202.0-RHVH-x86_64-dvd1.iso
2. Add the host to RHVM
3. Login to RHVH, setup local repos and point to "redhat-virtualization-host-4.4.5-20210215.0.el8_3"
4.  Create an LV with the LV name of latest build (so that imgbased will fail during the creation of LV)
   # lvcreate -V 2G --thin -n rhvh-4.4.5.3-0.20210215.0+1 {vg-name}/{pool-name}
   # lvs
5. Upgrade the host via RHVM
6. Focus on host status and /var/log/imgbased.log

Actual results:
1. The host upgrade failed in imgbased.log, which is what we expect. 
2. However, RHVM shows "Upgrade was successful and host will be rebooted." and then the status of the host becomes "NonResponsive".

Expected results:
When RHVH upgrade failed in imgbased, it should be reflected as an error event in the RHVM portal.

Additional info:
The error in /var/log/imgbased.log QE made is as follows:
~~~~
Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/__main__.py", line 53, in <module>
    CliApplication()
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/__init__.py", line 82, in CliApplication
    app.hooks.emit("post-arg-parse", args)
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/hooks.py", line 120, in emit
    cb(self.context, *args)
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/plugins/update.py", line 75, in post_argparse
    six.reraise(*exc_info)
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/six.py", line 675, in reraise
    raise value
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/plugins/update.py", line 66, in post_argparse
    base, _ = LiveimgExtractor(app.imgbase).extract(args.FILENAME)
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/plugins/update.py", line 148, in extract
    "%s" % size, nvr)
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/plugins/update.py", line 128, in add_base_with_tree
    new_layer_lv = self.imgbase.add_layer(new_base)
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/imgbase.py", line 209, in add_layer
    new_lv = self._add_lvm_snapshot(prev_lv, new_layer.lv_name)
  File "/tmp/tmp.xIXTME6RBg/usr/lib/python3.6/site-packages/imgbased/imgbase.py", line 230, in _add_lvm_snapshot
    raise RuntimeError("Failed to create a new layer")
RuntimeError: Failed to create a new layer
~~~~

Comment 2 peyu 2021-02-20 03:53:44 UTC
Created attachment 1758335 [details]
engine logs from /var/log/ovirt-engine/host-deploy

Comment 4 peyu 2021-03-03 02:33:18 UTC
Created attachment 1760325 [details]
/var/log/ovirt-eigine

Comment 5 peyu 2021-03-03 02:54:09 UTC
Created attachment 1760327 [details]
host logs from /var/log

Comment 6 Yedidyah Bar David 2021-03-03 09:10:25 UTC
It seems like some tasks, including e.g. 'Delete yum_updates file from host', but also 'Prepare NGN host for upgrade', are called more than once. Dana, can you please check why?

Perhaps some things ran in parallel unintentionally, including checking for '/var/imgbased/.image-updated' before 'yum update' finished. Not sure.

Comment 7 Yedidyah Bar David 2021-03-03 09:10:57 UTC
Please attach all of /var/log from the engine. Thanks.

Comment 8 peyu 2021-03-04 02:17:52 UTC
Created attachment 1760564 [details]
Part1 of the engine log /var/log

The compressed size of /var/log exceeds the upper limit of bugzilla, so I divided /var/log into two parts, Part1 does not contain /var/log/ovirt-engine, and /var/log/ovirt-engine is in Part2

Comment 9 peyu 2021-03-04 02:26:21 UTC
Created attachment 1760565 [details]
Part2 of engine log /var/log

Comment 10 Yedidyah Bar David 2021-03-04 06:02:50 UTC
Thanks! Restoring needinfo on Dana.

Comment 11 Dana 2021-03-04 08:52:15 UTC
This issue has an open bug and is currently under investigation
https://bugzilla.redhat.com/show_bug.cgi?id=1917707

Comment 12 Yedidyah Bar David 2021-03-04 10:10:59 UTC
Thanks. Closing as duplicate for now. Please reopen if needed.

*** This bug has been marked as a duplicate of bug 1917707 ***

Comment 13 Dana 2021-04-04 10:37:51 UTC
reopening as bug 1917707 seems to be caused an ansible-runner issue, which causes events to appear several times in the audit log

Comment 14 Yedidyah Bar David 2021-04-07 09:12:41 UTC
In my previous investigation (leading to comment 6), it seemed like imgbased and the %post of redhat-virtualization-host-image-update worked as expected, and the bug (not noticing that they failed) happened in the ansible code calling them. So moving to Dana for now.

Comment 15 Dana 2021-04-07 09:37:53 UTC
While looking at host deploy log I noticed 2 things:

1. Whereas audit log shows some tasks several times (comment 13), in host upgrade log, tasks 'Check if image-updated file exists', ' Verify image was updated successfully', 'Configure LVM filter' are listed twice, and the results of 'Configure LVM filter' indicate that these are separate runs:
first one output contains: "stdout" : "ok: [10.73.73.91]"
second run output contains: "stdout" : "Analyzing host...\nLVM filter is already configured for Vdsm"

2. The task 'Verify image was updated successfully' (https://github.com/oVirt/ovirt-engine/blob/a65cf0eae8858ab2278c3f537dc427e3ff20eba7/packaging/ansible-runner-service-project/project/roles/ovirt-host-upgrade/tasks/main.yml#L64) which was expected to fail was skipped, although, in ansible-runner-service log, the values of those params are (can be also seen in host upgrade log):  
node_host: [2021-02-20 19:36:52,487 ] {'node_host': True}
image_pkg_updated: [2021-02-20 19:38:37,088] {'image_pkg_updated': True}
image_updated_file: [2021-02-20 19:38:37,806] 'res': {'changed': False, 'stat': {'exists': False}, 'invocation': {'module_args': {'path': '/var/imgbased/.image-updated', 
host_deploy_cluster_version was reported to be 4.5

I'm looking more in Ansible Runner to see what might cause [1]. 
Pengshan, is it possible to re-run with a file that should be replaced in the engine so that I can see whether there was a change in any of these params?

Comment 16 peyu 2021-04-08 01:55:47 UTC
(In reply to Dana from comment #15)
> While looking at host deploy log I noticed 2 things:
> 
> 1. Whereas audit log shows some tasks several times (comment 13), in host
> upgrade log, tasks 'Check if image-updated file exists', ' Verify image was
> updated successfully', 'Configure LVM filter' are listed twice, and the
> results of 'Configure LVM filter' indicate that these are separate runs:
> first one output contains: "stdout" : "ok: [10.73.73.91]"
> second run output contains: "stdout" : "Analyzing host...\nLVM filter is
> already configured for Vdsm"
> 
> 2. The task 'Verify image was updated successfully'
> (https://github.com/oVirt/ovirt-engine/blob/
> a65cf0eae8858ab2278c3f537dc427e3ff20eba7/packaging/ansible-runner-service-
> project/project/roles/ovirt-host-upgrade/tasks/main.yml#L64) which was
> expected to fail was skipped, although, in ansible-runner-service log, the
> values of those params are (can be also seen in host upgrade log):  
> node_host: [2021-02-20 19:36:52,487 ] {'node_host': True}
> image_pkg_updated: [2021-02-20 19:38:37,088] {'image_pkg_updated': True}
> image_updated_file: [2021-02-20 19:38:37,806] 'res': {'changed': False,
> 'stat': {'exists': False}, 'invocation': {'module_args': {'path':
> '/var/imgbased/.image-updated', 
> host_deploy_cluster_version was reported to be 4.5
> 
> I'm looking more in Ansible Runner to see what might cause [1]. 
> Pengshan, is it possible to re-run with a file that should be replaced in
> the engine so that I can see whether there was a change in any of these
> params?

Yes, I can re-run, please tell me how to replace the file.

Comment 17 peyu 2021-04-08 10:13:03 UTC
Created attachment 1770191 [details]
logs of /var/log/ovirt-engine after replacing main.yml

logs of /var/log/ovirt-engine after replacing /usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-upgrade/tasks/main.yml

Comment 19 peyu 2021-04-20 04:30:59 UTC
Pending on the new buid of ovirt-engine-4.4.6.4 to verify this bug.

Comment 20 peyu 2021-04-27 04:00:04 UTC
QE verified this bug.

Verified version:
RHVM: 4.4.6.5-0.17.el8ev
RHVH: redhat-virtualization-host-4.4.6-20210426.0.el8_4

Steps to Reproduce:
1. Install redhat-virtualization-host-4.4.5-20210330.0.el8_3
2. Add the host to RHVM
3. Login to RHVH, setup local repos and point to "redhat-virtualization-host-4.4.6-20210426.0.el8_4"
4.  Create an LV with the LV name of latest build (so that imgbased will fail during the creation of LV)
   # lvcreate -V 2G --thin -n rhvh-4.4.6.1-0.20210426.0+1 {vg-name}/{pool-name}
   # lvs
5. Upgrade the host via RHVM
6. Focus on host status and /var/log/imgbased.log

Actual results:
1. Host upgrade failed in imgbased.log, which is what we expect. 
2. Host status changed to "InstallFailed" on RHVM. But in "Events", there is no related error message of "Failed to upgrade Host xxx".

Comment 21 peyu 2021-04-27 04:47:14 UTC
Will “Actual results 2” affect the functionality?

Comment 22 Dana 2021-04-27 07:13:50 UTC
1. what do you see in 'events'?
2. in 'hosts'- what is the host's status? does it show 'installation failed'?

Comment 23 peyu 2021-04-27 08:15:26 UTC
Sorry, I made a mistake on Comment 20.
Verified this bug again.

Verified version:
RHVM: 4.4.6.5-0.17.el8ev
RHVH: redhat-virtualization-host-4.4.6-20210426.0.el8_4

Steps to Reproduce:
1. Install redhat-virtualization-host-4.4.5-20210330.0.el8_3
2. Add the host to RHVM
3. Login to RHVH, setup local repos and point to "redhat-virtualization-host-4.4.6-20210426.0.el8_4"
4.  Create an LV with the LV name of latest build (so that imgbased will fail during the creation of LV)
   # lvcreate -V 2G --thin -n rhvh-4.4.6.1-0.20210426.0+1 {vg-name}/{pool-name}
   # lvs
5. Upgrade the host via RHVM
6. Focus on host status and /var/log/imgbased.log

Actual results:
1. Host upgrade failed in imgbased.log, which is what we expect. 
2. Host status changed to "InstallFailed" on RHVM. 
3. In "Events", there is a related error message of "Failed to upgrade Host 0427-03 (User: admin@internal-authz).".

So move bug status to "VERIFIED".

Comment 27 errata-xmlrpc 2021-06-01 13:22:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager security update (ovirt-engine) [ovirt-4.4.6]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2179


Note You need to log in before you can comment on or make changes to this bug.