Bug 2181772 - Upgrading RHV-H from 4.4.10 to 4.5.3-202302150956_8.6 breaks with an obscure vdsm error, which breaks subsequent re-installations.
Summary: Upgrading RHV-H from 4.4.10 to 4.5.3-202302150956_8.6 breaks with an obscure ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-host
Version: 4.5.3
Hardware: All
OS: All
high
high
Target Milestone: ---
: ---
Assignee: Asaf Rachmani
QA Contact: peyu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-25 20:08 UTC by Greg Scott
Modified: 2023-05-18 10:57 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-18 10:57:54 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Greg Scott 2023-03-25 20:08:06 UTC
Description of problem:
Upgrading RHV-H from 4.4.10 to 4.5.3-202302150956_8.6 breaks with an obscure vdsm error, which breaks subsequent re-installations

Version-Release number of selected component (if applicable):
4.5.3

How reproducible:
At will

Steps to Reproduce:
1. Start with a working 4.4-SP1 RHVM and RHV-H 4.4 hosts.
2. From the RHVM GUI, graphically update a RHV-H host.
3. RHVM reports an upgrade failure.
4. Reboot the host.
5. RHVM incorrectly reports the host is up to date.
6. From the host, try dnf reinstall redhat-virtualization-host-image-update.
7. The post-installation script fails, but dnf still reports a successful upgrade.

Actual results:
After several minutes, the RHV-M GUI reports an upgrade failure. The host goes non-responsive. On the host, "nodectl check" shows a VDSM problem. Further digging shows a VDSM problem with sebool. The vdsmd service no longer starts. Fix the VDSM problem (see below) and reboot the host. RMVM incorrectly shows the host is up to date. The failed upgrade leaves a legacy of a bogus LV and two loop mounts on the host. The legacy LV breaks subsequent dnf reinstall attempts from an ssh session on the host, but dnf reports a successful upgrade anyway.

Expected results:
A host upgrade from the GUI should not break VDSM. Upgrade failures should not leave behind legacies to break subsequent re-installations. And if an upgrade fails, nobody should report success.

Additional info:

On the RHV-H host, work around the VDSM problem like this:

[root@rhva2020 tmp]# semodule -i /usr/share/selinux/packages/ovirt-vmconsole/ovirt_vmconsole.pp
[root@rhva2020 tmp]# vdsm-tool configure --module sebool
[root@rhva2020 tmp]# systemctl start vdsmd.service

Clean up the legacy LV like this:

[root@rhva2020 tmp]# lvremove rhvh/rhvh-4.5.3.4-0.20230215.0+1
Do you really want to remove active logical volume rhvh/rhvh-4.5.3.4-0.20230215.0+1? [y/n]: y
  Logical volume "rhvh-4.5.3.4-0.20230215.0+1" successfully removed.
[root@rhva2020 tmp]#

And get rid of the loop mounts like this:

[root@rhvb2020 tmp]# df -h | grep loop
/dev/loop1                                       3.9G  3.3G  455M  88% /tmp/tmp.nxHntHklwn
[root@rhva2020 tmp]# umount /dev/loop1
[root@rhva2020 tmp]#
Do this twice because the failed upgrade leaves two loop mounts.
And then remove the tmp directories the loops were mounted on.

********************************************************

Here is output from a failed upgrade with the recovery from above to work around the problem.

[root@rhva2020 log]# date
Sat Mar 25 17:46:43 UTC 2023
[root@rhva2020 log]# # a minute after starting the upgrade
[root@rhva2020 log]# nodectl check
Status: OK
Bootloader ... OK
  Layer boot entries ... OK
  Valid boot entries ... OK
Mount points ... OK
  Separate /var ... OK
  Discard is used ... OK
Basic storage ... OK
  Initialized VG ... OK
  Initialized Thin Pool ... OK
  Initialized LVs ... OK
Thin storage ... OK
  Checking available space in thinpool ... OK
  Checking thinpool auto-extend ... OK
vdsmd ... OK
[root@rhva2020 log]#
[root@rhva2020 log]#
[root@rhva2020 log]# date
Sat Mar 25 17:52:09 UTC 2023
[root@rhva2020 log]# nodectl check
Status: OK
Bootloader ... OK
  Layer boot entries ... OK
  Valid boot entries ... OK
Mount points ... OK
  Separate /var ... OK
  Discard is used ... OK
Basic storage ... OK
  Initialized VG ... OK
  Initialized Thin Pool ... OK
  Initialized LVs ... OK
Thin storage ... OK
  Checking available space in thinpool ... OK
  Checking thinpool auto-extend ... OK
vdsmd ... OK
[root@rhva2020 log]# lvs
  LV                           VG   Attr       LSize   Pool   Origin                     Data%  Meta%  Mo                           ve Log Cpy%Sync Convert
  home                         rhvh Vwi-aotz--   1.00g pool00                            1.04                                      
  pool00                       rhvh twi-aotz-- 162.05g                                   10.30  2.52                                
  rhvh-4.4.10.1-0.20220208.0   rhvh Vri---tz-k 125.05g pool00                                                                      
  rhvh-4.4.10.1-0.20220208.0+1 rhvh Vwi-aotz-- 125.05g pool00 rhvh-4.4.10.1-0.20220208.0 3.46                                      
  rhvh-4.4.3.2-0.20201210.0    rhvh Vri---tz-k 125.05g pool00                                                                      
  rhvh-4.4.3.2-0.20201210.0+1  rhvh Vwi-a-tz-- 125.05g pool00 rhvh-4.4.3.2-0.20201210.0  3.01                                      
  rhvh-4.5.3.4-0.20230215.0    rhvh Vri-a-tz-k 125.05g pool00                            2.62                                      
  rhvh-4.5.3.4-0.20230215.0+1  rhvh Vwi-aotz-- 125.05g pool00 rhvh-4.5.3.4-0.20230215.0  2.62                                      
  root                         rhvh Vri---tz-k 125.05g pool00                                                                      
  swap                         rhvh -wi-ao---- <23.29g                                                                              
  tmp                          rhvh Vwi-aotz--   1.00g pool00                            3.49                                      
  var                          rhvh Vwi-aotz--  15.00g pool00                            13.48                                      
  var_crash                    rhvh Vwi-aotz--  10.00g pool00                            0.11                                      
  var_log                      rhvh Vwi-aotz--   8.00g pool00                            4.22                                      
  var_log_audit                rhvh Vwi-aotz--   2.00g pool00                            2.64                                      
[root@rhva2020 log]# df -h
Filesystem                                       Size  Used Avail Use% Mounted on
devtmpfs                                          32G     0   32G   0% /dev
tmpfs                                             32G   16K   32G   1% /dev/shm
tmpfs                                             32G  914M   31G   3% /run
tmpfs                                             32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/rhvh-rhvh--4.4.10.1--0.20220208.0+1  125G  5.1G  120G   5% /
/dev/mapper/rhvh-tmp                            1014M   40M  975M   4% /tmp
/dev/mapper/rhvh-home                           1014M   40M  975M   4% /home
/dev/mapper/rhvh-var                              15G  2.1G   13G  14% /var
/dev/sda1                                        976M  437M  473M  49% /boot
/dev/mapper/rhvh-var_log                         8.0G  377M  7.7G   5% /var/log
/dev/mapper/rhvh-var_crash                        10G  105M  9.9G   2% /var/crash
/dev/mapper/rhvh-var_log_audit                   2.0G   88M  2.0G   5% /var/log/audit
tmpfs                                            6.3G     0  6.3G   0% /run/user/0
/dev/loop1                                       3.9G  3.3G  455M  88% /tmp/mnt.hFITg
/dev/loop0                                       1.1G  1.1G     0 100% /tmp/mnt.JlHR2
/dev/mapper/rhvh-rhvh--4.5.3.4--0.20230215.0+1   125G  4.2G  121G   4% /tmp/mnt.MJSQC

****************

After the RHVM GUI reported a failed upgrade
- See the legacy LV and loop mounts.
- VDSM is broken.

[root@rhva2020 log]#
[root@rhva2020 log]# lvs
  LV                           VG   Attr       LSize   Pool   Origin                     Data%  Meta%  Mo                           ve Log Cpy%Sync Convert
  home                         rhvh Vwi-aotz--   1.00g pool00                            1.04                                      
  pool00                       rhvh twi-aotz-- 162.05g                                   10.26  2.52                                
  rhvh-4.4.10.1-0.20220208.0   rhvh Vri---tz-k 125.05g pool00                                                                      
  rhvh-4.4.10.1-0.20220208.0+1 rhvh Vwi-aotz-- 125.05g pool00 rhvh-4.4.10.1-0.20220208.0 3.46                                      
  rhvh-4.4.3.2-0.20201210.0    rhvh Vri---tz-k 125.05g pool00                                                                      
  rhvh-4.4.3.2-0.20201210.0+1  rhvh Vwi-a-tz-- 125.05g pool00 rhvh-4.4.3.2-0.20201210.0  3.01                                      
  rhvh-4.5.3.4-0.20230215.0+1  rhvh Vwi-a-tz-- 125.05g pool00                            2.62                                      
  root                         rhvh Vri---tz-k 125.05g pool00                                                                      
  swap                         rhvh -wi-ao---- <23.29g                                                                              
  tmp                          rhvh Vwi-aotz--   1.00g pool00                            3.50                                      
  var                          rhvh Vwi-aotz--  15.00g pool00                            13.55                                      
  var_crash                    rhvh Vwi-aotz--  10.00g pool00                            0.11                                      
  var_log                      rhvh Vwi-aotz--   8.00g pool00                            4.22                                      
  var_log_audit                rhvh Vwi-aotz--   2.00g pool00                            2.65                                      
[root@rhva2020 log]# nodectl check
Status: WARN
Bootloader ... OK
  Layer boot entries ... OK
  Valid boot entries ... OK
Mount points ... OK
  Separate /var ... OK
  Discard is used ... OK
Basic storage ... OK
  Initialized VG ... OK
  Initialized Thin Pool ... OK
  Initialized LVs ... OK
Thin storage ... OK
  Checking available space in thinpool ... OK
  Checking thinpool auto-extend ... OK
vdsmd ... BAD
[root@rhva2020 log]#
[root@rhva2020 log]# df -h
Filesystem                                       Size  Used Avail Use% Mounted on
devtmpfs                                          32G     0   32G   0% /dev
tmpfs                                             32G   16K   32G   1% /dev/shm
tmpfs                                             32G  914M   31G   3% /run
tmpfs                                             32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/rhvh-rhvh--4.4.10.1--0.20220208.0+1  125G  5.1G  120G   5% /
/dev/mapper/rhvh-tmp                            1014M   40M  975M   4% /tmp
/dev/mapper/rhvh-home                           1014M   40M  975M   4% /home
/dev/mapper/rhvh-var                              15G  2.1G   13G  14% /var
/dev/sda1                                        976M  437M  473M  49% /boot
/dev/mapper/rhvh-var_log                         8.0G  375M  7.7G   5% /var/log
/dev/mapper/rhvh-var_crash                        10G  105M  9.9G   2% /var/crash
/dev/mapper/rhvh-var_log_audit                   2.0G   87M  2.0G   5% /var/log/audit
tmpfs                                            6.3G     0  6.3G   0% /run/user/0
/dev/loop1                                       3.9G  3.3G  455M  88% /tmp/tmp.f5Bwzc2F6q
[root@rhva2020 log]# cd /tmp
[root@rhva2020 tmp]# umount /dev/loop1
[root@rhva2020 tmp]# df -h
Filesystem                                       Size  Used Avail Use% Mounted on
devtmpfs                                          32G     0   32G   0% /dev
tmpfs                                             32G   16K   32G   1% /dev/shm
tmpfs                                             32G  914M   31G   3% /run
tmpfs                                             32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/rhvh-rhvh--4.4.10.1--0.20220208.0+1  125G  5.1G  120G   5% /
/dev/mapper/rhvh-tmp                            1014M   40M  975M   4% /tmp
/dev/mapper/rhvh-home                           1014M   40M  975M   4% /home
/dev/mapper/rhvh-var                              15G  2.1G   13G  14% /var
/dev/sda1                                        976M  437M  473M  49% /boot
/dev/mapper/rhvh-var_log                         8.0G  375M  7.7G   5% /var/log
/dev/mapper/rhvh-var_crash                        10G  105M  9.9G   2% /var/crash
/dev/mapper/rhvh-var_log_audit                   2.0G   87M  2.0G   5% /var/log/audit
tmpfs                                            6.3G     0  6.3G   0% /run/user/0
/dev/loop0                                       1.1G  1.1G     0 100% /tmp/tmp.f5Bwzc2F6q
[root@rhva2020 tmp]# umount /dev/loop0
[root@rhva2020 tmp]# lvremove rhvh/rhvh-4.5.3.4-0.20230215.0+1
Do you really want to remove active logical volume rhvh/rhvh-4.5.3.4-0.20230215.0+1? [y/n]: y
  Logical volume "rhvh-4.5.3.4-0.20230215.0+1" successfully removed.
[root@rhva2020 tmp]#
[root@rhva2020 tmp]# semodule -i /usr/share/selinux/packages/ovirt-vmconsole/ovirt_vmconsole.pp
[root@rhva2020 tmp]# vdsm-tool configure --module sebool

Checking configuration status...


Running configure...

Done configuring modules to VDSM.
[root@rhva2020 tmp]#
[root@rhva2020 tmp]# systemctl start vdsmd.service
[root@rhva2020 tmp]# nodectl check
Status: OK
Bootloader ... OK
  Layer boot entries ... OK
  Valid boot entries ... OK
Mount points ... OK
  Separate /var ... OK
  Discard is used ... OK
Basic storage ... OK
  Initialized VG ... OK
  Initialized Thin Pool ... OK
  Initialized LVs ... OK
Thin storage ... OK
  Checking available space in thinpool ... OK
  Checking thinpool auto-extend ... OK
vdsmd ... OK
[root@rhva2020 tmp]# yum reinstall redhat-virtualization-host-image-update
Updating Subscription Management repositories.
Last metadata expiration check: 0:19:39 ago on Sat 25 Mar 2023 05:47:42 PM UTC.
Dependencies resolved.
====================================================================================================================================
 Package                                       Architecture Version                       Repository                           Size
====================================================================================================================================
Reinstalling:
 redhat-virtualization-host-image-update       x86_64       4.5.3-202302150956_8.6        rhvh-4-for-rhel-8-x86_64-rpms       1.0 G

Transaction Summary
====================================================================================================================================

Total size: 1.0 G
Installed size: 1.0 G
Is this ok [y/N]: y
Downloading Packages:
[SKIPPED] redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64.rpm: Already downloaded
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                            1/1
  Running scriptlet: redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64                                      1/2
  Reinstalling     : redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64                                      1/2
  Running scriptlet: redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64                                      1/2
  Cleanup          : redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64                                      2/2
  Verifying        : redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64                                      1/2
  Verifying        : redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64                                      2/2
Installed products updated.
Unpersisting: redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64.rpm

Reinstalled:
  redhat-virtualization-host-image-update-4.5.3-202302150956_8.6.x86_64

Complete!
[root@rhva2020 tmp]# reboot
login as: root
root.10.21's password:
Web console: https://rhva2020.mgmt.local:9090/ or https://10.10.10.21:9090/

Last failed login: Sat Mar 25 18:21:22 UTC 2023 from 10.10.10.115 on ssh:notty
There was 1 failed login attempt since the last successful login.
Last login: Sat Mar 25 17:55:36 2023 from 10.10.10.20

  node status: OK
  See `nodectl check` for more information

Admin Console: https://10.10.11.21:9090/ or https://10.10.10.21:9090/

[root@rhva2020 ~]# more /etc/redhat-release
Red Hat Enterprise Linux release 8.6
[root@rhva2020 ~]#

Comment 3 Greg Scott 2023-03-25 21:14:27 UTC
A question came up about my selinux settings. My RHV-H hosts have been set to enforcing for years.

[root@rhvb2020 ~]# cat /etc/selinux/config

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=enforcing
# SELINUXTYPE= can take one of these three values:
#     targeted - Targeted processes are protected,
#     minimum - Modification of targeted policy. Only selected processes are protected.
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted


[root@rhvb2020 ~]#

Comment 4 Michal Skrivanek 2023-03-29 16:07:25 UTC
2023-03-25 17:53:13,520 [DEBUG] (MainThread) Calling: (['restorecon', '-Rv', '/var/tmp/'],) {'close_fds': True, 'stderr': -2}
2023-03-25 17:53:13,560 [DEBUG] (MainThread) Exception! b'restorecon: Could not set context for /var/tmp/insights-client:  Invalid argument\nrestorecon: Could not set context for /var/tmp/insights-client/insights-archive-406barzz:  Invalid argument\nrestorecon: Could not set context for /var/tmp/insights-client/insights-archive-406barzz/insights-rhva2020.mgmt.local-20230325010202.tar.gz:  Invalid argument\nrestorecon: Could not set context for /var/tmp/insights-client/insights-client-egg-release:  Invalid argument\n'

I wonder if those files are just temporary files that aren't supposed to be there. Probably. 
It could be that it's there in all connected systems, I do not think we test it at all.
Sounds like insights problem really (or selinux policies that do not have correct rule for these insights' files)
I don't think we can do much about that for 4.4.10. We should check a connected system upgrade in more recent version perhaps.

Comment 5 Greg Scott 2023-03-29 17:31:52 UTC
> I don't think we can do much about that for 4.4.10. We should check a
> connected system upgrade in more recent version perhaps.

But this isn't a 4.4.10 issue - that RPM with the scripts that run the upgrade is part of 4.5.1. The update sees something it doesn't expect and just fails. An updated 4.5.1 z-stream could fix that, right?

Comment 6 Greg Scott 2023-03-29 17:34:13 UTC
Aw nuts, you can't edit comments. I should have said 4.5.3, not 4.5.1.

Comment 7 Michal Skrivanek 2023-03-30 08:09:26 UTC
(In reply to Greg Scott from comment #5)
> > I don't think we can do much about that for 4.4.10. We should check a
> > connected system upgrade in more recent version perhaps.
> 
> But this isn't a 4.4.10 issue - that RPM with the scripts that run the
> upgrade is part of 4.5.1. The update sees something it doesn't expect and
> just fails. An updated 4.5.1 z-stream could fix that, right?

the script that runs is from the new layer, yes, but it's a blanket restorecon over the whole /var/tmp. The problem is in the old layer's insights. Likely they're missing the right rule, and it makes restorecon explode. I don't know if ignoring the error code is the best way forward, it may just hide issues. I mean...ok, "rm -rf /var/tmp/insights-client" would probably be ok...
Can you give it a check/try, if you have other hosts to upgrade?

Comment 9 Greg Scott 2023-03-30 13:36:23 UTC
> Can you give it a check/try, if you have other hosts to upgrade?

I already did both of my hosts. I have a junk one I haven't powered on in several months. I'll check to see what's on it.

Ya know - if that's the root problem - restorecon explodes because of bogus files in /var/tmp, seems like it's okay to report the error with a reasonable error message and then fail, without telling the world that it applied the update. Or if dnf returns success anyway, even if a script inside fails, then say something in the error message about doing a dnf reinstall {packagename} from the host. And get rid of legacies from the failure, so the reinstall doesn't also blow up.

Comment 10 peyu 2023-03-31 06:30:08 UTC
QE tried to reproduce this bug, but it was not reproduced.

Test version:
RHVM: 4.5.2.4-0.1.el8ev
RHVH: Upgrade RHVH from rhvh-4.4.10.1-0.20220208.0+1 to rhvh-4.5.3.4-0.20230215.0+1

Test steps:
1. Install RHVH-4.4-20220208.0-RHVH-x86_64-dvd1.iso
2. Login to the host, set up local repo and point to "redhat-virtualization-host-4.5.3-202302150956_8.6"
3. Add host to RHVM
4. Upgrade host via RHVM GUI
5. Focus on the host status after upgrade

Test results:
The RHVH upgrade is successful, and the status of the host in RHVM is "Up".


Additional info:
~~~~~~
# imgbase w
You are on rhvh-4.5.3.4-0.20230215.0+1

# imgbase layout
rhvh-4.4.10.1-0.20220208.0
 +- rhvh-4.4.10.1-0.20220208.0+1
rhvh-4.5.3.4-0.20230215.0
 +- rhvh-4.5.3.4-0.20230215.0+1

vdsmd is active after upgrade.
# systemctl status vdsmd.service
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2023-03-31 04:56:26 UTC; 1h 19min ago
  Process: 5565 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS)
 Main PID: 8201 (vdsmd)
    Tasks: 42 (limit: 820699)
   Memory: 159.5M
   CGroup: /system.slice/vdsmd.service
           └─8201 /usr/bin/python3 /usr/libexec/vdsm/vdsmd

Mar 31 04:56:24 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running prepare_transient_repository
Mar 31 04:56:25 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running syslog_available
Mar 31 04:56:25 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running nwfilter
Mar 31 04:56:25 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running dummybr
Mar 31 04:56:26 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running tune_system
Mar 31 04:56:26 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running test_space
Mar 31 04:56:26 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running test_lo
Mar 31 04:56:26 dell-per7425-03.lab.eng.pek2.redhat.com systemd[1]: Started Virtual Desktop Server Manager.
Mar 31 04:56:29 dell-per7425-03.lab.eng.pek2.redhat.com vdsm[8201]: WARN MOM not available. Error: [Errno 2] No such file or directo>
Mar 31 04:56:29 dell-per7425-03.lab.eng.pek2.redhat.com vdsm[8201]: WARN MOM not available, KSM stats will be missing. Error:
~~~~~~

Comment 11 Greg Scott 2023-03-31 22:17:35 UTC
Which leads to the $million question, what's different about my hosts vs. the QE ones? I had forgotten this earlier - my 398 day certificates expired again on March 24. I renewed my RHVM certificates with engine-setup, and upgraded my RHVM to the latest at the same time. Here is my RHVM version - a little bit new than the QE one.

[root@rhvm2020 ~]#
[root@rhvm2020 ~]# rpm -qa | grep rhvm-4
rhvm-4.5.3.7-1.el8ev.noarch
[root@rhvm2020 ~]#

After RHVM came back alive, I used the RHVM GUI to renew my 4.4 host certificates. They all came back to green. And then I upgraded them.

I don't remember enrolling either of my RHV-H hosts with Insights, but I forget lots of things. This is after the upgrade, but the upgrade process does an rsync from the old to new layers, so this insights-client directory might be relevant. Per @Michal's analysis above, maybe that's the difference.


[root@rhva2020 tmp]# pwd
/var/tmp
[root@rhva2020 tmp]# ls
abrt             systemd-private-88212f5c517643ccae0c7efb9521cc08-chronyd.service-Vc8Hnm
insights-client  systemd-private-88212f5c517643ccae0c7efb9521cc08-systemd-resolved.service-Sz75pt
[root@rhva2020 tmp]#
[root@rhva2020 tmp]# ls -al -R insights-client/
insights-client/:
total 4
drwx------. 3 root root  74 Mar 25 01:03 .
drwxrwxrwt. 6 root root 208 Mar 25 18:15 ..
drwx------. 2 root root  64 Mar 25 01:03 insights-archive-406barzz
-rw-r--r--. 1 root root   8 Mar 25 18:10 insights-client-egg-release

insights-client/insights-archive-406barzz:
total 348
drwx------. 2 root root     64 Mar 25 01:03 .
drwx------. 3 root root     74 Mar 25 01:03 ..
-rw-r--r--. 1 root root 355446 Mar 25 01:03 insights-rhva2020.mgmt.local-20230325010202.tar.gz
[root@rhva2020 tmp]#

Comment 12 Greg Scott 2023-04-02 03:44:00 UTC
I think I have a revised set of steps to reproduce the problem. From the host, do:

cd /var/tmp
ls -al
(nothing exciting)
insights-client
(Abort or let it finish, doesn't matter.)
ls -al again.
Note a new directory named insights-client, dated right now.

From the RHVM GUI, try the upgrade again. This time, vdsm should break and the upgrade should fail. Subsequent upgrade attempts from the RHVM GUI will claim to complete, but they won't perform any upgrade.

I'll attach a copy of imgbased.log from this host, named twelvetesthost. The action should all be from April 1, 2023.

Comment 14 Greg Scott 2023-04-02 04:15:16 UTC
Looks like this is where the first attempt goes off the rails.

2023-04-01 21:14:43,722 [DEBUG] (MainThread) Calling: (['mount', '/dev/rhvh/var_tmp', '/tmp/mnt.lZ4hQ'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,477 [DEBUG] (MainThread) Running: ['restorecon', '-Rv', '/var/tmp/']
2023-04-01 21:14:44,477 [DEBUG] (MainThread) Calling: (['restorecon', '-Rv', '/var/tmp/'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,516 [DEBUG] (MainThread) Exception! b'restorecon: Could not set context for /var/tmp/insights-client:  Invalid argument\nrestorecon: Could not set context for /var/tmp/insights-client/insights-client-egg-release:  Invalid argument\n'
2023-04-01 21:14:44,517 [DEBUG] (MainThread) Calling: (['umount', '-l', '/tmp/mnt.lZ4hQ'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,553 [DEBUG] (MainThread) Calling: (['rmdir', '/tmp/mnt.lZ4hQ'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,556 [DEBUG] (MainThread) Calling: (['umount', '-l', '/etc'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,567 [DEBUG] (MainThread) Calling: (['umount', '-l', '/tmp/mnt.b4RZG'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,578 [DEBUG] (MainThread) Calling: (['rmdir', '/tmp/mnt.b4RZG'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,581 [ERROR] (MainThread) Failed to migrate etc

After fixing VDSM, the second attempt starts at 21:51. This one runs to completion without error and the node reboots. But when it comes back up, it's still on RHEL 8.5, so the upgrade didn't work. But now when the GUI checks for upgrades, it says the host is up to date. So the next attempt is yum reinstall redhat-virtualization-host-image-update. This runs smoothly and after a reboot, the node is now on 8.6.

Comment 18 Michal Skrivanek 2023-05-18 10:57:54 UTC
no capacity to handle. suspecting insight-client selinux rules may be broken, hard to say. It looks rare enough to ignore, with two potential workarounds
- running restorecon -Rv /var/tmp manually prior to upgrade 
- cleaning up /var/tmp/insights-client would fix the problem


Note You need to log in before you can comment on or make changes to this bug.