Bug 2181772
| Summary: | Upgrading RHV-H from 4.4.10 to 4.5.3-202302150956_8.6 breaks with an obscure vdsm error, which breaks subsequent re-installations. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Greg Scott <gscott> |
| Component: | ovirt-host | Assignee: | Asaf Rachmani <arachman> |
| Status: | CLOSED WONTFIX | QA Contact: | peyu |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.5.3 | CC: | arachman, cshao, lleistne, mavital, michal.skrivanek, mperina, pelauter, peyu, sbonazzo, weiwang, yaniwang |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-18 10:57:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Node | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Greg Scott
2023-03-25 20:08:06 UTC
A question came up about my selinux settings. My RHV-H hosts have been set to enforcing for years. [root@rhvb2020 ~]# cat /etc/selinux/config # This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - No SELinux policy is loaded. SELINUX=enforcing # SELINUXTYPE= can take one of these three values: # targeted - Targeted processes are protected, # minimum - Modification of targeted policy. Only selected processes are protected. # mls - Multi Level Security protection. SELINUXTYPE=targeted [root@rhvb2020 ~]# 2023-03-25 17:53:13,520 [DEBUG] (MainThread) Calling: (['restorecon', '-Rv', '/var/tmp/'],) {'close_fds': True, 'stderr': -2}
2023-03-25 17:53:13,560 [DEBUG] (MainThread) Exception! b'restorecon: Could not set context for /var/tmp/insights-client: Invalid argument\nrestorecon: Could not set context for /var/tmp/insights-client/insights-archive-406barzz: Invalid argument\nrestorecon: Could not set context for /var/tmp/insights-client/insights-archive-406barzz/insights-rhva2020.mgmt.local-20230325010202.tar.gz: Invalid argument\nrestorecon: Could not set context for /var/tmp/insights-client/insights-client-egg-release: Invalid argument\n'
I wonder if those files are just temporary files that aren't supposed to be there. Probably.
It could be that it's there in all connected systems, I do not think we test it at all.
Sounds like insights problem really (or selinux policies that do not have correct rule for these insights' files)
I don't think we can do much about that for 4.4.10. We should check a connected system upgrade in more recent version perhaps.
> I don't think we can do much about that for 4.4.10. We should check a
> connected system upgrade in more recent version perhaps.
But this isn't a 4.4.10 issue - that RPM with the scripts that run the upgrade is part of 4.5.1. The update sees something it doesn't expect and just fails. An updated 4.5.1 z-stream could fix that, right?
Aw nuts, you can't edit comments. I should have said 4.5.3, not 4.5.1. (In reply to Greg Scott from comment #5) > > I don't think we can do much about that for 4.4.10. We should check a > > connected system upgrade in more recent version perhaps. > > But this isn't a 4.4.10 issue - that RPM with the scripts that run the > upgrade is part of 4.5.1. The update sees something it doesn't expect and > just fails. An updated 4.5.1 z-stream could fix that, right? the script that runs is from the new layer, yes, but it's a blanket restorecon over the whole /var/tmp. The problem is in the old layer's insights. Likely they're missing the right rule, and it makes restorecon explode. I don't know if ignoring the error code is the best way forward, it may just hide issues. I mean...ok, "rm -rf /var/tmp/insights-client" would probably be ok... Can you give it a check/try, if you have other hosts to upgrade? > Can you give it a check/try, if you have other hosts to upgrade?
I already did both of my hosts. I have a junk one I haven't powered on in several months. I'll check to see what's on it.
Ya know - if that's the root problem - restorecon explodes because of bogus files in /var/tmp, seems like it's okay to report the error with a reasonable error message and then fail, without telling the world that it applied the update. Or if dnf returns success anyway, even if a script inside fails, then say something in the error message about doing a dnf reinstall {packagename} from the host. And get rid of legacies from the failure, so the reinstall doesn't also blow up.
QE tried to reproduce this bug, but it was not reproduced.
Test version:
RHVM: 4.5.2.4-0.1.el8ev
RHVH: Upgrade RHVH from rhvh-4.4.10.1-0.20220208.0+1 to rhvh-4.5.3.4-0.20230215.0+1
Test steps:
1. Install RHVH-4.4-20220208.0-RHVH-x86_64-dvd1.iso
2. Login to the host, set up local repo and point to "redhat-virtualization-host-4.5.3-202302150956_8.6"
3. Add host to RHVM
4. Upgrade host via RHVM GUI
5. Focus on the host status after upgrade
Test results:
The RHVH upgrade is successful, and the status of the host in RHVM is "Up".
Additional info:
~~~~~~
# imgbase w
You are on rhvh-4.5.3.4-0.20230215.0+1
# imgbase layout
rhvh-4.4.10.1-0.20220208.0
+- rhvh-4.4.10.1-0.20220208.0+1
rhvh-4.5.3.4-0.20230215.0
+- rhvh-4.5.3.4-0.20230215.0+1
vdsmd is active after upgrade.
# systemctl status vdsmd.service
● vdsmd.service - Virtual Desktop Server Manager
Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2023-03-31 04:56:26 UTC; 1h 19min ago
Process: 5565 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS)
Main PID: 8201 (vdsmd)
Tasks: 42 (limit: 820699)
Memory: 159.5M
CGroup: /system.slice/vdsmd.service
└─8201 /usr/bin/python3 /usr/libexec/vdsm/vdsmd
Mar 31 04:56:24 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running prepare_transient_repository
Mar 31 04:56:25 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running syslog_available
Mar 31 04:56:25 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running nwfilter
Mar 31 04:56:25 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running dummybr
Mar 31 04:56:26 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running tune_system
Mar 31 04:56:26 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running test_space
Mar 31 04:56:26 dell-per7425-03.lab.eng.pek2.redhat.com vdsmd_init_common.sh[5565]: vdsm: Running test_lo
Mar 31 04:56:26 dell-per7425-03.lab.eng.pek2.redhat.com systemd[1]: Started Virtual Desktop Server Manager.
Mar 31 04:56:29 dell-per7425-03.lab.eng.pek2.redhat.com vdsm[8201]: WARN MOM not available. Error: [Errno 2] No such file or directo>
Mar 31 04:56:29 dell-per7425-03.lab.eng.pek2.redhat.com vdsm[8201]: WARN MOM not available, KSM stats will be missing. Error:
~~~~~~
Which leads to the $million question, what's different about my hosts vs. the QE ones? I had forgotten this earlier - my 398 day certificates expired again on March 24. I renewed my RHVM certificates with engine-setup, and upgraded my RHVM to the latest at the same time. Here is my RHVM version - a little bit new than the QE one. [root@rhvm2020 ~]# [root@rhvm2020 ~]# rpm -qa | grep rhvm-4 rhvm-4.5.3.7-1.el8ev.noarch [root@rhvm2020 ~]# After RHVM came back alive, I used the RHVM GUI to renew my 4.4 host certificates. They all came back to green. And then I upgraded them. I don't remember enrolling either of my RHV-H hosts with Insights, but I forget lots of things. This is after the upgrade, but the upgrade process does an rsync from the old to new layers, so this insights-client directory might be relevant. Per @Michal's analysis above, maybe that's the difference. [root@rhva2020 tmp]# pwd /var/tmp [root@rhva2020 tmp]# ls abrt systemd-private-88212f5c517643ccae0c7efb9521cc08-chronyd.service-Vc8Hnm insights-client systemd-private-88212f5c517643ccae0c7efb9521cc08-systemd-resolved.service-Sz75pt [root@rhva2020 tmp]# [root@rhva2020 tmp]# ls -al -R insights-client/ insights-client/: total 4 drwx------. 3 root root 74 Mar 25 01:03 . drwxrwxrwt. 6 root root 208 Mar 25 18:15 .. drwx------. 2 root root 64 Mar 25 01:03 insights-archive-406barzz -rw-r--r--. 1 root root 8 Mar 25 18:10 insights-client-egg-release insights-client/insights-archive-406barzz: total 348 drwx------. 2 root root 64 Mar 25 01:03 . drwx------. 3 root root 74 Mar 25 01:03 .. -rw-r--r--. 1 root root 355446 Mar 25 01:03 insights-rhva2020.mgmt.local-20230325010202.tar.gz [root@rhva2020 tmp]# I think I have a revised set of steps to reproduce the problem. From the host, do: cd /var/tmp ls -al (nothing exciting) insights-client (Abort or let it finish, doesn't matter.) ls -al again. Note a new directory named insights-client, dated right now. From the RHVM GUI, try the upgrade again. This time, vdsm should break and the upgrade should fail. Subsequent upgrade attempts from the RHVM GUI will claim to complete, but they won't perform any upgrade. I'll attach a copy of imgbased.log from this host, named twelvetesthost. The action should all be from April 1, 2023. Looks like this is where the first attempt goes off the rails.
2023-04-01 21:14:43,722 [DEBUG] (MainThread) Calling: (['mount', '/dev/rhvh/var_tmp', '/tmp/mnt.lZ4hQ'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,477 [DEBUG] (MainThread) Running: ['restorecon', '-Rv', '/var/tmp/']
2023-04-01 21:14:44,477 [DEBUG] (MainThread) Calling: (['restorecon', '-Rv', '/var/tmp/'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,516 [DEBUG] (MainThread) Exception! b'restorecon: Could not set context for /var/tmp/insights-client: Invalid argument\nrestorecon: Could not set context for /var/tmp/insights-client/insights-client-egg-release: Invalid argument\n'
2023-04-01 21:14:44,517 [DEBUG] (MainThread) Calling: (['umount', '-l', '/tmp/mnt.lZ4hQ'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,553 [DEBUG] (MainThread) Calling: (['rmdir', '/tmp/mnt.lZ4hQ'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,556 [DEBUG] (MainThread) Calling: (['umount', '-l', '/etc'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,567 [DEBUG] (MainThread) Calling: (['umount', '-l', '/tmp/mnt.b4RZG'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,578 [DEBUG] (MainThread) Calling: (['rmdir', '/tmp/mnt.b4RZG'],) {'close_fds': True, 'stderr': -2}
2023-04-01 21:14:44,581 [ERROR] (MainThread) Failed to migrate etc
After fixing VDSM, the second attempt starts at 21:51. This one runs to completion without error and the node reboots. But when it comes back up, it's still on RHEL 8.5, so the upgrade didn't work. But now when the GUI checks for upgrades, it says the host is up to date. So the next attempt is yum reinstall redhat-virtualization-host-image-update. This runs smoothly and after a reboot, the node is now on 8.6.
no capacity to handle. suspecting insight-client selinux rules may be broken, hard to say. It looks rare enough to ignore, with two potential workarounds - running restorecon -Rv /var/tmp manually prior to upgrade - cleaning up /var/tmp/insights-client would fix the problem |