Description of problem: === After an upgrade of RHV-H to redhat-the latest virtualization-host-image-update-4.1-20180410, it is no longer possible to migrate VMs to upgraded host. The investigation leads to an exception shown on a source hypervisor and error message: libvirt: Lock Driver error : Failed to open socket to sanlock daemon: Permission denied Permission denied is most likely due to wrong SELinux booleans settings, e.g we have "off" for sanlock_use_fusefs, sanlock_use_nfs, virt_use_sanlock Version-Release number of selected component (if applicable): === redhat-virtualization-host-image-update-4.1-20180410 but most likely old versions could be affected as well How reproducible: === low ratio - only some hypervisors have problems Steps to Reproduce: === currently, it is hard to say how to do that Actual results: === VMs can't be migrated to an upgraded hypervisor Expected results: === After an upgrade, everything is working properly. Additional info: === I believe we have some intermittent bug which is triggered by some specific circumstances. This problem was reported more than 1 year ago and for a different RHV-H version: https://bugzilla.redhat.com/show_bug.cgi?id=1375546 This time we have much more data - a customer is upgrading many hypervisors and for some of them it is ok, for few there is an issue.
I have another instance of this issue with redhat-release-virtualization-host-4.1-10.5.el7.x86_64. I can provide additional data if it can be useful.
I never saw the original bug, unfortunately, because it never made it to RHVH. Given the dates of the upgrades, I'm guessing that this is a behavior change in 7.5. RHVH does actually go through RPM %post scripts when updated to check whether any of: restorecon semodule semanage fixfiles chcon Are invoked anywhere, and we re-invoke them on the new image. There are a couple of possibilities: 1) A change to the RPM %post scripts means some command we weren't looking for was called 2) A behavior change in selinux inside nsenter happened, and our script is silently failing. This happened with rpm late in the 7.5 cycle ('rpm' inside chroots or nsenter now requires /dev/urandom inside the chroot) 3) This isn't part of the default policy. An selinux rebase in 7.4 meant that we actually can't migrate the selinux policy as-is. /etc/selinux/targeted/active/modules is not migrated, since it's not binary compatible. This means that upgrades take the policy from the new image. We expect this to be ok, because packages should set the right booleans/contexts on their own, but it's possible that they do not. I'll try to reproduce so I can isolate this.
a customer confirmed changing booleans resolved his problem for 2 hypervisors. Moreover, a KCS article is created and attached.
David, any ideas? Sounds like a familiar issue? (see comment 3 specifically)
comment 4 sounds like the problem is resolved. comment 5 sounds like there's a problem, but I don't understand anything in comment 5 since I've never really touched selinux before.
A comment on one of the cases helped me run this down. I still don't have a reproducer, but I'm reasonably sure this is the root cause: # semanage permissive -a sanlock_t libsepol.context_from_record: type ovirt_vmconsole_host_port_t is not defined (No such file or directory). libsepol.context_from_record: could not create context structure (Invalid argument). libsepol.port_from_record: could not create port structure for range 2223:2223 (tcp) (Invalid argument). libsepol.sepol_port_modify: could not load port range 2223 - 2223 (tcp) (Invalid argument). libsemanage.dbase_policydb_modify: could not modify record value (Invalid argument). libsemanage.semanage_base_merge_components: could not merge local modifications into policy (Invalid argument). OSError: Invalid argument Initially, this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1463584, and resolving that will also resolve this. This appeared on 4.1.10 because that was the first version we installed ovirt-vmconsole on. Since ovirt-vmconsole did not require selinux-policy-targeted, it was installed first on the image inside brew, and the policy was never inserted. "vdsm-tool configure --force" (which always runs on RHVH) tries to configure these booleans. It cannot, because the policy is invalid. In looking at the behavior to see whether there was an selinux change between 7.4 and 7.5 also affecting us (because we DO try to re-run anything with "semodule", "restorecon", "chcon", or a number of other selinux commands in RPM %postinstall scripts on imgbased updates), I also learned that "nsenter --root=/tmp/foo --wd=/ getenforce" will report that selinux is disabled. This also means that: if /usr/sbin/selinuxenabled; then semodule -i "/usr/share/selinux/ovirt-vmconsole.pp"; fi Will never be executed, though imgbased diligently tries to run it anyway. /usr/sbin/selinuxenabled (and getenforce, and others) check for the existence of: /etc/config/selinux /proc /sys/fs/selinux rbind-ing /sys into the filesystem root allows "getenforce" to show that it's enforcing. This is potentially risky for other containers, but we can trust that any RPMs installed on RHVH/Node images have RPM scripts we want to run anyway.
QE also can NOT reproduce this issue. Test version: # imgbase layout rhvh-4.1-0.20171207.0 +- rhvh-4.1-0.20171207.0+1 rhvh-4.1-0.20180410.0 +- rhvh-4.1-0.20180410.0+1 Test steps: 1. Install host1 and host2(rhvh-4.1-0.20171207.0), and register them to rhvm as the same cluster, create vm1 on host1. 2. Migrate vm1 from host1 to host2 successful, then migrate vm1 back to host1 3. Upgrade host2 from rhvh-4.1-0.20171207.0 to rhvh-4.1-0.20180410.0 4. Migrate vm1 again from host1 to host2 Test results: After step4, can migrate vm1 from host1 to host2 successful.
Olimp, QE can not reproduce this bug, so could you please help to verify this bug once new 4.1.11 build available? Thanks!
Lower the severity according #c21.
Moving, since there will not be another 4.1
Test version: rhvh-4.1-0.20171207.0 rhvh-4.2.3.1-0.20180531.0(imgbased-1.0.17-0.1.el7ev.noarch) Test steps: 1. Install host1 and host2(rhvh-4.1-0.20171207.0), and register them to rhvm as the same cluster, create vm1 on host1. 2. Migrate vm1 from host1 to host2 successful, then migrate vm1 back to host1 3. Upgrade host2 from rhvh-4.1-0.20171207.0 to rhvh-4.2.3.1-0.20180531.0. 4. Migrate vm1 again from host1 to host2 5. Upgrade host1 from rhvh-4.1-0.20171207.0 to rhvh-4.2.3.1-0.20180531.0. 6. Migrate vm1 again from host1 to host2 Test results: After step4 & 6, can migrate vm1 from host1 to host2 successful without error, so the bug is fixed, change bug status to VERIFIED. Please re-open this bug if still can reproduce the socket issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1820
BZ<2>Jira Resync