1571607 – Failed to open socket to sanlock daemon

Bug 1571607 - Failed to open socket to sanlock daemon

Summary: Failed to open socket to sanlock daemon

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-node-ng
Sub Component:
Version:	4.1.9
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	ovirt-4.2.3-1
Target Release:	---
Assignee:	Ryan Barry
QA Contact:	Yaning Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	imgbased-1.0.17
TreeView+	depends on / blocked

Reported:	2018-04-25 08:22 UTC by Olimp Bockowski
Modified:	2021-09-09 13:51 UTC (History)
CC List:	18 users (show)
Fixed In Version:	imgbased-1.0.17
Doc Type:	Bug Fix
Doc Text:	Previously some SELinux %opst scripts were not re-executed because imgbased attempts to re-execute RPM %post scripts which involve SELinux commands inside a namespace, and some commands failed due to SELinux namespacing rules. This update ensures that SELinux contexts inside imgbased update namespaces now update appropriately, and the scripts are re-executed by remounting /sys and /sys/fs/selinux inside the update namespace.
Clone Of:
Environment:
Last Closed:	2018-06-11 06:56:53 UTC
oVirt Team:	Node
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3425031	None	None	None	2018-04-27 18:43:05 UTC
Red Hat Product Errata	RHSA-2018:1820	None	None	None	2018-06-11 06:57:31 UTC
oVirt gerrit	90730	master	MERGED	osupdater: rbind /sys for SELinux RPM %post	2021-01-11 23:39:36 UTC
oVirt gerrit	95777	ovirt-4.2	MERGED	osupdater: rbind /sys for SELinux RPM %post	2021-01-11 23:39:36 UTC

Description Olimp Bockowski 2018-04-25 08:22:04 UTC

Description of problem:
===
After an upgrade of RHV-H to redhat-the latest virtualization-host-image-update-4.1-20180410, it is no longer possible to migrate VMs to upgraded host. The investigation leads to an exception shown on a source hypervisor and error message:
libvirt: Lock Driver error : Failed to open socket to sanlock daemon: Permission denied
Permission denied is most likely due to wrong SELinux booleans settings, e.g we have "off" for sanlock_use_fusefs, sanlock_use_nfs, virt_use_sanlock                            

Version-Release number of selected component (if applicable):
===
redhat-virtualization-host-image-update-4.1-20180410 but most likely old 
versions could be affected as well

How reproducible:
===
low ratio - only some hypervisors have problems 

Steps to Reproduce:
===
currently, it is hard to say how to do that

Actual results:
===
VMs can't be migrated to an upgraded hypervisor

Expected results:
===
After an upgrade, everything is working properly.

Additional info:
===
I believe we have some intermittent bug which is triggered by some specific circumstances. This problem was reported more than 1 year ago and for a different RHV-H version: 
https://bugzilla.redhat.com/show_bug.cgi?id=1375546
This time we have much more data - a customer is upgrading many hypervisors and for some of them it is ok, for few there is an issue.

Comment 2 Allie DeVolder 2018-04-25 15:56:57 UTC

I have another instance of this issue with redhat-release-virtualization-host-4.1-10.5.el7.x86_64. I can provide additional data if it can be useful.

Comment 3 Ryan Barry 2018-04-25 21:36:05 UTC

I never saw the original bug, unfortunately, because it never made it to RHVH.

Given the dates of the upgrades, I'm guessing that this is a behavior change in 7.5.

RHVH does actually go through RPM %post scripts when updated to check whether any of:

restorecon semodule semanage fixfiles chcon

Are invoked anywhere, and we re-invoke them on the new image.

There are a couple of possibilities:

1) A change to the RPM %post scripts means some command we weren't looking for was called

2) A behavior change in selinux inside nsenter happened, and our script is silently failing. This happened with rpm late in the 7.5 cycle ('rpm' inside chroots or nsenter now requires /dev/urandom inside the chroot)

3) This isn't part of the default policy. An selinux rebase in 7.4 meant that we actually can't migrate the selinux policy as-is. /etc/selinux/targeted/active/modules is not migrated, since it's not binary compatible. This means that upgrades take the policy from the new image. We expect this to be ok, because packages should set the right booleans/contexts on their own, but it's possible that they do not.

I'll try to reproduce so I can isolate this.

Comment 4 Olimp Bockowski 2018-04-26 14:19:03 UTC

a customer confirmed changing booleans resolved his problem for 2 hypervisors. 
Moreover, a KCS article is created and attached.

Comment 9 Yaniv Kaul 2018-04-27 18:57:15 UTC

David, any ideas? Sounds like a familiar issue? (see comment 3 specifically)

Comment 10 David Teigland 2018-04-27 19:40:13 UTC

comment 4 sounds like the problem is resolved.  comment 5 sounds like there's a problem, but I don't understand anything in comment 5 since I've never really touched selinux before.

Comment 11 Ryan Barry 2018-04-27 19:41:13 UTC

A comment on one of the cases helped me run this down.

I still don't have a reproducer, but I'm reasonably sure this is the root cause:

# semanage permissive -a sanlock_t
libsepol.context_from_record: type ovirt_vmconsole_host_port_t is not defined (No such file or directory).
libsepol.context_from_record: could not create context structure (Invalid argument).
libsepol.port_from_record: could not create port structure for range 2223:2223 (tcp) (Invalid argument).
libsepol.sepol_port_modify: could not load port range 2223 - 2223 (tcp) (Invalid argument).
libsemanage.dbase_policydb_modify: could not modify record value (Invalid argument).
libsemanage.semanage_base_merge_components: could not merge local modifications into policy (Invalid argument). OSError: Invalid argument

Initially, this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1463584, and resolving that will also resolve this. This appeared on 4.1.10 because that was the first version we installed ovirt-vmconsole on. Since ovirt-vmconsole did not require selinux-policy-targeted, it was installed first on the image inside brew, and the policy was never inserted.

"vdsm-tool configure --force" (which always runs on RHVH) tries to configure these booleans. It cannot, because the policy is invalid.

In looking at the behavior to see whether there was an selinux change between 7.4 and 7.5 also affecting us (because we DO try to re-run anything with "semodule", "restorecon", "chcon", or a number of other selinux commands in RPM %postinstall scripts on imgbased updates), I also learned that "nsenter --root=/tmp/foo --wd=/ getenforce" will report that selinux is disabled.

This also means that:

if /usr/sbin/selinuxenabled; then semodule -i "/usr/share/selinux/ovirt-vmconsole.pp"; fi

Will never be executed, though imgbased diligently tries to run it anyway.

/usr/sbin/selinuxenabled (and getenforce, and others) check for the existence of:

/etc/config/selinux
/proc
/sys/fs/selinux

rbind-ing /sys into the filesystem root allows "getenforce" to show that it's enforcing. This is potentially risky for other containers, but we can trust that any RPMs installed on RHVH/Node images have RPM scripts we want to run anyway.

Comment 12 Huijuan Zhao 2018-04-28 02:28:59 UTC

QE also can NOT reproduce this issue.

Test version:
# imgbase layout
rhvh-4.1-0.20171207.0
 +- rhvh-4.1-0.20171207.0+1
rhvh-4.1-0.20180410.0
 +- rhvh-4.1-0.20180410.0+1


Test steps:
1. Install host1 and host2(rhvh-4.1-0.20171207.0), and register them to rhvm as the same cluster, create vm1 on host1.
2. Migrate vm1 from host1 to host2 successful, then migrate vm1 back to host1
3. Upgrade host2 from rhvh-4.1-0.20171207.0 to rhvh-4.1-0.20180410.0
4. Migrate vm1 again from host1 to host2

Test results:
After step4, can migrate vm1 from host1 to host2 successful.

Comment 17 Huijuan Zhao 2018-05-03 07:56:18 UTC

Olimp, QE can not reproduce this bug, so could you please help to verify this bug once new 4.1.11 build available?
Thanks!

Comment 22 cshao 2018-05-07 03:48:05 UTC

Lower the severity according #c21.

Comment 23 Ryan Barry 2018-05-07 10:40:54 UTC

Moving, since there will not be another 4.1

Comment 29 cshao 2018-06-04 07:38:00 UTC

Test version:
rhvh-4.1-0.20171207.0
rhvh-4.2.3.1-0.20180531.0(imgbased-1.0.17-0.1.el7ev.noarch)


Test steps:
1. Install host1 and host2(rhvh-4.1-0.20171207.0), and register them to rhvm as the same cluster, create vm1 on host1.
2. Migrate vm1 from host1 to host2 successful, then migrate vm1 back to host1
3. Upgrade host2 from rhvh-4.1-0.20171207.0 to rhvh-4.2.3.1-0.20180531.0.
4. Migrate vm1 again from host1 to host2
5. Upgrade host1 from rhvh-4.1-0.20171207.0 to rhvh-4.2.3.1-0.20180531.0.
6. Migrate vm1 again from host1 to host2

Test results:
After step4 & 6, can migrate vm1 from host1 to host2 successful without error, so the bug is fixed, change bug status to VERIFIED.

Please re-open this bug if still can reproduce the socket issue.

Comment 31 errata-xmlrpc 2018-06-11 06:56:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1820

Comment 32 Franta Kust 2019-05-16 13:08:24 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.