Bug 832199

Summary:	vdsmd init script times out due to lengthy semanage operation
Product:	[Retired] oVirt	Reporter:	Adam Litke <alitke>
Component:	vdsm	Assignee:	Douglas Schilling Landgraf <dougsland>
Status:	CLOSED NEXTRELEASE	QA Contact:	Haim <hateya>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.1 RC	CC:	abaron, acathrow, bazulay, dyasny, iheim, lvroyce, mgoldboi, oschreib, yeylon, ykaul
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	infra
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-07-06 20:33:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	822145

Description Adam Litke 2012-06-14 20:16:55 UTC

Description of problem:
When starting vdsm on an unconfigured Fedora 17 system (libvirt not configured for vdsm), the systemd init script times out during a lengthy semanage operation.

Version-Release number of selected component (if applicable):
vdsm-4.10.0-1.fc17.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Stop vdsmd and libvirtd
2. Clear vdsm libvirt configuration

LCONF=/etc/libvirt/libvirtd.conf
QCONF=/etc/libvirt/qemu.conf
LDCONF=/etc/sysconfig/libvirtd
QLCONF=/etc/libvirt/qemu-sanlock.conf
by_vdsm="by vdsm"
by_vdsm_vers="4.9.6"
start_conf_section="## beginning of configuration section ${by_vdsm}"
end_conf_section="## end of configuration section ${by_vdsm}"
remove_vdsm_conf() {
    sed -i --copy -e "/${start_conf_section}/,/${end_conf_section}/d" \
        -e "/$by_vdsm/d" "$@"
}
remove_vdsm_conf $LCONF $QCONF $LDCONF $QLCONF

3. Start vdsmd
  
Actual results:

> service vdsmd start
Redirecting to /bin/systemctl  start vdsmd.service
Job failed. See system journal and 'systemctl status' for details.

from journalctl:
Jun 14 14:54:36 bc1cn7-mgmt.phx.austin.ibm.com systemd-vdsmd[7725]: Configuring libvirt for vdsm...
Jun 14 14:54:36 bc1cn7-mgmt.phx.austin.ibm.com systemd-vdsmd[7725]: Thu Jun 14 14:54:36 CDT 2012: Before restorecon /etc/logrotate.d/libvirtd
Jun 14 14:54:36 bc1cn7-mgmt.phx.austin.ibm.com systemd-vdsmd[7725]: Thu Jun 14 14:54:36 CDT 2012: After restorecon /etc/logrotate.d/libvirtd
Jun 14 14:54:36 bc1cn7-mgmt.phx.austin.ibm.com systemd-vdsmd[7725]: Thu Jun 14 14:54:36 CDT 2012: Before semanage
Jun 14 14:55:01 bc1cn7-mgmt.phx.austin.ibm.com /USR/SBIN/CROND[7797]: (root) CMD (/usr/sbin/logrotate /etc/logrotate.d/libvirtd)
Jun 14 14:56:06 bc1cn7-mgmt.phx.austin.ibm.com systemd[1]: vdsmd.service operation timed out. Terminating.
Jun 14 14:56:06 bc1cn7-mgmt.phx.austin.ibm.com systemd[1]: Unit vdsmd.service entered failed state.

(Note that I have added some debugging output to my version of the script to show some timings)...


Expected results:

Service should start correctly.

Additional info:

The script times out while executing the following block:
    # vdsm makes extensive use of nfs-exported images
    /usr/sbin/semanage  boolean -m -S targeted -F /dev/stdin  << _EOF
virt_use_nfs=1
virt_use_sanlock=1
_EOF


This bug is onerous because the next time you try to start vdsm it will think everything has been configured and will skip the configure_libvirt function.  This will allow vdsmd to start, but some things (such as selinux) will not have been successfully configured.  I would suggest marking configuration complete only after all steps have completed.  Perhaps a stamp in the /etc/vdsm directory would be better for this.

Comment 1 Itamar Heim 2012-06-14 20:24:38 UTC

duplicate of bug 831921?

Comment 2 Adam Litke 2012-06-14 20:44:08 UTC

(In reply to comment #1)
> duplicate of bug 831921?

No.  831921 times out while waiting for a dependent service (iscsid) while this bug times out because of a long running operation in the init script that has no business being in there.  Note that there is ongoing work underway to remove these things from the initscript and place them into the daemon's own startup code.  Until then, we have a pretty bad bug that will hit nearly all users that are not using ovirt-node.

Comment 3 Dan Kenigsberg 2012-06-16 20:31:33 UTC

Adam, I'm not sure I agree that that system preparation should be placed i the daemon's statup code. Could systemd be convinced to give vdsm more time?

Workaround that may help ovirt users: run

  /lib/systemd/systemd-vdsmd reconfigure

manually before start vdsmd service. After this step succeeds, starting the service would not attempt to play with selinux policy.

Comment 4 Ofer Schreiber 2012-06-18 13:46:51 UTC

I'm not sure how severe is this issue.
1. Will it happen all the time? (e.g - every first time you run vdsm? or just in a special case?)
2. What happens when it fails? selinux will prevent vdsm from running in case it's enabled and enforcing?

I'm trying to understand if this bug is oVirt 3.1 blocker or not.

Comment 5 Adam Litke 2012-06-18 14:30:40 UTC

(In reply to comment #4)
> I'm not sure how severe is this issue.
> 1. Will it happen all the time? (e.g - every first time you run vdsm? or
> just in a special case?)

This will happen for all new installations (non ovirt-node).  I think it is a release blocker for this reason.

> 2. What happens when it fails? selinux will prevent vdsm from running in
> case it's enabled and enforcing?

When it fails, vdsm will only be partially configured.  As the current code is written, the configuration steps that are not completed are the SELinux policy alterations.  Until the user manually intervenes (systemd-vdsmd reconfigure) vdsm will be unable to access images on NFS storage.

> I'm trying to understand if this bug is oVirt 3.1 blocker or not.

Yes, I believe so.

Comment 6 Ofer Schreiber 2012-06-18 14:38:12 UTC

Added as oVirt 3.1 blocker, thanks.

Comment 7 Dan Kenigsberg 2012-06-18 16:06:26 UTC

Douglas, Adam, do you have any idea how to fix this (short of moving host configuration into vdsm startup code)? Maybe we can extend systemd's timeout?

Comment 8 Douglas Schilling Landgraf 2012-06-18 20:17:38 UTC

Hi Adam, 

I have been playing here to reproduce the report and I couldn't (with selinux enabled or disabled).

Is it a clean Fedora 17 installation that you are using? Also, could you please share # 'systemd-analyze blame' after the failing? What your getenforce says? 

My test data:

repo
======
# cat /etc/yum.repos.d/ovirt-beta.repo 
[ovirt-beta]
name=Nightly builds of the oVirt project
baseurl=http://ovirt.org/releases/beta/fedora/$releasever
enabled=1
skip_if_unavailable=1
gpgcheck=0

vdsm
=====
# rpm -qa | grep -i vdsm
vdsm-python-4.10.0-1.fc17.x86_64
vdsm-cli-4.10.0-1.fc17.noarch
vdsm-4.10.0-1.fc17.x86_64
vdsm-xmlrpc-4.10.0-1.fc17.noarch

First scenario:
=================

* Installed clean Fedora 17 (Minimum installation)

# getenforce 
Enforcing

* Added the above host to oVirt Engine
* Host reboot is became UP

Second scenario:
=========================
* 1. Stop vdsmd and libvirtd
* 2. Clear vdsm libvirt configuration

# getenforce 
Enforcing

# service vdsmd status (indeed, took some time. See my systemd-analyze below)
Redirecting to /bin/systemctl  status vdsmd.service
vdsmd.service - Virtual Desktop Server Manager
	  Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
	  Active: activating (start) since Mon, 18 Jun 2012 15:44:49 -0400; 34s ago
	Main PID: 1461 (code=killed, signal=TERM); Control: 1775 (systemd-vdsmd)
	  CGroup: name=systemd:/system/vdsmd.service
		  ├ 1775 /bin/sh /lib/systemd/systemd-vdsmd start
		  └ 1891 /usr/bin/python -Es /usr/sbin/semanage boolean -m -S targeted -F /dev/stdin


# service vdsmd status
Redirecting to /bin/systemctl  status vdsmd.service
vdsmd.service - Virtual Desktop Server Manager
	  Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
	  Active: active (running) since Mon, 18 Jun 2012 15:59:05 -0400; 1min 32s ago
	 Process: 2440 ExecStart=/lib/systemd/systemd-vdsmd start (code=exited, status=0/SUCCESS)
	Main PID: 2719 (respawn)
	  CGroup: name=systemd:/system/vdsmd.service
		  ├ 2719 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm
		  ├ 2722 /usr/bin/python /usr/share/vdsm/vdsm
		  ├ 2744 /usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.py f9ce2222-8d8f-485b-bb88-541a9aeea014 2722
		  └ 2745 /usr/bin/python /usr/share/vdsm/supervdsmServer.py f9ce2222-8d8f-485b-bb88-541a9aeea014 2722

Jun 18 15:59:04 localhost.localdomain setsebool[2571]: The virt_use_nfs policy boolean was changed to on by root
Jun 18 15:59:04 localhost.localdomain setsebool[2572]: The virt_use_sanlock policy boolean was changed to on by root
Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: Starting sanlock...
Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: Starting iscsid:
Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: Starting libvirtd (via systemctl):  [  OK  ]
Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: Starting up vdsm daemon:
Jun 18 15:59:05 localhost.localdomain runuser[2716]: pam_unix(runuser:session): session opened for user vdsm by (uid=0)
Jun 18 15:59:05 localhost.localdomain runuser[2716]: pam_unix(runuser:session): session closed for user vdsm
Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: [27B blob data]
Jun 18 15:59:08 localhost.localdomain vdsm[2722]: vdsm vds ERROR Unable to load the rest server module. Please make sure it is installed.


# systemd-analyze blame  (*with selinux enabled)
 53629ms vdsmd.service
 10880ms network.service
  3093ms fedora-loadmodules.service
  1876ms iscsid.service

# systemd-analyze blame  (*without selinux enabled)
 51802ms vdsmd.service
 12316ms network.service
  4533ms iscsid.service


Thanks!

Comment 9 Adam Litke 2012-06-20 12:16:13 UTC

(In reply to comment #7)
> Douglas, Adam, do you have any idea how to fix this (short of moving host
> configuration into vdsm startup code)? Maybe we can extend systemd's timeout?

How about doing the SELinux configuration during install rather than during service startup?

Comment 10 Adam Litke 2012-06-20 12:53:13 UTC

(In reply to comment #8)
> Hi Adam, 
> 
> I have been playing here to reproduce the report and I couldn't (with
> selinux enabled or disabled).
> 
> Is it a clean Fedora 17 installation that you are using? Also, could you
> please share # 'systemd-analyze blame' after the failing? What your
> getenforce says? 


> My test data:
> 
> repo
> ======
> # cat /etc/yum.repos.d/ovirt-beta.repo 
> [ovirt-beta]
> name=Nightly builds of the oVirt project
> baseurl=http://ovirt.org/releases/beta/fedora/$releasever
> enabled=1
> skip_if_unavailable=1
> gpgcheck=0
> 
> vdsm
> =====
> # rpm -qa | grep -i vdsm
> vdsm-python-4.10.0-1.fc17.x86_64
> vdsm-cli-4.10.0-1.fc17.noarch
> vdsm-4.10.0-1.fc17.x86_64
> vdsm-xmlrpc-4.10.0-1.fc17.noarch
> 
> First scenario:
> =================
> 
> * Installed clean Fedora 17 (Minimum installation)
> 
> # getenforce 
> Enforcing
> 
> * Added the above host to oVirt Engine
> * Host reboot is became UP
> 
> Second scenario:
> =========================
> * 1. Stop vdsmd and libvirtd
> * 2. Clear vdsm libvirt configuration
> 
> # getenforce 
> Enforcing
> 
> # service vdsmd status (indeed, took some time. See my systemd-analyze below)
> Redirecting to /bin/systemctl  status vdsmd.service
> vdsmd.service - Virtual Desktop Server Manager
> 	  Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
> 	  Active: activating (start) since Mon, 18 Jun 2012 15:44:49 -0400; 34s ago
> 	Main PID: 1461 (code=killed, signal=TERM); Control: 1775 (systemd-vdsmd)
> 	  CGroup: name=systemd:/system/vdsmd.service
> 		  ├ 1775 /bin/sh /lib/systemd/systemd-vdsmd start
> 		  └ 1891 /usr/bin/python -Es /usr/sbin/semanage boolean -m -S targeted -F
> /dev/stdin
> 
> 
> # service vdsmd status
> Redirecting to /bin/systemctl  status vdsmd.service
> vdsmd.service - Virtual Desktop Server Manager
> 	  Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
> 	  Active: active (running) since Mon, 18 Jun 2012 15:59:05 -0400; 1min 32s
> ago
> 	 Process: 2440 ExecStart=/lib/systemd/systemd-vdsmd start (code=exited,
> status=0/SUCCESS)
> 	Main PID: 2719 (respawn)
> 	  CGroup: name=systemd:/system/vdsmd.service
> 		  ├ 2719 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon
> --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm
> 		  ├ 2722 /usr/bin/python /usr/share/vdsm/vdsm
> 		  ├ 2744 /usr/bin/sudo -n /usr/bin/python
> /usr/share/vdsm/supervdsmServer.py f9ce2222-8d8f-485b-bb88-541a9aeea014 2722
> 		  └ 2745 /usr/bin/python /usr/share/vdsm/supervdsmServer.py
> f9ce2222-8d8f-485b-bb88-541a9aeea014 2722
> 
> Jun 18 15:59:04 localhost.localdomain setsebool[2571]: The virt_use_nfs
> policy boolean was changed to on by root
> Jun 18 15:59:04 localhost.localdomain setsebool[2572]: The virt_use_sanlock
> policy boolean was changed to on by root
> Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: Starting
> sanlock...
> Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: Starting iscsid:
> Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: Starting libvirtd
> (via systemctl):  [  OK  ]
> Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: Starting up vdsm
> daemon:
> Jun 18 15:59:05 localhost.localdomain runuser[2716]:
> pam_unix(runuser:session): session opened for user vdsm by (uid=0)
> Jun 18 15:59:05 localhost.localdomain runuser[2716]:
> pam_unix(runuser:session): session closed for user vdsm
> Jun 18 15:59:05 localhost.localdomain systemd-vdsmd[2440]: [27B blob data]
> Jun 18 15:59:08 localhost.localdomain vdsm[2722]: vdsm vds ERROR Unable to
> load the rest server module. Please make sure it is installed.
> 
> 
> # systemd-analyze blame  (*with selinux enabled)
>  53629ms vdsmd.service
>  10880ms network.service
>   3093ms fedora-loadmodules.service
>   1876ms iscsid.service
> 
> # systemd-analyze blame  (*without selinux enabled)
>  51802ms vdsmd.service
>  12316ms network.service
>   4533ms iscsid.service
> 
> 
> Thanks!

My Test Data:
=============

It was a clean Fedora install.  But now I am back on vdsm HEAD.  This should not matter for this bug since the code in question is the same.


Repo
====
$ cat /etc/yum.repos.d/ovirt.repo 
[ovirt]
name=oVirt
baseurl=http://www.ovirt.org/releases/beta/fedora/17
enabled=1
gpgcheck=0

vdsm* 
=====
rpm -qa | grep -i vdsm
vdsm-tests-4.10.0-0.12.git4399aa5.fc17.noarch
vdsm-4.10.0-0.12.git4399aa5.fc17.x86_64
vdsm-rest-4.10.0-0.12.git4399aa5.fc17.noarch
vdsm-xmlrpc-4.10.0-0.12.git4399aa5.fc17.noarch
vdsm-cli-4.10.0-0.12.git4399aa5.fc17.noarch
vdsm-python-4.10.0-0.12.git4399aa5.fc17.x86_64
* Originally reported based on oVirt-3.1 beta but am collecting the requested info from my dev environment now.  This should not invalidate the results.

I suspect your first scenario could mask the problem because the lines that fail to execute when the script times out are:
    /usr/sbin/setsebool virt_use_nfs on
    /usr/sbin/setsebool virt_use_sanlock on

    /sbin/initctl restart libvirtd 2>/dev/null || :
These would all get done as a result of the reboot.

I am focusing on scenario 2:
============================
$ getenforce
Enforcing

$ sudo service vdsmd start
[sudo] password for aglitke: 
Redirecting to /bin/systemctl  start vdsmd.service
Job failed. See system journal and 'systemctl status' for details.

# service vdsmd status
Redirecting to /bin/systemctl  status vdsmd.service
vdsmd.service - Virtual Desktop Server Manager
	  Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
	  Active: failed (Result: timeout) since Wed, 20 Jun 2012 07:43:49 -0500; 3min 43s ago
	 Process: 10554 ExecStart=/lib/systemd/systemd-vdsmd start (code=killed, signal=TERM)
	Main PID: 14498 (code=killed, signal=TERM)
	  CGroup: name=systemd:/system/vdsmd.service

Jun 20 07:42:19 bc1cn7-mgmt.phx.austin.ibm.com systemd-vdsmd[10554]: Note: Fo...
Jun 20 07:42:19 bc1cn7-mgmt.phx.austin.ibm.com systemd-vdsmd[10554]: Configur...

(The above is not very helpful at all I'm afraid -- ahh systemd)

systemd-analyze blame does not contain any information for vdsmd.service.  Probably because it failed.

Comment 11 Douglas Schilling Landgraf 2012-06-21 18:46:12 UTC

Hi Adam,

     Unfortunately, I cannot reproduce your report. However, I do believe it can happen and I do agree that we can move the selinux stuff to spec file. Can you please review/verify the following patch: http://gerrit.ovirt.org/#/c/5600/

Thanks!

Comment 12 Douglas Schilling Landgraf 2012-08-17 18:18:05 UTC

*** Bug 831921 has been marked as a duplicate of this bug. ***