1222564 – regression for EL7: spmprotect always reboot when fencing vdsm on systemd

Bug 1222564 - regression for EL7: spmprotect always reboot when fencing vdsm on systemd

Summary: regression for EL7: spmprotect always reboot when fencing vdsm on systemd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Nir Soffer
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1265177
TreeView+	depends on / blocked

Reported:	2015-05-18 14:22 UTC by Yaniv Bronhaim
Modified:	2016-03-09 19:40 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1265177 (view as bug list)
Environment:
Last Closed:	2016-03-09 19:40:05 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0362	normal	SHIPPED_LIVE	vdsm 3.6.0 bug fix and enhancement update	2016-03-09 23:49:32 UTC
oVirt gerrit	43211	master	ABANDONED	spmprotect: Unbreak safelease when using systemd	2016-03-08 09:16:14 UTC
oVirt gerrit	45963	master	MERGED	safelease: Unbreak safelease on systemd	Never
oVirt gerrit	46057	master	MERGED	safelease: Increase spmprotect timeouts	Never
oVirt gerrit	46331	ovirt-3.6	MERGED	safelease: Unbreak safelease on systemd	Never
oVirt gerrit	46332	ovirt-3.6	MERGED	safelease: Increase spmprotect timeouts	Never

Description Yaniv Bronhaim 2015-05-18 14:22:45 UTC

systemctl doesn't produce vdsmd.pid file while spmprotect script depends on to fence vdsm. The fence method cannot work as it is

The code should use systemctl command to retrieve the pid and use it instead of the current `cat vdsmd.pid`

Comment 3 Nir Soffer 2015-05-25 12:06:23 UTC

This is indeed a regression, but not this is not 3.5 regression. This is broken since we support systemd on Fedora and EL 7.

Comment 4 Nir Soffer 2015-06-25 10:15:08 UTC

This effects only storage domain v1, which is still supported for backward
compatibility.

Comment 5 Allon Mureinik 2015-06-28 13:49:54 UTC

(In reply to Nir Soffer from comment #4)
> This effects only storage domain v1, which is still supported for backward
> compatibility.
Yaniv - Given this analysis, I'm fine with pushing it off to 3.5.5.
Having a 3.0 DC in with a RHEV>=3.5 host isn't really interesting, IMHO.

Comment 6 Yaniv Lavi 2015-06-29 09:49:59 UTC

(In reply to Allon Mureinik from comment #5)
> (In reply to Nir Soffer from comment #4)
> > This effects only storage domain v1, which is still supported for backward
> > compatibility.
> Yaniv - Given this analysis, I'm fine with pushing it off to 3.5.5.
> Having a 3.0 DC in with a RHEV>=3.5 host isn't really interesting, IMHO.

done. Happy to hear soft fencing works on 3.5.

Comment 7 Yaniv Bronhaim 2015-07-15 08:44:50 UTC

Patch is ready - Nir please help to verify or state how it can be verified properly...

Comment 8 Yaniv Bronhaim 2015-08-10 17:03:18 UTC

This is a storage bug that needs to be tested with storage functional flows. Dima might be able to help as he did with the patch after nir requested for assistance, but moving on with verifying it should be handled by storage. I would use the testing env for that which found it at first, we can't certainly ack it as we are not familiar with the full usage of spmprotect script

Comment 9 Nir Soffer 2015-08-31 11:38:50 UTC

(In reply to Yaniv Bronhaim from comment #8)
> This is a storage bug that needs to be tested with storage functional flows.

Yaniv, I'm verifying the attached patch, no action needed on your side.

Comment 10 Nir Soffer 2015-09-11 20:00:49 UTC

Because spmprotect fail to get vdsm pid, it fail to do a clean shutdown of the spm, and then fail to terminate and kill vdsm. Finally, it reboot the machine.

This effects only legacy dc using storage domain format 3.0.

Comment 11 Nir Soffer 2015-09-22 08:08:33 UTC

How to verify this:

1. create data center v 3.0
2. create cluster v 3.4
3. add host
4. create storage domain v1
5. wait until spm is up
6. check now maintenance/activation (no regression)
7. block access to storage
8. watch vdsm being killed without reboot (previously host would reboot after this)
9. unblock access to storage
10. watch vdsm become spm again (no regression)

Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Comment 14 Kevin Alon Goldblatt 2015-11-09 12:51:55 UTC

(In reply to Nir Soffer from comment #11)
> How to verify this:
> 
> 1. create data center v 3.0
> 2. create cluster v 3.4
> 3. add host
> 4. create storage domain v1
> 5. wait until spm is up
> 6. check now maintenance/activation (no regression)
> 7. block access to storage
> 8. watch vdsm being killed without reboot (previously host would reboot
> after this)
> 9. unblock access to storage
> 10. watch vdsm become spm again (no regression)
> 
> Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Please clarify:
------------------------------------
> 1. create data center v 3.0
> 2. create cluster v 3.4
> 3. add host
> 4. create storage domain v1

> 5. wait until spm is up (
(wouln't the spm be up before you could create a SD in step 4? Should this step be before step 3?))

> 6. check now maintenance/activation (no regression)
((Check for setting the host into maintenance and the reactivating?)

> 7. block access to storage
> 8. watch vdsm being killed without reboot (previously host would reboot
> after this)
((How long do we need to wait till this happens?

> 9. unblock access to storage
> 10. watch vdsm become spm again (no regression)
> 
> Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Comment 15 Nir Soffer 2015-11-09 15:18:05 UTC

(In reply to Kevin Alon Goldblatt from comment #14)
> (In reply to Nir Soffer from comment #11)
> > How to verify this:
> > 
> > 1. create data center v 3.0
> > 2. create cluster v 3.4
> > 3. add host
> > 4. create storage domain v1
> > 5. wait until spm is up
> > 6. check now maintenance/activation (no regression)
> > 7. block access to storage
> > 8. watch vdsm being killed without reboot (previously host would reboot
> > after this)
> > 9. unblock access to storage
> > 10. watch vdsm become spm again (no regression)
> > 
> > Repeat for nfs and iscsi (cannot mix domains in dc in this version).
> 
> Please clarify:
> ------------------------------------
> > 1. create data center v 3.0
> > 2. create cluster v 3.4
> > 3. add host
> > 4. create storage domain v1
> 
> > 5. wait until spm is up (
> (wouln't the spm be up before you could create a SD in step 4? Should this
> step be before step 3?))

No, until you have storage, the host is up, but it is not spm.

> > 6. check now maintenance/activation (no regression)
> ((Check for setting the host into maintenance and the reactivating?)
> 
> > 7. block access to storage
> > 8. watch vdsm being killed without reboot (previously host would reboot
> > after this)
> ((How long do we need to wait till this happens?

Vdsm will be killed in about 20 seconds after safelease
fail to renew the lease.

See the patch commit message, it describe the flow precisely.

Comment 16 Kevin Alon Goldblatt 2015-11-09 15:31:18 UTC

Verified using the following code:
-----------------------------------
vdsm-4.17.10.1-0.el7ev.noarch
rhevm-3.6.0.3-0.1.el6.noarch

Verified using the following scenario:
---------------------------------------
Steps to reproduce:
1.Create a DC with V3.0
2.Create a Cluster with V3.4
3.Add host
4.Create a SD (scsci/nfs)
5.Wait Till host becomes SPM
6.Verify that the host can be put into maintenance and then activated successfully
7.Block access to the storage using iptables
8.Verify that VDSM is killed and that the host is not rebooted
9.Unblock access to the host
10.Verify that the host becomes SP< again

Moving to VERIFY!

Comment 18 errata-xmlrpc 2016-03-09 19:40:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html

Note You need to log in before you can comment on or make changes to this bug.