Bug 1222564

Summary: regression for EL7: spmprotect always reboot when fencing vdsm on systemd
Product: Red Hat Enterprise Virtualization Manager Reporter: Yaniv Bronhaim <ybronhei>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED ERRATA QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.0CC: amureini, bazulay, gklein, lpeer, lsurette, nsoffer, tnisan, yeylon, ykaul, ylavi
Target Milestone: ovirt-3.6.0-rcKeywords: Regression, ZStream
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1265177 (view as bug list) Environment:
Last Closed: 2016-03-09 19:40:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1265177    

Description Yaniv Bronhaim 2015-05-18 14:22:45 UTC
systemctl doesn't produce vdsmd.pid file while spmprotect script depends on to fence vdsm. The fence method cannot work as it is

The code should use systemctl command to retrieve the pid and use it instead of the current `cat vdsmd.pid`

Comment 3 Nir Soffer 2015-05-25 12:06:23 UTC
This is indeed a regression, but not this is not 3.5 regression. This is broken since we support systemd on Fedora and EL 7.

Comment 4 Nir Soffer 2015-06-25 10:15:08 UTC
This effects only storage domain v1, which is still supported for backward
compatibility.

Comment 5 Allon Mureinik 2015-06-28 13:49:54 UTC
(In reply to Nir Soffer from comment #4)
> This effects only storage domain v1, which is still supported for backward
> compatibility.
Yaniv - Given this analysis, I'm fine with pushing it off to 3.5.5.
Having a 3.0 DC in with a RHEV>=3.5 host isn't really interesting, IMHO.

Comment 6 Yaniv Lavi 2015-06-29 09:49:59 UTC
(In reply to Allon Mureinik from comment #5)
> (In reply to Nir Soffer from comment #4)
> > This effects only storage domain v1, which is still supported for backward
> > compatibility.
> Yaniv - Given this analysis, I'm fine with pushing it off to 3.5.5.
> Having a 3.0 DC in with a RHEV>=3.5 host isn't really interesting, IMHO.

done. Happy to hear soft fencing works on 3.5.

Comment 7 Yaniv Bronhaim 2015-07-15 08:44:50 UTC
Patch is ready - Nir please help to verify or state how it can be verified properly...

Comment 8 Yaniv Bronhaim 2015-08-10 17:03:18 UTC
This is a storage bug that needs to be tested with storage functional flows. Dima might be able to help as he did with the patch after nir requested for assistance, but moving on with verifying it should be handled by storage. I would use the testing env for that which found it at first, we can't certainly ack it as we are not familiar with the full usage of spmprotect script

Comment 9 Nir Soffer 2015-08-31 11:38:50 UTC
(In reply to Yaniv Bronhaim from comment #8)
> This is a storage bug that needs to be tested with storage functional flows.

Yaniv, I'm verifying the attached patch, no action needed on your side.

Comment 10 Nir Soffer 2015-09-11 20:00:49 UTC
Because spmprotect fail to get vdsm pid, it fail to do a clean shutdown of the spm, and then fail to terminate and kill vdsm. Finally, it reboot the machine.

This effects only legacy dc using storage domain format 3.0.

Comment 11 Nir Soffer 2015-09-22 08:08:33 UTC
How to verify this:

1. create data center v 3.0
2. create cluster v 3.4
3. add host
4. create storage domain v1
5. wait until spm is up
6. check now maintenance/activation (no regression)
7. block access to storage
8. watch vdsm being killed without reboot (previously host would reboot after this)
9. unblock access to storage
10. watch vdsm become spm again (no regression)

Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Comment 14 Kevin Alon Goldblatt 2015-11-09 12:51:55 UTC
(In reply to Nir Soffer from comment #11)
> How to verify this:
> 
> 1. create data center v 3.0
> 2. create cluster v 3.4
> 3. add host
> 4. create storage domain v1
> 5. wait until spm is up
> 6. check now maintenance/activation (no regression)
> 7. block access to storage
> 8. watch vdsm being killed without reboot (previously host would reboot
> after this)
> 9. unblock access to storage
> 10. watch vdsm become spm again (no regression)
> 
> Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Please clarify:
------------------------------------
> 1. create data center v 3.0
> 2. create cluster v 3.4
> 3. add host
> 4. create storage domain v1

> 5. wait until spm is up (
(wouln't the spm be up before you could create a SD in step 4? Should this step be before step 3?))

> 6. check now maintenance/activation (no regression)
((Check for setting the host into maintenance and the reactivating?)

> 7. block access to storage
> 8. watch vdsm being killed without reboot (previously host would reboot
> after this)
((How long do we need to wait till this happens?

> 9. unblock access to storage
> 10. watch vdsm become spm again (no regression)
> 
> Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Comment 15 Nir Soffer 2015-11-09 15:18:05 UTC
(In reply to Kevin Alon Goldblatt from comment #14)
> (In reply to Nir Soffer from comment #11)
> > How to verify this:
> > 
> > 1. create data center v 3.0
> > 2. create cluster v 3.4
> > 3. add host
> > 4. create storage domain v1
> > 5. wait until spm is up
> > 6. check now maintenance/activation (no regression)
> > 7. block access to storage
> > 8. watch vdsm being killed without reboot (previously host would reboot
> > after this)
> > 9. unblock access to storage
> > 10. watch vdsm become spm again (no regression)
> > 
> > Repeat for nfs and iscsi (cannot mix domains in dc in this version).
> 
> Please clarify:
> ------------------------------------
> > 1. create data center v 3.0
> > 2. create cluster v 3.4
> > 3. add host
> > 4. create storage domain v1
> 
> > 5. wait until spm is up (
> (wouln't the spm be up before you could create a SD in step 4? Should this
> step be before step 3?))

No, until you have storage, the host is up, but it is not spm.

> > 6. check now maintenance/activation (no regression)
> ((Check for setting the host into maintenance and the reactivating?)
> 
> > 7. block access to storage
> > 8. watch vdsm being killed without reboot (previously host would reboot
> > after this)
> ((How long do we need to wait till this happens?

Vdsm will be killed in about 20 seconds after safelease
fail to renew the lease.

See the patch commit message, it describe the flow precisely.

Comment 16 Kevin Alon Goldblatt 2015-11-09 15:31:18 UTC
Verified using the following code:
-----------------------------------
vdsm-4.17.10.1-0.el7ev.noarch
rhevm-3.6.0.3-0.1.el6.noarch

Verified using the following scenario:
---------------------------------------
Steps to reproduce:
1.Create a DC with V3.0
2.Create a Cluster with V3.4
3.Add host
4.Create a SD (scsci/nfs)
5.Wait Till host becomes SPM
6.Verify that the host can be put into maintenance and then activated successfully
7.Block access to the storage using iptables
8.Verify that VDSM is killed and that the host is not rebooted
9.Unblock access to the host
10.Verify that the host becomes SP< again

Moving to VERIFY!

Comment 18 errata-xmlrpc 2016-03-09 19:40:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html