Bug 1222564 - regression for EL7: spmprotect always reboot when fencing vdsm on systemd
Summary: regression for EL7: spmprotect always reboot when fencing vdsm on systemd
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Nir Soffer
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks: 1265177
TreeView+ depends on / blocked
 
Reported: 2015-05-18 14:22 UTC by Yaniv Bronhaim
Modified: 2016-03-09 19:40 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1265177 (view as bug list)
Environment:
Last Closed: 2016-03-09 19:40:05 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0362 0 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 43211 0 master ABANDONED spmprotect: Unbreak safelease when using systemd 2016-03-08 09:16:14 UTC
oVirt gerrit 45963 0 master MERGED safelease: Unbreak safelease on systemd Never
oVirt gerrit 46057 0 master MERGED safelease: Increase spmprotect timeouts Never
oVirt gerrit 46331 0 ovirt-3.6 MERGED safelease: Unbreak safelease on systemd Never
oVirt gerrit 46332 0 ovirt-3.6 MERGED safelease: Increase spmprotect timeouts Never

Description Yaniv Bronhaim 2015-05-18 14:22:45 UTC
systemctl doesn't produce vdsmd.pid file while spmprotect script depends on to fence vdsm. The fence method cannot work as it is

The code should use systemctl command to retrieve the pid and use it instead of the current `cat vdsmd.pid`

Comment 3 Nir Soffer 2015-05-25 12:06:23 UTC
This is indeed a regression, but not this is not 3.5 regression. This is broken since we support systemd on Fedora and EL 7.

Comment 4 Nir Soffer 2015-06-25 10:15:08 UTC
This effects only storage domain v1, which is still supported for backward
compatibility.

Comment 5 Allon Mureinik 2015-06-28 13:49:54 UTC
(In reply to Nir Soffer from comment #4)
> This effects only storage domain v1, which is still supported for backward
> compatibility.
Yaniv - Given this analysis, I'm fine with pushing it off to 3.5.5.
Having a 3.0 DC in with a RHEV>=3.5 host isn't really interesting, IMHO.

Comment 6 Yaniv Lavi 2015-06-29 09:49:59 UTC
(In reply to Allon Mureinik from comment #5)
> (In reply to Nir Soffer from comment #4)
> > This effects only storage domain v1, which is still supported for backward
> > compatibility.
> Yaniv - Given this analysis, I'm fine with pushing it off to 3.5.5.
> Having a 3.0 DC in with a RHEV>=3.5 host isn't really interesting, IMHO.

done. Happy to hear soft fencing works on 3.5.

Comment 7 Yaniv Bronhaim 2015-07-15 08:44:50 UTC
Patch is ready - Nir please help to verify or state how it can be verified properly...

Comment 8 Yaniv Bronhaim 2015-08-10 17:03:18 UTC
This is a storage bug that needs to be tested with storage functional flows. Dima might be able to help as he did with the patch after nir requested for assistance, but moving on with verifying it should be handled by storage. I would use the testing env for that which found it at first, we can't certainly ack it as we are not familiar with the full usage of spmprotect script

Comment 9 Nir Soffer 2015-08-31 11:38:50 UTC
(In reply to Yaniv Bronhaim from comment #8)
> This is a storage bug that needs to be tested with storage functional flows.

Yaniv, I'm verifying the attached patch, no action needed on your side.

Comment 10 Nir Soffer 2015-09-11 20:00:49 UTC
Because spmprotect fail to get vdsm pid, it fail to do a clean shutdown of the spm, and then fail to terminate and kill vdsm. Finally, it reboot the machine.

This effects only legacy dc using storage domain format 3.0.

Comment 11 Nir Soffer 2015-09-22 08:08:33 UTC
How to verify this:

1. create data center v 3.0
2. create cluster v 3.4
3. add host
4. create storage domain v1
5. wait until spm is up
6. check now maintenance/activation (no regression)
7. block access to storage
8. watch vdsm being killed without reboot (previously host would reboot after this)
9. unblock access to storage
10. watch vdsm become spm again (no regression)

Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Comment 14 Kevin Alon Goldblatt 2015-11-09 12:51:55 UTC
(In reply to Nir Soffer from comment #11)
> How to verify this:
> 
> 1. create data center v 3.0
> 2. create cluster v 3.4
> 3. add host
> 4. create storage domain v1
> 5. wait until spm is up
> 6. check now maintenance/activation (no regression)
> 7. block access to storage
> 8. watch vdsm being killed without reboot (previously host would reboot
> after this)
> 9. unblock access to storage
> 10. watch vdsm become spm again (no regression)
> 
> Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Please clarify:
------------------------------------
> 1. create data center v 3.0
> 2. create cluster v 3.4
> 3. add host
> 4. create storage domain v1

> 5. wait until spm is up (
(wouln't the spm be up before you could create a SD in step 4? Should this step be before step 3?))

> 6. check now maintenance/activation (no regression)
((Check for setting the host into maintenance and the reactivating?)

> 7. block access to storage
> 8. watch vdsm being killed without reboot (previously host would reboot
> after this)
((How long do we need to wait till this happens?

> 9. unblock access to storage
> 10. watch vdsm become spm again (no regression)
> 
> Repeat for nfs and iscsi (cannot mix domains in dc in this version).

Comment 15 Nir Soffer 2015-11-09 15:18:05 UTC
(In reply to Kevin Alon Goldblatt from comment #14)
> (In reply to Nir Soffer from comment #11)
> > How to verify this:
> > 
> > 1. create data center v 3.0
> > 2. create cluster v 3.4
> > 3. add host
> > 4. create storage domain v1
> > 5. wait until spm is up
> > 6. check now maintenance/activation (no regression)
> > 7. block access to storage
> > 8. watch vdsm being killed without reboot (previously host would reboot
> > after this)
> > 9. unblock access to storage
> > 10. watch vdsm become spm again (no regression)
> > 
> > Repeat for nfs and iscsi (cannot mix domains in dc in this version).
> 
> Please clarify:
> ------------------------------------
> > 1. create data center v 3.0
> > 2. create cluster v 3.4
> > 3. add host
> > 4. create storage domain v1
> 
> > 5. wait until spm is up (
> (wouln't the spm be up before you could create a SD in step 4? Should this
> step be before step 3?))

No, until you have storage, the host is up, but it is not spm.

> > 6. check now maintenance/activation (no regression)
> ((Check for setting the host into maintenance and the reactivating?)
> 
> > 7. block access to storage
> > 8. watch vdsm being killed without reboot (previously host would reboot
> > after this)
> ((How long do we need to wait till this happens?

Vdsm will be killed in about 20 seconds after safelease
fail to renew the lease.

See the patch commit message, it describe the flow precisely.

Comment 16 Kevin Alon Goldblatt 2015-11-09 15:31:18 UTC
Verified using the following code:
-----------------------------------
vdsm-4.17.10.1-0.el7ev.noarch
rhevm-3.6.0.3-0.1.el6.noarch

Verified using the following scenario:
---------------------------------------
Steps to reproduce:
1.Create a DC with V3.0
2.Create a Cluster with V3.4
3.Add host
4.Create a SD (scsci/nfs)
5.Wait Till host becomes SPM
6.Verify that the host can be put into maintenance and then activated successfully
7.Block access to the storage using iptables
8.Verify that VDSM is killed and that the host is not rebooted
9.Unblock access to the host
10.Verify that the host becomes SP< again

Moving to VERIFY!

Comment 18 errata-xmlrpc 2016-03-09 19:40:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.