Bug 1245284 - [Sanlock] Host reboot after sanlock.service restart, network is lost and unrecoverable
Summary: [Sanlock] Host reboot after sanlock.service restart, network is lost and unre...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.17.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-3.6.1
: 4.17.14
Assignee: Nir Soffer
QA Contact: Lukas Svaty
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-21 16:26 UTC by Ori Gofen
Modified: 2019-04-28 10:45 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-26 15:36:15 UTC
oVirt Team: Storage
Embargoed:
danken: ovirt-3.6.z?
danken: ovirt-4.0.0?
ylavi: planning_ack+
rule-engine: devel_ack+
rule-engine: testing_ack?


Attachments (Terms of Use)
Logs (5.80 MB, application/x-gzip)
2015-07-21 16:26 UTC, Ori Gofen
no flags Details
logs (14.98 MB, application/x-gzip)
2015-08-05 08:33 UTC, Ori Gofen
no flags Details

Description Ori Gofen 2015-07-21 16:26:36 UTC
Created attachment 1054459 [details]
Logs

Description of problem:
Had 3 hypervisors on a 3.6 datacenter with 1 fc domain 1 nfs and one VM with Dlun as disk and several other vms.
after sanlock.service restart, all of the hypervisors rebooted, 2 of them (10.35.77.10, 10.35.77.2) came back without ovirtmgmt network:

[root@purple-vds2 ~]# ls  /etc/sysconfig/network-scripts/
ifcfg-enp0s29f0u2  ifdown-isdn      ifup-bnep   ifup-routes
ifcfg-enp11s0f0    ifdown-post      ifup-eth    ifup-sit
ifcfg-enp11s0f1    ifdown-ppp       ifup-ib     ifup-Team
ifcfg-lo           ifdown-routes    ifup-ippp   ifup-TeamPort
ifdown             ifdown-sit       ifup-ipv6   ifup-tunnel
ifdown-bnep        ifdown-Team      ifup-isdn   ifup-wireless
ifdown-eth         ifdown-TeamPort  ifup-plip   init.ipv6-global
ifdown-ib          ifdown-tunnel    ifup-plusb  network-functions
ifdown-ippp        ifup             ifup-post   network-functions-ipv6
ifdown-ipv6        ifup-aliases     ifup-ppp

Further more the ifcfg-ovirtmgmt script has been wiped out, so in order to regain connectivity I had to log in to server via management and recreate the script manually.

Had this behaviour reproduced twice, the time it took me to recover the datacenter was almost one day.

Version-Release number of selected component (if applicable):

[root@purple-vds2 ~]# rpm -qa | grep vdsm
vdsm-python-4.17.0-1054.git562e711.el7.noarch
vdsm-jsonrpc-4.17.0-1054.git562e711.el7.noarch
vdsm-4.17.0-1054.git562e711.el7.noarch
vdsm-infra-4.17.0-1054.git562e711.el7.noarch
vdsm-xmlrpc-4.17.0-1054.git562e711.el7.noarch
vdsm-yajsonrpc-4.17.0-1054.git562e711.el7.noarch
vdsm-cli-4.17.0-1054.git562e711.el7.noarch
[root@purple-vds2 ~]# rpm -qa | grep sanlock
sanlock-python-3.2.2-2.el7.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64
sanlock-3.2.2-2.el7.x86_64
sanlock-lib-3.2.2-2.el7.x86_64

ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch

How reproducible:
2/2

Steps to Reproduce:
1.restart sanlock.service over all 3 hypervisor starting at the SPM hypervisor

Actual results:
The whole datacenter is gone for a day

Expected results:
The hosts should loose their storage connections thus becoming none-operational, after a reasonable interval regain connectivity and shift to state up, SPM should be restored automatically!

Additional info:

Comment 1 Ido Barkan 2015-07-22 11:05:41 UTC
Ori you are using a non fresh master that was indeed unstable in terms of network restoration at the commit you are using. Can you upgrade to latest master and try to reproduce?

Comment 2 Ori Gofen 2015-07-22 13:06:00 UTC
sure, will do

Comment 3 Ori Gofen 2015-07-22 13:20:57 UTC
well, it seems that I'm using the latest master:
Installed:
  ovirt-release-master.noarch 0:001-0.9.master    

otherwise please correct me.

Comment 4 Ido Barkan 2015-07-23 06:08:44 UTC
Ori, this might be the latest build but not the latest code, and the git hash points to an unstable era of network persistence. Please try to reproduce using the latest nightly build, because many fixes in this area are missing in your deployment.

Comment 5 Ori Gofen 2015-07-23 15:18:45 UTC
Well in order to clear things, Qe are opening or verifying bugs based on rhev-ci official builds only, I will try to reproduce on the (real) current master and post result asap

Comment 6 Ori Gofen 2015-08-05 08:33:31 UTC
Created attachment 1059376 [details]
logs

reproduced on 
ovirt-engine-3.6.0-0.0.master.20150726172446.git65db93d.el6.noarch
vdsm-4.17.0-1201.git7ba0684.el7.noarch

Comment 7 Nir Soffer 2015-10-08 12:18:27 UTC
(In reply to Ori Gofen from comment #0)
> Description of problem:
> Had 3 hypervisors on a 3.6 datacenter with 1 fc domain 1 nfs and one VM with
> Dlun as disk and several other vms.
> after sanlock.service restart, all of the hypervisors rebooted, 2 of them
> (10.35.77.10, 10.35.77.2) came back without ovirtmgmt network:

What do you mean by "sanlock.service restart"? Please share the exact
command line you are using for "restarting" sanlock.

sanlock should fail to stop if it is holding the spm resource; killing it
will lead to reboot by the system watchdog.

Please try to reproduce this by rebooting the hosts directly, since the 
issue does not seems to be related to sanlock.

If you can stop sanlock on the spm using systemctl stop sanlock, please
open a sanlock bug for this; it should not stop in this case.

Comment 8 Ori Gofen 2015-10-12 14:12:09 UTC
(In reply to Nir Soffer from comment #7)
> (In reply to Ori Gofen from comment #0)
> > Description of problem:
> > Had 3 hypervisors on a 3.6 datacenter with 1 fc domain 1 nfs and one VM with
> > Dlun as disk and several other vms.
> > after sanlock.service restart, all of the hypervisors rebooted, 2 of them
> > (10.35.77.10, 10.35.77.2) came back without ovirtmgmt network:
> 
> What do you mean by "sanlock.service restart"? Please share the exact
> command line you are using for "restarting" sanlock.

systemctl restart sanlock.service

> sanlock should fail to stop if it is holding the spm resource; killing it
> will lead to reboot by the system watchdog.
I mentioned no kill signal was sent

> Please try to reproduce this by rebooting the hosts directly, since the 
> issue does not seems to be related to sanlock.
our resources are occupied with cinder integration testing, if you can provide me an engine with 3 hypervisors it will be helpful

> If you can stop sanlock on the spm using systemctl stop sanlock, please
> open a sanlock bug for this; it should not stop in this case.
systemctl stop sanlock.service causes spm machine to reboot.

Nir is there any documentation about 'sanlock should fail to stop etc.' issue?

Comment 9 Dan Kenigsberg 2015-10-22 09:58:29 UTC
Moving issue to storage, as the network problem has been solved in 3.5.4.

Comment 10 Nir Soffer 2015-10-26 11:57:50 UTC
(In reply to Ori Gofen from comment #8)
> (In reply to Nir Soffer from comment #7)
> > (In reply to Ori Gofen from comment #0)
> > > Description of problem:
> > > Had 3 hypervisors on a 3.6 datacenter with 1 fc domain 1 nfs and one VM with
> > > Dlun as disk and several other vms.
> > > after sanlock.service restart, all of the hypervisors rebooted, 2 of them
> > > (10.35.77.10, 10.35.77.2) came back without ovirtmgmt network:
> > 
> > What do you mean by "sanlock.service restart"? Please share the exact
> > command line you are using for "restarting" sanlock.
> 
> systemctl restart sanlock.service

I discussed this with Federico, this is a known issue in sanlock
when working with systemd.

Please open sanlock bug, it should not allow stop and restart of
the sanlock service if sanlock is holding a resource.

However, this operation is not supported, and should be documented
in rhev docs; the adinistrator should not try to stop the sanlock
service.
If this is not documented, please open documentation bug.

Currently, when you stop sanlock service or kill sanlock, the watchdog
may reboot your host, this is expected behavior.

> Nir is there any documentation about 'sanlock should fail to stop etc.'
> issue?

I don't of any docs, maybe David can add more info.

Comment 11 David Teigland 2015-10-26 15:15:08 UTC
If sanlock refuses to stop, then I believe systemd kills sanlock.  This was a new "feature" of systemd that we didn't want.  I fixed this quite a long time ago in sanlock's native systemd unit file by adding "SendSIGKILL=no".  I don't believe this works with the init-wrapper unit file that sanlock installs by default, but you could try it.

This is one of multiple reasons that the native systemd unit file may be better than the default.  In 7.2 I've begun installing the native systemd unit files in
/usr/share/doc/sanlock-3.2.4 so that people who have problems with the defaults can try the native ones.  For 7.3 I may propose installing the native systemd unit files by default.

Comment 12 Nir Soffer 2015-10-26 15:36:15 UTC
(In reply to David Teigland from comment #11)
> If sanlock refuses to stop, then I believe systemd kills sanlock.  This was
> a new "feature" of systemd that we didn't want.  I fixed this quite a long
> time ago in sanlock's native systemd unit file by adding "SendSIGKILL=no". 
> I don't believe this works with the init-wrapper unit file that sanlock
> installs by default, but you could try it.
> 
> This is one of multiple reasons that the native systemd unit file may be
> better than the default.  In 7.2 I've begun installing the native systemd
> unit files in
> /usr/share/doc/sanlock-3.2.4 so that people who have problems with the
> defaults can try the native ones.  For 7.3 I may propose installing the
> native systemd unit files by default.

Thanks David!

I think this is the correct solution. I suggest that you open a sanlock bug
to track this. We will want to consume the version using native
systemd files when it is available.

Closing for now, as the behavior is expected with current sanlock version.


Note You need to log in before you can comment on or make changes to this bug.