Created attachment 1054459 [details] Logs Description of problem: Had 3 hypervisors on a 3.6 datacenter with 1 fc domain 1 nfs and one VM with Dlun as disk and several other vms. after sanlock.service restart, all of the hypervisors rebooted, 2 of them (10.35.77.10, 10.35.77.2) came back without ovirtmgmt network: [root@purple-vds2 ~]# ls /etc/sysconfig/network-scripts/ ifcfg-enp0s29f0u2 ifdown-isdn ifup-bnep ifup-routes ifcfg-enp11s0f0 ifdown-post ifup-eth ifup-sit ifcfg-enp11s0f1 ifdown-ppp ifup-ib ifup-Team ifcfg-lo ifdown-routes ifup-ippp ifup-TeamPort ifdown ifdown-sit ifup-ipv6 ifup-tunnel ifdown-bnep ifdown-Team ifup-isdn ifup-wireless ifdown-eth ifdown-TeamPort ifup-plip init.ipv6-global ifdown-ib ifdown-tunnel ifup-plusb network-functions ifdown-ippp ifup ifup-post network-functions-ipv6 ifdown-ipv6 ifup-aliases ifup-ppp Further more the ifcfg-ovirtmgmt script has been wiped out, so in order to regain connectivity I had to log in to server via management and recreate the script manually. Had this behaviour reproduced twice, the time it took me to recover the datacenter was almost one day. Version-Release number of selected component (if applicable): [root@purple-vds2 ~]# rpm -qa | grep vdsm vdsm-python-4.17.0-1054.git562e711.el7.noarch vdsm-jsonrpc-4.17.0-1054.git562e711.el7.noarch vdsm-4.17.0-1054.git562e711.el7.noarch vdsm-infra-4.17.0-1054.git562e711.el7.noarch vdsm-xmlrpc-4.17.0-1054.git562e711.el7.noarch vdsm-yajsonrpc-4.17.0-1054.git562e711.el7.noarch vdsm-cli-4.17.0-1054.git562e711.el7.noarch [root@purple-vds2 ~]# rpm -qa | grep sanlock sanlock-python-3.2.2-2.el7.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64 sanlock-3.2.2-2.el7.x86_64 sanlock-lib-3.2.2-2.el7.x86_64 ovirt-engine-3.6.0-0.0.master.20150627185750.git6f063c1.el6.noarch How reproducible: 2/2 Steps to Reproduce: 1.restart sanlock.service over all 3 hypervisor starting at the SPM hypervisor Actual results: The whole datacenter is gone for a day Expected results: The hosts should loose their storage connections thus becoming none-operational, after a reasonable interval regain connectivity and shift to state up, SPM should be restored automatically! Additional info:
Ori you are using a non fresh master that was indeed unstable in terms of network restoration at the commit you are using. Can you upgrade to latest master and try to reproduce?
sure, will do
well, it seems that I'm using the latest master: Installed: ovirt-release-master.noarch 0:001-0.9.master otherwise please correct me.
Ori, this might be the latest build but not the latest code, and the git hash points to an unstable era of network persistence. Please try to reproduce using the latest nightly build, because many fixes in this area are missing in your deployment.
Well in order to clear things, Qe are opening or verifying bugs based on rhev-ci official builds only, I will try to reproduce on the (real) current master and post result asap
Created attachment 1059376 [details] logs reproduced on ovirt-engine-3.6.0-0.0.master.20150726172446.git65db93d.el6.noarch vdsm-4.17.0-1201.git7ba0684.el7.noarch
(In reply to Ori Gofen from comment #0) > Description of problem: > Had 3 hypervisors on a 3.6 datacenter with 1 fc domain 1 nfs and one VM with > Dlun as disk and several other vms. > after sanlock.service restart, all of the hypervisors rebooted, 2 of them > (10.35.77.10, 10.35.77.2) came back without ovirtmgmt network: What do you mean by "sanlock.service restart"? Please share the exact command line you are using for "restarting" sanlock. sanlock should fail to stop if it is holding the spm resource; killing it will lead to reboot by the system watchdog. Please try to reproduce this by rebooting the hosts directly, since the issue does not seems to be related to sanlock. If you can stop sanlock on the spm using systemctl stop sanlock, please open a sanlock bug for this; it should not stop in this case.
(In reply to Nir Soffer from comment #7) > (In reply to Ori Gofen from comment #0) > > Description of problem: > > Had 3 hypervisors on a 3.6 datacenter with 1 fc domain 1 nfs and one VM with > > Dlun as disk and several other vms. > > after sanlock.service restart, all of the hypervisors rebooted, 2 of them > > (10.35.77.10, 10.35.77.2) came back without ovirtmgmt network: > > What do you mean by "sanlock.service restart"? Please share the exact > command line you are using for "restarting" sanlock. systemctl restart sanlock.service > sanlock should fail to stop if it is holding the spm resource; killing it > will lead to reboot by the system watchdog. I mentioned no kill signal was sent > Please try to reproduce this by rebooting the hosts directly, since the > issue does not seems to be related to sanlock. our resources are occupied with cinder integration testing, if you can provide me an engine with 3 hypervisors it will be helpful > If you can stop sanlock on the spm using systemctl stop sanlock, please > open a sanlock bug for this; it should not stop in this case. systemctl stop sanlock.service causes spm machine to reboot. Nir is there any documentation about 'sanlock should fail to stop etc.' issue?
Moving issue to storage, as the network problem has been solved in 3.5.4.
(In reply to Ori Gofen from comment #8) > (In reply to Nir Soffer from comment #7) > > (In reply to Ori Gofen from comment #0) > > > Description of problem: > > > Had 3 hypervisors on a 3.6 datacenter with 1 fc domain 1 nfs and one VM with > > > Dlun as disk and several other vms. > > > after sanlock.service restart, all of the hypervisors rebooted, 2 of them > > > (10.35.77.10, 10.35.77.2) came back without ovirtmgmt network: > > > > What do you mean by "sanlock.service restart"? Please share the exact > > command line you are using for "restarting" sanlock. > > systemctl restart sanlock.service I discussed this with Federico, this is a known issue in sanlock when working with systemd. Please open sanlock bug, it should not allow stop and restart of the sanlock service if sanlock is holding a resource. However, this operation is not supported, and should be documented in rhev docs; the adinistrator should not try to stop the sanlock service. If this is not documented, please open documentation bug. Currently, when you stop sanlock service or kill sanlock, the watchdog may reboot your host, this is expected behavior. > Nir is there any documentation about 'sanlock should fail to stop etc.' > issue? I don't of any docs, maybe David can add more info.
If sanlock refuses to stop, then I believe systemd kills sanlock. This was a new "feature" of systemd that we didn't want. I fixed this quite a long time ago in sanlock's native systemd unit file by adding "SendSIGKILL=no". I don't believe this works with the init-wrapper unit file that sanlock installs by default, but you could try it. This is one of multiple reasons that the native systemd unit file may be better than the default. In 7.2 I've begun installing the native systemd unit files in /usr/share/doc/sanlock-3.2.4 so that people who have problems with the defaults can try the native ones. For 7.3 I may propose installing the native systemd unit files by default.
(In reply to David Teigland from comment #11) > If sanlock refuses to stop, then I believe systemd kills sanlock. This was > a new "feature" of systemd that we didn't want. I fixed this quite a long > time ago in sanlock's native systemd unit file by adding "SendSIGKILL=no". > I don't believe this works with the init-wrapper unit file that sanlock > installs by default, but you could try it. > > This is one of multiple reasons that the native systemd unit file may be > better than the default. In 7.2 I've begun installing the native systemd > unit files in > /usr/share/doc/sanlock-3.2.4 so that people who have problems with the > defaults can try the native ones. For 7.3 I may propose installing the > native systemd unit files by default. Thanks David! I think this is the correct solution. I suggest that you open a sanlock bug to track this. We will want to consume the version using native systemd files when it is available. Closing for now, as the behavior is expected with current sanlock version.