Bug 1278369
Summary: | Wdmd fail to start after an upgrade "other wdmd not cleanly stopped, shm_open error 17" | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Nir Soffer <nsoffer> | ||||||
Component: | sanlock | Assignee: | David Teigland <teigland> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Artyom <alukiano> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.3 | CC: | acanan, agk, alukiano, amureini, bmcclain, bugs, cluster-maint, coughlan, gklein, jbrassow, mavital, msivak, nsoffer, sbonazzo, snagar, stirabos, teigland, tlavigne, ykaul | ||||||
Target Milestone: | rc | Keywords: | Triaged, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | 1278130 | ||||||||
: | 1287131 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-11-04 04:34:40 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 128713, 1154205, 1285700, 1295577 | ||||||||
Attachments: |
|
Description
Nir Soffer
2015-11-05 10:59:22 UTC
Created attachment 1090020 [details]
/var/log/messages and /var/log/sanlock.log
# rpm -qa | grep sanlock libvirt-lock-sanlock-1.2.17-13.el7.x86_64 sanlock-lib-3.2.4-1.el7.x86_64 sanlock-python-3.2.4-1.el7.x86_64 sanlock-3.2.4-1.el7.x86_64 The issue seams here: yum update started about 18:28 Nov 4 18:28:46 alma05 yum[29969]: Updated: glibc-2.17-105.el7.x86_64 ... sanlock was in the update: Nov 4 18:29:36 alma05 yum[29969]: Updated: sanlock-3.2.4-1.el7.x86_64 and got stopped here: Nov 4 18:34:06 alma05 systemd: Stopping Shared Storage Lease Manager... Nov 4 18:34:06 alma05 systemd-sanlock: Sending stop signal sanlock (615): [ OK ] but then: Nov 4 18:35:49 alma05 wdmd[635]: client dead ci 4 fd 11 pid 615 renewal 4716 expire 4796 sanlock_hosted-engine:2 Nov 4 18:35:49 alma05 systemd-wdmd: Sending stop signal wdmd (635): [ OK ] Nov 4 18:36:00 alma05 systemd-wdmd: Waiting for wdmd (635) to stop:[FAILED] Nov 4 18:36:00 alma05 systemd: wdmd.service: control process exited, code=exited status=1 and then: Nov 4 18:37:30 alma05 systemd: wdmd.service stop-sigterm timed out. Killing. Nov 4 18:37:30 alma05 systemd: wdmd.service: main process exited, code=killed, status=9/KILL Nov 4 18:37:30 alma05 systemd: Unit wdmd.service entered failed state. Nov 4 18:37:30 alma05 systemd: wdmd.service failed. Nov 4 18:37:30 alma05 systemd-wdmd: Starting wdmd: [ OK ] Nov 4 18:37:30 alma05 wdmd[20480]: wdmd started S0 H1 G179 Nov 4 18:37:30 alma05 wdmd[20480]: other wdmd not cleanly stopped, shm_open error 17 At the end: [root@alma05 ~]# systemctl status sanlock ● sanlock.service - Shared Storage Lease Manager Loaded: loaded (/usr/lib/systemd/system/sanlock.service; disabled; vendor preset: disabled) Active: active (running) since mer 2015-11-04 18:35:49 IST; 18h ago Main PID: 20723 (sanlock) but [root@alma05 ~]# systemctl status wdmd ● wdmd.service - Watchdog Multiplexing Daemon Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled; vendor preset: disabled) Active: inactive (dead) since mer 2015-11-04 18:37:30 IST; 19h ago Main PID: 635 (code=killed, signal=KILL) and sanlock cannot really work without wdmd so probably there is also a systemd dependency issue. It looks like sanlock was being used at the time. Is that correct? I recommend cleanly stopping both sanlock and wdmd before upgrading sanlock, or at least ensure that they are both unused. If they are unused, then I believe that they can be restarted by the package update. If they fail to restart, e.g. because they are being used, then systemd uses SIGKILL on them, which follows the "host reset" recovery path. See bug 1245284 for the same issue. I hope to switch to the native systemd unit files for sanlock and wdmd in 7.3 which prevent systemd from using SIGKILL. (In reply to David Teigland from comment #5) > It looks like sanlock was being used at the time. Is that correct? > > I recommend cleanly stopping both sanlock and wdmd before upgrading sanlock, > or at least ensure that they are both unused. If they are unused, then I > believe that they can be restarted by the package update. I tried to reproduce that sequence: I correctly put the host into maintenance mode from the engine and than I check sanlock status and it's still up: [root@c71het20151028 ~]# systemctl status sanlock sanlock.service - Shared Storage Lease Manager Loaded: loaded (/usr/lib/systemd/system/sanlock.service; disabled) Active: active (running) since Thu 2015-11-05 15:02:53 CET; 1h 32min ago Process: 785 ExecStart=/lib/systemd/systemd-sanlock start (code=exited, status=0/SUCCESS) Main PID: 826 (sanlock) CGroup: /system.slice/sanlock.service ├─826 sanlock daemon -U sanlock -G sanlock └─827 sanlock daemon -U sanlock -G sanlock Nov 05 15:02:53 c71het20151028.localdomain systemd-sanlock[785]: Starting sanlock: [ OK ] Nov 05 15:02:53 c71het20151028.localdomain systemd[1]: Started Shared Storage Lease Manager. Do you mean that you've reproduced the problem or you can't reproduce the problem? I assume that putting the host into maintenance mode should make it unused, and that you do this prior to updating packages. So, the sanlock and wdmd daemons should either be stopped or unused, allowing them to be restarted during the package update. After putting the host into maintenance mode, please run: systemctl status sanlock systemctl status wdmd sanlock status wdmd -d pgrep sanlock pgrep wdmd Then try a package update. Then check for errors and repeat the commands above. (In reply to David Teigland from comment #7) > Do you mean that you've reproduced the problem or you can't reproduce the > problem? I reproduced the environment > > I assume that putting the host into maintenance mode should make it unused, > and that you do this prior to updating packages. So, the sanlock and wdmd > daemons should either be stopped or unused, allowing them to be restarted > during the package update. > > After putting the host into maintenance mode, please run: > systemctl status sanlock > systemctl status wdmd > sanlock status > wdmd -d > pgrep sanlock > pgrep wdmd The host is in maintenance (I putted it in maintenance from the engine): [root@c71het20151028 ~]# systemctl status sanlock sanlock.service - Shared Storage Lease Manager Loaded: loaded (/usr/lib/systemd/system/sanlock.service; disabled) Active: active (running) since Thu 2015-11-05 15:02:53 CET; 3h 4min ago Process: 785 ExecStart=/lib/systemd/systemd-sanlock start (code=exited, status=0/SUCCESS) Main PID: 826 (sanlock) CGroup: /system.slice/sanlock.service ├─826 sanlock daemon -U sanlock -G sanlock └─827 sanlock daemon -U sanlock -G sanlock Nov 05 15:02:53 c71het20151028.localdomain systemd-sanlock[785]: Starting sanlock: [ OK ] Nov 05 15:02:53 c71het20151028.localdomain systemd[1]: Started Shared Storage Lease Manager. [root@c71het20151028 ~]# systemctl status wdmd wdmd.service - Watchdog Multiplexing Daemon Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled) Active: active (running) since Thu 2015-11-05 15:02:53 CET; 3h 4min ago Process: 784 ExecStart=/lib/systemd/systemd-wdmd start (code=exited, status=0/SUCCESS) Main PID: 833 (wdmd) CGroup: /system.slice/wdmd.service └─833 wdmd -G sanlock Nov 05 15:02:53 c71het20151028.localdomain systemd-wdmd[784]: Loading the softdog kernel module: [ OK ] Nov 05 15:02:53 c71het20151028.localdomain systemd-wdmd[784]: Starting wdmd: [ OK ] Nov 05 15:02:53 c71het20151028.localdomain wdmd[833]: wdmd started S0 H1 G179 Nov 05 15:02:53 c71het20151028.localdomain systemd[1]: Started Watchdog Multiplexing Daemon. Nov 05 15:02:53 c71het20151028.localdomain wdmd[833]: /dev/watchdog0 armed with fire_timeout 60 [root@c71het20151028 ~]# sanlock status daemon f4ab3842-de72-4119-beab-37405a06567a.c71het2015 p -1 helper p -1 listener p -1 status p 3091 [root@c71het20151028 ~]# wdmd -d wdmd 833 socket_gid 179 high_priority 1 now 11058 last_keepalive 11057 last_closeunclean 0 allow_scripts 0 kill_script_sec 0 client 0 name signal pid 0 fd 6 dead 0 ref 0 now 11058 renewal 0 expire 0 client 1 name listen pid 0 fd 7 dead 0 ref 0 now 11058 renewal 0 expire 0 client 2 name dump pid 0 fd 9 dead 0 ref 0 now 11058 renewal 0 expire 0 [root@c71het20151028 ~]# pgrep sanlock 826 827 [root@c71het20151028 ~]# pgrep wdmd 833 but in my case (centos 7.1 with upstream code) the upgrade doesn't pull a sanlock upgrade as it did on the reporter's host (rhel7.2). Resolving Dependencies --> Running transaction check ---> Package device-mapper-multipath.x86_64 0:0.4.9-77.el7_1.1 will be updated ---> Package device-mapper-multipath.x86_64 0:0.4.9-77.el7_1.2 will be an update ---> Package device-mapper-multipath-libs.x86_64 0:0.4.9-77.el7_1.1 will be updated ---> Package device-mapper-multipath-libs.x86_64 0:0.4.9-77.el7_1.2 will be an update ---> Package gmp.x86_64 1:6.0.0-11.el7 will be updated ---> Package gmp.x86_64 1:6.0.0-12.el7_1 will be an update ---> Package ipxe-roms-qemu.noarch 0:20130517-6.gitc4bce43.el7 will be updated ---> Package ipxe-roms-qemu.noarch 0:20130517-7.gitc4bce43.el7 will be an update ---> Package kernel.x86_64 0:3.10.0-229.20.1.el7 will be installed ---> Package kernel-tools.x86_64 0:3.10.0-229.14.1.el7 will be updated ---> Package kernel-tools.x86_64 0:3.10.0-229.20.1.el7 will be an update ---> Package kernel-tools-libs.x86_64 0:3.10.0-229.14.1.el7 will be updated ---> Package kernel-tools-libs.x86_64 0:3.10.0-229.20.1.el7 will be an update ---> Package kpartx.x86_64 0:0.4.9-77.el7_1.1 will be updated ---> Package kpartx.x86_64 0:0.4.9-77.el7_1.2 will be an update ---> Package libcacard-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated ---> Package libcacard-ev.x86_64 10:2.3.0-29.1.el7 will be an update ---> Package libnl3.x86_64 0:3.2.21-8.el7 will be updated ---> Package libnl3.x86_64 0:3.2.21-9.el7_1 will be an update ---> Package libnl3-cli.x86_64 0:3.2.21-8.el7 will be updated ---> Package libnl3-cli.x86_64 0:3.2.21-9.el7_1 will be an update ---> Package libvirt-client.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-client.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-config-nwfilter.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-config-nwfilter.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-driver-interface.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-driver-interface.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-driver-network.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-driver-network.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-driver-nodedev.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-driver-nodedev.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-driver-nwfilter.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-driver-nwfilter.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-driver-qemu.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-driver-qemu.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-driver-secret.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-driver-secret.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-driver-storage.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-driver-storage.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-daemon-kvm.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-daemon-kvm.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package libvirt-lock-sanlock.x86_64 0:1.2.8-16.el7_1.4 will be updated ---> Package libvirt-lock-sanlock.x86_64 0:1.2.8-16.el7_1.5 will be an update ---> Package nspr.x86_64 0:4.10.8-1.el7_1 will be updated ---> Package nspr.x86_64 0:4.10.8-2.el7_1 will be an update ---> Package nss.x86_64 0:3.19.1-5.el7_1 will be updated ---> Package nss.x86_64 0:3.19.1-7.el7_1.2 will be an update ---> Package nss-sysinit.x86_64 0:3.19.1-5.el7_1 will be updated ---> Package nss-sysinit.x86_64 0:3.19.1-7.el7_1.2 will be an update ---> Package nss-tools.x86_64 0:3.19.1-5.el7_1 will be updated ---> Package nss-tools.x86_64 0:3.19.1-7.el7_1.2 will be an update ---> Package nss-util.x86_64 0:3.19.1-3.el7_1 will be updated ---> Package nss-util.x86_64 0:3.19.1-4.el7_1 will be an update ---> Package otopi.noarch 0:1.3.2-1.el7.centos will be updated ---> Package otopi.noarch 0:1.4.0-1.el7.centos will be an update ---> Package ovirt-engine-sdk-python.noarch 0:3.5.5.0-1.el7.centos will be updated ---> Package ovirt-engine-sdk-python.noarch 0:3.6.0.3-1.el7.centos will be an update ---> Package ovirt-host-deploy.noarch 0:1.3.2-1.el7.centos will be updated ---> Package ovirt-host-deploy.noarch 0:1.4.0-1.el7.centos will be an update ---> Package ovirt-hosted-engine-ha.noarch 0:1.2.7.2-1.el7.centos will be updated ---> Package ovirt-hosted-engine-ha.noarch 0:1.3.2.1-1.el7.centos will be an update ---> Package ovirt-hosted-engine-setup.noarch 0:1.2.6.1-1.el7.centos will be updated ---> Package ovirt-hosted-engine-setup.noarch 0:1.3.0-1.el7.centos will be an update --> Processing Dependency: ovirt-setup-lib >= 1.0.0 for package: ovirt-hosted-engine-setup-1.3.0-1.el7.centos.noarch --> Processing Dependency: socat for package: ovirt-hosted-engine-setup-1.3.0-1.el7.centos.noarch --> Processing Dependency: ovirt-vmconsole-host for package: ovirt-hosted-engine-setup-1.3.0-1.el7.centos.noarch ---> Package python-lockfile.noarch 1:0.9.1-4.el7 will be updated ---> Package python-lockfile.noarch 1:0.9.1-4.el7.centos will be an update ---> Package qemu-guest-agent.x86_64 10:2.1.0-4.el7 will be updated ---> Package qemu-guest-agent.x86_64 10:2.3.0-4.el7 will be an update ---> Package qemu-img-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated ---> Package qemu-img-ev.x86_64 10:2.3.0-29.1.el7 will be an update --> Processing Dependency: libtcmalloc.so.4()(64bit) for package: 10:qemu-img-ev-2.3.0-29.1.el7.x86_64 ---> Package qemu-kvm-common-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated ---> Package qemu-kvm-common-ev.x86_64 10:2.3.0-29.1.el7 will be an update ---> Package qemu-kvm-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated ---> Package qemu-kvm-ev.x86_64 10:2.3.0-29.1.el7 will be an update ---> Package qemu-kvm-tools-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated ---> Package qemu-kvm-tools-ev.x86_64 10:2.3.0-29.1.el7 will be an update ---> Package rsyslog.x86_64 0:7.4.7-7.el7_0 will be updated ---> Package rsyslog.x86_64 0:7.4.7-7.el7_1.1 will be an update ---> Package rsyslog-mmjsonparse.x86_64 0:7.4.7-7.el7_0 will be updated ---> Package rsyslog-mmjsonparse.x86_64 0:7.4.7-7.el7_1.1 will be an update ---> Package seabios-bin.noarch 0:1.7.5-8.el7 will be updated ---> Package seabios-bin.noarch 0:1.7.5-11.el7 will be an update ---> Package seavgabios-bin.noarch 0:1.7.5-8.el7 will be updated ---> Package seavgabios-bin.noarch 0:1.7.5-11.el7 will be an update ---> Package selinux-policy.noarch 0:3.13.1-23.el7_1.18 will be updated ---> Package selinux-policy.noarch 0:3.13.1-23.el7_1.21 will be an update ---> Package selinux-policy-targeted.noarch 0:3.13.1-23.el7_1.18 will be updated ---> Package selinux-policy-targeted.noarch 0:3.13.1-23.el7_1.21 will be an update ---> Package vdsm.x86_64 0:4.16.27-0.el7.centos will be updated ---> Package vdsm.noarch 0:4.17.10.1-0.el7.centos will be an update --> Processing Dependency: vdsm-infra = 4.17.10.1-0.el7.centos for package: vdsm-4.17.10.1-0.el7.centos.noarch --> Processing Dependency: safelease >= 1.0-5 for package: vdsm-4.17.10.1-0.el7.centos.noarch --> Processing Dependency: ovirt-vmconsole >= 1.0.0-0 for package: vdsm-4.17.10.1-0.el7.centos.noarch ---> Package vdsm-cli.noarch 0:4.16.27-0.el7.centos will be updated ---> Package vdsm-cli.noarch 0:4.17.10.1-0.el7.centos will be an update ---> Package vdsm-jsonrpc.noarch 0:4.16.27-0.el7.centos will be updated ---> Package vdsm-jsonrpc.noarch 0:4.17.10.1-0.el7.centos will be an update ---> Package vdsm-python.noarch 0:4.16.27-0.el7.centos will be updated ---> Package vdsm-python.noarch 0:4.17.10.1-0.el7.centos will be an update ---> Package vdsm-xmlrpc.noarch 0:4.16.27-0.el7.centos will be updated ---> Package vdsm-xmlrpc.noarch 0:4.17.10.1-0.el7.centos will be an update ---> Package vdsm-yajsonrpc.noarch 0:4.16.27-0.el7.centos will be updated ---> Package vdsm-yajsonrpc.noarch 0:4.17.10.1-0.el7.centos will be an update --> Running transaction check ---> Package gperftools-libs.x86_64 0:2.4-7.el7 will be installed --> Processing Dependency: libunwind.so.8()(64bit) for package: gperftools-libs-2.4-7.el7.x86_64 ---> Package ovirt-setup-lib.noarch 0:1.0.0-1.el7.centos will be installed ---> Package ovirt-vmconsole.noarch 0:1.0.0-1.el7.centos will be installed ---> Package ovirt-vmconsole-host.noarch 0:1.0.0-1.el7.centos will be installed ---> Package safelease.x86_64 0:1.0-5.el7 will be installed ---> Package socat.x86_64 0:1.7.2.2-5.el7 will be installed ---> Package vdsm-infra.noarch 0:4.17.10.1-0.el7.centos will be installed --> Running transaction check ---> Package libunwind.x86_64 0:1.1-10.el7 will be installed --> Finished Dependency Resolution (In reply to Simone Tiraboschi from comment #9) > but in my case (centos 7.1 with upstream code) the upgrade doesn't pull a > sanlock upgrade as it did on the reporter's host (rhel7.2). The original reporter got Nov 4 18:29:36 alma05 yum[29969]: Updated: sanlock-3.2.4-1.el7.x86_64 So it doesn't reproduce on my env (centos 7.1): Installed Packages sanlock.x86_64 3.2.2-2.el7 @base > [root@c71het20151028 ~]# sanlock status > daemon f4ab3842-de72-4119-beab-37405a06567a.c71het2015 > p -1 helper > p -1 listener > p -1 status > p 3091 > [root@c71het20151028 ~]# wdmd -d > wdmd 833 socket_gid 179 high_priority 1 now 11058 last_keepalive 11057 > last_closeunclean 0 allow_scripts 0 kill_script_sec 0 > client 0 name signal pid 0 fd 6 dead 0 ref 0 now 11058 renewal 0 expire 0 > client 1 name listen pid 0 fd 7 dead 0 ref 0 now 11058 renewal 0 expire 0 > client 2 name dump pid 0 fd 9 dead 0 ref 0 now 11058 renewal 0 expire 0 Both daemons are unused, so restarting them from this state should be fine. Given the various sanlock and wdmd errors I see above, there must have been some problem in moving the host into maintenance mode, i.e. the host was not really in maintenance mode when the update happened. So, the problem was not in updating and restarting sanlock/wdmd, but prior to that when the host failed to cleanly enter maintenance. Artyom, did you remember if that host correctly entered maintenance mode from what the user can see from the engine web UI? I put host to maintenance via engine, and also checked that via hosted-engine --vm-status host have state local maintenance. I can provide you details about second host and engine, you will see that host appear under maintenance. Artyom, can you reproduce this? 1. Have a working setup with active spm 2. Put spm host to maintenance 3. yum update -> updating sanlock package Expected: sanlock and wdmd restart normally after the update Have 3.5 HE environment: two hosts, one nfs storage domain, one vm 1) put SPM host to local maintenance 2) host have state maintenance under engine 3) upgrade host from 3.5 to 3.6 4) sanlock up, but [root@alma05 ~]# systemctl status wdmd ● wdmd.service - Watchdog Multiplexing Daemon Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2015-11-10 15:57:51 IST; 2min 44s ago Main PID: 20241 (code=exited, status=255) Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: Starting Watchdog Multiplexing Daemon... Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd-wdmd[20234]: Starting wdmd: [ OK ] Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com wdmd[20241]: wdmd started S0 H1 G179 Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com wdmd[20241]: could not set RR|RESET_ON_FORK priority 99 err 1 Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com wdmd[20241]: other wdmd not cleanly stopped, shm_open error 17 Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: Started Watchdog Multiplexing Daemon. Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: wdmd.service: main process exited, code=exited, status=255/n/a Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: Unit wdmd.service entered failed state. Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: wdmd.service failed. (In reply to Artyom from comment #15) > Have 3.5 HE environment: two hosts, one nfs storage domain, one vm > 1) put SPM host to local maintenance > 2) host have state maintenance under engine Is hosted engine vm running at this point? Hosted engine vm is holding a resource on shared storage. You *cannot* upgrade sanlock or wddm when they maintain an resource. You must stop hosted engine vm and shutdown hosted engine agents/broker so nobody is using sanlock *before* you update sanlock. Please contact msivak for info about stopping hosted engine processes for upgrade. It should be as simple as service ovirt-ha-agent stop. The broker does not care about sanlock. (In reply to Martin Sivák from comment #17) > It should be as simple as service ovirt-ha-agent stop. The broker does not > care about sanlock. Martin, the qemu process running the engine vm is holding a reosurce, right? Stopping ovirt-ha-agent will shutdown the engine vm? Or maybe we should migrate this vm to another host before upgrading a host? Do we have documented procedure for upgrading he setup? Right, if the VM runs locally then it needs to be migrated away too to free the sanlock daemon. Shutting down agent does not do that automatically. The normal upgrade procedure needed just yum update and service restart, but it seems it is getting a bit more complicated. So it seems that the root cause here is upgrading sanlock while a process on the host is holding a resource, which is unsupported. Reducing severity since this usage is unsupported. Typically this cause the host to reboot because sanlock is killed and cannot pet the watchdog (via wdmd), so the watchdog is rebooting the machine. David, can you explain what happened here on the sanlock/wdmd side? Do you need more information from us on reproducing this failure on the sanlock side? I was asking the same question. SIGKILL of wdmd should cause the watchdog to reset the host (whether wdmd is in use or not). I'd like to see /var/log/messages from the time that happened. Artyom, can you provide /var/log/messages showing the timeframe when you upgrade the machine, as you described in comment 15? David, did you check /var/log/messages in attachment 1090020 [details]? I didn't notice that, thanks. The messages are concerning: Nov 4 18:36:40 alma05 wdmd[635]: /dev/watchdog closed unclean Nov 4 18:36:40 alma05 kernel: hpwdt: Unexpected close, not stopping watchdog! At this point, the watchdog should fire and reset the machine in 60 seconds, i.e. at 18:37:40. But the machine continues to operate and generate log messages right through that time and beyond. The watchdog seems to have done nothing at all. Can we run tests on this machine to debug why the watchdog behaves this way? If so, the first thing to try is using the hpwdt, start the wdmd daemon, then kill -9 the wdmd process, and see if the machine is reset in 60 seconds. A second test is to remove hpwdt, modprobe softdog, start wdmd, kill -9 wdmd and see if the machine is reset. I've also noticed this in the /var/log/messages: Nov 4 17:17:15 alma05 kernel: hpwdt 0000:02:00.0: HP Watchdog Timer Driver: NMI decoding initialized, allow kernel dump: ON (default = 0/OFF) Nov 4 17:17:15 alma05 kernel: hpwdt 0000:02:00.0: HP Watchdog Timer Driver: 1.3.3, timer margin: 30 seconds (nowayout=0). I'm not familiar with the NMI aspect of watchdog drivers or the "allow kernel dump" feature, both of which seem somewhat dubious. I need to investigate further what these are about, if they subvert the required watchdog behavior, and if they might explain what we're seeing. (In reply to Artyom from comment #15) > 2) host have state maintenance under engine This means that the engine VM has been migrated somewhere else. So, at this point, nothing should prevent the upgrade. (In reply to Simone Tiraboschi from comment #25) > (In reply to Artyom from comment #15) > > 2) host have state maintenance under engine > > This means that the engine VM has been migrated somewhere else. > So, at this point, nothing should prevent the upgrade. To make sure that nobody is using sanlock at this point, can you reproduce the situation before the upgrade (without upgrading), and share the output of: for i in `seq 6`; do date sanlock client status -D echo sleep 10 done Created attachment 1092765 [details]
output of script
Output of script, also I have now 3.5 environment, if you want to check additional details, before I will upgrade it.
(In reply to Artyom from comment #27) > Created attachment 1092765 [details] > output of script > > Output of script, also I have now 3.5 environment, if you want to check > additional details, before I will upgrade it. Thanks Artyom. According to the output, sanlcok has 2 lockspaces, one for hosted engine, and one for vdsm. This is not expected in maintenance mode and need further investigation. Would you open another hosted engine bug for this? In the other bug: - Describe exactly what you did to reach this state (engine ui, commands run on the host, etc.) - Attach the output of this script - Attach vdsm log showing the entire flow since hosted engine was installed - Attach sanlock log - Attach /var/log/messages In reference to a question elsewhere: If you do 'systemctl stop sanlock' while lockspaces exist, sanlock will ignore the SIGTERM and refuse to exit. After some time, systemd then uses SIGKILL which causes an unclean exit of sanlock. This unclean exit will cause wdmd to stop pinging /dev/watchdog, which will lead to a host reset from the watchdog (if the watchdog driver is working correctly.) There is one alternative way to forcibly shut down sanlock by running 'sanlock shutdown -f 1'. (systemd does not use this method.) The -f 1 causes the sanlock daemon to: 1) kill any pids using resources in existing lockspaces, then 2) remove/leave those lockspaces when the resource leases are gone (if no processes were using any resources, then the lockspaces are just removed immediately), then 3) exit Looking at the hpwdt.c source, it's clear that "allow_kdump" stops the watchdog just like a normal stop. This behavior breaks any application like sanlock which depends on controlling the liveness of the watchdog device directly through the defined watchdog API. I will need access to the hardware to test and confirm this. (If confirmed, I will have to add a special check to wdmd to disallow this hpwdt allow_kdump setting.) Done, sorry that it took so much time, had problem with storage https://bugzilla.redhat.com/show_bug.cgi?id=1282187 The watchdog did not fire when expected because the wdmd restart procedure is unintentionally disabling the watchdog with the the "magic V character". When wdmd is started/restarted, the init script first probes all the watchdog devices to find which one works. The probe function opens/closes each device to check it, and before closing, turns off the watchdog with the magic character. This probe turns off the watchdog that's about to fire. I need to fix the wdmd probe function so that it does not interfere with any watchdog state that might already exist. The wdmd daemon already avoids this using shm, but the probe function is missing this same logic. (When things are running as expected, wdmd should never be closed uncleanly in the first place. If vdsm cleanly enters maintenance mode, this issue will not appear.) Verified on sanlock-3.2.4-2.el7_2.x86_64 1) Start with host RHEL7.1 - sanlock-3.2.2-2.el7.x86_64 2) Get two sanlock locks: sanlock client status daemon c094d046-98f8-4ee6-a714-bca06544be77.rose05.qa. p -1 helper p -1 listener p -1 status p 32347 s c8b5968c-68e8-4ed6-b309-163fc9e148bb:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade/c8b5968c-68e8-4ed6-b309-163fc9e148bb/dom_md/ids:0 s hosted-engine:1:/var/run/vdsm/storage/c8b5968c-68e8-4ed6-b309-163fc9e148bb/24ee7f20-f2eb-4734-9741-ead1eb66bcc2/b209bc7f-3154-430f-b9ef-0d3bd584e589:0 3) Update host to RHEL72 - sanlock-3.2.4-2.el7_2.x86_64 4) Check that sanlock and wdmd services up after update Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2356.html |