Bug 1278369

Summary:

Wdmd fail to start after an upgrade "other wdmd not cleanly stopped, shm_open error 17"

Product:

Red Hat Enterprise Linux 7

Reporter:

Nir Soffer <nsoffer>

Component:

sanlock

Assignee:

David Teigland <teigland>

Status:

CLOSED ERRATA

QA Contact:

Artyom <alukiano>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.3

CC:

acanan, agk, alukiano, amureini, bmcclain, bugs, cluster-maint, coughlan, gklein, jbrassow, mavital, msivak, nsoffer, sbonazzo, snagar, stirabos, teigland, tlavigne, ykaul

Target Milestone:

Keywords:

Triaged, ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

1278130

Clones:

1287131 (view as bug list)

Environment:

Last Closed:

2016-11-04 04:34:40 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

128713, 1154205, 1285700, 1295577

Attachments:

Description	Flags
/var/log/messages and /var/log/sanlock.log	none
output of script	none

Description Nir Soffer 2015-11-05 10:59:22 UTC

+++ This bug was initially created as a clone of Bug #1278130 +++

Description of problem:
Hosted engine host failed to acquire host id for domain after upgrade from 3.5 to 3.6
Under vdsm log I see:
VDSM says that it successfully acquired
Thread-30::DEBUG::2015-11-04 19:09:43,551::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)
but then it still respond with 'acquired': False
Thread-553::DEBUG::2015-11-04 19:09:44,374::task::1191::Storage.TaskManager.Task::(prepare) Task=`7aeee01f-44ea-4530-aad2-516599f3d635`::finished: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.000299854', 'lastCheck': '0.8', 'valid': True}}

Version-Release number of selected component (if applicable):
from 
vdsm-4.16.28-1.el7ev.x86_64
to
vdsm-4.17.10.1-0.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy hosted-engine 3.5 on two hosts
2. set global maintenance
3. upgrade engine from 3.5 to 3.6
4. put host without engine vm to maintenance
5. upgrade hosts from 3.5 to 3.6

Actual results:
ovirt-ha-agent stop with traceback
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 190, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 63, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 411, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=66454524-8626-4511-9c19-9b6036854fc5, host_id=2): timeout during domain acquisition

Expected results:
ovirt-ha-agent run without any errors

Additional info:

--- Additional comment from Simone Tiraboschi on 2015-11-04 12:45:56 EST ---

The issue seams here:
VDSM successfully acquires domain lock but then it still reports 'acquired': False.
ha-agent is calling startMonitoringDomain and then is polling for that.
ha-agent fails due to a timeout cause VDSM indefinitely responds 'acquired': False also if it was correctly acquired:


Thread-30::INFO::2015-11-04 19:12:44,532::clusterlock::219::Storage.SANLock::(acquireHostId) Acquiring host id for domain 66454524-8626-4511-9c19-9b6036854fc5 (id: 2)
Thread-30::DEBUG::2015-11-04 19:12:44,532::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)
...
Thread-611::INFO::2015-11-04 19:12:47,307::logUtils::51::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.00030614', 'lastCheck': '2.8', 'valid': True}}
...
Thread-613::INFO::2015-11-04 19:12:52,399::logUtils::51::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.00030614', 'lastCheck': '7.9', 'valid': True}}

--- Additional comment from Simone Tiraboschi on 2015-11-05 05:11 EST ---



--- Additional comment from Simone Tiraboschi on 2015-11-05 05:21:19 EST ---

It seams that sanlock it's failing to acquire the log.
Nov  5 12:19:46 alma05 sanlock[20723]: 2015-11-05 12:19:46+0200 68574 [324]: s6304 wdmd_connect failed -111
Nov  5 12:19:46 alma05 sanlock[20723]: 2015-11-05 12:19:46+0200 68574 [324]: s6304 connect_watchdog failed -1
Nov  5 12:19:47 alma05 sanlock[20723]: 2015-11-05 12:19:47+0200 68575 [20727]: s6304 add_lockspace fail result -203

so this 'Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)' is not coherent at all but the issue is not on repoStats as we initially thought but probably on sanlock interaction.

--- Additional comment from Nir Soffer on 2015-11-05 05:56:12 EST ---

We have 2 issues:

1. In Vdsm log, we have a bogus log:

Thread-30::DEBUG::2015-11-04 19:09:43,551::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)

This log was correct before we added the async acquire option. Now we are
using async acquire, and when sanlock operation is finished successfuly, 
we do *not* have a host id, we only know that we started a asynchronous 
acquire operation.

Vdsm check every 10 seconds if the host id was acquired, and since in this
case it was not acquired, it try to acquire again, as we can see in vdsm log.
Every 10 second, we have a new "successful" attempt to acquire.

The log must be fixed to reflect reality.

2. Sanlock cannot acquire the host id, since it cannot connect to wdmd:

2015-11-05 12:52:22+0200 70530 [20729]: s6498 add_lockspace fail result -203
2015-11-05 12:52:31+0200 70539 [20729]: s6499 lockspace 66454524-8626-4511-9c19-9b6036854fc5:2:/rhev/data-center/mnt/10.35.160.108:_Virt_alukiano__hosted__engine__upgrade/66454524-8626-4511-9c19-9b6036854fc5/dom_md/ids:0
2015-11-05 12:52:31+0200 70539 [2153]: s6499 wdmd_connect failed -111
2015-11-05 12:52:31+0200 70539 [2153]: s6499 connect_watchdog failed -1
2015-11-05 12:52:32+0200 70540 [20729]: s6499 add_lockspace fail result -203

When we check wdmd status, we see:

[root@alma05 dom_md]# systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2015-11-04 18:37:30 IST; 18h ago
 Main PID: 635 (code=killed, signal=KILL)

Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd[1]: Starting Watchdog Multiplexing Daemon...
Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd-wdmd[20473]: Starting wdmd: [  OK  ]
Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com wdmd[20480]: wdmd started S0 H1 G179
Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com wdmd[20480]: other wdmd not cleanly stopped, shm_open error 17
Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd[1]: Started Watchdog Multiplexing Daemon.

We can see many wdmd related errors in /var/log/messages (see attachment 1089997 [details]).

I'm cloning this bug to sanlock, so David can look at this.

--- Additional comment from Nir Soffer on 2015-11-05 05:57:13 EST ---

Lowering urgency, since this is a log issue.

Comment 1 Nir Soffer 2015-11-05 11:02:51 UTC

Created attachment 1090020 [details]
/var/log/messages and /var/log/sanlock.log

Comment 2 Nir Soffer 2015-11-05 11:04:19 UTC

# rpm -qa | grep sanlock
libvirt-lock-sanlock-1.2.17-13.el7.x86_64
sanlock-lib-3.2.4-1.el7.x86_64
sanlock-python-3.2.4-1.el7.x86_64
sanlock-3.2.4-1.el7.x86_64

Comment 4 Simone Tiraboschi 2015-11-05 11:39:32 UTC

The issue seams here:

yum update started about 18:28
Nov  4 18:28:46 alma05 yum[29969]: Updated: glibc-2.17-105.el7.x86_64
...
sanlock was in the update:
Nov  4 18:29:36 alma05 yum[29969]: Updated: sanlock-3.2.4-1.el7.x86_64
and got stopped here:
Nov  4 18:34:06 alma05 systemd: Stopping Shared Storage Lease Manager...
Nov  4 18:34:06 alma05 systemd-sanlock: Sending stop signal sanlock (615): [  OK  ]
but then:
Nov  4 18:35:49 alma05 wdmd[635]: client dead ci 4 fd 11 pid 615 renewal 4716 expire 4796 sanlock_hosted-engine:2
Nov  4 18:35:49 alma05 systemd-wdmd: Sending stop signal wdmd (635): [  OK  ]
Nov  4 18:36:00 alma05 systemd-wdmd: Waiting for wdmd (635) to stop:[FAILED]
Nov  4 18:36:00 alma05 systemd: wdmd.service: control process exited, code=exited status=1

and then:

Nov  4 18:37:30 alma05 systemd: wdmd.service stop-sigterm timed out. Killing.
Nov  4 18:37:30 alma05 systemd: wdmd.service: main process exited, code=killed, status=9/KILL
Nov  4 18:37:30 alma05 systemd: Unit wdmd.service entered failed state.
Nov  4 18:37:30 alma05 systemd: wdmd.service failed.
Nov  4 18:37:30 alma05 systemd-wdmd: Starting wdmd: [  OK  ]
Nov  4 18:37:30 alma05 wdmd[20480]: wdmd started S0 H1 G179
Nov  4 18:37:30 alma05 wdmd[20480]: other wdmd not cleanly stopped, shm_open error 17


At the end:

[root@alma05 ~]# systemctl status sanlock
● sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; disabled; vendor preset: disabled)
   Active: active (running) since mer 2015-11-04 18:35:49 IST; 18h ago
 Main PID: 20723 (sanlock)


but
[root@alma05 ~]# systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since mer 2015-11-04 18:37:30 IST; 19h ago
 Main PID: 635 (code=killed, signal=KILL)


and sanlock cannot really work without wdmd so probably there is also a systemd dependency issue.

Comment 5 David Teigland 2015-11-05 15:19:38 UTC

It looks like sanlock was being used at the time.  Is that correct?

I recommend cleanly stopping both sanlock and wdmd before upgrading sanlock, or at least ensure that they are both unused.  If they are unused, then I believe that they can be restarted by the package update.

If they fail to restart, e.g. because they are being used, then systemd uses SIGKILL on them, which follows the "host reset" recovery path.  See bug 1245284 for the same issue.  I hope to switch to the native systemd unit files for sanlock and wdmd in 7.3 which prevent systemd from using SIGKILL.

Comment 6 Simone Tiraboschi 2015-11-05 15:37:37 UTC

(In reply to David Teigland from comment #5)
> It looks like sanlock was being used at the time.  Is that correct?
> 
> I recommend cleanly stopping both sanlock and wdmd before upgrading sanlock,
> or at least ensure that they are both unused.  If they are unused, then I
> believe that they can be restarted by the package update.

I tried to reproduce that sequence:
I correctly put the host into maintenance mode from the engine and than I check sanlock status and it's still up:

[root@c71het20151028 ~]# systemctl status sanlock
sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; disabled)
   Active: active (running) since Thu 2015-11-05 15:02:53 CET; 1h 32min ago
  Process: 785 ExecStart=/lib/systemd/systemd-sanlock start (code=exited, status=0/SUCCESS)
 Main PID: 826 (sanlock)
   CGroup: /system.slice/sanlock.service
           ├─826 sanlock daemon -U sanlock -G sanlock
           └─827 sanlock daemon -U sanlock -G sanlock

Nov 05 15:02:53 c71het20151028.localdomain systemd-sanlock[785]: Starting sanlock: [  OK  ]
Nov 05 15:02:53 c71het20151028.localdomain systemd[1]: Started Shared Storage Lease Manager.

Comment 7 David Teigland 2015-11-05 16:00:20 UTC

Do you mean that you've reproduced the problem or you can't reproduce the problem?

I assume that putting the host into maintenance mode should make it unused, and that you do this prior to updating packages.  So, the sanlock and wdmd daemons should either be stopped or unused, allowing them to be restarted during the package update.

After putting the host into maintenance mode, please run:
systemctl status sanlock
systemctl status wdmd
sanlock status
wdmd -d
pgrep sanlock
pgrep wdmd

Then try a package update.

Then check for errors and repeat the commands above.

Comment 8 Simone Tiraboschi 2015-11-05 17:08:59 UTC

(In reply to David Teigland from comment #7)
> Do you mean that you've reproduced the problem or you can't reproduce the
> problem?

I reproduced the environment 

> 
> I assume that putting the host into maintenance mode should make it unused,
> and that you do this prior to updating packages.  So, the sanlock and wdmd
> daemons should either be stopped or unused, allowing them to be restarted
> during the package update.
> 
> After putting the host into maintenance mode, please run:
> systemctl status sanlock
> systemctl status wdmd
> sanlock status
> wdmd -d
> pgrep sanlock
> pgrep wdmd


The host is in maintenance (I putted it in maintenance from the engine): 
[root@c71het20151028 ~]# systemctl status sanlock
sanlock.service - Shared Storage Lease Manager
   Loaded: loaded (/usr/lib/systemd/system/sanlock.service; disabled)
   Active: active (running) since Thu 2015-11-05 15:02:53 CET; 3h 4min ago
  Process: 785 ExecStart=/lib/systemd/systemd-sanlock start (code=exited, status=0/SUCCESS)
 Main PID: 826 (sanlock)
   CGroup: /system.slice/sanlock.service
           ├─826 sanlock daemon -U sanlock -G sanlock
           └─827 sanlock daemon -U sanlock -G sanlock

Nov 05 15:02:53 c71het20151028.localdomain systemd-sanlock[785]: Starting sanlock: [  OK  ]
Nov 05 15:02:53 c71het20151028.localdomain systemd[1]: Started Shared Storage Lease Manager.
[root@c71het20151028 ~]# systemctl status wdmd
wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled)
   Active: active (running) since Thu 2015-11-05 15:02:53 CET; 3h 4min ago
  Process: 784 ExecStart=/lib/systemd/systemd-wdmd start (code=exited, status=0/SUCCESS)
 Main PID: 833 (wdmd)
   CGroup: /system.slice/wdmd.service
           └─833 wdmd -G sanlock

Nov 05 15:02:53 c71het20151028.localdomain systemd-wdmd[784]: Loading the softdog kernel module: [  OK  ]
Nov 05 15:02:53 c71het20151028.localdomain systemd-wdmd[784]: Starting wdmd: [  OK  ]
Nov 05 15:02:53 c71het20151028.localdomain wdmd[833]: wdmd started S0 H1 G179
Nov 05 15:02:53 c71het20151028.localdomain systemd[1]: Started Watchdog Multiplexing Daemon.
Nov 05 15:02:53 c71het20151028.localdomain wdmd[833]: /dev/watchdog0 armed with fire_timeout 60
[root@c71het20151028 ~]# sanlock status
daemon f4ab3842-de72-4119-beab-37405a06567a.c71het2015
p -1 helper
p -1 listener
p -1 status
p 3091 
[root@c71het20151028 ~]# wdmd -d
wdmd 833 socket_gid 179 high_priority 1 now 11058 last_keepalive 11057 last_closeunclean 0 allow_scripts 0 kill_script_sec 0
client 0 name signal pid 0 fd 6 dead 0 ref 0 now 11058 renewal 0 expire 0
client 1 name listen pid 0 fd 7 dead 0 ref 0 now 11058 renewal 0 expire 0
client 2 name dump pid 0 fd 9 dead 0 ref 0 now 11058 renewal 0 expire 0
[root@c71het20151028 ~]# pgrep sanlock
826
827
[root@c71het20151028 ~]# pgrep wdmd
833

Comment 9 Simone Tiraboschi 2015-11-05 17:13:55 UTC

but in my case (centos 7.1 with upstream code) the upgrade doesn't pull a sanlock  upgrade as it did on the reporter's host (rhel7.2).


Resolving Dependencies
--> Running transaction check
---> Package device-mapper-multipath.x86_64 0:0.4.9-77.el7_1.1 will be updated
---> Package device-mapper-multipath.x86_64 0:0.4.9-77.el7_1.2 will be an update
---> Package device-mapper-multipath-libs.x86_64 0:0.4.9-77.el7_1.1 will be updated
---> Package device-mapper-multipath-libs.x86_64 0:0.4.9-77.el7_1.2 will be an update
---> Package gmp.x86_64 1:6.0.0-11.el7 will be updated
---> Package gmp.x86_64 1:6.0.0-12.el7_1 will be an update
---> Package ipxe-roms-qemu.noarch 0:20130517-6.gitc4bce43.el7 will be updated
---> Package ipxe-roms-qemu.noarch 0:20130517-7.gitc4bce43.el7 will be an update
---> Package kernel.x86_64 0:3.10.0-229.20.1.el7 will be installed
---> Package kernel-tools.x86_64 0:3.10.0-229.14.1.el7 will be updated
---> Package kernel-tools.x86_64 0:3.10.0-229.20.1.el7 will be an update
---> Package kernel-tools-libs.x86_64 0:3.10.0-229.14.1.el7 will be updated
---> Package kernel-tools-libs.x86_64 0:3.10.0-229.20.1.el7 will be an update
---> Package kpartx.x86_64 0:0.4.9-77.el7_1.1 will be updated
---> Package kpartx.x86_64 0:0.4.9-77.el7_1.2 will be an update
---> Package libcacard-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated
---> Package libcacard-ev.x86_64 10:2.3.0-29.1.el7 will be an update
---> Package libnl3.x86_64 0:3.2.21-8.el7 will be updated
---> Package libnl3.x86_64 0:3.2.21-9.el7_1 will be an update
---> Package libnl3-cli.x86_64 0:3.2.21-8.el7 will be updated
---> Package libnl3-cli.x86_64 0:3.2.21-9.el7_1 will be an update
---> Package libvirt-client.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-client.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-config-nwfilter.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-config-nwfilter.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-driver-interface.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-driver-interface.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-driver-network.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-driver-network.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-driver-nodedev.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-driver-nodedev.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-driver-nwfilter.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-driver-nwfilter.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-driver-qemu.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-driver-qemu.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-driver-secret.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-driver-secret.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-driver-storage.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-driver-storage.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-daemon-kvm.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-daemon-kvm.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package libvirt-lock-sanlock.x86_64 0:1.2.8-16.el7_1.4 will be updated
---> Package libvirt-lock-sanlock.x86_64 0:1.2.8-16.el7_1.5 will be an update
---> Package nspr.x86_64 0:4.10.8-1.el7_1 will be updated
---> Package nspr.x86_64 0:4.10.8-2.el7_1 will be an update
---> Package nss.x86_64 0:3.19.1-5.el7_1 will be updated
---> Package nss.x86_64 0:3.19.1-7.el7_1.2 will be an update
---> Package nss-sysinit.x86_64 0:3.19.1-5.el7_1 will be updated
---> Package nss-sysinit.x86_64 0:3.19.1-7.el7_1.2 will be an update
---> Package nss-tools.x86_64 0:3.19.1-5.el7_1 will be updated
---> Package nss-tools.x86_64 0:3.19.1-7.el7_1.2 will be an update
---> Package nss-util.x86_64 0:3.19.1-3.el7_1 will be updated
---> Package nss-util.x86_64 0:3.19.1-4.el7_1 will be an update
---> Package otopi.noarch 0:1.3.2-1.el7.centos will be updated
---> Package otopi.noarch 0:1.4.0-1.el7.centos will be an update
---> Package ovirt-engine-sdk-python.noarch 0:3.5.5.0-1.el7.centos will be updated
---> Package ovirt-engine-sdk-python.noarch 0:3.6.0.3-1.el7.centos will be an update
---> Package ovirt-host-deploy.noarch 0:1.3.2-1.el7.centos will be updated
---> Package ovirt-host-deploy.noarch 0:1.4.0-1.el7.centos will be an update
---> Package ovirt-hosted-engine-ha.noarch 0:1.2.7.2-1.el7.centos will be updated
---> Package ovirt-hosted-engine-ha.noarch 0:1.3.2.1-1.el7.centos will be an update
---> Package ovirt-hosted-engine-setup.noarch 0:1.2.6.1-1.el7.centos will be updated
---> Package ovirt-hosted-engine-setup.noarch 0:1.3.0-1.el7.centos will be an update
--> Processing Dependency: ovirt-setup-lib >= 1.0.0 for package: ovirt-hosted-engine-setup-1.3.0-1.el7.centos.noarch
--> Processing Dependency: socat for package: ovirt-hosted-engine-setup-1.3.0-1.el7.centos.noarch
--> Processing Dependency: ovirt-vmconsole-host for package: ovirt-hosted-engine-setup-1.3.0-1.el7.centos.noarch
---> Package python-lockfile.noarch 1:0.9.1-4.el7 will be updated
---> Package python-lockfile.noarch 1:0.9.1-4.el7.centos will be an update
---> Package qemu-guest-agent.x86_64 10:2.1.0-4.el7 will be updated
---> Package qemu-guest-agent.x86_64 10:2.3.0-4.el7 will be an update
---> Package qemu-img-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated
---> Package qemu-img-ev.x86_64 10:2.3.0-29.1.el7 will be an update
--> Processing Dependency: libtcmalloc.so.4()(64bit) for package: 10:qemu-img-ev-2.3.0-29.1.el7.x86_64
---> Package qemu-kvm-common-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated
---> Package qemu-kvm-common-ev.x86_64 10:2.3.0-29.1.el7 will be an update
---> Package qemu-kvm-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated
---> Package qemu-kvm-ev.x86_64 10:2.3.0-29.1.el7 will be an update
---> Package qemu-kvm-tools-ev.x86_64 10:2.1.2-23.el7_1.9.1 will be updated
---> Package qemu-kvm-tools-ev.x86_64 10:2.3.0-29.1.el7 will be an update
---> Package rsyslog.x86_64 0:7.4.7-7.el7_0 will be updated
---> Package rsyslog.x86_64 0:7.4.7-7.el7_1.1 will be an update
---> Package rsyslog-mmjsonparse.x86_64 0:7.4.7-7.el7_0 will be updated
---> Package rsyslog-mmjsonparse.x86_64 0:7.4.7-7.el7_1.1 will be an update
---> Package seabios-bin.noarch 0:1.7.5-8.el7 will be updated
---> Package seabios-bin.noarch 0:1.7.5-11.el7 will be an update
---> Package seavgabios-bin.noarch 0:1.7.5-8.el7 will be updated
---> Package seavgabios-bin.noarch 0:1.7.5-11.el7 will be an update
---> Package selinux-policy.noarch 0:3.13.1-23.el7_1.18 will be updated
---> Package selinux-policy.noarch 0:3.13.1-23.el7_1.21 will be an update
---> Package selinux-policy-targeted.noarch 0:3.13.1-23.el7_1.18 will be updated
---> Package selinux-policy-targeted.noarch 0:3.13.1-23.el7_1.21 will be an update
---> Package vdsm.x86_64 0:4.16.27-0.el7.centos will be updated
---> Package vdsm.noarch 0:4.17.10.1-0.el7.centos will be an update
--> Processing Dependency: vdsm-infra = 4.17.10.1-0.el7.centos for package: vdsm-4.17.10.1-0.el7.centos.noarch
--> Processing Dependency: safelease >= 1.0-5 for package: vdsm-4.17.10.1-0.el7.centos.noarch
--> Processing Dependency: ovirt-vmconsole >= 1.0.0-0 for package: vdsm-4.17.10.1-0.el7.centos.noarch
---> Package vdsm-cli.noarch 0:4.16.27-0.el7.centos will be updated
---> Package vdsm-cli.noarch 0:4.17.10.1-0.el7.centos will be an update
---> Package vdsm-jsonrpc.noarch 0:4.16.27-0.el7.centos will be updated
---> Package vdsm-jsonrpc.noarch 0:4.17.10.1-0.el7.centos will be an update
---> Package vdsm-python.noarch 0:4.16.27-0.el7.centos will be updated
---> Package vdsm-python.noarch 0:4.17.10.1-0.el7.centos will be an update
---> Package vdsm-xmlrpc.noarch 0:4.16.27-0.el7.centos will be updated
---> Package vdsm-xmlrpc.noarch 0:4.17.10.1-0.el7.centos will be an update
---> Package vdsm-yajsonrpc.noarch 0:4.16.27-0.el7.centos will be updated
---> Package vdsm-yajsonrpc.noarch 0:4.17.10.1-0.el7.centos will be an update
--> Running transaction check
---> Package gperftools-libs.x86_64 0:2.4-7.el7 will be installed
--> Processing Dependency: libunwind.so.8()(64bit) for package: gperftools-libs-2.4-7.el7.x86_64
---> Package ovirt-setup-lib.noarch 0:1.0.0-1.el7.centos will be installed
---> Package ovirt-vmconsole.noarch 0:1.0.0-1.el7.centos will be installed
---> Package ovirt-vmconsole-host.noarch 0:1.0.0-1.el7.centos will be installed
---> Package safelease.x86_64 0:1.0-5.el7 will be installed
---> Package socat.x86_64 0:1.7.2.2-5.el7 will be installed
---> Package vdsm-infra.noarch 0:4.17.10.1-0.el7.centos will be installed
--> Running transaction check
---> Package libunwind.x86_64 0:1.1-10.el7 will be installed
--> Finished Dependency Resolution

Comment 10 Simone Tiraboschi 2015-11-05 17:21:05 UTC

(In reply to Simone Tiraboschi from comment #9)
> but in my case (centos 7.1 with upstream code) the upgrade doesn't pull a
> sanlock  upgrade as it did on the reporter's host (rhel7.2).

The original reporter got 
Nov  4 18:29:36 alma05 yum[29969]: Updated: sanlock-3.2.4-1.el7.x86_64

So it doesn't reproduce on my env (centos 7.1):
Installed Packages
sanlock.x86_64                                                                                          3.2.2-2.el7                                                                                           @base

Comment 11 David Teigland 2015-11-05 17:29:48 UTC

> [root@c71het20151028 ~]# sanlock status
> daemon f4ab3842-de72-4119-beab-37405a06567a.c71het2015
> p -1 helper
> p -1 listener
> p -1 status
> p 3091 

> [root@c71het20151028 ~]# wdmd -d
> wdmd 833 socket_gid 179 high_priority 1 now 11058 last_keepalive 11057
> last_closeunclean 0 allow_scripts 0 kill_script_sec 0
> client 0 name signal pid 0 fd 6 dead 0 ref 0 now 11058 renewal 0 expire 0
> client 1 name listen pid 0 fd 7 dead 0 ref 0 now 11058 renewal 0 expire 0
> client 2 name dump pid 0 fd 9 dead 0 ref 0 now 11058 renewal 0 expire 0

Both daemons are unused, so restarting them from this state should be fine.  Given the various sanlock and wdmd errors I see above, there must have been some problem in moving the host into maintenance mode, i.e. the host was not really in maintenance mode when the update happened.  So, the problem was not in updating and restarting sanlock/wdmd, but prior to that when the host failed to cleanly enter maintenance.

Comment 12 Simone Tiraboschi 2015-11-05 17:42:51 UTC

Artyom, did you remember if that host correctly entered maintenance mode from what the user can see from the engine web UI?

Comment 13 Artyom 2015-11-08 07:35:58 UTC

I put host to maintenance via engine, and also checked that via hosted-engine --vm-status host have state local maintenance.
I can provide you details about second host and engine, you will see that host appear under maintenance.

Comment 14 Nir Soffer 2015-11-09 15:14:19 UTC

Artyom, can you reproduce this?

1. Have a working setup with active spm
2. Put spm host to maintenance
3. yum update -> updating sanlock package

Expected:
sanlock and wdmd restart normally after the update

Comment 15 Artyom 2015-11-10 16:02:03 UTC

Have 3.5 HE environment: two hosts, one nfs storage domain, one vm
1) put SPM host to local maintenance
2) host have state maintenance under engine
3) upgrade host from 3.5 to 3.6
4) sanlock up, but 
[root@alma05 ~]# systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2015-11-10 15:57:51 IST; 2min 44s ago
 Main PID: 20241 (code=exited, status=255)

Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: Starting Watchdog Multiplexing Daemon...
Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd-wdmd[20234]: Starting wdmd: [  OK  ]
Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com wdmd[20241]: wdmd started S0 H1 G179
Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com wdmd[20241]: could not set RR|RESET_ON_FORK priority 99 err 1
Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com wdmd[20241]: other wdmd not cleanly stopped, shm_open error 17
Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: Started Watchdog Multiplexing Daemon.
Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: wdmd.service: main process exited, code=exited, status=255/n/a
Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: Unit wdmd.service entered failed state.
Nov 10 15:57:51 alma05.qa.lab.tlv.redhat.com systemd[1]: wdmd.service failed.

Comment 16 Nir Soffer 2015-11-10 16:08:16 UTC

(In reply to Artyom from comment #15)
> Have 3.5 HE environment: two hosts, one nfs storage domain, one vm
> 1) put SPM host to local maintenance
> 2) host have state maintenance under engine

Is hosted engine vm running at this point?

Hosted engine vm is holding a resource on shared storage. You *cannot*
upgrade sanlock or wddm when they maintain an resource.

You must stop hosted engine vm and shutdown hosted engine agents/broker 
so nobody is using sanlock *before* you update sanlock.

Please contact msivak for info about stopping hosted engine processes for
upgrade.

Comment 17 Martin Sivák 2015-11-10 16:29:36 UTC

It should be as simple as service ovirt-ha-agent stop. The broker does not care about sanlock.

Comment 18 Nir Soffer 2015-11-10 16:37:18 UTC

(In reply to Martin Sivák from comment #17)
> It should be as simple as service ovirt-ha-agent stop. The broker does not
> care about sanlock.

Martin, the qemu process running the engine vm is holding a reosurce, right?

Stopping ovirt-ha-agent will shutdown the engine vm? Or maybe we should migrate this vm to another host before upgrading a host?

Do we have documented procedure for upgrading he setup?

Comment 19 Martin Sivák 2015-11-10 16:42:54 UTC

Right, if the VM runs locally then it needs to be migrated away too to free the sanlock daemon. Shutting down agent does not do that automatically.

The normal upgrade procedure needed just yum update and service restart, but it seems it is getting a bit more complicated.

Comment 20 Nir Soffer 2015-11-10 16:57:53 UTC

So it seems that the root cause here is upgrading sanlock while a process on
the host is holding a resource, which is unsupported.

Reducing severity since this usage is unsupported.

Typically this cause the host to reboot because sanlock is killed and cannot
pet the watchdog (via wdmd), so the watchdog is rebooting the machine.

David, can you explain what happened here on the sanlock/wdmd side?

Do you need more information from us on reproducing this failure on the
sanlock side?

Comment 21 David Teigland 2015-11-10 17:22:06 UTC

I was asking the same question.  SIGKILL of wdmd should cause the watchdog to reset the host (whether wdmd is in use or not).  I'd like to see /var/log/messages from the time that happened.

Comment 22 Nir Soffer 2015-11-10 19:28:17 UTC

Artyom, can you provide /var/log/messages showing the timeframe when you
upgrade the machine, as you described in comment 15?

David, did you check /var/log/messages in attachment 1090020 [details]?

Comment 23 David Teigland 2015-11-10 19:46:47 UTC

I didn't notice that, thanks.  The messages are concerning:

Nov  4 18:36:40 alma05 wdmd[635]: /dev/watchdog closed unclean
Nov  4 18:36:40 alma05 kernel: hpwdt: Unexpected close, not stopping watchdog!

At this point, the watchdog should fire and reset the machine in 60 seconds, i.e. at 18:37:40.  But the machine continues to operate and generate log messages right through that time and beyond.  The watchdog seems to have done nothing at all.

Can we run tests on this machine to debug why the watchdog behaves this way?  If so, the first thing to try is using the hpwdt, start the wdmd daemon, then kill -9 the wdmd process, and see if the machine is reset in 60 seconds.  A second test is to remove hpwdt, modprobe softdog, start wdmd, kill -9 wdmd and see if the machine is reset.

Comment 24 David Teigland 2015-11-10 21:17:57 UTC

I've also noticed this in the /var/log/messages:

Nov  4 17:17:15 alma05 kernel: hpwdt 0000:02:00.0: HP Watchdog Timer Driver: NMI decoding initialized, allow kernel dump: ON (default = 0/OFF)
Nov  4 17:17:15 alma05 kernel: hpwdt 0000:02:00.0: HP Watchdog Timer Driver: 1.3.3, timer margin: 30 seconds (nowayout=0).

I'm not familiar with the NMI aspect of watchdog drivers or the "allow kernel dump" feature, both of which seem somewhat dubious.  I need to investigate further what these are about, if they subvert the required watchdog behavior, and if they might explain what we're seeing.

Comment 25 Simone Tiraboschi 2015-11-10 23:07:36 UTC

(In reply to Artyom from comment #15)
> 2) host have state maintenance under engine

This means that the engine VM has been migrated somewhere else.
So, at this point, nothing should prevent the upgrade.

Comment 26 Nir Soffer 2015-11-11 00:04:52 UTC

(In reply to Simone Tiraboschi from comment #25)
> (In reply to Artyom from comment #15)
> > 2) host have state maintenance under engine
> 
> This means that the engine VM has been migrated somewhere else.
> So, at this point, nothing should prevent the upgrade.

To make sure that nobody is using sanlock at this point, can you reproduce
the situation before the upgrade (without upgrading), and share the output
of:

for i in `seq 6`; do
    date
    sanlock client status -D
    echo
    sleep 10
done

Comment 27 Artyom 2015-11-11 14:27:39 UTC

Created attachment 1092765 [details]
output of script

Output of script, also I have now 3.5 environment, if you want to check additional details, before I will upgrade it.

Comment 28 Nir Soffer 2015-11-11 19:40:07 UTC

(In reply to Artyom from comment #27)
> Created attachment 1092765 [details]
> output of script
> 
> Output of script, also I have now 3.5 environment, if you want to check
> additional details, before I will upgrade it.

Thanks Artyom. According to the output, sanlcok has 2 lockspaces, one for
hosted engine, and one for vdsm. This is not expected in maintenance mode
and need further investigation.

Would you open another hosted engine bug for this?

In the other bug:
- Describe exactly what you did to reach this state (engine ui, commands
  run on the host, etc.)
- Attach the output of this script
- Attach vdsm log showing the entire flow since hosted engine was installed
- Attach sanlock log
- Attach /var/log/messages

Comment 29 David Teigland 2015-11-11 20:05:34 UTC

In reference to a question elsewhere:

If you do 'systemctl stop sanlock' while lockspaces exist, sanlock will ignore the SIGTERM and refuse to exit.  After some time, systemd then uses SIGKILL which causes an unclean exit of sanlock.  This unclean exit will cause wdmd to stop pinging /dev/watchdog, which will lead to a host reset from the watchdog (if the watchdog driver is working correctly.)

There is one alternative way to forcibly shut down sanlock by running 'sanlock shutdown -f 1'.  (systemd does not use this method.)

The -f 1 causes the sanlock daemon to:

1) kill any pids using resources in existing lockspaces, then

2) remove/leave those lockspaces when the resource leases are gone (if no processes were using any resources, then the lockspaces are just removed immediately), then

3) exit

Comment 30 David Teigland 2015-11-11 20:36:43 UTC

Looking at the hpwdt.c source, it's clear that "allow_kdump" stops the watchdog just like a normal stop.  This behavior breaks any application like sanlock which depends on controlling the liveness of the watchdog device directly through the defined watchdog API.  I will need access to the hardware to test and confirm this.
(If confirmed, I will have to add a special check to wdmd to disallow this hpwdt allow_kdump setting.)

Comment 31 Artyom 2015-11-15 11:39:54 UTC

Done, sorry that it took so much time, had problem with storage https://bugzilla.redhat.com/show_bug.cgi?id=1282187

Comment 32 David Teigland 2015-11-16 17:59:30 UTC

The watchdog did not fire when expected because the wdmd restart procedure is unintentionally disabling the watchdog with the the "magic V character".

When wdmd is started/restarted, the init script first probes all the watchdog devices to find which one works.  The probe function opens/closes each device to check it, and before closing, turns off the watchdog with the magic character.  This probe turns off the watchdog that's about to fire.

I need to fix the wdmd probe function so that it does not interfere with any watchdog state that might already exist.  The wdmd daemon already avoids this using shm, but the probe function is missing this same logic.

(When things are running as expected, wdmd should never be closed uncleanly in the first place.  If vdsm cleanly enters maintenance mode, this issue will not appear.)

Comment 36 Artyom 2016-05-08 08:15:48 UTC

Verified on sanlock-3.2.4-2.el7_2.x86_64

1) Start with host RHEL7.1 - sanlock-3.2.2-2.el7.x86_64
2) Get two sanlock locks:
sanlock client status
daemon c094d046-98f8-4ee6-a714-bca06544be77.rose05.qa.
p -1 helper
p -1 listener
p -1 status
p 32347 
s c8b5968c-68e8-4ed6-b309-163fc9e148bb:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade/c8b5968c-68e8-4ed6-b309-163fc9e148bb/dom_md/ids:0
s hosted-engine:1:/var/run/vdsm/storage/c8b5968c-68e8-4ed6-b309-163fc9e148bb/24ee7f20-f2eb-4734-9741-ead1eb66bcc2/b209bc7f-3154-430f-b9ef-0d3bd584e589:0
3) Update host to RHEL72 - sanlock-3.2.4-2.el7_2.x86_64
4) Check that sanlock and wdmd services up after update

Comment 38 errata-xmlrpc 2016-11-04 04:34:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2356.html