Bug 1488338 - SPM host is not moving to Non-Operational status when blocking its access to storage domain.
Summary: SPM host is not moving to Non-Operational status when blocking its access to ...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Maor
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks: 1518562
TreeView+ depends on / blocked
 
Reported: 2017-09-05 08:01 UTC by Nisim Simsolo
Modified: 2018-01-01 13:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-01 13:12:50 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)
engine log (5.98 MB, text/plain)
2017-09-05 08:07 UTC, Nisim Simsolo
no flags Details
SPM host vdsm.log (3.24 MB, text/plain)
2017-09-05 08:08 UTC, Nisim Simsolo
no flags Details
non SPM host vdsm.log (7.89 MB, text/plain)
2017-09-05 08:08 UTC, Nisim Simsolo
no flags Details
Events view (145.49 KB, image/png)
2017-11-28 21:06 UTC, Martin Perina
no flags Details
engine logs - new (678.16 KB, application/x-xz)
2017-11-28 21:07 UTC, Martin Perina
no flags Details
dev-18 logs new (268.56 KB, application/x-xz)
2017-11-28 21:07 UTC, Martin Perina
no flags Details
dev-21 logs new (338.07 KB, application/x-xz)
2017-11-28 21:08 UTC, Martin Perina
no flags Details
snalock log (652.38 KB, text/plain)
2017-11-29 08:36 UTC, Nisim Simsolo
no flags Details

Description Nisim Simsolo 2017-09-05 08:01:48 UTC
Description of problem:
When blocking NFS access of SPM host using iptables, the host status in webadmin changed circularly from down connecting -> initializing -> non-responsive.
In such case host should move to Non-Operational status.
When blocking non SPM host, the host status changed to Non-Operational.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20170903205106.gitb17261a.el7.centos
sanlock-3.5.0-1.el7.x86_64
vdsm-4.20.3-8.gitd4eb30e.el7.centos.x86_64
libvirt-client-3.2.0-14.el7.x86_64
qemu-kvm-ev-2.6.0-28.el7.10.1.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Have 2 hosts under the same cluster.
2. Block access of SPM host to storage using iptables rule (for example: iptables -I INPUT -s <NFS IP> -j DROP)
3. Observe host status.

Actual results:
Host status change its status repeatedly from down connecting -> initializing -> non-responsive.

Expected results:
Host status should change to Non-Operational.

Additional info:
vdsm.log and engine.log (2017-09-05 10:35:42,051+03) attached

Comment 1 Nisim Simsolo 2017-09-05 08:07:23 UTC
Created attachment 1322107 [details]
engine log

Comment 2 Nisim Simsolo 2017-09-05 08:08:12 UTC
Created attachment 1322108 [details]
SPM host vdsm.log

Comment 3 Nisim Simsolo 2017-09-05 08:08:57 UTC
Created attachment 1322109 [details]
non SPM host vdsm.log

Comment 4 Allon Mureinik 2017-09-11 07:09:49 UTC
I wonder if we broke something with the changes in the host acquisition flow.

Comment 5 Maor 2017-09-12 22:29:47 UTC
(In reply to Allon Mureinik from comment #4)
> I wonder if we broke something with the changes in the host acquisition flow.

It should not be related since the recent changes were related to attach/detach and force remove, this scenario should be related to the monitor of the Host.

From the logs it looks like the connection between the HOST and ENGINE is timing out, which would explain the Non-Responsive status.
the engine sends GetStatsVDS to the Host and gets "Heartbeat exceeded" every time (see [1])

I found a very similar bug which describes the same reproduce steps and the same non-responsiveness behavior the reporter encountered caused by "Heartbeat exceeded":
   https://bugzilla.redhat.com/1326009

[1]

2017-09-05 10:44:37,546+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-47) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded

2017-09-05 10:49:22,686+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-3) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded

2017-09-05 10:51:49,951+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-22) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded

2017-09-05 10:54:17,835+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-9) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded
2017-09-05 10:54:17,836+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-9) [] Command 'GetStatsVDSCommand(HostName = host_mixed_2, VdsIdAndVdsVDSCommandParametersBase:{hostId='d8750c3a-47b0-4664-89aa-bcaedff85b3b', vds='Host[host_mixed_2,d8750c3a-47b0-4664-89aa-bcaedff85b3b]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded

2017-09-05 10:56:46,274+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-13) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded

2017-09-05 10:59:15,212+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-35) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded

2017-09-05 11:01:44,352+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-7) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded
2017-09-05 11:01:44,353+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-7) [] Command 'GetStatsVDSCommand(HostName = host_mixed_2, VdsIdAndVdsVDSCommandParametersBase:{hostId='d8750c3a-47b0-4664-89aa-bcaedff85b3b', vds='Host[host_mixed_2,d8750c3a-47b0-4664-89aa-bcaedff85b3b]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded

2017-09-05 11:02:55,747+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [50e59616] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_2, VdsIdVDSCommandParametersBase:{hostId='d8750c3a-47b0-4664-89aa-bcaedff85b3b'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded

Comment 6 Allon Mureinik 2017-09-13 08:31:06 UTC
(In reply to Maor from comment #5)
> From the logs it looks like the connection between the HOST and ENGINE is
> timing out, which would explain the Non-Responsive status.
> the engine sends GetStatsVDS to the Host and gets "Heartbeat exceeded" every
> time (see [1])
How can blocking connection between the SPM to the master SD cause this? We're missing something here...

Comment 7 Maor 2017-09-14 09:37:57 UTC
Hi Martin,

Basically getStats should only reflect if the Host is responsive or not, and should not be reflected by storage issues, although it seems that getStats gets Heartbeat exceeded once the storage domain becomes unreachable.
Do you have any insight about this?

Comment 8 Martin Perina 2017-09-21 12:59:40 UTC
I can see a lot of communication issues between engine and hosts in engine.log, so are you sure that mentioned iptables rule doesn't affect communication between engine and host?

Comment 9 Nisim Simsolo 2017-09-24 07:46:17 UTC
Yes, the drop iptables rule I applied on SPM host was with NFS storage IP.

Comment 10 Yaniv Kaul 2017-11-16 08:03:16 UTC
Is this an oVirt GA blocker? If so, what is the status of the BZ? Do we have a RCA?

Comment 11 Maor 2017-11-16 10:26:27 UTC
Martin,

Based on Comment 9, should this be an infra bug?

Comment 13 Martin Perina 2017-11-28 14:46:45 UTC
(In reply to Maor from comment #11)
> Martin,
> 
> Based on Comment 9, should this be an infra bug?

Not sure at the moment, we will need to go over logs and try to reproduce. Leaving needinfo on me for now

Comment 14 Maor 2017-11-28 18:21:13 UTC
(In reply to Martin Perina from comment #13)
> (In reply to Maor from comment #11)
> > Martin,
> > 
> > Based on Comment 9, should this be an infra bug?
> 
> Not sure at the moment, we will need to go over logs and try to reproduce.
> Leaving needinfo on me for now

I will try to reproduce this once again on my laptop to see if maybe I can find any clues

Comment 15 Martin Perina 2017-11-28 21:06:00 UTC
Hmm, so I've been able to reproduce the issue with 2 hosts (dev-18, dev-21):

1. Making dev-18 an SPM
2. Blocking access to NFS server on dev-18 around 21:20
3. At 21:21:54 received Heartbeat exceeded on SpmStatusVds
4. At the same time 21:21:54 received Heartbeat exceeded on GetVdsStats
5. Right afterwards dev-18 status changed to Connecting
6. At 21:22:44 received GetCapabilitiesVDS failed: client closed
7. At 21:24:51 dev-18 status switched to Up
8. At 21:25:02 dev-21 was selected as SPM
9. At 21:26:19 dev-18 was switched to NonOperational as it cannot access storage

Full events list in events.png

Looking at dev-18 VDSM logs I can see that VDSM has been killed by sanlock at 21:22:27 and started again at 21:22:41

So it seems that VDSM was stucked and was not able to send Heartbeat, that's why the host has been switched to Connecting and once VDSM was killed and restarted, engine was able to communicated with VDSM and changed the status to Up.

So questions is, can we do anything on VDSM side to not block and send heartbeats as required?

Piotr, could you please take a look if I haven't missed anything?

Comment 16 Martin Perina 2017-11-28 21:06:47 UTC
Created attachment 1360076 [details]
Events view

Comment 17 Martin Perina 2017-11-28 21:07:27 UTC
Created attachment 1360078 [details]
engine logs - new

Comment 18 Martin Perina 2017-11-28 21:07:54 UTC
Created attachment 1360079 [details]
dev-18 logs new

Comment 19 Martin Perina 2017-11-28 21:08:19 UTC
Created attachment 1360080 [details]
dev-21 logs new

Comment 20 Piotr Kliczewski 2017-11-28 22:17:28 UTC
The engine called getSpmStatus:

2017-11-28 21:20:47,609+0100 INFO  (jsonrpc/2) [vdsm.api] START getSpmStatus(spUUID=u'58d38fb0-cee3-4241-af41-021d6e7df82d', options=None) from=::ffff:10.40.200.18,57214, task_id=9663baee-48a1-43ec-8815-60f5838bdf9e (api:46)

and task was blocked:

2017-11-28 21:21:47,611+0100 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/2 running <Task <JsonRpcTask {'params': {u'storagepoolID': u'58d38fb0-cee3-4241-af41-021d6e7df82d'}, 'jsonrpc': '2.0', 'method': u'StoragePool.getSpmStatus', 'id': u'43ce1f4e-1393-4547-bea6-7d92e009534d'} at 0x371fc90> timeout=60, duration=60 at 0x371fad0> task#=53 at 0x2cb8510> (executor:358)

and the timeout occurred:

2017-11-28 21:21:48,606+0100 ERROR (monitor/ce36911) [storage.Monitor] Error checking domain ce36911f-2ad2-423b-84c1-92a7b404d384 (monitor:424)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 405, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 688, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 243, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 481, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

It is potentially the task which blocks vdsm since from the engine side no responses arrived after that specific call. Here is corresponding engine log:

2017-11-28 21:17:05,100+01 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engine-Thread-598) [5a877fa5] Discovered no tasks on Storage Pool 'dc1'
2017-11-28 21:21:19,102+01 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-666) [] domain 'ce36911f-2ad2-423b-84c1-92a7b404d384:data' in problem 'PROBLEMATIC'. vds: 'dev-18'
2017-11-28 21:21:54,495+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM dev-18 command SpmStatusVDS failed: Heartbeat exceeded 

Nir, Maor is it possible to do something about it?

Comment 21 Nir Soffer 2017-11-28 23:17:17 UTC
(In reply to Piotr Kliczewski from comment #20)
> and the timeout occurred:
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 688,
> in selftest
>     self.oop.os.statvfs(self.domaindir)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line
> 243, in statvfs
>     return self._iop.statvfs(path)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 481,
> in statvfs
>     resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448,
> in _sendCommand
>     raise Timeout(os.strerror(errno.ETIMEDOUT))
> Timeout: Connection timed out

This means this storage domain is not accessible at this point, and ioprocess
call timed out.

This may also be the reason that the getSpmStatus call hang, but this error is
not related to the task thread.

> It is potentially the task which blocks vdsm since from the engine side no
> responses arrived after that specific call. Here is corresponding engine log:

I don't think the task can block vdsm.

> 2017-11-28 21:21:54,495+01 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engineScheduled-Thread-26) [] EVENT_ID:
> VDS_BROKER_COMMAND_FAILURE(10,802), VDSM dev-18 command SpmStatusVDS failed:
> Heartbeat exceeded 
> 
> Nir, Maor is it possible to do something about it?

getSpmStatus may access sanlock via sanlock binding. If we have the same error
in sanlock binding like the error we had in libvirt python binding, it can 
explain this.

But this is relevant only if entire vdsm is actually blocked - do you see any
evidence? for example no logs in vdsm log for 30 seconds?

Comment 22 Nir Soffer 2017-11-28 23:44:59 UTC
(In reply to Martin Perina from comment #15)
> Hmm, so I've been able to reproduce the issue with 2 hosts (dev-18, dev-21):
> 
> 2. Blocking access to NFS server on dev-18 around 21:20
> 3. At 21:21:54 received Heartbeat exceeded on SpmStatusVds
> 4. At the same time 21:21:54 received Heartbeat exceeded on GetVdsStats

What happened on the host from 21:20 up to 21:21:54? Do we see any hole in the log
that can tell us that the entire process was blocked?

> Looking at dev-18 VDSM logs I can see that VDSM has been killed by sanlock
> at 21:22:27 and started again at 21:22:41

When you block storage on the spm, sanlock cannot update the host id, and vdsm
will be killed by sanlock after 80 seconds.

Are engine and vdsm clocks synchronized?

Comment 23 Nir Soffer 2017-11-28 23:48:09 UTC
Nisim, do you have sanlock.log from this host? for any issue that may be related
to storage, we must have sanlock.log.

Comment 24 Martin Perina 2017-11-29 08:22:05 UTC
(In reply to Nir Soffer from comment #22)
> (In reply to Martin Perina from comment #15)
> > Hmm, so I've been able to reproduce the issue with 2 hosts (dev-18, dev-21):
> > 
> > 2. Blocking access to NFS server on dev-18 around 21:20
> > 3. At 21:21:54 received Heartbeat exceeded on SpmStatusVds
> > 4. At the same time 21:21:54 received Heartbeat exceeded on GetVdsStats
> 
> What happened on the host from 21:20 up to 21:21:54? Do we see any hole in
> the log
> that can tell us that the entire process was blocked?

I can't see in VDSM log from dev-18 any time hole bigger than 10 seconds, but I'm not an expert on VDSM, all logs are attached: https://bugzilla.redhat.com/attachment.cgi?id=1360079

> 
> > Looking at dev-18 VDSM logs I can see that VDSM has been killed by sanlock
> > at 21:22:27 and started again at 21:22:41
> 
> When you block storage on the spm, sanlock cannot update the host id, and
> vdsm
> will be killed by sanlock after 80 seconds.

That's OK, but it also means that host will be switched to Connecting and back to Up once VDSM starts up (this is happened above). But the strange thing is that we have lost communication (not receiving heartbeats) even before VDSM was killed by sanlock:

21:21:54 we got Heartbeat exceed, which means that we'got no response from VDSM till 21:21:24. And according to logs sanlock killed VDSM at 21:22:27, so we have a minute when VDSM seemed locked/nonresponsive from communication point of view.

> 
> Are engine and vdsm clocks synchronized?

Yes

Comment 25 Nisim Simsolo 2017-11-29 08:36:19 UTC
Created attachment 1360238 [details]
snalock log

Comment 28 Maor 2018-01-01 13:12:50 UTC
When blocking the master storage domain from the Host sanlock should reboot the host after 80 seconds, once it is up again it should become non-operational and the other Host in the DC should become the SPM.
If the host is not up again after the reboot, it should stay in non-responsive and the user should pick "confirm reboot" to elect a new SPM.

Based on Comment 15 and my own reproduce, it seems to follow the anticipated behavior.
After discussing this with Nir and Nissim, I'm closing the bug as WorksForMe. Please feel free to re-open if you see otherwise.


Note You need to log in before you can comment on or make changes to this bug.