Description of problem: When blocking NFS access of SPM host using iptables, the host status in webadmin changed circularly from down connecting -> initializing -> non-responsive. In such case host should move to Non-Operational status. When blocking non SPM host, the host status changed to Non-Operational. Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20170903205106.gitb17261a.el7.centos sanlock-3.5.0-1.el7.x86_64 vdsm-4.20.3-8.gitd4eb30e.el7.centos.x86_64 libvirt-client-3.2.0-14.el7.x86_64 qemu-kvm-ev-2.6.0-28.el7.10.1.x86_64 How reproducible: 100% Steps to Reproduce: 1. Have 2 hosts under the same cluster. 2. Block access of SPM host to storage using iptables rule (for example: iptables -I INPUT -s <NFS IP> -j DROP) 3. Observe host status. Actual results: Host status change its status repeatedly from down connecting -> initializing -> non-responsive. Expected results: Host status should change to Non-Operational. Additional info: vdsm.log and engine.log (2017-09-05 10:35:42,051+03) attached
Created attachment 1322107 [details] engine log
Created attachment 1322108 [details] SPM host vdsm.log
Created attachment 1322109 [details] non SPM host vdsm.log
I wonder if we broke something with the changes in the host acquisition flow.
(In reply to Allon Mureinik from comment #4) > I wonder if we broke something with the changes in the host acquisition flow. It should not be related since the recent changes were related to attach/detach and force remove, this scenario should be related to the monitor of the Host. From the logs it looks like the connection between the HOST and ENGINE is timing out, which would explain the Non-Responsive status. the engine sends GetStatsVDS to the Host and gets "Heartbeat exceeded" every time (see [1]) I found a very similar bug which describes the same reproduce steps and the same non-responsiveness behavior the reporter encountered caused by "Heartbeat exceeded": https://bugzilla.redhat.com/1326009 [1] 2017-09-05 10:44:37,546+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-47) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded 2017-09-05 10:49:22,686+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-3) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded 2017-09-05 10:51:49,951+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-22) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded 2017-09-05 10:54:17,835+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-9) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded 2017-09-05 10:54:17,836+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-9) [] Command 'GetStatsVDSCommand(HostName = host_mixed_2, VdsIdAndVdsVDSCommandParametersBase:{hostId='d8750c3a-47b0-4664-89aa-bcaedff85b3b', vds='Host[host_mixed_2,d8750c3a-47b0-4664-89aa-bcaedff85b3b]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-09-05 10:56:46,274+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-13) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded 2017-09-05 10:59:15,212+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-35) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded 2017-09-05 11:01:44,352+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-7) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command GetStatsVDS failed: Heartbeat exceeded 2017-09-05 11:01:44,353+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-7) [] Command 'GetStatsVDSCommand(HostName = host_mixed_2, VdsIdAndVdsVDSCommandParametersBase:{hostId='d8750c3a-47b0-4664-89aa-bcaedff85b3b', vds='Host[host_mixed_2,d8750c3a-47b0-4664-89aa-bcaedff85b3b]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 2017-09-05 11:02:55,747+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler9) [50e59616] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_2, VdsIdVDSCommandParametersBase:{hostId='d8750c3a-47b0-4664-89aa-bcaedff85b3b'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded
(In reply to Maor from comment #5) > From the logs it looks like the connection between the HOST and ENGINE is > timing out, which would explain the Non-Responsive status. > the engine sends GetStatsVDS to the Host and gets "Heartbeat exceeded" every > time (see [1]) How can blocking connection between the SPM to the master SD cause this? We're missing something here...
Hi Martin, Basically getStats should only reflect if the Host is responsive or not, and should not be reflected by storage issues, although it seems that getStats gets Heartbeat exceeded once the storage domain becomes unreachable. Do you have any insight about this?
I can see a lot of communication issues between engine and hosts in engine.log, so are you sure that mentioned iptables rule doesn't affect communication between engine and host?
Yes, the drop iptables rule I applied on SPM host was with NFS storage IP.
Is this an oVirt GA blocker? If so, what is the status of the BZ? Do we have a RCA?
Martin, Based on Comment 9, should this be an infra bug?
(In reply to Maor from comment #11) > Martin, > > Based on Comment 9, should this be an infra bug? Not sure at the moment, we will need to go over logs and try to reproduce. Leaving needinfo on me for now
(In reply to Martin Perina from comment #13) > (In reply to Maor from comment #11) > > Martin, > > > > Based on Comment 9, should this be an infra bug? > > Not sure at the moment, we will need to go over logs and try to reproduce. > Leaving needinfo on me for now I will try to reproduce this once again on my laptop to see if maybe I can find any clues
Hmm, so I've been able to reproduce the issue with 2 hosts (dev-18, dev-21): 1. Making dev-18 an SPM 2. Blocking access to NFS server on dev-18 around 21:20 3. At 21:21:54 received Heartbeat exceeded on SpmStatusVds 4. At the same time 21:21:54 received Heartbeat exceeded on GetVdsStats 5. Right afterwards dev-18 status changed to Connecting 6. At 21:22:44 received GetCapabilitiesVDS failed: client closed 7. At 21:24:51 dev-18 status switched to Up 8. At 21:25:02 dev-21 was selected as SPM 9. At 21:26:19 dev-18 was switched to NonOperational as it cannot access storage Full events list in events.png Looking at dev-18 VDSM logs I can see that VDSM has been killed by sanlock at 21:22:27 and started again at 21:22:41 So it seems that VDSM was stucked and was not able to send Heartbeat, that's why the host has been switched to Connecting and once VDSM was killed and restarted, engine was able to communicated with VDSM and changed the status to Up. So questions is, can we do anything on VDSM side to not block and send heartbeats as required? Piotr, could you please take a look if I haven't missed anything?
Created attachment 1360076 [details] Events view
Created attachment 1360078 [details] engine logs - new
Created attachment 1360079 [details] dev-18 logs new
Created attachment 1360080 [details] dev-21 logs new
The engine called getSpmStatus: 2017-11-28 21:20:47,609+0100 INFO (jsonrpc/2) [vdsm.api] START getSpmStatus(spUUID=u'58d38fb0-cee3-4241-af41-021d6e7df82d', options=None) from=::ffff:10.40.200.18,57214, task_id=9663baee-48a1-43ec-8815-60f5838bdf9e (api:46) and task was blocked: 2017-11-28 21:21:47,611+0100 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/2 running <Task <JsonRpcTask {'params': {u'storagepoolID': u'58d38fb0-cee3-4241-af41-021d6e7df82d'}, 'jsonrpc': '2.0', 'method': u'StoragePool.getSpmStatus', 'id': u'43ce1f4e-1393-4547-bea6-7d92e009534d'} at 0x371fc90> timeout=60, duration=60 at 0x371fad0> task#=53 at 0x2cb8510> (executor:358) and the timeout occurred: 2017-11-28 21:21:48,606+0100 ERROR (monitor/ce36911) [storage.Monitor] Error checking domain ce36911f-2ad2-423b-84c1-92a7b404d384 (monitor:424) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 405, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 688, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 243, in statvfs return self._iop.statvfs(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 481, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out It is potentially the task which blocks vdsm since from the engine side no responses arrived after that specific call. Here is corresponding engine log: 2017-11-28 21:17:05,100+01 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engine-Thread-598) [5a877fa5] Discovered no tasks on Storage Pool 'dc1' 2017-11-28 21:21:19,102+01 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-666) [] domain 'ce36911f-2ad2-423b-84c1-92a7b404d384:data' in problem 'PROBLEMATIC'. vds: 'dev-18' 2017-11-28 21:21:54,495+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM dev-18 command SpmStatusVDS failed: Heartbeat exceeded Nir, Maor is it possible to do something about it?
(In reply to Piotr Kliczewski from comment #20) > and the timeout occurred: > File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 688, > in selftest > self.oop.os.statvfs(self.domaindir) > File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line > 243, in statvfs > return self._iop.statvfs(path) > File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 481, > in statvfs > resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) > File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448, > in _sendCommand > raise Timeout(os.strerror(errno.ETIMEDOUT)) > Timeout: Connection timed out This means this storage domain is not accessible at this point, and ioprocess call timed out. This may also be the reason that the getSpmStatus call hang, but this error is not related to the task thread. > It is potentially the task which blocks vdsm since from the engine side no > responses arrived after that specific call. Here is corresponding engine log: I don't think the task can block vdsm. > 2017-11-28 21:21:54,495+01 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (EE-ManagedThreadFactory-engineScheduled-Thread-26) [] EVENT_ID: > VDS_BROKER_COMMAND_FAILURE(10,802), VDSM dev-18 command SpmStatusVDS failed: > Heartbeat exceeded > > Nir, Maor is it possible to do something about it? getSpmStatus may access sanlock via sanlock binding. If we have the same error in sanlock binding like the error we had in libvirt python binding, it can explain this. But this is relevant only if entire vdsm is actually blocked - do you see any evidence? for example no logs in vdsm log for 30 seconds?
(In reply to Martin Perina from comment #15) > Hmm, so I've been able to reproduce the issue with 2 hosts (dev-18, dev-21): > > 2. Blocking access to NFS server on dev-18 around 21:20 > 3. At 21:21:54 received Heartbeat exceeded on SpmStatusVds > 4. At the same time 21:21:54 received Heartbeat exceeded on GetVdsStats What happened on the host from 21:20 up to 21:21:54? Do we see any hole in the log that can tell us that the entire process was blocked? > Looking at dev-18 VDSM logs I can see that VDSM has been killed by sanlock > at 21:22:27 and started again at 21:22:41 When you block storage on the spm, sanlock cannot update the host id, and vdsm will be killed by sanlock after 80 seconds. Are engine and vdsm clocks synchronized?
Nisim, do you have sanlock.log from this host? for any issue that may be related to storage, we must have sanlock.log.
(In reply to Nir Soffer from comment #22) > (In reply to Martin Perina from comment #15) > > Hmm, so I've been able to reproduce the issue with 2 hosts (dev-18, dev-21): > > > > 2. Blocking access to NFS server on dev-18 around 21:20 > > 3. At 21:21:54 received Heartbeat exceeded on SpmStatusVds > > 4. At the same time 21:21:54 received Heartbeat exceeded on GetVdsStats > > What happened on the host from 21:20 up to 21:21:54? Do we see any hole in > the log > that can tell us that the entire process was blocked? I can't see in VDSM log from dev-18 any time hole bigger than 10 seconds, but I'm not an expert on VDSM, all logs are attached: https://bugzilla.redhat.com/attachment.cgi?id=1360079 > > > Looking at dev-18 VDSM logs I can see that VDSM has been killed by sanlock > > at 21:22:27 and started again at 21:22:41 > > When you block storage on the spm, sanlock cannot update the host id, and > vdsm > will be killed by sanlock after 80 seconds. That's OK, but it also means that host will be switched to Connecting and back to Up once VDSM starts up (this is happened above). But the strange thing is that we have lost communication (not receiving heartbeats) even before VDSM was killed by sanlock: 21:21:54 we got Heartbeat exceed, which means that we'got no response from VDSM till 21:21:24. And according to logs sanlock killed VDSM at 21:22:27, so we have a minute when VDSM seemed locked/nonresponsive from communication point of view. > > Are engine and vdsm clocks synchronized? Yes
Created attachment 1360238 [details] snalock log
When blocking the master storage domain from the Host sanlock should reboot the host after 80 seconds, once it is up again it should become non-operational and the other Host in the DC should become the SPM. If the host is not up again after the reboot, it should stay in non-responsive and the user should pick "confirm reboot" to elect a new SPM. Based on Comment 15 and my own reproduce, it seems to follow the anticipated behavior. After discussing this with Nir and Nissim, I'm closing the bug as WorksForMe. Please feel free to re-open if you see otherwise.