+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1499056 +++ ====================================================================== Description of problem: Not sure if this needs to be solved at vdsm or better handled by the engine, assigning to vdsm for initial evaluation. When the engine issues a ConnectStorageServer command, VDSM uses mount.nfs to mount ISO/Export Domains. The engine expects this to complete in 180 seconds (vdsTimeout). However, mount(2) can take much longer than 180 seconds to return in some conditions. This results in the host going unresponsive and the DC being completely down when the NFS server is unreachable and the engine is restarted (initialize the hosts), or when the environment is completely powered off and then brought up again. In the absence of the NFS domains, in a specific case, the hosts go in a loop as below: 2017-10-03 19:09:00,738-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] START, ConnectStorageServerVDSCommand(.....), log id: 607cbf68 ... 3 min vdsTimeout ... 2017-10-03 19:12:01,166-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [2a0ebe37] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM compute1 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues 2017-10-03 19:12:03,630-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetMOMPolicyParametersVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] START 2017-10-03 19:12:03,807-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetMOMPolicyParametersVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] FINISH 2017-10-03 19:12:03,987-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [73d5515a] EVENT_ID: VDS_DETECTED(13), Correlation ID: 31a1dfc3, Call Stack: null, Custom Event ID: -1, Message: Status of host compute1 was set to Up. 2017-10-03 19:12:04,057-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-43) [2a0ebe37] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host compute1 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. On the VDSM side we can see mount.nfs takes 4min21 seconds, which is more than vdsTimeout (3min). The worker is blocked and only after 4min21s vdsm returns the response to the engine, which already timed out. 2017-10-03 19:06:43,911-0400 INFO (jsonrpc/7) [vdsm.api] START connectStorageServer(....) 2017-10-03 19:11:04,227-0400 ERROR (jsonrpc/7) [storage.HSM] Could not connect to storageServer (hsm:2384) Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2381, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 381, in connect return self._mountCon.connect() File "/usr/share/vdsm/storage/storageServer.py", line 163, in connect six.reraise(t, v, tb) File "/usr/share/vdsm/storage/storageServer.py", line 155, in connect self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP) File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 190, in mount timeout=timeout, cgroup=cgroup) File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 53, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 51, in <lambda> **kwargs) File "<string>", line 2, in mount File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) MountError: (32, ';mount.nfs: Connection timed out\n') Now on the specific case mount.nfs takes all this time to respond: * The default mount.nfs will retry for 2 minutes. That is, if mount(2) returns with error within 2 minutes, it tries again. * But if the mount(2) call takes more than 2 minutes, then userspance mount doesn't return within 2 minutes. It's a retry parameter, not a hard timeout. * mount(2) takes ~4m to return in case it's able to resolve the nexthop (arp) for the NFS server, but the NFS server does not return any ICMP or response to the TCP SYN. This is the TCP timeout. Can be hit in at least two cases: Case 1: Host on network A, NFS Server on net B. Host reaches network A gateway (nexthop resolved), TCP SYN sent. No response from NFS Server. Case 2: NFS Server on same network as Host, but NFS server simply drops the TCP SYN (silent drop firewall rule). The point is if the Host kernel sends the TCP SYN and it does not receive any reply with error or an ICMP, mount(2) will take 4 minutes to return, by the TCP timeout. Version-Release number of selected component (if applicable): rhevm-4.1.3.5-0.1.el7.noarch vdsm-4.19.20-1.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: Easiest way is to silently DROP outgoing IP to the NFS server: # iptables -I OUTPUT 1 -d <NFS_IP> -j DROP # time mount <NFS_IP>:/ /mnt mount.nfs: Connection timed out real 4m17.504s user 0m0.004s sys 0m0.001s Note that the condition is that mount(2) doesn't return quickly. In the case below it would try for 2 minutes, probably dozens of mount(2) calls and then give up. # date && time mount.nfs -o timeo=600,retrans=6 -v <NFS_IP>:/tmp /mnt Fri Oct 6 10:26:46 AEST 2017 mount.nfs: timeout set for Fri Oct 6 10:28:46 2017 mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4' mount.nfs: mount(2): No route to host mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4' mount.nfs: mount(2): No route to host mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4' mount.nfs: mount(2): No route to host ^C real 0m9.742s user 0m0.000s sys 0m0.004s Usually we would expect it to do something like the above, which would be mount(2) returning quicker by receiving a proper error from the network layer instead of waiting for the tcp timeout (4m). In case mount(2) hangs on TCP timeout, it ultimately leads to vdsm exceeding vdsTimeout to return ConnectStorageServer. And this seems to break host initialization. Actual results: RHV doesn't go up, DC is down, hosts doing up/not-responding dance. Expected results: RHV Up (Originally by Germano Veit Michel)
This is the same behavior as any other domain, pushing out to 4.3 to rethink. (Originally by amureini)
(In reply to Allon Mureinik from comment #2) > This is the same behavior as any other domain, pushing out to 4.3 to rethink. Is it? I thought vdsTimeout exception for connectStorageServer on a Data Domain would move the Host to Non-Operational. Might be wrong though... (Originally by Germano Veit Michel)
Stuck mount can be solved by letting someone else wait for the mount, and returning a timeout early. Engine can modify the domain or host status considering the domain status on other hosts. One option is to let systemd do the mount using mount units. If mount get stuck, this is now systemd problem. Vdsm can wait for a mount events, and report to engine when the mount was completed. See https://gerrit.ovirt.org/#/c/56516/ for example using mount events. Using persistent mount units, a host can recover the mounts automatically after a reboot - this will simplify hosted engine flows. (Originally by Nir Soffer)
(In reply to Nir Soffer from comment #11) > Stuck mount can be solved by letting someone else wait for the mount, and > returning > a timeout early. Engine can modify the domain or host status considering the > domain > status on other hosts. > > One option is to let systemd do the mount using mount units. If mount get > stuck, > this is now systemd problem. Vdsm can wait for a mount events, and report to > engine when the mount was completed. See https://gerrit.ovirt.org/#/c/56516/ > for example using mount events. > > Using persistent mount units, a host can recover the mounts automatically > after > a reboot - this will simplify hosted engine flows. Allon should we move it back to 4.3? (Originally by ylavi)
Returning old needinfo (Originally by Raz Tamir)
Tested with the following code: ----------------------------------------- rhevm-3.6.13.1-0.1.el6.noarch vdsm-4.17.44-2.el7ev.noarch Tested with the following scenario: --------------------------------------- 1. On a system with nfs, iscsi and ISO domain 2. Ran "iptables -I OUTPUT 1 -d <NFS_IP> -j DROP" from the host to the ISO domain >>>> The data center was set to Non responsive Engine logs: ------------------------------- 2018-03-03 22:33:44,459 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-65) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Client close 2018-03-03 22:33:44,467 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-65) [] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSGenericException: VDSNetworkException: Client close, log id: 8478bdf 2018-03-03 22:33:44,467 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-65) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Client close at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand.executeIrsBrokerCommand(GetStoragePoolInfoVDSCommand.java:32) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:161) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:74) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData.proceedStoragePoolStats(IrsProxyData.java:336) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed(IrsProxyData.java:238) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor134.invoke(Unknown Source) [:1.8.0_121] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_121] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_121] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2018-03-03 22:33:44,483 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-65) [] IRS failover failed - can't allocate vds server 2018-03-03 22:33:44,498 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (org.ovirt.thread.pool-6-thread-6) [4b9f84fe] Lock freed to object 'EngineLock:{exclusiveLocks='null', sharedLocks='[dea07dac-f1a7-4a94-ac7b-ed697b635276=<POOL, ACTION_TYPE_FAILED_OBJECT_LOCKED>]'}' 2018-03-03 22:33:44,511 INFO [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (org.ovirt.thread.pool-6-thread-6) [4b9f84fe] Running command: MaintenanceVdsCommand internal: true. Entities affected : ID: 1fed50d9-ed6c-4adf-bcbc-37487de67bb0 Type: VDS . . . 2018-03-03 22:33:54,498 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-64) [113f6f10] Storage Pool 'dea07dac-f1a7-4a94-ac7b-ed697b635276' - Updating Storag e Domain 'fb39c7f3-7bed-4946-970a-ff3c0656726d' status from 'Active' to 'Unknown', reason: null 2018-03-03 22:33:54,503 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-64) [113f6f10] Correlation ID: 113f6f10, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center golden_env_mixed. Setting status to Non Responsive. 2018-03-03 22:33:54,518 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-64) [113f6f10] START, ConnectStoragePoolVDSCommand(HostName = host_mixed_1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vdsId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', storagePoolId='dea07dac-f1a7-4a94-ac7b-ed697b635276', masterVersion='1'}), log id: 671a00e8 . . . 2018-03-03 22:36:20,618 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-65) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vds='Host[host_mixed_1,5c9876d4-d68d-4eac-8f5a-10557b5f319a]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2018-03-03 22:36:20,621 INFO [org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher] (DefaultQuartzScheduler_Worker-65) [] Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring. 2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-54) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Heartbeat exeeded 2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-54) [] Command 'GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vds='Host[host_mixed_1,5c9876d4-d68d-4eac-8f5a-10557b5f319a]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) [] Failed getting vds stats, vds='host_mixed_1'(5c9876d4-d68d-4eac-8f5a-10557b5f319a): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) [] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:74) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source) [:1.8.0_121] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_121] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_121] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2018-03-03 22:36:20,624 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-54) [] Failed to refresh VDS, network error, continuing, vds='host_mixed_1'(5c9876d4-d68d-4eac-8f5a-10557b5f319a): VDSGenericException: VDSNetworkException: Heartbeat exeeded 2018-03-03 22:36:20,624 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-22) [] Host 'host_mixed_1' is not responding. 2018-03-03 22:36:20,624 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-24) [] Host 'host_mixed_1' is not responding. 2018-03-03 22:36:20,625 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-63) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Heartbeat exeeded 2018-03-03 22:36:20,625 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-63) [] Command 'SpmStatusVDSCommand(HostName = host_mixed_1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', storagePoolId='dea07dac-f1a7-4a94-ac7b-ed697b635276'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2018-03-03 22:36:20,627 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-63) [39df4684] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: dea07dac-f1a7-4a94-ac7b-ed697b635276 Type: StoragePool (Originally by Kevin Goldblatt)
Created attachment 1403592 [details] engine and vdsm logs Added logs (Originally by Kevin Goldblatt)
According to reproduction from Kevin, which is not the same issue as the bug itself, this seems like a regression (Originally by Raz Tamir)
The iso domain which was used in the test reported in comment #22 was on the engine so non responsive is the desired behavior. Upon blocking connection to the storage server where the iso domain resides, only the iso domain becomes inactive (Originally by Elad Ben Aharon)
(In reply to Elad from comment #25) > The iso domain which was used in the test reported in comment #22 was on the > engine so non responsive is the desired behavior. > Upon blocking connection to the storage server where the iso domain resides, > only the iso domain becomes inactive Logs? (Originally by amureini)
Created attachment 1403741 [details] logs-3.6-only_iso_domain_inactive vdsm.log.1.xz: Thread-52183::ERROR::2018-03-04 11:38:55,686::monitor::276::Storage.Monitor::(_monitorDomain) Error monitoring domain 8c5e702c-486e-4bc4-9bbb-0155218abf14 Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain self._performDomainSelftest() File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in wrapper value = meth(self, *a, **kw) File "/usr/share/vdsm/storage/monitor.py", line 339, in _performDomainSelftest self.domain.selftest() File "/usr/share/vdsm/storage/fileSD.py", line 648, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/outOfProcess.py", line 274, in statvfs return self._iop.statvfs(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 452, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 419, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out Thread-52183::INFO::2018-03-04 11:38:55,687::monitor::299::Storage.Monitor::(_notifyStatusChanges) Domain 8c5e702c-486e-4bc4-9bbb-0155218abf14 became INVALID engine.log: ISO domain is unreachable: 2018-03-04 11:42:01,809 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] (ajp-/127.0.0.1:8702-1) [] FINISH, GetFileStatsVDSCommand, log id: d97256f 2018-03-04 11:42:01,809 WARN [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (ajp-/127.0.0.1:8702-1) [] The refresh process for pattern * failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Cannot get file stats: ('8c5e702c-486e-4bc4-9bbb-0155218abf14',) (Failed with error GET_FILE_LIST_ERROR and code 330) ISO domain is being deactivated: 2018-03-04 11:43:22,542 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (org.ovirt.thread.pool-6-thread-35) [762c9a06] Running command: DeactivateStorageDomainCommand internal: true. Entities affected : ID: 8c5e702c-486e-4bc4-9bbb-0155218abf14 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN The data center remains up, only the ISO domain is affected and becomes inactive. There seems to be a bug on 3.6 for GetImagesListQuery, an NPE is thrown. Not sure if it's a known issue. 2018-03-04 11:42:01,809 WARN [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (ajp-/127.0.0.1:8702-1) [] The refresh process for pattern * failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Cannot get file stats: ('8c5e702c-486e-4bc4-9bbb-0155218abf14',) (Failed with error GET_FILE_LIST_ERROR and code 330) 2018-03-04 11:42:01,810 ERROR [org.ovirt.engine.core.bll.GetImagesListQuery] (ajp-/127.0.0.1:8702-1) [] Query 'GetImagesListQuery' failed: null 2018-03-04 11:42:01,810 ERROR [org.ovirt.engine.core.bll.GetImagesListQuery] (ajp-/127.0.0.1:8702-1) [] Exception: java.lang.NullPointerException at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.removeFileStatsForComplyingFileNames(IsoDomainListSyncronizer.java:699) [bll.jar:] at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.updateAllFileListFromVDSM(IsoDomainListSyncronizer.java:679) [bll.jar:] at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.updateFileList(IsoDomainListSyncronizer.java:305) [bll.jar:] at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshIsoDomainFileForStoragePool(IsoDomainListSyncronizer.java:289) [bll.jar:] at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshIsoDomain(IsoDomainListSyncronizer.java:346) [bll.jar:] at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshRepos(IsoDomainListSyncronizer.java:212) [bll.jar:] (Originally by Elad Ben Aharon)
I don't see the regression here. The bug is about the behavior when the the mount command for the iso domain get stuck for more then 180 seconds. Did we reproduce stuck mount in 3.6? do we have different behavior when mounting the iso domain is stuck? (Originally by Nir Soffer)
(In reply to Elad from comment #28) > Thread-52183::ERROR::2018-03-04 > 11:38:55,686::monitor::276::Storage.Monitor::(_monitorDomain) Error > monitoring domain 8c5e702c-486e-4bc4-9bbb-0155218abf14 > Traceback (most recent call last): This means the connectStoragesServer succeeded, and we are monitoring the domain. This bug is about the case when connectStorageServer timed out. (Originally by Nir Soffer)
Nir, how can we simulate this? Isn't blocking connection to the NFS server enough? (Originally by Elad Ben Aharon)
(In reply to Elad from comment #31) > Nir, how can we simulate this? > Isn't blocking connection to the NFS server enough? It should, but if the mount commmand did not got stuck, we did not reproduce the issue reported in this bug. (Originally by Nir Soffer)
Nir, maybe Germano can provide us a way of reproducing stuck mount. (Originally by Elad Ben Aharon)
I see the exact same behavior in 4.2 as I saw in 4.1 in comment #0. No heartbeat exceeded, just vdsTimeout and a up/down dance forever. Host cannot activate. 1. Engine sends connectStorageServer 2018-03-05 12:42:16,037+11 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] 2. vdsTimeout on connectStorageServer (3min) 2018-03-05 12:45:16,966+11 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] ... execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 3. Engine reconnects to host 2018-03-05 12:45:17,021+11 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to ... 4. Host is set to Up 2018-03-05 12:45:18,422+11 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Threa d-62) [773ff1ff] EVENT_ID: VDS_DETECTED(13), Status of host ... was set to Up. 5. Then quickly switches to not responding 2018-03-05 12:45:18,658+11 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-91979) [675de76a] Host '....' is not responding. 6. Engine connects again and sends another connectStorageServer. Up/Not-Responding dance continues, repeat 1-5 again, and again and again. The only improvement is that now I can clearly see the blocked worked on vdsm side trying to do mount (thanks!), but it only helps diagnosing the issue, it doesn't solve it of course. Versions used: rhvm-4.2.0-0.6.el7.noarch vdsm-4.20.17-1.el7ev.x86_64 (Originally by Germano Veit Michel)
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Project 'ovirt-engine'/Component 'vdsm' mismatch] For more info please contact: rhv-devops
Verified with the following code: ----------------------------------------- ovirt-engine-4.2.4-0.1.el7.noarch vdsm-4.20.28-1.el7ev.x86_64 Verified with the following scenario: ----------------------------------------- Steps: - Data Center with at least two host - Configure an ISO storage domain - Move one of the hosts to Maintenance - Block with IPtables the connectivity to the ISO domain from the host in maintenance only. - Activate the Host >>>>> The host move to Up state. Moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2072
BZ<2>Jira Resync
sync2jira