1499056 – Unreachable ISO/Export SD prevents hosts from activating

Bug 1499056 - Unreachable ISO/Export SD prevents hosts from activating

Summary: Unreachable ISO/Export SD prevents hosts from activating

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.3.0
Target Release:	---
Assignee:	Fred Rolland
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1541529 1578763
TreeView+	depends on / blocked

Reported:	2017-10-06 00:32 UTC by Germano Veit Michel
Modified:	2021-12-10 15:18 UTC (History)
CC List:	19 users (show)
Fixed In Version:	ovirt-engine-4.3.0_rc
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1578763 (view as bug list)
Environment:
Last Closed:	2019-05-08 12:35:59 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine and vdsm logs (896.54 KB, application/x-gzip) 2018-03-03 21:33 UTC, Kevin Alon Goldblatt	no flags	Details
logs-3.6-only_iso_domain_inactive (1.95 MB, application/x-gzip) 2018-03-04 12:18 UTC, Elad	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1580243	high	CLOSED	A host with Unreachable Data SD is not moving to non-operational when brought up (but remains in a loop between non-resp...	2022-03-13 15:50:28 UTC
Red Hat Knowledge Base (Solution)	3208571	None	None	None	2017-10-06 03:53:49 UTC
Red Hat Product Errata	RHBA-2019:1077	None	None	None	2019-05-08 12:36:29 UTC
oVirt gerrit	89823	'None'	MERGED	engine: Ignore failure connecting ISO and Export SD	2020-10-21 01:51:33 UTC
oVirt gerrit	91274	'None'	MERGED	engine: Ignore failure connecting ISO and Export SD	2020-10-21 01:51:34 UTC

Internal Links: 1580243

Description Germano Veit Michel 2017-10-06 00:32:32 UTC

Description of problem:

Not sure if this needs to be solved at vdsm or better handled by the engine, assigning to vdsm for initial evaluation.

When the engine issues a ConnectStorageServer command, VDSM uses mount.nfs to mount ISO/Export Domains. The engine expects this to complete in 180 seconds (vdsTimeout). However, mount(2) can take much longer than 180 seconds to return in some conditions.

This results in the host going unresponsive and the DC being completely down when the NFS server is unreachable and the engine is restarted (initialize the hosts), or when the environment is completely powered off and then brought up again. In the absence of the NFS domains, in a specific case, the hosts go in a loop as below:

2017-10-03 19:09:00,738-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] START, ConnectStorageServerVDSCommand(.....), log id: 607cbf68

... 3 min vdsTimeout ...

2017-10-03 19:12:01,166-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [2a0ebe37] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM compute1 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues

2017-10-03 19:12:03,630-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetMOMPolicyParametersVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] START
2017-10-03 19:12:03,807-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetMOMPolicyParametersVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] FINISH

2017-10-03 19:12:03,987-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [73d5515a] EVENT_ID: VDS_DETECTED(13), Correlation ID: 31a1dfc3, Call Stack: null, Custom Event ID: -1, Message: Status of host compute1 was set to Up.

2017-10-03 19:12:04,057-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-43) [2a0ebe37] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host compute1 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.

On the VDSM side we can see mount.nfs takes 4min21 seconds, which is more than vdsTimeout (3min). The worker is blocked and only after 4min21s vdsm returns the response to the engine, which already timed out.

2017-10-03 19:06:43,911-0400 INFO  (jsonrpc/7) [vdsm.api] START connectStorageServer(....)

2017-10-03 19:11:04,227-0400 ERROR (jsonrpc/7) [storage.HSM] Could not connect to storageServer (hsm:2384)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2381, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 381, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 163, in connect
    six.reraise(t, v, tb)
  File "/usr/share/vdsm/storage/storageServer.py", line 155, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 190, in mount
    timeout=timeout, cgroup=cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 53, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 51, in <lambda>
    **kwargs)
  File "<string>", line 2, in mount
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
MountError: (32, ';mount.nfs: Connection timed out\n')

Now on the specific case mount.nfs takes all this time to respond:
* The default mount.nfs will retry for 2 minutes. That is, if mount(2) returns with error within 2 minutes, it tries again. 
* But if the mount(2) call takes more than 2 minutes, then userspance mount doesn't return within 2 minutes. It's a retry parameter, not a hard timeout.
* mount(2) takes ~4m to return in case it's able to resolve the nexthop (arp) for the NFS server, but the NFS server does not return any ICMP or response to the TCP SYN. This is the TCP timeout. 

Can be hit in at least two cases:

Case 1: Host on network A, NFS Server on net B. Host reaches network A gateway (nexthop resolved), TCP SYN sent. No response from NFS Server.

Case 2: NFS Server on same network as Host, but NFS server simply drops the TCP SYN (silent drop firewall rule).

The point is if the Host kernel sends the TCP SYN and it does not receive any reply with error or an ICMP, mount(2) will take 4 minutes to return, by the TCP timeout.

Version-Release number of selected component (if applicable):
rhevm-4.1.3.5-0.1.el7.noarch
vdsm-4.19.20-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
Easiest way is to silently DROP outgoing IP to the NFS server:

# iptables -I OUTPUT 1 -d <NFS_IP> -j DROP
# time mount <NFS_IP>:/ /mnt
mount.nfs: Connection timed out

real    4m17.504s
user    0m0.004s
sys    0m0.001s

Note that the condition is that mount(2) doesn't return quickly. In the case below it would try for 2 minutes, probably dozens of mount(2) calls and then give up. 

# date && time mount.nfs -o timeo=600,retrans=6 -v <NFS_IP>:/tmp /mnt
Fri Oct  6 10:26:46 AEST 2017
mount.nfs: timeout set for Fri Oct  6 10:28:46 2017
mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4'
mount.nfs: mount(2): No route to host
mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4'
mount.nfs: mount(2): No route to host
mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4'
mount.nfs: mount(2): No route to host
^C
real    0m9.742s
user    0m0.000s
sys     0m0.004s

Usually we would expect it to do something like the above, which would be mount(2) returning quicker by receiving a proper error from the network layer instead of waiting for the tcp timeout (4m). In case mount(2) hangs on TCP timeout, it ultimately leads to vdsm exceeding vdsTimeout to return ConnectStorageServer. And this seems to break host initialization.

Actual results:
RHV doesn't go up, DC is down, hosts doing up/not-responding dance.

Expected results:
RHV Up

Comment 2 Allon Mureinik 2017-10-15 10:06:10 UTC

This is the same behavior as any other domain, pushing out to 4.3 to rethink.

Comment 3 Germano Veit Michel 2017-10-15 23:27:41 UTC

(In reply to Allon Mureinik from comment #2)
> This is the same behavior as any other domain, pushing out to 4.3 to rethink.

Is it? I thought vdsTimeout exception for connectStorageServer on a Data Domain would move the Host to Non-Operational. Might be wrong though...

Comment 11 Nir Soffer 2018-02-22 19:52:44 UTC

Stuck mount can be solved by letting someone else wait for the mount, and returning
a timeout early. Engine can modify the domain or host status considering the domain
status on other hosts.

One option is to let systemd do the mount using mount units. If mount get stuck,
this is now systemd problem. Vdsm can wait for a mount events, and report to
engine when the mount was completed. See https://gerrit.ovirt.org/#/c/56516/
for example using mount events.

Using persistent mount units, a host can recover the mounts automatically after
a reboot - this will simplify hosted engine flows.

Comment 13 Yaniv Lavi 2018-02-26 13:26:39 UTC

(In reply to Nir Soffer from comment #11)
> Stuck mount can be solved by letting someone else wait for the mount, and
> returning
> a timeout early. Engine can modify the domain or host status considering the
> domain
> status on other hosts.
> 
> One option is to let systemd do the mount using mount units. If mount get
> stuck,
> this is now systemd problem. Vdsm can wait for a mount events, and report to
> engine when the mount was completed. See https://gerrit.ovirt.org/#/c/56516/
> for example using mount events.
> 
> Using persistent mount units, a host can recover the mounts automatically
> after
> a reboot - this will simplify hosted engine flows.

Allon should we move it back to 4.3?

Comment 16 Raz Tamir 2018-02-26 14:30:49 UTC

Returning old needinfo

Comment 22 Kevin Alon Goldblatt 2018-03-03 21:31:54 UTC

Tested with the following code:
-----------------------------------------
rhevm-3.6.13.1-0.1.el6.noarch
vdsm-4.17.44-2.el7ev.noarch

Tested with the following scenario:
---------------------------------------

1. On a system with nfs, iscsi and ISO domain
2. Ran "iptables -I OUTPUT 1 -d <NFS_IP> -j DROP" from the host to the ISO domain >>>> The data center was set to Non responsive


Engine logs:
-------------------------------
2018-03-03 22:33:44,459 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-65) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Client close
2018-03-03 22:33:44,467 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-65) [] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSGenericException: VDSNetworkException: Client close, log id: 8478bdf
2018-03-03 22:33:44,467 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-65) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Client close
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand.executeIrsBrokerCommand(GetStoragePoolInfoVDSCommand.java:32) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:161) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:74) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData.proceedStoragePoolStats(IrsProxyData.java:336) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed(IrsProxyData.java:238) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor134.invoke(Unknown Source) [:1.8.0_121]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_121]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_121]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2018-03-03 22:33:44,483 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-65) [] IRS failover failed - can't allocate vds server
2018-03-03 22:33:44,498 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (org.ovirt.thread.pool-6-thread-6) [4b9f84fe] Lock freed to object 'EngineLock:{exclusiveLocks='null', sharedLocks='[dea07dac-f1a7-4a94-ac7b-ed697b635276=<POOL, ACTION_TYPE_FAILED_OBJECT_LOCKED>]'}'
2018-03-03 22:33:44,511 INFO  [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (org.ovirt.thread.pool-6-thread-6) [4b9f84fe] Running command: MaintenanceVdsCommand internal: true. Entities affected :  ID: 1fed50d9-ed6c-4adf-bcbc-37487de67bb0 Type: VDS

.
.
.
2018-03-03 22:33:54,498 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-64) [113f6f10] Storage Pool 'dea07dac-f1a7-4a94-ac7b-ed697b635276' - Updating Storag
e Domain 'fb39c7f3-7bed-4946-970a-ff3c0656726d' status from 'Active' to 'Unknown', reason: null
2018-03-03 22:33:54,503 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-64) [113f6f10] Correlation ID: 113f6f10, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center golden_env_mixed. Setting status to Non Responsive.
2018-03-03 22:33:54,518 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-64) [113f6f10] START, ConnectStoragePoolVDSCommand(HostName = host_mixed_1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vdsId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', storagePoolId='dea07dac-f1a7-4a94-ac7b-ed697b635276', masterVersion='1'}), log id: 671a00e8
.
.
.
2018-03-03 22:36:20,618 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-65) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vds='Host[host_mixed_1,5c9876d4-d68d-4eac-8f5a-10557b5f319a]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,621 INFO  [org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher] (DefaultQuartzScheduler_Worker-65) [] Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring.
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-54) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Heartbeat exeeded
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-54) [] Command 'GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vds='Host[host_mixed_1,5c9876d4-d68d-4eac-8f5a-10557b5f319a]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) []  Failed getting vds stats,  vds='host_mixed_1'(5c9876d4-d68d-4eac-8f5a-10557b5f319a): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) [] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:74) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source) [:1.8.0_121]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_121]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_121]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2018-03-03 22:36:20,624 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-54) [] Failed to refresh VDS, network error, continuing, vds='host_mixed_1'(5c9876d4-d68d-4eac-8f5a-10557b5f319a): VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,624 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-22) [] Host 'host_mixed_1' is not responding.
2018-03-03 22:36:20,624 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-24) [] Host 'host_mixed_1' is not responding.
2018-03-03 22:36:20,625 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-63) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Heartbeat exeeded
2018-03-03 22:36:20,625 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-63) [] Command 'SpmStatusVDSCommand(HostName = host_mixed_1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', storagePoolId='dea07dac-f1a7-4a94-ac7b-ed697b635276'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,627 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-63) [39df4684] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: dea07dac-f1a7-4a94-ac7b-ed697b635276 Type: StoragePool

Comment 23 Kevin Alon Goldblatt 2018-03-03 21:33:13 UTC

Created attachment 1403592 [details]
engine and vdsm logs

Added logs

Comment 24 Raz Tamir 2018-03-04 09:23:32 UTC

According to reproduction from Kevin, which is not the same issue as the bug itself, this seems like a regression

Comment 25 Elad 2018-03-04 10:09:57 UTC

The iso domain which was used in the test reported in comment #22 was on the engine so non responsive is the desired behavior.
Upon blocking connection to the storage server where the iso domain resides, only the iso domain becomes inactive

Comment 27 Allon Mureinik 2018-03-04 10:18:38 UTC

(In reply to Elad from comment #25)
> The iso domain which was used in the test reported in comment #22 was on the
> engine so non responsive is the desired behavior.
> Upon blocking connection to the storage server where the iso domain resides,
> only the iso domain becomes inactive
Logs?

Comment 28 Elad 2018-03-04 12:18:42 UTC

Created attachment 1403741 [details]
logs-3.6-only_iso_domain_inactive

vdsm.log.1.xz:


Thread-52183::ERROR::2018-03-04 11:38:55,686::monitor::276::Storage.Monitor::(_monitorDomain) Error monitoring domain 8c5e702c-486e-4bc4-9bbb-0155218abf14
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain
    self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 339, in _performDomainSelftest
    self.domain.selftest()
  File "/usr/share/vdsm/storage/fileSD.py", line 648, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 274, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 452, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 419, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
Thread-52183::INFO::2018-03-04 11:38:55,687::monitor::299::Storage.Monitor::(_notifyStatusChanges) Domain 8c5e702c-486e-4bc4-9bbb-0155218abf14 became INVALID


engine.log:

ISO domain is unreachable:

2018-03-04 11:42:01,809 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] (ajp-/127.0.0.1:8702-1) [] FINISH, GetFileStatsVDSCommand, log id: d97256f
2018-03-04 11:42:01,809 WARN  [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (ajp-/127.0.0.1:8702-1) [] The refresh process for pattern * failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Cannot get file stats: ('8c5e702c-486e-4bc4-9bbb-0155218abf14',) (Failed with error GET_FILE_LIST_ERROR and code 330)


ISO domain is being deactivated:

2018-03-04 11:43:22,542 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (org.ovirt.thread.pool-6-thread-35) [762c9a06] Running command: DeactivateStorageDomainCommand internal: true. Entities affected :  ID: 8c5e702c-486e-4bc4-9bbb-0155218abf14 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN


The data center remains up, only the ISO domain is affected and becomes inactive.


There seems to be a bug on 3.6 for GetImagesListQuery, an NPE is thrown. Not sure if it's a known issue. 


2018-03-04 11:42:01,809 WARN  [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (ajp-/127.0.0.1:8702-1) [] The refresh process for pattern * failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Cannot get file stats: ('8c5e702c-486e-4bc4-9bbb-0155218abf14',) (Failed with error GET_FILE_LIST_ERROR and code 330)
2018-03-04 11:42:01,810 ERROR [org.ovirt.engine.core.bll.GetImagesListQuery] (ajp-/127.0.0.1:8702-1) [] Query 'GetImagesListQuery' failed: null
2018-03-04 11:42:01,810 ERROR [org.ovirt.engine.core.bll.GetImagesListQuery] (ajp-/127.0.0.1:8702-1) [] Exception: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.removeFileStatsForComplyingFileNames(IsoDomainListSyncronizer.java:699) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.updateAllFileListFromVDSM(IsoDomainListSyncronizer.java:679) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.updateFileList(IsoDomainListSyncronizer.java:305) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshIsoDomainFileForStoragePool(IsoDomainListSyncronizer.java:289) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshIsoDomain(IsoDomainListSyncronizer.java:346) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshRepos(IsoDomainListSyncronizer.java:212) [bll.jar:]

Comment 29 Nir Soffer 2018-03-04 12:26:52 UTC

I don't see the regression here. The bug is about the behavior when the the mount
command for the iso domain get stuck for more then 180 seconds.

Did we reproduce stuck mount in 3.6? do we have different behavior when mounting
the iso domain is stuck?

Comment 30 Nir Soffer 2018-03-04 12:28:44 UTC

(In reply to Elad from comment #28)
> Thread-52183::ERROR::2018-03-04
> 11:38:55,686::monitor::276::Storage.Monitor::(_monitorDomain) Error
> monitoring domain 8c5e702c-486e-4bc4-9bbb-0155218abf14
> Traceback (most recent call last):

This means the connectStoragesServer succeeded, and we are monitoring the domain.

This bug is about the case when connectStorageServer timed out.

Comment 31 Elad 2018-03-04 12:32:03 UTC

Nir, how can we simulate this? 
Isn't blocking connection to the NFS server enough?

Comment 32 Nir Soffer 2018-03-04 12:38:59 UTC

(In reply to Elad from comment #31)
> Nir, how can we simulate this? 
> Isn't blocking connection to the NFS server enough?

It should, but if the mount commmand did not got stuck, we did not reproduce the
issue reported in this bug.

Comment 34 Elad 2018-03-04 12:43:31 UTC

Nir, maybe Germano can provide us a way of reproducing stuck mount.

Comment 36 Germano Veit Michel 2018-03-05 02:00:53 UTC

I see the exact same behavior in 4.2 as I saw in 4.1 in comment #0. No heartbeat exceeded, just vdsTimeout and a up/down dance forever. Host cannot activate.

1. Engine sends connectStorageServer

2018-03-05 12:42:16,037+11 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand]

2. vdsTimeout on connectStorageServer (3min)

2018-03-05 12:45:16,966+11 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] ... execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

3. Engine reconnects to host

2018-03-05 12:45:17,021+11 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to ...

4. Host is set to Up

2018-03-05 12:45:18,422+11 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Threa
d-62) [773ff1ff] EVENT_ID: VDS_DETECTED(13), Status of host ... was set to Up.

5. Then quickly switches to not responding 

2018-03-05 12:45:18,658+11 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-91979) [675de76a] Host '....' is not responding.

6. Engine connects again and sends another connectStorageServer. Up/Not-Responding dance continues, repeat 1-5 again, and again and again.

The only improvement is that now I can clearly see the blocked worked on vdsm side trying to do mount (thanks!), but it only helps diagnosing the issue, it doesn't solve it of course.

Versions used:
rhvm-4.2.0-0.6.el7.noarch
vdsm-4.20.17-1.el7ev.x86_64

Comment 51 Germano Veit Michel 2018-05-21 01:14:01 UTC

Hi Fred, 

I took a look at the patch, and wondered if the same would happen with a data domain.
And it reproduces with data domains, same dance.

2018-05-21 10:43:33,055+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [73c7e015] START, ConnectStorageServerVDSCommand(HostName = ovirt-h1, StorageServerConnectionManagementVDSParameters:{hostId='1628bc78-276c-45ca-b7e6-e6bcbfe77273', storagePoolId='6a8adb90-36e7-11e8-872c-001a4a61183e', storageType='ISCSI', connectionList='[StorageServerConnections:{id='7b82e212-40bd-4f0a-9ed3-fa9ab274f122', connection='10.64.24.33', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', connection='10.64.24.45', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 7e750511

2018-05-21 10:46:35,123+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [73c7e015] Command 'ConnectStorageServerVDSCommand(HostName = ovirt-h1, StorageServerConnectionManagementVDSParameters:{hostId='1628bc78-276c-45ca-b7e6-e6bcbfe77273', storagePoolId='6a8adb90-36e7-11e8-872c-001a4a61183e', storageType='ISCSI', connectionList='[StorageServerConnections:{id='7b82e212-40bd-4f0a-9ed3-fa9ab274f122', connection='10.64.24.33', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', connection='10.64.24.45', iqn='iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

VDSM took 256s to respond:
2018-05-21 10:47:49,356+1000 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 256.30 seconds (__init__:573)

Blocked worker:
2018-05-21 10:47:33,058+1000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'id': u'7b82e212-40bd-4f0a-9ed3-fa9ab274f122', u'connection': u'10.64.24.33', u'iqn': u'iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a739', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}, {u'id': u'f00b96bd-edbe-4bc2-9b18-87e97ddc43c2', u'connection': u'10.64.24.45', u'iqn': u'iqn.2003-01.org.linux-iscsi.gsslab-24-33.x8664:sn.c2d0f339a738', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u'3260'}], u'storagepoolID': u'6a8adb90-36e7-11e8-872c-001a4a61183e', u'domainType': 3}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'6a36ae3a-3338-4175-af28-a07f7f371609'} at 0x7f2a7408aad0> timeout=60, duration=240 at 0x7f2a7408ae10> task#=16 at 0x7f2a7416a710>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 201, in _dynamicMethod
  result = fn(*methodArgs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1070, in connectStorageServer
  connectionParams)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper
  result = ctask.prepare(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapperFile: "<string>", line 2, in connectStorageServer
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2395, in connectStorageServer
  conObj.connect()
File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 487, in connect
  iscsi.addIscsiNode(self._iface, self._target, self._cred)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode
  iscsiadm.node_login(iface.name, target.address, target.iqn)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 327, in node_login
  portal, "-l"])
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 122, in _runCmd
  return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync)
File: "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 80, in execCmd
  (out, err) = p.communicate(data)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
  stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate
  orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll
  ready = poller.poll(self._remaining_time(endtime)) (executor:363)

  return m(self, *a, **kw)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare
  result = self._run(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
  return fn(*args, **kargs)
File: "<string>", line 2, in connectStorageServer
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2395, in connectStorageServer
  conObj.connect()
File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 487, in connect
  iscsi.addIscsiNode(self._iface, self._target, self._cred)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode
  iscsiadm.node_login(iface.name, target.address, target.iqn)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 327, in node_login
  portal, "-l"])
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 122, in _runCmd
  return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync)
File: "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 80, in execCmd
  (out, err) = p.communicate(data)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
  stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate
  orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll
  ready = poller.poll(self._remaining_time(endtime)) (executor:363)

VDSM does not reply to the engine in time. Then the host starts a Connecting->Not Responding->Attempt to Fence dance that seems quite hard to get rid of in 4.2. I had to stop the engine and switch it to maintenance in the DB.

The hosts going on this dance because VDSM is taking too long to complete a command doesn't sound right. The host is just fine, nothing is really wrong except for some unreachable storage which a a connection attempt takes more than 180s to timeout. Even the DC goes up and down with the SPM participating in this dance.

Can't the engine better handle this? Report to the user the storage is unreachable would be what I would expect, not going Not Responsive.

So even if the Export/ISO is ignored and not sent in the same request of data domains, I'm afraid the same can happen for data domains and IMO it is not desirable.

Would you like a new bug for this?

Comment 52 Fred Rolland 2018-05-21 05:13:23 UTC

Germano hi,

You are right. The fix is for ISO domains only.

Please open a different bug for data domains.

The solution there will be maybe to change the handling of these commands to be async.

Comment 54 Kevin Alon Goldblatt 2018-08-16 16:18:02 UTC

Tested with the following code:
-------------------------------------
ovirt-engine-4.3.0-0.0.master.20180815091554.gitd5455ea.el7.noarch
vdsm-4.30.0-527.gitcec1054.el7.x86_64


Verified with the following scenario:
-------------------------------------
Scenario 1:
---------------
while host is u[ block access to ISO domain with "iptables -I OUTPUT 1 -d <NFS_IP> -j DROP"
time mount <NFS_IP>:/ /mnt
mount.nfs: Connection timed out

>>>>> The host remains operational, The DC remains up

Scenario 2:
---------------
1. Set host to maintenance
2. block access to ISO domain with "iptables -I OUTPUT 1 -d <NFS_IP> -j DROP"
time mount <NFS_IP>:/ /mnt
mount.nfs: Connection timed out
3. Activate host >>>>> Host is activated and DC is still up


Moving to VERIFIED

Comment 57 Raz Tamir 2018-08-23 08:44:40 UTC

Moving back to MODIFIED as this is a downstream bug

Comment 60 Kevin Alon Goldblatt 2018-10-22 14:09:09 UTC

Tested with the following code:
-------------------------------------
ovirt-engine-4.2.7.3-0.1.el7ev.noarch
vdsm-4.20.43-1.el7ev.x86_64


Verified with the following scenario:
-------------------------------------
Scenario 1:
---------------
while host is u[ block access to ISO domain with "iptables -I OUTPUT 1 -d <NFS_IP> -j DROP"
time mount <NFS_IP>:/ /mnt
mount.nfs: Connection timed out

>>>>> The host remains operational, The DC remains up

Scenario 2:
---------------
1. Set host to maintenance
2. block access to ISO domain with "iptables -I OUTPUT 1 -d <NFS_IP> -j DROP"
time mount <NFS_IP>:/ /mnt
mount.nfs: Connection timed out
3. Activate host >>>>> Host is activated and DC is still up


Moving to VERIFIED

Comment 61 RHV bug bot 2018-11-28 14:38:30 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

Comment 62 RHV bug bot 2019-01-15 23:36:12 UTC

INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Project 'ovirt-engine'/Component 'vdsm' mismatch]

For more info please contact: rhv-devops

Comment 64 errata-xmlrpc 2019-05-08 12:35:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 65 Daniel Gur 2019-08-28 13:14:43 UTC

sync2jira

Comment 66 Daniel Gur 2019-08-28 13:19:46 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.