1578763 – [downstream clone - 4.2.4] Unreachable ISO/Export SD prevents hosts from activating

Bug 1578763 - [downstream clone - 4.2.4] Unreachable ISO/Export SD prevents hosts from activating

Summary: [downstream clone - 4.2.4] Unreachable ISO/Export SD prevents hosts from acti...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.2.4
Target Release:	---
Assignee:	Fred Rolland
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1499056
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-16 11:29 UTC by RHV bug bot
Modified:	2021-09-09 14:05 UTC (History)
CC List:	21 users (show)
Fixed In Version:	ovirt-engine-4.2.4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1499056
Environment:
Last Closed:	2018-06-27 10:02:46 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3208571	None	None	None	2018-05-16 11:34:58 UTC
Red Hat Product Errata	RHEA-2018:2072	None	None	None	2018-06-27 10:03:30 UTC
oVirt gerrit	89823	master	MERGED	engine: Ignore failure connecting ISO and Export SD	2020-10-21 01:49:19 UTC
oVirt gerrit	91274	ovirt-engine-4.2	MERGED	engine: Ignore failure connecting ISO and Export SD	2020-10-21 01:49:36 UTC

Description RHV bug bot 2018-05-16 11:29:23 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1499056 +++
======================================================================

Description of problem:

Not sure if this needs to be solved at vdsm or better handled by the engine, assigning to vdsm for initial evaluation.

When the engine issues a ConnectStorageServer command, VDSM uses mount.nfs to mount ISO/Export Domains. The engine expects this to complete in 180 seconds (vdsTimeout). However, mount(2) can take much longer than 180 seconds to return in some conditions.

This results in the host going unresponsive and the DC being completely down when the NFS server is unreachable and the engine is restarted (initialize the hosts), or when the environment is completely powered off and then brought up again. In the absence of the NFS domains, in a specific case, the hosts go in a loop as below:

2017-10-03 19:09:00,738-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] START, ConnectStorageServerVDSCommand(.....), log id: 607cbf68

... 3 min vdsTimeout ...

2017-10-03 19:12:01,166-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [2a0ebe37] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM compute1 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues

2017-10-03 19:12:03,630-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetMOMPolicyParametersVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] START
2017-10-03 19:12:03,807-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SetMOMPolicyParametersVDSCommand] (DefaultQuartzScheduler2) [2a0ebe37] FINISH

2017-10-03 19:12:03,987-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [73d5515a] EVENT_ID: VDS_DETECTED(13), Correlation ID: 31a1dfc3, Call Stack: null, Custom Event ID: -1, Message: Status of host compute1 was set to Up.

2017-10-03 19:12:04,057-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-43) [2a0ebe37] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host compute1 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.

On the VDSM side we can see mount.nfs takes 4min21 seconds, which is more than vdsTimeout (3min). The worker is blocked and only after 4min21s vdsm returns the response to the engine, which already timed out.

2017-10-03 19:06:43,911-0400 INFO  (jsonrpc/7) [vdsm.api] START connectStorageServer(....)

2017-10-03 19:11:04,227-0400 ERROR (jsonrpc/7) [storage.HSM] Could not connect to storageServer (hsm:2384)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2381, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 381, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 163, in connect
    six.reraise(t, v, tb)
  File "/usr/share/vdsm/storage/storageServer.py", line 155, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 190, in mount
    timeout=timeout, cgroup=cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 53, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/supervdsm.py", line 51, in <lambda>
    **kwargs)
  File "<string>", line 2, in mount
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
MountError: (32, ';mount.nfs: Connection timed out\n')

Now on the specific case mount.nfs takes all this time to respond:
* The default mount.nfs will retry for 2 minutes. That is, if mount(2) returns with error within 2 minutes, it tries again. 
* But if the mount(2) call takes more than 2 minutes, then userspance mount doesn't return within 2 minutes. It's a retry parameter, not a hard timeout.
* mount(2) takes ~4m to return in case it's able to resolve the nexthop (arp) for the NFS server, but the NFS server does not return any ICMP or response to the TCP SYN. This is the TCP timeout. 

Can be hit in at least two cases:

Case 1: Host on network A, NFS Server on net B. Host reaches network A gateway (nexthop resolved), TCP SYN sent. No response from NFS Server.

Case 2: NFS Server on same network as Host, but NFS server simply drops the TCP SYN (silent drop firewall rule).

The point is if the Host kernel sends the TCP SYN and it does not receive any reply with error or an ICMP, mount(2) will take 4 minutes to return, by the TCP timeout.

Version-Release number of selected component (if applicable):
rhevm-4.1.3.5-0.1.el7.noarch
vdsm-4.19.20-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
Easiest way is to silently DROP outgoing IP to the NFS server:

# iptables -I OUTPUT 1 -d <NFS_IP> -j DROP
# time mount <NFS_IP>:/ /mnt
mount.nfs: Connection timed out

real    4m17.504s
user    0m0.004s
sys    0m0.001s

Note that the condition is that mount(2) doesn't return quickly. In the case below it would try for 2 minutes, probably dozens of mount(2) calls and then give up. 

# date && time mount.nfs -o timeo=600,retrans=6 -v <NFS_IP>:/tmp /mnt
Fri Oct  6 10:26:46 AEST 2017
mount.nfs: timeout set for Fri Oct  6 10:28:46 2017
mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4'
mount.nfs: mount(2): No route to host
mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4'
mount.nfs: mount(2): No route to host
mount.nfs: trying text-based options 'timeo=600,retrans=6,vers=4'
mount.nfs: mount(2): No route to host
^C
real    0m9.742s
user    0m0.000s
sys     0m0.004s

Usually we would expect it to do something like the above, which would be mount(2) returning quicker by receiving a proper error from the network layer instead of waiting for the tcp timeout (4m). In case mount(2) hangs on TCP timeout, it ultimately leads to vdsm exceeding vdsTimeout to return ConnectStorageServer. And this seems to break host initialization.

Actual results:
RHV doesn't go up, DC is down, hosts doing up/not-responding dance.

Expected results:
RHV Up

(Originally by Germano Veit Michel)

Comment 3 RHV bug bot 2018-05-16 11:29:40 UTC

This is the same behavior as any other domain, pushing out to 4.3 to rethink.

(Originally by amureini)

Comment 4 RHV bug bot 2018-05-16 11:29:46 UTC

(In reply to Allon Mureinik from comment #2)
> This is the same behavior as any other domain, pushing out to 4.3 to rethink.

Is it? I thought vdsTimeout exception for connectStorageServer on a Data Domain would move the Host to Non-Operational. Might be wrong though...

(Originally by Germano Veit Michel)

Comment 12 RHV bug bot 2018-05-16 11:30:34 UTC

Stuck mount can be solved by letting someone else wait for the mount, and returning
a timeout early. Engine can modify the domain or host status considering the domain
status on other hosts.

One option is to let systemd do the mount using mount units. If mount get stuck,
this is now systemd problem. Vdsm can wait for a mount events, and report to
engine when the mount was completed. See https://gerrit.ovirt.org/#/c/56516/
for example using mount events.

Using persistent mount units, a host can recover the mounts automatically after
a reboot - this will simplify hosted engine flows.

(Originally by Nir Soffer)

Comment 14 RHV bug bot 2018-05-16 11:30:46 UTC

(In reply to Nir Soffer from comment #11)
> Stuck mount can be solved by letting someone else wait for the mount, and
> returning
> a timeout early. Engine can modify the domain or host status considering the
> domain
> status on other hosts.
> 
> One option is to let systemd do the mount using mount units. If mount get
> stuck,
> this is now systemd problem. Vdsm can wait for a mount events, and report to
> engine when the mount was completed. See https://gerrit.ovirt.org/#/c/56516/
> for example using mount events.
> 
> Using persistent mount units, a host can recover the mounts automatically
> after
> a reboot - this will simplify hosted engine flows.

Allon should we move it back to 4.3?

(Originally by ylavi)

Comment 17 RHV bug bot 2018-05-16 11:31:05 UTC

Returning old needinfo

(Originally by Raz Tamir)

Comment 23 RHV bug bot 2018-05-16 11:31:41 UTC

Tested with the following code:
-----------------------------------------
rhevm-3.6.13.1-0.1.el6.noarch
vdsm-4.17.44-2.el7ev.noarch

Tested with the following scenario:
---------------------------------------

1. On a system with nfs, iscsi and ISO domain
2. Ran "iptables -I OUTPUT 1 -d <NFS_IP> -j DROP" from the host to the ISO domain >>>> The data center was set to Non responsive


Engine logs:
-------------------------------
2018-03-03 22:33:44,459 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-65) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Client close
2018-03-03 22:33:44,467 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-65) [] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSGenericException: VDSNetworkException: Client close, log id: 8478bdf
2018-03-03 22:33:44,467 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-65) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Client close
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand.executeIrsBrokerCommand(GetStoragePoolInfoVDSCommand.java:32) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand.executeVDSCommand(IrsBrokerCommand.java:161) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:74) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData.proceedStoragePoolStats(IrsProxyData.java:336) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData._updatingTimer_Elapsed(IrsProxyData.java:238) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor134.invoke(Unknown Source) [:1.8.0_121]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_121]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_121]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2018-03-03 22:33:44,483 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-65) [] IRS failover failed - can't allocate vds server
2018-03-03 22:33:44,498 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (org.ovirt.thread.pool-6-thread-6) [4b9f84fe] Lock freed to object 'EngineLock:{exclusiveLocks='null', sharedLocks='[dea07dac-f1a7-4a94-ac7b-ed697b635276=<POOL, ACTION_TYPE_FAILED_OBJECT_LOCKED>]'}'
2018-03-03 22:33:44,511 INFO  [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (org.ovirt.thread.pool-6-thread-6) [4b9f84fe] Running command: MaintenanceVdsCommand internal: true. Entities affected :  ID: 1fed50d9-ed6c-4adf-bcbc-37487de67bb0 Type: VDS

.
.
.
2018-03-03 22:33:54,498 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-64) [113f6f10] Storage Pool 'dea07dac-f1a7-4a94-ac7b-ed697b635276' - Updating Storag
e Domain 'fb39c7f3-7bed-4946-970a-ff3c0656726d' status from 'Active' to 'Unknown', reason: null
2018-03-03 22:33:54,503 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-64) [113f6f10] Correlation ID: 113f6f10, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center golden_env_mixed. Setting status to Non Responsive.
2018-03-03 22:33:54,518 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-64) [113f6f10] START, ConnectStoragePoolVDSCommand(HostName = host_mixed_1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vdsId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', storagePoolId='dea07dac-f1a7-4a94-ac7b-ed697b635276', masterVersion='1'}), log id: 671a00e8
.
.
.
2018-03-03 22:36:20,618 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-65) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vds='Host[host_mixed_1,5c9876d4-d68d-4eac-8f5a-10557b5f319a]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,621 INFO  [org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher] (DefaultQuartzScheduler_Worker-65) [] Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring.
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-54) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Heartbeat exeeded
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-54) [] Command 'GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', vds='Host[host_mixed_1,5c9876d4-d68d-4eac-8f5a-10557b5f319a]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) []  Failed getting vds stats,  vds='host_mixed_1'(5c9876d4-d68d-4eac-8f5a-10557b5f319a): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) [] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,623 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-54) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:74) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source) [:1.8.0_121]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_121]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_121]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2018-03-03 22:36:20,624 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-54) [] Failed to refresh VDS, network error, continuing, vds='host_mixed_1'(5c9876d4-d68d-4eac-8f5a-10557b5f319a): VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,624 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-22) [] Host 'host_mixed_1' is not responding.
2018-03-03 22:36:20,624 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-24) [] Host 'host_mixed_1' is not responding.
2018-03-03 22:36:20,625 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-63) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Heartbeat exeeded
2018-03-03 22:36:20,625 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-63) [] Command 'SpmStatusVDSCommand(HostName = host_mixed_1, SpmStatusVDSCommandParameters:{runAsync='true', hostId='5c9876d4-d68d-4eac-8f5a-10557b5f319a', storagePoolId='dea07dac-f1a7-4a94-ac7b-ed697b635276'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2018-03-03 22:36:20,627 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-63) [39df4684] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: dea07dac-f1a7-4a94-ac7b-ed697b635276 Type: StoragePool

(Originally by Kevin Goldblatt)

Comment 24 RHV bug bot 2018-05-16 11:31:49 UTC

Created attachment 1403592 [details]
engine and vdsm logs

Added logs

(Originally by Kevin Goldblatt)

Comment 25 RHV bug bot 2018-05-16 11:31:55 UTC

According to reproduction from Kevin, which is not the same issue as the bug itself, this seems like a regression

(Originally by Raz Tamir)

Comment 26 RHV bug bot 2018-05-16 11:32:01 UTC

The iso domain which was used in the test reported in comment #22 was on the engine so non responsive is the desired behavior.
Upon blocking connection to the storage server where the iso domain resides, only the iso domain becomes inactive

(Originally by Elad Ben Aharon)

Comment 28 RHV bug bot 2018-05-16 11:32:14 UTC

(In reply to Elad from comment #25)
> The iso domain which was used in the test reported in comment #22 was on the
> engine so non responsive is the desired behavior.
> Upon blocking connection to the storage server where the iso domain resides,
> only the iso domain becomes inactive
Logs?

(Originally by amureini)

Comment 29 RHV bug bot 2018-05-16 11:32:20 UTC

Created attachment 1403741 [details]
logs-3.6-only_iso_domain_inactive

vdsm.log.1.xz:


Thread-52183::ERROR::2018-03-04 11:38:55,686::monitor::276::Storage.Monitor::(_monitorDomain) Error monitoring domain 8c5e702c-486e-4bc4-9bbb-0155218abf14
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain
    self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 339, in _performDomainSelftest
    self.domain.selftest()
  File "/usr/share/vdsm/storage/fileSD.py", line 648, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 274, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 452, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 419, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
Thread-52183::INFO::2018-03-04 11:38:55,687::monitor::299::Storage.Monitor::(_notifyStatusChanges) Domain 8c5e702c-486e-4bc4-9bbb-0155218abf14 became INVALID


engine.log:

ISO domain is unreachable:

2018-03-04 11:42:01,809 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetFileStatsVDSCommand] (ajp-/127.0.0.1:8702-1) [] FINISH, GetFileStatsVDSCommand, log id: d97256f
2018-03-04 11:42:01,809 WARN  [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (ajp-/127.0.0.1:8702-1) [] The refresh process for pattern * failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Cannot get file stats: ('8c5e702c-486e-4bc4-9bbb-0155218abf14',) (Failed with error GET_FILE_LIST_ERROR and code 330)


ISO domain is being deactivated:

2018-03-04 11:43:22,542 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (org.ovirt.thread.pool-6-thread-35) [762c9a06] Running command: DeactivateStorageDomainCommand internal: true. Entities affected :  ID: 8c5e702c-486e-4bc4-9bbb-0155218abf14 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN


The data center remains up, only the ISO domain is affected and becomes inactive.


There seems to be a bug on 3.6 for GetImagesListQuery, an NPE is thrown. Not sure if it's a known issue. 


2018-03-04 11:42:01,809 WARN  [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (ajp-/127.0.0.1:8702-1) [] The refresh process for pattern * failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsOperationFailedNoFailoverException: IRSGenericException: IRSErrorException: Cannot get file stats: ('8c5e702c-486e-4bc4-9bbb-0155218abf14',) (Failed with error GET_FILE_LIST_ERROR and code 330)
2018-03-04 11:42:01,810 ERROR [org.ovirt.engine.core.bll.GetImagesListQuery] (ajp-/127.0.0.1:8702-1) [] Query 'GetImagesListQuery' failed: null
2018-03-04 11:42:01,810 ERROR [org.ovirt.engine.core.bll.GetImagesListQuery] (ajp-/127.0.0.1:8702-1) [] Exception: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.removeFileStatsForComplyingFileNames(IsoDomainListSyncronizer.java:699) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.updateAllFileListFromVDSM(IsoDomainListSyncronizer.java:679) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.updateFileList(IsoDomainListSyncronizer.java:305) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshIsoDomainFileForStoragePool(IsoDomainListSyncronizer.java:289) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshIsoDomain(IsoDomainListSyncronizer.java:346) [bll.jar:]
        at org.ovirt.engine.core.bll.IsoDomainListSyncronizer.refreshRepos(IsoDomainListSyncronizer.java:212) [bll.jar:]

(Originally by Elad Ben Aharon)

Comment 30 RHV bug bot 2018-05-16 11:32:27 UTC

I don't see the regression here. The bug is about the behavior when the the mount
command for the iso domain get stuck for more then 180 seconds.

Did we reproduce stuck mount in 3.6? do we have different behavior when mounting
the iso domain is stuck?

(Originally by Nir Soffer)

Comment 31 RHV bug bot 2018-05-16 11:32:34 UTC

(In reply to Elad from comment #28)
> Thread-52183::ERROR::2018-03-04
> 11:38:55,686::monitor::276::Storage.Monitor::(_monitorDomain) Error
> monitoring domain 8c5e702c-486e-4bc4-9bbb-0155218abf14
> Traceback (most recent call last):

This means the connectStoragesServer succeeded, and we are monitoring the domain.

This bug is about the case when connectStorageServer timed out.

(Originally by Nir Soffer)

Comment 32 RHV bug bot 2018-05-16 11:32:39 UTC

Nir, how can we simulate this? 
Isn't blocking connection to the NFS server enough?

(Originally by Elad Ben Aharon)

Comment 33 RHV bug bot 2018-05-16 11:32:47 UTC

(In reply to Elad from comment #31)
> Nir, how can we simulate this? 
> Isn't blocking connection to the NFS server enough?

It should, but if the mount commmand did not got stuck, we did not reproduce the
issue reported in this bug.

(Originally by Nir Soffer)

Comment 35 RHV bug bot 2018-05-16 11:33:01 UTC

Nir, maybe Germano can provide us a way of reproducing stuck mount.

(Originally by Elad Ben Aharon)

Comment 37 RHV bug bot 2018-05-16 11:33:16 UTC

I see the exact same behavior in 4.2 as I saw in 4.1 in comment #0. No heartbeat exceeded, just vdsTimeout and a up/down dance forever. Host cannot activate.

1. Engine sends connectStorageServer

2018-03-05 12:42:16,037+11 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand]

2. vdsTimeout on connectStorageServer (3min)

2018-03-05 12:45:16,966+11 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] ... execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

3. Engine reconnects to host

2018-03-05 12:45:17,021+11 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to ...

4. Host is set to Up

2018-03-05 12:45:18,422+11 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Threa
d-62) [773ff1ff] EVENT_ID: VDS_DETECTED(13), Status of host ... was set to Up.

5. Then quickly switches to not responding 

2018-03-05 12:45:18,658+11 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-91979) [675de76a] Host '....' is not responding.

6. Engine connects again and sends another connectStorageServer. Up/Not-Responding dance continues, repeat 1-5 again, and again and again.

The only improvement is that now I can clearly see the blocked worked on vdsm side trying to do mount (thanks!), but it only helps diagnosing the issue, it doesn't solve it of course.

Versions used:
rhvm-4.2.0-0.6.el7.noarch
vdsm-4.20.17-1.el7ev.x86_64

(Originally by Germano Veit Michel)

Comment 52 RHV bug bot 2018-05-24 23:53:35 UTC

INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Project 'ovirt-engine'/Component 'vdsm' mismatch]

For more info please contact: rhv-devops

Comment 56 Kevin Alon Goldblatt 2018-05-28 13:15:45 UTC

Verified with the following code:
-----------------------------------------
ovirt-engine-4.2.4-0.1.el7.noarch
vdsm-4.20.28-1.el7ev.x86_64

Verified with the following scenario:
-----------------------------------------
Steps:
- Data Center with at least two host
- Configure an ISO storage domain
- Move one of the hosts to Maintenance
- Block with IPtables the connectivity to the ISO domain from the host in maintenance only.
- Activate the Host >>>>> The host move to Up state.


Moving to VERIFIED

Comment 58 errata-xmlrpc 2018-06-27 10:02:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2072

Comment 59 Franta Kust 2019-05-16 13:08:43 UTC

BZ<2>Jira Resync

Comment 60 Daniel Gur 2019-08-28 13:14:47 UTC

sync2jira

Comment 61 Daniel Gur 2019-08-28 13:19:49 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.

aperotti
danken
ebenahar
frolland
gscott
gveitmic
jbryant
kgoldbla
kshukla
ktordeur
lsurette
milee
mkalinin
mklika
nsoffer
ratamir
srevivo
tnisan
ycui
ykaul
ylavi