1264498 – GetStatsVDSCommand encountered VDSNetworkException timeouts due to delayed repoStats response on host

Bug 1264498 - GetStatsVDSCommand encountered VDSNetworkException timeouts due to delayed repoStats response on host

Summary: GetStatsVDSCommand encountered VDSNetworkException timeouts due to delayed re...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.5.6
Target Release:	3.5.6
Assignee:	Maor
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-18 15:24 UTC by Gordon Watson
Modified:	2019-10-10 10:13 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-10-26 12:57:33 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Gordon Watson 2015-09-18 15:24:51 UTC

Description of problem:

We had a customer issue where they experienced a delayed repostats response on a host that resulted in GetStatsVDSCommand encountering VDSNetworkException timeouts. From one of these the host recovered, but from the second, the subsequent 'ConnectStoragePool' also encountered a VDSNetworkException timeout, and the host went non-operational, resulting in VM migrations.


Version-Release number of selected component (if applicable):

RHEV-M 3.5.3
RHEL 6.6 host w/vdsm-4.16.20-1


How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:


-  engine;

<<<  VDSNetworkException  >>>

2015-08-29 00:32:11,929 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-93) [3dd011c9] Command GetStatsVDSCommand(HostName = rvh0207, HostId = b3260d95-e524-4f52-b356-8d36a330121f, vds=Host[rvh0207,b3260d95-e524-4f52-b356-8d36a330121f]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException


<<<  "Connecting"  >>>

2015-08-29 00:32:11,939 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-93) [3dd011c9] Host rvh0207 is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.

2015-08-29 00:32:12,281 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-93) [3dd011c9] vds::refreshVdsStats Failed getVdsStats,  vds = b3260d95-e524-4f52-b356-8d36a330121f : rvh0207, error = VDSNetworkException: java.util.concurrent.TimeoutException
2015-08-29 00:32:12,303 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-93) [3dd011c9] Failed to refresh VDS , vds = b3260d95-e524-4f52-b356-8d36a330121f : rvh0207, VDS Network Error, continuing.


<<<  Connect host to DC succeeded  >>>

2015-08-29 00:34:00,043 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (DefaultQuartzScheduler_Worker-82) [4767992c] START, ConnectStorageServerVDSCommand(HostName = rvh0207, HostId = b3260d95-e524-4f52-b356-8d36a330121f, 
2015-08-29 00:34:03,904 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (DefaultQuartzScheduler_Worker-82) [4767992c] Host rvh0207 storage connection was succeeded 

2015-08-29 00:34:03,932 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-17) START, ConnectStoragePoolVDSCommand(HostName = rvh0207, HostId = b3260d95-e524-4f52-b356-8d36a330121f, vdsId = b3260d95-e524-4f52-b356-8d36a330121f, storagePoolId = 7256bf2a-e9ac-4712-8522-3a32a588a975, masterVersion = 1), log id: 7ee4566a


<<<  VDSNetworkException  >>>

2015-08-29 00:42:44,683 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-60) [4cb41909] Command GetStatsVDSCommand(HostName = rvh0207, HostId = b3260d95-e524-4f52-b356-8d36a330121f, vds=Host[rvh0207,b3260d95-e524-4f52-b356-8d36a330121f]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException


<<<  Connect host to DC failed  >>>

2015-08-29 00:44:17,573 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (DefaultQuartzScheduler_Worker-16) [5f5e251c] START, ConnectStorageServerVDSCommand(HostName = rvh0207, HostId = b3260d95-e524-4f52-b356-8d36a330121f,
2015-08-29 00:44:18,956 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (DefaultQuartzScheduler_Worker-16) [5f5e251c] Host rvh0207 storage connection was succeeded 

2015-08-29 00:47:18,984 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-35) Command ConnectStoragePoolVDSCommand(HostName = rvh0207, HostId = b3260d95-e524-4f52-b356-8d36a330121f, vdsId = b3260d95-e524-4f52-b356-8d36a330121f, storagePoolId = 7256bf2a-e9ac-4712-8522-3a32a588a975, masterVersion = 1) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException
2015-08-29 00:47:18,984 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (org.ovirt.thread.pool-7-thread-35) Could not connect host rvh0207 to pool DC_PIS with the message: null


<<<  -> "non-operational"  >>>

2015-08-29 00:47:19,068 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (DefaultQuartzScheduler_Worker-16) [a560ba0] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: b3260d95-e524-4f52-b356-8d36a330121f Type: VDS
2015-08-29 00:47:19,083 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler_Worker-16) [a560ba0] START, SetVdsStatusVDSCommand(HostName = rvh0207, HostId = b3260d95-e524-4f52-b356-8d36a330121f, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE, stopSpmFailureLogged=false), log id: 4cfe6a75



-  host, vdsm log;

<<<  first connectStoragePool responded within 22 seconds  >>>

Thread-315668::INFO::2015-08-29 00:34:03,938::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='7256bf2a-e9ac-4712-8522-3a32a588a975', hostID=15, msdUUID='17d79935-de71-40b6-b0ae-1d8091ecfd74'
Thread-315668::INFO::2015-08-29 00:34:25,407::logUtils::47::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True
Thread-316074::INFO::2015-08-29 00:44:17,588::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='7256bf2a-e9ac-4712-8522-3a32a588a975', conList=[{'connection': 'iosvm01-port1.dosko
Thread-316074::INFO::2015-08-29 00:44:18,951::logUtils::47::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '02436346-4efc-4e05-86ac-dd778ea2f797'}, {'statu


<<<  second connectStoragePool took 5 minutes to respond  >>>

Thread-316074::INFO::2015-08-29 00:44:19,878::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='7256bf2a-e9ac-4712-8522-3a32a588a975', hostID=15, msdUUID='17d79935-de71-40b6-b0ae-1d8091ecfd74'
Thread-316074::INFO::2015-08-29 00:49:15,822::logUtils::47::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True



-  lastCheck;

00:27:15,255::logUtils::47::dispatcher::(wrapper) '5.3', '0.7', '2.3', '3.6', '2.3', '8.3', '5.3', '2.0', '7.1', '7.0', '2.3', '0.5', '6.0', '3.6', '2.0', '7.5', '0.4',
00:27:30,552::logUtils::47::dispatcher::(wrapper) '0.5', '6.0', '7.6', '8.8', '7.5', '3.6', '0.5', '7.2', '2.4', '2.3', '7.5', '5.8', '1.2', '8.8', '7.2', '2.8', '5.7',
00:32:58,748::logUtils::47::dispatcher::(wrapper) '139.9', '63.4', '86.0', '108.6', '63.4', '18.5', '91.7', '80.6', '37.0', '59.2', '83.8', '43.3', '66.0', '129.1', '58.8', '30.9', '36.9',
00:34:25,499::logUtils::47::dispatcher::(wrapper) '0.5', '0.6', '0.7', '0.8', '0.8', '0.7', '0.7', '0.7', '0.6', '0.6', '0.8', '0.6', '0.5', '0.6', '0.9', '0.7', '0.6',

..... etc, etc., .....

00:37:58,517::logUtils::47::dispatcher::(wrapper) '1.9', '2.3', '2.0', '2.0', '3.2', '3.0', '1.7', '3.1', '1.5', '1.7', '3.4', '1.5', '1.7', '1.7', '3.3', '1.6', '2.5',
00:38:13,776::logUtils::47::dispatcher::(wrapper) '7.2', '7.6', '7.2', '7.2', '8.4', '8.3', '7.0', '8.3', '6.7', '6.9', '8.6', '6.8', '6.9', '6.9', '8.5', '6.8', '7.7',
00:41:47,174::logUtils::47::dispatcher::(wrapper) '19.4', '32.3', '177.6', '25.9', '45.2', '-0.1', '-7.2', '12.9', '45.1', '25.8', '65.9', '0.1', '177.2', '25.8', '-26.4', '0.1', '19.3',
00:47:53,581::logUtils::47::dispatcher::(wrapper) '418.9', '269.4', '385.9', '269.4', '306.0', '399.4', '218.3', '260.5', '319.3', '262.0', '339.8', '257.2', '312.9', '59.2', '215.1', '259.0', '260.5',
00:48:26,082::logUtils::47::dispatcher::(wrapper) '451.4', '301.9', '13.2', '301.9', '338.5', '431.9', '250.8', '293.0', '351.8', '294.5', '372.3', '289.7', '345.4', '91.7', '247.6', '291.5', '293.0',
00:49:52,532::logUtils::47::dispatcher::(wrapper) '80.0', '49.6', '99.6', '41.7', '76.8', '26.7', '14.3', '39.2', '46.7', '31.7', '42.1', '42.4', '27.1', '178.2', '14.3', '32.1', '31.0',
01:06:52,835::logUtils::47::dispatcher::(wrapper) '44.9', '104.8', '77.2', '3.6', '41.2', '82.4', '5.4', '62.0', '9.5', '64.6', '16.2', '49.8', '303.0', '57.5', '13.9', '91.8', '81.5',
01:13:24,623::logUtils::47::dispatcher::(wrapper) '0.1', '1.8', '1.5', '114.3', '1.8', '116.1', '113.9', '1.5', '0.0', '456.3', '114.3', '111.6', '1.4', '1.6', '125.9', '117.4', '1.4',
01:15:44,252::logUtils::47::dispatcher::(wrapper) '9.6', '1.2', '1.0', '8.4', '1.2', '8.9', '8.4', '0.8', '8.8', '8.7', '8.6', '8.9', '0.4', '1.1', '8.7', '8.9', '0.8',

..... etc, etc., .....


Expected results:

No lengthy repostat thread delays.


Additional info:

Comment 1 Gordon Watson 2015-09-18 15:32:22 UTC

I actually didn't quite express the 'Expected Results' above correctly. What I meant to say was that the repostats delay caused the initial problem of 'GetStatsVDSCommand' timing out. Az a result, 'ConnectStoragePoolVDSCommand' ran and itself timed out, resulting in the host going non-operational.

I'm not sure why the 'ConnectStoragePoolVDSCommand' timed out specifically, but it is presumably related to the repostats issue or maybe the SD monitor threads. This is the same customer for which BZ 1262675 was opened recently, in which SD monitoring just appeared to stop.

Comment 7 Allon Mureinik 2015-09-21 10:59:50 UTC

Tal, let's get someone assigned to investigate this one please?

Comment 8 Maor 2015-10-16 09:41:38 UTC

In the engine log it looks that the http request for connectStoragePool has been timed out after 3 minutes.
The VDSM thread of connectStoragePool timed out after 5 minutes (As Gordon mentioned earlier)

Nir, 
Since this bug was opened by the the same customer could it be that the domainMonitor problem mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=1262675 is related to this issue.

Comment 9 Nir Soffer 2015-10-20 12:45:06 UTC

(In reply to Maor from comment #8)
> Since this bug was opened by the the same customer could it be that the
> domainMonitor problem mentioned at
> https://bugzilla.redhat.com/show_bug.cgi?id=1262675 is related to this issue.

I don't think so. repoStat response does not depend on functioning vdsm 
monitor threads; it just grab the current status and return, and should not
be effected by bug 1262675. Note that we don't have any clue what happened
in bug 1262675 - all monitoring threads are gone, and we don't have any
info explaining that.

Regarding the network timeout during repoStat, lets check the vdsm logs
and see how much time the repoStat command took - should be milliseconds.

Comment 10 Maor 2015-10-21 09:19:59 UTC

Hi Gordon,

It looks that the host rvh0207 is not using the JSON protocol.
Is there any reason why the customer don't want to change the host to use the json protocol?

In the logs it looks like the engine calls getStats and immediately get a Netwrok exception although the Host seems to communicate well.

Is there any chance to change the Host to use JSON protocol (By 'edit' the Host and check the checkbox of Json under Advanced parameters), and see if there is any difference in the behavior?

Comment 11 Maor 2015-10-21 09:21:54 UTC

(In reply to Maor from comment #10)
> Hi Gordon,
> 
> It looks that the host rvh0207 is not using the JSON protocol.
> Is there any reason why the customer don't want to change the host to use
> the json protocol?
> 
> In the logs it looks like the engine calls getStats and immediately get a
> Netwrok exception although the Host seems to communicate well.

With xml RPC the timeout for connection is 2 seconds, that is probably why we got the timeout there, in json RPC it should not happen.
besides that in 3.6 xmlrpc is deprecated and users are not able to use it anymore

> 
> Is there any chance to change the Host to use JSON protocol (By 'edit' the
> Host and check the checkbox of Json under Advanced parameters), and see if
> there is any difference in the behavior?

Comment 12 Maor 2015-10-21 09:42:02 UTC

Gordon,

Can u please also confirm the timeout of the user configured in the DB.
This is the sql command that should be executed:
SELECT * FROM vdc_options where option_name in ('vdsConnectionTimeout', 'vdsTimeout');

Comment 13 Maor 2015-10-21 14:20:02 UTC

Regarding the issue of ConnectStoragePool that took 5 minutes.
From the logs it looks like the DomainMonitor was running in the middle of the ConnectStoragePool

In the logs it can be shown that few samplingMethod were called by the DomainMonitor, trying to refresh the Storage Domains:

fc-scan:
Thread-82::DEBUG::2015-08-29 00:46:11,993::utils::739::Storage.HBA::(execCmd) /usr/bin/sudo -n /usr/libexec/vdsm/fc-scan 
...
multipath rescan:
Thread-82::DEBUG::2015-08-29 00:46:27,231::multipath::131::Storage.Misc.excCmd::(rescan) /usr/bin/sudo -n /sbin/multipath (cwd None)
...
Thread-207::DEBUG::2015-08-29 00:47:03,633::iscsi::424::Storage.ISCSI::(rescan) Performing SCSI scan, this will take up to 30 seconds

The domain monitor wakes up from time to time, and the operations of scan/rescan could be heavy and acquire a lock which connectStoragePool needs to wait for to connect to its Storage Domains (_syncroot in method sdc.py#_realProduce())

Comment 15 Maor 2015-10-22 07:24:42 UTC

Hi Gordon,
Thank you for the quick respond!
Your assumption was correct, GetStatsVDSCommand leading to ConnectStoragePool.

The timeout of the ConnectStoragePool is basically because of a coincidence which the domain monitor refreshing all the storage domains in the Host right in the middle of the process, and since the rescan of the storage domains took a while, the timeout occurred.
This is a known issue which we aware of, perhaps it is worth to open a separate bug.

The main problem of this bug is that getVdsStats gets timed out - 
  The problem with analyzing this issue is that the log level of the customer engine is info (not debug) so the call for getVdsStats could not be seen, only the exceptions which were thrown.
It is hard to determine whether the timeout occurred because of connectionTimeOut or the vdsTimeOut.
but, working with json might solve this, since with json, there is no need to establish a new connection for every request to the VDSM, so there should not be any connectionTimeOut.

I would try to ask the user to move to json for a while and see if that timeout still occurs.
If the user still suffering from getVdsStats time out, can u please ask the customer to change the debug level of the engine logs to DEBUG instead of INFO and attach the appropriate logs again.
(see http://www.ovirt.org/OVirt_Engine_Development_Environment#Enable_DEBUG_log_-_restart_required)

Comment 17 Maor 2015-10-22 16:13:25 UTC

If the issue does not appear anymore in the customer environment, can u please close the bug for now, and once it will be relevant again, reopen it with the appropriate logs (mentioned at Comment 15).

Thanks,
Maor

Comment 18 Maor 2015-10-26 12:57:33 UTC

I'm Closing the bug for now, 
Please feel free to reopen, if the problem persists.

Note You need to log in before you can comment on or make changes to this bug.