1177530 – [scale] repoStat lastcheck value too high and delayed

Bug 1177530 - [scale] repoStat lastcheck value too high and delayed

Summary: [scale] repoStat lastcheck value too high and delayed

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Nir Soffer
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	1177634
Blocks:
TreeView+	depends on / blocked

Reported:	2014-12-28 16:38 UTC by Eldad Marciano
Modified:	2016-02-10 19:32 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-01-13 14:52:21 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Eldad Marciano 2014-12-28 16:38:01 UTC

Description of problem:
some of the host become non-operational, due to the following errors reported by engine:

-Storage Domain storage_real of pool dc_real is in problem
-storage_real check time
ot 54.9 is too big

so it looks like we have some storage issues, but no.
storage monitored separately and there is no latency.

there is no ERRORs at all @vdsm side.
by the vdsm logs repoStat always reporting valid ture, and low delay less then 0
but sometimes the lastcheck value is big more than 30 sec (which is the threshold and the engine side).

we should identify why the lastcheck operation delayed on scale.


Version-Release number of selected component (if applicable):
3.5 VT 13.5

How reproducible:
100%

Steps to Reproduce:
1. scale up to 37 hosts (3K vms)
3. especially reproduced while remove vms in bulks of 50 vms. 

Actual results:
storage become non available.
host become nonOperetional.

Expected results:
this repostat monitor should run as a single operation and not affected from other operations.
active storage \ hosts while delay is low.

Additional info:

Comment 1 Allon Mureinik 2014-12-29 09:15:26 UTC

Repostats monitoring is a storage flow, and the research should start from our side. In the worst case, if the analysis of this bug will uncover an infra issue, we'll move it back there.

Nir - please take lead on this.

Comment 2 Nir Soffer 2015-01-04 14:34:06 UTC

Eldad, we need logs from the host that becoming non-operational.

Please provide these logs from the machine:

/var/log/messages
/var/log/sanlock.log
/var/log/vdsm/vdsm.log

To understand host health during that time, we also need sar logs.

Lets try to reproduce this with a smaller setup:
- one host running engine
- one host running x vms
- same storage used in your full scale test

Can you reproduce it on such setup?

If you can, we will need access to such setup for investigation.

Comment 3 Tal Nisan 2015-01-13 14:52:21 UTC

Next time please open a bug correctly with the logs, especially when concerning the scale environment which is rapidly changing.

Comment 4 Eldad Marciano 2015-01-15 11:56:31 UTC

Since we have performance issues on that env in higher proprity i have customize the lastCheck treshhold @engine side.
MaxStorageVdsTimeoutCheckSec=120

this issue may reproduced on loaded hosts.

Note You need to log in before you can comment on or make changes to this bug.