Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1177530

Summary: [scale] repoStat lastcheck value too high and delayed
Product: Red Hat Enterprise Virtualization Manager Reporter: Eldad Marciano <emarcian>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Eldad Marciano <emarcian>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: amureini, bazulay, ecohen, emarcian, gklein, iheim, lpeer, lsurette, nsoffer, tnisan, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-13 14:52:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1177634    
Bug Blocks:    

Description Eldad Marciano 2014-12-28 16:38:01 UTC
Description of problem:
some of the host become non-operational, due to the following errors reported by engine:

-Storage Domain storage_real of pool dc_real is in problem
-storage_real check time
ot 54.9 is too big

so it looks like we have some storage issues, but no.
storage monitored separately and there is no latency.

there is no ERRORs at all @vdsm side.
by the vdsm logs repoStat always reporting valid ture, and low delay less then 0
but sometimes the lastcheck value is big more than 30 sec (which is the threshold and the engine side).

we should identify why the lastcheck operation delayed on scale.


Version-Release number of selected component (if applicable):
3.5 VT 13.5

How reproducible:
100%

Steps to Reproduce:
1. scale up to 37 hosts (3K vms)
3. especially reproduced while remove vms in bulks of 50 vms. 

Actual results:
storage become non available.
host become nonOperetional.

Expected results:
this repostat monitor should run as a single operation and not affected from other operations.
active storage \ hosts while delay is low.

Additional info:

Comment 1 Allon Mureinik 2014-12-29 09:15:26 UTC
Repostats monitoring is a storage flow, and the research should start from our side. In the worst case, if the analysis of this bug will uncover an infra issue, we'll move it back there.

Nir - please take lead on this.

Comment 2 Nir Soffer 2015-01-04 14:34:06 UTC
Eldad, we need logs from the host that becoming non-operational.

Please provide these logs from the machine:

/var/log/messages
/var/log/sanlock.log
/var/log/vdsm/vdsm.log

To understand host health during that time, we also need sar logs.

Lets try to reproduce this with a smaller setup:
- one host running engine
- one host running x vms
- same storage used in your full scale test

Can you reproduce it on such setup?

If you can, we will need access to such setup for investigation.

Comment 3 Tal Nisan 2015-01-13 14:52:21 UTC
Next time please open a bug correctly with the logs, especially when concerning the scale environment which is rapidly changing.

Comment 4 Eldad Marciano 2015-01-15 11:56:31 UTC
Since we have performance issues on that env in higher proprity i have customize the lastCheck treshhold @engine side.
MaxStorageVdsTimeoutCheckSec=120

this issue may reproduced on loaded hosts.