948232 – vdsm storage latency timings are inaccurate

Bug 948232 - vdsm storage latency timings are inaccurate

Summary: vdsm storage latency timings are inaccurate

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.1.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Federico Simoncelli
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	972700
TreeView+	depends on / blocked

Reported:	2013-04-04 11:20 UTC by Julio Entrena Perez
Modified:	2021-08-30 12:29 UTC (History)
CC List:	21 users (show)
Fixed In Version:	is2
Doc Type:	Bug Fix
Doc Text:	On some overloaded systems the Storage Domains' latency timings were reported with inaccurate values. Now VDSM reports the most accurate latency values in all cases.
Clone Of:
Clones:	972700 (view as bug list)
Environment:
Last Closed:	2014-01-21 16:06:02 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43239	None	None	None	2021-08-30 12:29:10 UTC
Red Hat Knowledge Base (Solution)	229043	None	None	None	Never
Red Hat Product Errata	RHBA-2014:0040	normal	SHIPPED_LIVE	vdsm bug fix and enhancement update	2014-01-21 20:26:21 UTC
oVirt gerrit	15049	None	None	None	2020-09-11 19:17:39 UTC

Description Julio Entrena Perez 2013-04-04 11:20:04 UTC

Description of problem:
rhevm frequently reports "Storage domain <sd> experienced a high latency of <x> seconds..." where x > 1 second.
Separate tests running "/bin/dd iflag=direct if=/dev/<sd>/metadata of=/dev/null bs=4k count=1" every second show <1 values while rhevm shows >1 values.

Version-Release number of selected component (if applicable):
vdsm 4.10.2-1.6 and 4.10.2-1.8.

How reproducible:
Always.

Steps to Reproduce:
1. Run "/bin/dd iflag=direct if=/dev/<sd>/metadata of=/dev/null bs=4k count=1" commands on all SDs every second from all hosts in rhev cluster.
2. Wait for rhevm to report storage latency warnings with high latency values (>1 second).
  
Actual results:
rhevm reports [5-11] seconds latency.

Worst time values reported by separate dd commands running every second at the same time never report >1 seconds to complete:

Thu Mar 28 16:09:52 CET 2013
4096 bytes (4.1 kB) copied, 0.0202971 s, 202 kB/s
1+0 records in
1+0 records out


Expected results:
Latency values reported by rhevm and separate timing results for the same SDs are congruent.
Customers are not presented with unnecessary worrying warnings.
What can customer do to ensure that their storage is fine and prevent those warning messages?

Comment 21 Mark Huth 2013-05-17 05:39:33 UTC

Some false positives generated from a customer system (RHEVM UTC+10).  The message in the RHEVM ... 

 2013-04-30 21:35:05.814+10 | Storage domain s2.ha_data1 experienced a high latency of 6.30241894722 seconds from host h4.rhev

... doesn't correlate to the latencies reported by dd on the hypervisor:

Tue Apr 30 11:35:04 UTC 2013
4096 bytes (4.1 kB) copied, 0.032126 seconds, 127 kB/s
...
Tue Apr 30 11:35:09 UTC 2013
4096 bytes (4.1 kB) copied, 0.007127 seconds, 575 kB/s


And another ...

 2013-04-30 20:34:00.376+10 | Storage domain s2.ha_data1 experienced a high latency of 8.63769292831 seconds from host h4.rhev

Tue Apr 30 10:33:59 UTC 2013
4096 bytes (4.1 kB) copied, 0.007689 seconds, 533 kB/s
...
Tue Apr 30 10:34:04 UTC 2013
4096 bytes (4.1 kB) copied, 0.006882 seconds, 595 kB/s

... and there were many more examples of these inconsistencies.  Customers are opening tickets concerned about these messages and the stability of their storage, yet it seems the problem lies with how vdsm is calculating the 'latency' and not with the storage.

Comment 23 Mark Huth 2013-05-26 21:39:36 UTC

Need to consider NFS / file based storage domains as well.  Similar code is used for timing their latency:

    def getReadDelay(self):
        t = time.time()
        oop.getProcessPool(self.sdUUID).directReadLines(self.metafile)
        return time.time() - t

... which can also deliver false positives if its in fact vdsm that is experiencing the latency and not the storage.  Customers are getting latency warnings for their ISO domains (and NFS data domains) yet testing outside of vdsm has shown there was no latency.

Comment 27 Elad 2013-07-16 11:49:28 UTC

I've ran the script mentioned on comment #17.

Didn't see any event related to storage latency. repoStats reports about low latency:

Domain 9a876f39-0f53-4688-a66e-42bfb48b07e9 {'delay': '0.000637052', 'lastCheck': '5.3', 'code': 0, 'valid': True, 'version': 3}
Domain 5ac9b6a8-dade-4682-9260-236acb78fb20 {'delay': '0.000666265', 'lastCheck': '3.8', 'code': 0, 'valid': True, 'version': 3}
Domain b4d9a17e-d82c-4af7-9686-1ab2e4e2ba64 {'delay': '0.000494544', 'lastCheck': '5.1', 'code': 0, 'valid': True, 'version': 3}
Domain 45052d4e-28ab-4fb1-bb87-0b7e7141aeb4 {'delay': '0.000647863', 'lastCheck': '4.1', 'code': 0, 'valid': True, 'version': 3}
Domain a0484c19-a33e-468a-a0da-ef07dfbb8317 {'delay': '0.000632065', 'lastCheck': '3.7', 'code': 0, 'valid': True, 'version': 3}
Domain 1f40f789-178d-4881-af3e-eeb5fc6504d5 {'delay': '0.000667948', 'lastCheck': '4.6', 'code': 0, 'valid': True, 'version': 3}
Domain 017af273-41d5-479b-bb90-730f41f8793d {'delay': '0.000609018', 'lastCheck': '5.0', 'code': 0, 'valid': True, 'version': 3}



Checked on RHEVM3.3- is5:
vdsm-4.11.0-121.git082925a.el6.x86_64
rhevm-3.3.0-0.6.master.el6ev.noarch

Comment 28 Charlie 2013-11-28 00:30:37 UTC

This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 29 errata-xmlrpc 2014-01-21 16:06:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html

Note You need to log in before you can comment on or make changes to this bug.

abaron
acathrow
amureini
bazulay
carlos.molina.ext
cboyle
cmosher
cpelland
dornelas
fsimonce
iheim
jentrena
jkt
lpeer
lyarwood
mhuth
mkalinin
perobins
scohen
sputhenp
yeylon