Description of problem: The cpu stats in the notifier logs on RHEV-M is showing a value beyond the threhold of 100%. Version-Release number of selected component (if applicable): It is RHEV-M 4.0.0 How reproducible: It is not persistent, but it shows sometimes under heavy load. Actual results: 2017-03-28 11:56:27.106+02 | Used CPU of host <host_name> [2147483647%] exceeded defined threshold [90%]. Expected results: The cpu load value should be under 100% and should not show such a big number Additional info: The vdsm on host is sending the correct value but it may be a calculation issue at ovirt-engine side.
Hello, Please do let know if more info is needed on the description, but I have only notifier logs which gives %load beyond the threshold limit(100%)
Hello, I have shared required logs internally. Thanks, Nirav Dave
(In reply to Nirav Dave from comment #4) > Hello, > > I have shared required logs internally. > > Thanks, > Nirav Dave I have shared with Yaniv Kaul.
Andrej, please verify we still have the code in 4.1. I suspect there is a division by zero and negative infinity somewhere.
I have not found any calculations with cpu usage in the engine. It just stores the values and displays them, no division. Probably VDSM sometimes sent large values of 'cpuUser' or 'cpuSys'.
Hi Andrej, Thanks for the update. If VDSM is sending the larger values can we have catch in ovirt code to rectify the invalid values, probably the out range value exception so that we know that VDSM is sending such values. Thanks, Nirav Dave
The vdsStats file contains just the current state of host statistics. VDSM logs with DEBUG level from the time when the bug happened would be useful to see if the VDSM really sends wrong data.
The VDSM returns incorrect values for 'cpuUser', 'cpuSys' and 'cpuIdle': - at 2017-07-06 11:09:45,127 in vdsm.log.7: - cpuUser = -1259180371.05 - cpuSys = -1259180370.17 - cpuIdle = 2518360841.23 - at 2017-07-06 16:00:09,363 in vdsm.log.3: - cpuUser = 2924878764.12 - cpuSys = 2924878764.12 - cpuIdle = 0.00 It is probably a division by zero, which is fixed by the patch.
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: rhv-devops
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Project 'vdsm'/Component 'ovirt-engine' mismatch] For more info please contact: rhv-devops
Verified on: 4.2.0-0.5.master.el7 Step of verification: 1. Load up the Host CPU. 2. Accumulate VDSM log and check the cpuUser, cpuSys and cpuIdle values. 3. See that the values are between 0 to 100. I ran the case and checked the values on the log and vdsm-client Host getStats. The check was made on 3 different host_sample_stats_interval vdsm timing. The default(15), on 0 and on 45. The CPU values were in the range of 0 to 100.
INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [Project 'vdsm'/Component 'ovirt-engine' mismatch] For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
BZ<2>Jira Resync