1207610 – [vdsm] errors: value of 'vcpu_period' is out of range [1000, 1000000]

Bug 1207610 - [vdsm] errors: value of 'vcpu_period' is out of range [1000, 1000000]

Summary: [vdsm] errors: value of 'vcpu_period' is out of range [1000, 1000000]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Martin Sivák
QA Contact:	Shira Maximov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1191119 (view as bug list)
Depends On:
Blocks:	1213438
TreeView+	depends on / blocked

Reported:	2015-03-31 10:04 UTC by Martin Tessun
Modified:	2019-07-16 11:58 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	The Memory Overcommitment Manager (MOM) policy formula for CPU limits previously used fixed constants and divided those by the amount of CPUs. The result was too low on hosts with more than 100 CPUs and the value was refused by libvirt, which caused performance degradation in virtual machines. The CPU limit formulas have been improved and as a result, the CPU limits can now handle any number of CPUs.
Clone Of:
Clones:	1213438 (view as bug list)
Environment:
Last Closed:	2016-03-09 19:34:45 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1438583	None	None	None	Never
Red Hat Product Errata	RHBA-2016:0362	normal	SHIPPED_LIVE	vdsm 3.6.0 bug fix and enhancement update	2016-03-09 23:49:32 UTC
oVirt gerrit	39411	master	MERGED	Fix the CPU quota MOM policy computations	Never

Description Martin Tessun 2015-03-31 10:04:08 UTC

Description of problem:
After upgrading to vdsm-4.16.8.1-8.el6ev the customer is experiencing high performance degradation of his VMs once started there, showing high %steal time and vdsm logs errors for each VM continously:

~~~
PolicyEngine::ERROR::2015-03-28 20:54:56,016::vm::5104::vm.Vm::(_reportException) vmId=`f989786c-9245-40b7-ba5d-be059364216b`::Operation failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 5080, in setCpuTunePeriod
    self._dom.setSchedulerParameters({'vcpu_period': int(period)})
  File "/usr/share/vdsm/virt/vm.py", line 689, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2039, in setSchedulerParameters
    if ret == -1: raise libvirtError ('virDomainSetSchedulerParameters() failed', dom=self)
libvirtError: invalid argument: value of 'vcpu_period' is out of range [1000, 1000000]
~~~

Version-Release number of selected component (if applicable):
vdsm-4.16.8.1-8.el6ev

How reproducible:
always at customer site.

Steps to Reproduce:
1. Have RHEV 3.4 with vdsm-4.14 running
2. Upgrade Hypervisors to latest release (yum update -y)
3. Start a VM on that hypervisor which has some load
4. Experience top and /var/log/vdsm/vdsm.log from hypervisor

Actual results:
The VM shows high %steal times (although running alone on the hypervisor):

~~~
Linux 2.6.32-504.8.1.el6.x86_64 (myhost)   03/30/2015      _x86_64_        (5 CPU)

11:21:21 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:21:23 AM     all      7.21      1.33      1.23      0.19     33.87     56.17
11:21:25 AM     all     13.10      0.83      2.31      3.23     64.02     16.51
11:21:27 AM     all     14.41      1.78      1.78      0.89     67.32     13.82
11:21:29 AM     all     12.00      1.94      1.84      1.74     57.79     24.69
11:21:31 AM     all     14.66      1.40      1.89      1.79     70.39      9.87
11:21:33 AM     all     15.53      1.98      1.29      0.00     75.17      6.03
11:21:35 AM     all     15.85      1.81      1.15      0.76     78.61      1.81
11:21:37 AM     all     14.04      1.94      1.26      1.45     66.12     15.20
11:21:39 AM     all     14.95      1.36      1.55      3.69     67.86     10.58
11:21:41 AM     all     14.00      1.19      1.19      0.99     60.77     21.85
11:21:43 AM     all     14.23      2.19      1.29      1.69     68.86     11.74
11:21:45 AM     all     16.63      1.41      1.21      0.00     80.75      0.00
Average:        all     13.86      1.59      1.50      1.38     65.84     15.83

Additionally /var/log/vdsm/vdsm.log as well as /var/log/messages throws the following errors:

vdsm.log:
PolicyEngine::ERROR::2015-03-28 20:54:56,016::vm::5104::vm.Vm::(_reportException) vmId=`f989786c-9245-40b7-ba5d-be059364216b`::Operation failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 5080, in setCpuTunePeriod
    self._dom.setSchedulerParameters({'vcpu_period': int(period)})
  File "/usr/share/vdsm/virt/vm.py", line 689, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2039, in setSchedulerParameters
    if ret == -1: raise libvirtError ('virDomainSetSchedulerParameters() failed', dom=self)
libvirtError: invalid argument: value of 'vcpu_period' is out of range [1000, 1000000]

messages:
Mar 29 10:20:32 myhost vdsm vm.Vm ERROR vmId=`2bc56270-2dfe-46bf-97f1-0399cb9390b1`::Operation failed#012Traceback (most recent call last):#012  File "/usr/share/vdsm/virt/vm.py", line 5080, in setCpuTunePeriod#012    self._dom.setSchedulerParameters({'vcpu_period': int(period)})#012  File "/usr/share/vdsm/virt/vm.py", line 689, in f#012    ret = attr(*args, **kwargs)#012  File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper#012    ret = f(*args, **kwargs)#012  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2039, in setSchedulerParameters#012    if ret == -1: raise libvirtError ('virDomainSetSchedulerParameters() failed', dom=self)#012libvirtError: invalid argument: value of 'vcpu_period' is out of range [1000, 1000000]

Expected results:
* VM should show no %steal 
* Performance of the VM should not increase dramatically after updating to vdsm-4.16
* The errors logged in vdsm.log/messages should not occur

Additional info:
After downgrading to vdsm-4.14.18-6.el6ev and restarting the VM no further errors could be observed as well as the %steal being back to normal.

On top of that:
Some VMs can't even be started (that was also true for vdsm-4.14.18-7.el6ev) as the VM BIOS never kicked in.
They could be started again after applying version vdsm-4.14.18-6.el6ev as well
(Specs: 10*10 Cores (100 vCPUs) and 386GB of Memory).

Comment 2 Doron Fediuck 2015-03-31 11:30:55 UTC

Can you please provide the mom version you're using?

Comment 4 Martin Tessun 2015-03-31 12:05:36 UTC

Hi Doron,

mom version is:

mom-0.4.1-4.el6ev.noarch

What else is observed:
If a VM is migrated from B420 to B460 blades the steal and performance issues do occur immediately.
If migrated back to a B420 blade everything is back to normal.

So somehow this seems to be a bit hardware related (numa? qpi-link?) as well.
Maybe the migration is also part, as our test VM that was started on B460 with old vdsm did not show the issue.

Cheers,
Martin

Comment 6 Martin Sivák 2015-03-31 13:26:13 UTC

Hi,

I see the reason for the logged errors. How many CPUs does the B460 have?

mom.log:

2015-03-29 12:47:50,082 - mom.Controllers.Cputune - INFO - CpuTune guest:sapvmnl
bsrt from quota:50000 period:100000 to quota:50000 period:833 (<-- 833 is less than 1000)

This would imply that there are 120 cores.. we never anticipated such amount of CPUs. It should not be hard to fix though.


But I have no idea why the performance degrades. Can you give us the full process list (with cpu and memory usages) and /proc/cpuinfo from the affected machine?

Comment 7 Martin Tessun 2015-03-31 13:46:16 UTC

Hi Martin,

the Hypervisor indeed does have 120 CPUs:

[...]
processor	: 119
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
stepping	: 7
microcode	: 1805
cpu MHz		: 2801.000
cache size	: 38400 KB
physical id	: 3
siblings	: 30
core id		: 14
cpu cores	: 15
apicid		: 125
initial apicid	: 125
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips	: 5588.28
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:


I will attach the requested information privately.

Comment 25 Shira Maximov 2015-05-25 13:55:19 UTC

In order to reproduce the bug:
perform the following actions on one of your hosts:

1. change in hostCpu file to always return 120 CPUs for example:
vi /usr/lib/python2.7/site-packages/mom/Collectors/HostCpu.py -> 
in line 36 change to: return { 'cpu_count': 120 }

2. change the file that was fixed  to the previuos ( /etc/vdsm/mom.d/04-cputune.policy)

3. add debug Host.cpu_coun to the buttom of the file /etc/vdsm/mom.conf 

4. systemctl restart vdsmd.service 

5. see in vdsm log that the error: value of 'vcpu_period' is out of range [1000, 1000000]

In order to verify the fix:

6. return the fix in /etc/vdsm/mom.d/04-cputune.policy file.

7. check to see in the vdsm log if the error appears.

Comment 27 Ilanit Stein 2015-08-20 12:54:43 UTC

QE: Test case should be added for this bug, with a flow based on comment 25.

Comment 28 Roman Mohr 2015-11-10 15:11:07 UTC

*** Bug 1191119 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2016-03-09 19:34:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html

Note You need to log in before you can comment on or make changes to this bug.