Description of problem: After upgrading to vdsm-4.16.8.1-8.el6ev the customer is experiencing high performance degradation of his VMs once started there, showing high %steal time and vdsm logs errors for each VM continously: ~~~ PolicyEngine::ERROR::2015-03-28 20:54:56,016::vm::5104::vm.Vm::(_reportException) vmId=`f989786c-9245-40b7-ba5d-be059364216b`::Operation failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 5080, in setCpuTunePeriod self._dom.setSchedulerParameters({'vcpu_period': int(period)}) File "/usr/share/vdsm/virt/vm.py", line 689, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2039, in setSchedulerParameters if ret == -1: raise libvirtError ('virDomainSetSchedulerParameters() failed', dom=self) libvirtError: invalid argument: value of 'vcpu_period' is out of range [1000, 1000000] ~~~ Version-Release number of selected component (if applicable): vdsm-4.16.8.1-8.el6ev How reproducible: always at customer site. Steps to Reproduce: 1. Have RHEV 3.4 with vdsm-4.14 running 2. Upgrade Hypervisors to latest release (yum update -y) 3. Start a VM on that hypervisor which has some load 4. Experience top and /var/log/vdsm/vdsm.log from hypervisor Actual results: The VM shows high %steal times (although running alone on the hypervisor): ~~~ Linux 2.6.32-504.8.1.el6.x86_64 (myhost) 03/30/2015 _x86_64_ (5 CPU) 11:21:21 AM CPU %user %nice %system %iowait %steal %idle 11:21:23 AM all 7.21 1.33 1.23 0.19 33.87 56.17 11:21:25 AM all 13.10 0.83 2.31 3.23 64.02 16.51 11:21:27 AM all 14.41 1.78 1.78 0.89 67.32 13.82 11:21:29 AM all 12.00 1.94 1.84 1.74 57.79 24.69 11:21:31 AM all 14.66 1.40 1.89 1.79 70.39 9.87 11:21:33 AM all 15.53 1.98 1.29 0.00 75.17 6.03 11:21:35 AM all 15.85 1.81 1.15 0.76 78.61 1.81 11:21:37 AM all 14.04 1.94 1.26 1.45 66.12 15.20 11:21:39 AM all 14.95 1.36 1.55 3.69 67.86 10.58 11:21:41 AM all 14.00 1.19 1.19 0.99 60.77 21.85 11:21:43 AM all 14.23 2.19 1.29 1.69 68.86 11.74 11:21:45 AM all 16.63 1.41 1.21 0.00 80.75 0.00 Average: all 13.86 1.59 1.50 1.38 65.84 15.83 Additionally /var/log/vdsm/vdsm.log as well as /var/log/messages throws the following errors: vdsm.log: PolicyEngine::ERROR::2015-03-28 20:54:56,016::vm::5104::vm.Vm::(_reportException) vmId=`f989786c-9245-40b7-ba5d-be059364216b`::Operation failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 5080, in setCpuTunePeriod self._dom.setSchedulerParameters({'vcpu_period': int(period)}) File "/usr/share/vdsm/virt/vm.py", line 689, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2039, in setSchedulerParameters if ret == -1: raise libvirtError ('virDomainSetSchedulerParameters() failed', dom=self) libvirtError: invalid argument: value of 'vcpu_period' is out of range [1000, 1000000] messages: Mar 29 10:20:32 myhost vdsm vm.Vm ERROR vmId=`2bc56270-2dfe-46bf-97f1-0399cb9390b1`::Operation failed#012Traceback (most recent call last):#012 File "/usr/share/vdsm/virt/vm.py", line 5080, in setCpuTunePeriod#012 self._dom.setSchedulerParameters({'vcpu_period': int(period)})#012 File "/usr/share/vdsm/virt/vm.py", line 689, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2039, in setSchedulerParameters#012 if ret == -1: raise libvirtError ('virDomainSetSchedulerParameters() failed', dom=self)#012libvirtError: invalid argument: value of 'vcpu_period' is out of range [1000, 1000000] Expected results: * VM should show no %steal * Performance of the VM should not increase dramatically after updating to vdsm-4.16 * The errors logged in vdsm.log/messages should not occur Additional info: After downgrading to vdsm-4.14.18-6.el6ev and restarting the VM no further errors could be observed as well as the %steal being back to normal. On top of that: Some VMs can't even be started (that was also true for vdsm-4.14.18-7.el6ev) as the VM BIOS never kicked in. They could be started again after applying version vdsm-4.14.18-6.el6ev as well (Specs: 10*10 Cores (100 vCPUs) and 386GB of Memory).
Can you please provide the mom version you're using?
Hi Doron, mom version is: mom-0.4.1-4.el6ev.noarch What else is observed: If a VM is migrated from B420 to B460 blades the steal and performance issues do occur immediately. If migrated back to a B420 blade everything is back to normal. So somehow this seems to be a bit hardware related (numa? qpi-link?) as well. Maybe the migration is also part, as our test VM that was started on B460 with old vdsm did not show the issue. Cheers, Martin
Hi, I see the reason for the logged errors. How many CPUs does the B460 have? mom.log: 2015-03-29 12:47:50,082 - mom.Controllers.Cputune - INFO - CpuTune guest:sapvmnl bsrt from quota:50000 period:100000 to quota:50000 period:833 (<-- 833 is less than 1000) This would imply that there are 120 cores.. we never anticipated such amount of CPUs. It should not be hard to fix though. But I have no idea why the performance degrades. Can you give us the full process list (with cpu and memory usages) and /proc/cpuinfo from the affected machine?
Hi Martin, the Hypervisor indeed does have 120 CPUs: [...] processor : 119 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz stepping : 7 microcode : 1805 cpu MHz : 2801.000 cache size : 38400 KB physical id : 3 siblings : 30 core id : 14 cpu cores : 15 apicid : 125 initial apicid : 125 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips : 5588.28 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: I will attach the requested information privately.
In order to reproduce the bug: perform the following actions on one of your hosts: 1. change in hostCpu file to always return 120 CPUs for example: vi /usr/lib/python2.7/site-packages/mom/Collectors/HostCpu.py -> in line 36 change to: return { 'cpu_count': 120 } 2. change the file that was fixed to the previuos ( /etc/vdsm/mom.d/04-cputune.policy) 3. add debug Host.cpu_coun to the buttom of the file /etc/vdsm/mom.conf 4. systemctl restart vdsmd.service 5. see in vdsm log that the error: value of 'vcpu_period' is out of range [1000, 1000000] In order to verify the fix: 6. return the fix in /etc/vdsm/mom.d/04-cputune.policy file. 7. check to see in the vdsm log if the error appears.
QE: Test case should be added for this bug, with a flow based on comment 25.
*** Bug 1191119 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html