Bug 846629

Summary: Failed to run cpu-stats when cpuacct.usage_percpu is too large
Product: Red Hat Enterprise Linux 6 Reporter: hongming <honzhang>
Component: libvirtAssignee: Gunannan Ren <gren>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.4CC: acathrow, ajia, dallan, dyasny, dyuan, gsun, mzhan, rwu, veillard
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-0.10.1-1.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 07:21:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description hongming 2012-08-08 09:55:50 UTC
Description of problem:
There  is one guest that has been test in many cases about NUMA and cpu testing.  It is failed to run cpu-stats for the guest .


Version-Release number of selected component (if applicable):
libvirt-0.10.0-0rc0.el6.x86_64
qemu-kvm-0.12.1.2-2.295.el6.x86_64
numad-0.5-4.20120522git.el6.x86_64
kernel-2.6.32-279.el6.x86_64


How reproducible:
100% 

Steps to Reproduce:
1.# virsh start rhel6q
Domain rhel6q started

2. # virsh cpu-stats rhel6q
error: Failed to virDomainGetCPUStats()
error: Failed to read file '/cgroup/cpuacct/libvirt/qemu/rhel6q/cpuacct.usage_percpu': Value too large for defined data type 

3. Check the libvirtd log
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupMakeGroup:536 : Make group /libvirt/qemu/rhel6q
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupMakeGroup:560 : Make controller /cgroup/cpu/libvirt/qemu/rhel6q/
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupMakeGroup:560 : Make controller /cgroup/cpuacct/libvirt/qemu/rhel6q/
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupMakeGroup:560 : Make controller /cgroup/cpuset/libvirt/qemu/rhel6q/
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupMakeGroup:560 : Make controller /cgroup/memory/libvirt/qemu/rhel6q/
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupMakeGroup:560 : Make controller /cgroup/devices/libvirt/qemu/rhel6q/
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupMakeGroup:560 : Make controller /cgroup/freezer/libvirt/qemu/rhel6q/
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupMakeGroup:560 : Make controller /cgroup/blkio/libvirt/qemu/rhel6q/
2012-08-08 09:15:56.832+0000: 24687: debug : virFileClose:72 : Closed fd 20
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupGetValueStr:362 : Get value /cgroup/cpuacct/libvirt/qemu/rhel6q/cpuacct.usage_percpu
2012-08-08 09:15:56.832+0000: 24687: debug : virFileClose:72 : Closed fd 20
2012-08-08 09:15:56.832+0000: 24687: error : virFileReadAll:463 : Failed to read file '/cgroup/cpuacct/libvirt/qemu/rhel6q/cpuacct.usage_percpu': Value too large for defined data type
2012-08-08 09:15:56.832+0000: 24687: debug : virCgroupGetValueStr:367 : Failed to read /cgroup/cpuacct/libvirt/qemu/rhel6q/cpuacct.usage_percpu: Value too large for defined data type 

  
Actual results:
It is failed to run cpu-stats for the guest when cpuacct.usage_percpu is too large. 

Expected results:
It can run successfully


Additional info:

cat /cgroup/cpuacct/libvirt/qemu/rhel6q/cpuacct.usage_percpu

2486818959 1912121459 85760552 781624 13444802 0 0 0 0 4815900 419488931 241610 0 0 29899950 0 0 0 0 686722197 57412910 307028651 420070 0 0 1306018802 0 0 0 239432501 109698572490 95352118770 51729442620 45542332309 47522451978 4941761764 5867551155 832272028 4774174844 15639757557 152710856868 86976663095 81620678693 25106119214 24268156890 20135020718 15526880286 40411931401 17364392060 14777451791 49405110 202750 891540 62849361 1943656962 1326511 0 0 24332571 820985 950086643 31248278 28766240 26961345 304343 1420865204 0 0 3809170 604926 6298724023 2219617927 47247943 2346243 0 0 0 0 1505529 4491270 1942296286 735835858 106087408 0 0 26616471 0 0 0 0 66458403 43862547 0 58040 2474334439 0 0 0 0 280484306 0 78517 0 0 0 0 620096566 0 0 1132089497 92189821771 61617872584 39870931229 58293696860 38339237738 1636019325 10369540325 1954827764 5705294056 19905372912 117990468690 60711130043 62953993043 27414059468 16630746545 16884891077 28410240087 38192261303 27412870255 37651629649 4350185 5715377 36408278 0 13774426 201851 2758717 0 0 99137534 1551736744 33309528 0 0 0 1873391949 0 0 194244567 56196854 148904490 1615419846 0 1714097 0 0 0 0 126262169 2354398

Comment 2 Gunannan Ren 2012-08-31 09:33:12 UTC
The problem is caused the limit of 1024 bytes to the length of cpuacct.usage_percpu content.
patch sent to upstream to change the limit from 1024 bytes to 1024*1024(1M).

There are other two bugs fixed together.
one is a typo about vcpu time computing which causes the wrong data returned to user.

the other crash libvirtd on machine with big number of cpus.

The total of three patches are sent together.
https://www.redhat.com/archives/libvir-list/2012-August/msg01999.html

Comment 3 Gunannan Ren 2012-08-31 10:15:38 UTC
commit fccab89def6dd13b895d8a6578573f8abc50401a
Author: Guannan Ren <gren>
Date:   Fri Aug 31 16:45:02 2012 +0800

    cgroup: fix libvirtd crash caused by messed memory
    
    The variable max_id is initialized again in the step of
    getting cpu mapping variable map2. But in the next for loop
    we still expect original value of max_id, the bug will
    crash libvirtd when using on NUMA machine with big number
    of cpus.

commit 657fef1401cce0227263e05488b0769765467b73
Author: Guannan Ren <gren>
Date:   Fri Aug 31 16:40:10 2012 +0800

    cgroup: fix a typo on extracting data from vcpu cgroup

commit c402eebc71770390f8ce1b400dfe19cbfca30ec7
Author: Guannan Ren <gren>
Date:   Fri Aug 31 16:31:30 2012 +0800

    cgroup: read more data from cgroup cpuacct.usage_percpu
    
    On NUMA machine, the length of string got from file
    cpuacct.usage_percpu is quite large, so expand the
    limit of 1024 bytes.
    
    errors like:
    Failed to read file \
    '/cgroup/cpuacct/libvirt/qemu/rhel6q/cpuacct.usage_percpu': \
    Value too large for defined data type

Comment 5 hongming 2012-09-03 05:26:33 UTC
Verify it using libvirt-0.10.1-1.el6.x86_64 . the result is expected . So move its status to VERIFIED.

Steps 
1.# virsh list --all
 Id    Name                           State
----------------------------------------------------
 11    yuping-rhel6                   running


2.# cat /cgroup/cpuacct/libvirt/qemu/yuping-rhel6/cpuacct.usage_percpu
238544577786 274016617989 10127353650 5380803841 5857882905 34768367605 1985423870 8215296502 1985286406 4550374840 382835539298 494488743798 22028369208 27285303938 56094089795 6380656666 249733255 12819802429 22740655422 889219505 318782648028 430099293161 13359140860 1111078876 18111098442 19333098 2893444625 9251959110 7706878808 5325667624 190529472225 204588922977 16661409319 4599932703 271347897 4484649431 23233040076 16555409710 2513900215 16671895745 147972133550 149054657670 2932338805 5673188561 889922714 5424693354 429743551 1372265314 1378928684 368952280 328321426926 466287857800 18784468995 4239019802 5187730078 13780543384 25138747328 34579277125 40054063352 38225108268 608154488267 1124362214555 44546814885 40200961462 14826867741 53424342945 15427049642 4493215476 9714220345 11738679798 312184687315 611050564951 6658312013 30289658816 9554713793 3617576486 52435594847 2519169386 17875590464 70427782543 410341291959 163082983854 9595445315 10265401811 23088404845 44392156884 2235074881 14618042289 7415925689 10449398761 329384522262 292005096736 26686569148 13284051197 68892953591 5459003 275145246 17707013423 11720812716 1377318343 334107293282 385989917630 28740704397 758172428 12313925318 306924728 8751778569 28626181814 27941147001 8104917434 214526366820 115969110690 17182508299 301406170 8687565 1306612636 39278113578 19976071109 762721304 11274297540 214190883820 94067466593 8595804924 6616390017 3200659804 8740161704 24991658 340670902 2118074350 33876042 362377150181 192902760594 21963764118 288944511 1699093200 5186701435 7741224387 26180368204 19614309666 19062676151 557161441868 510432992887 56537360144 57883469754 40941347452 73751515382 13631777591 4067253143 10733066326 1336647363 297800480173 366933747005 7420293543 24942087941 14093437677 6296071634 41263056022 850829666 4836711838 69957681740 



3.# for i in {1..10};do virsh cpu-stats yuping-rhel6 ; done

CPU0:
	cpu_time           238.544577786 seconds
	vcpu_time           58.180739601 seconds
CPU1:
	cpu_time           274.016617989 seconds
	vcpu_time            6.962152684 seconds

--- snip---

CPU158:
	cpu_time             4.836711838 seconds
	vcpu_time            0.356833735 seconds
CPU159:
	cpu_time            69.957681740 seconds
	vcpu_time            0.619906074 seconds
Total:
	cpu_time         13096.575252224 seconds
	user_time         3427.310000000 seconds
	system_time       7829.200000000 seconds

Comment 6 errata-xmlrpc 2013-02-21 07:21:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html