Bug 1264003

Summary: CPU which goes offline while vdsm is running breaks getVdsStats
Product: [oVirt] vdsm Reporter: Roman Mohr <rmohr>
Component: CoreAssignee: Roman Mohr <rmohr>
Status: CLOSED CURRENTRELEASE QA Contact: Shira Maximov <mshira>
Severity: medium Docs Contact:
Priority: medium    
Version: ---CC: bugs, dfediuck, mgoldboi, michal.skrivanek, pkliczew, rgolan, rmohr, sbonazzo
Target Milestone: ovirt-3.6.6Keywords: EasyFix
Target Release: 4.17.27Flags: rule-engine: ovirt-3.6.z+
mgoldboi: planning_ack+
michal.skrivanek: devel_ack+
mavital: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-30 10:53:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Log with failing getVdsStats
none
Log with failing getVdsStats
none
Effects on libvirtd and vdsm when cpus go offline and online none

Description Roman Mohr 2015-09-17 09:50:12 UTC
Description of problem:
When a cpu goes online or offline while vdsm is running 'getVdsStats' crashes with an exception. The reason is, that vdsm still tries to collect samples for this cpu.

Version-Release number of selected component (if applicable):

How reproducible:

Start vdsmd, disable a CPU and try to query for the stats.

Steps to Reproduce:
1. service vdsmd restart
2. echo 0 > /sys/devices/system/cpu/cpu2/online
3. repeatedly run vdsClient -s 0 getVdsStats until vdsm thinks that it is running long enough that it can also provide you some cpu sampling data

Actual results:
getVdsStats is broken and can't recover until the cpu comes online again or you manually restart vdsmd

Expected results:
Skip offline cpus when collecting cpu samples

Additional info:
The exception:

Thread-6189::ERROR::2015-09-16 14:48:58,863::__init__::527::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 522, in _serveRequest
    res = method(**params)
  File "/usr/share/vdsm/rpc/Bridge.py", line 279, in _dynamicMethod
    raise InvalidCall(fn, methodArgs, e)
InvalidCall: Attempt to call function: <bound method Global.getStats of <API.Global object at 0x7fa45b90d350>> with arguments: () error: 'NoneType' object has no attribute '__getitem__'

Comment 1 Piotr Kliczewski 2015-09-17 09:59:17 UTC
Please attach vdsm log.

Comment 2 Roman Mohr 2015-09-17 10:18:22 UTC
Created attachment 1074387 [details]
Log with failing getVdsStats

Comment 3 Piotr Kliczewski 2015-09-17 10:41:04 UTC
The attached log ends at Thread-735::DEBUG::2015-09-15 10:05:29,601 and there is no getStats issue. Please attach the log from 2015-09-16.

Comment 4 Roman Mohr 2015-09-17 10:43:32 UTC
Created attachment 1074393 [details]
Log with failing getVdsStats

Uploaded the wrong log file

Comment 5 Roman Mohr 2015-09-17 10:45:41 UTC
(In reply to Piotr Kliczewski from comment #3)
> The attached log ends at Thread-735::DEBUG::2015-09-15 10:05:29,601 and
> there is no getStats issue. Please attach the log from 2015-09-16.
Uplaoded the wrong log file before.

I don't have the original vdsm.log. This is a fresh log from today with the relevant exceptions.

Comment 6 Piotr Kliczewski 2015-09-17 11:08:38 UTC
Ok. I can see that the error is a bit different than in BZ description. Actual issue is:

Traceback (most recent call last):
  File "/usr/share/vdsm/rpc/bindingxmlrpc.py", line 1260, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/rpc/bindingxmlrpc.py", line 535, in getStats
    return api.getStats()
  File "/usr/share/vdsm/API.py", line 1359, in getStats
    decStats = self._cif._hostStats.get()
  File "/usr/share/vdsm/virt/sampling.py", line 662, in get
    first_sample, last_sample)
  File "/usr/share/vdsm/virt/sampling.py", line 696, in _get_cpu_core_stats
    'cpuUser': compute_cpu_usage(cpu_core, 'user'),
  File "/usr/share/vdsm/virt/sampling.py", line 685, in compute_cpu_usage
    last_sample.cpuCores.getCoreSample(cpu_core)[mode] -
TypeError: 'NoneType' object has no attribute '__getitem__'

Comment 7 Roman Mohr 2015-09-17 11:15:15 UTC
(In reply to Piotr Kliczewski from comment #6)
> Ok. I can see that the error is a bit different than in BZ description.
> Actual issue is:
> 
> Traceback (most recent call last):
>   File "/usr/share/vdsm/rpc/bindingxmlrpc.py", line 1260, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/rpc/bindingxmlrpc.py", line 535, in getStats
>     return api.getStats()
>   File "/usr/share/vdsm/API.py", line 1359, in getStats
>     decStats = self._cif._hostStats.get()
>   File "/usr/share/vdsm/virt/sampling.py", line 662, in get
>     first_sample, last_sample)
>   File "/usr/share/vdsm/virt/sampling.py", line 696, in _get_cpu_core_stats
>     'cpuUser': compute_cpu_usage(cpu_core, 'user'),
>   File "/usr/share/vdsm/virt/sampling.py", line 685, in compute_cpu_usage
>     last_sample.cpuCores.getCoreSample(cpu_core)[mode] -
> TypeError: 'NoneType' object has no attribute '__getitem__'

Oh yes you are right. The reason for that is, that I installed a newer master build to verify if it is still an issue on the latest commit because there was some refactoring going on. So the exception in the log looks a little bit different than the original one because the code lines which cause the issue where moved somewhere else.

Comment 8 Roman Mohr 2015-09-18 11:40:15 UTC
Created attachment 1074803 [details]
Effects on libvirtd and vdsm when cpus go offline and online

Haven been digging deeper and the attached file reveals that the problem already starts in libvirt:

A) Scenario A in the file shows that libvirt detects when cores go offline while libvrirtd is running and removes siblings from the cpu, but keeps the empty cpu tag in the topology:

Both cpus online:
>  <cpus num='2'>
>    <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
>    <cpu id='1' socket_id='1' core_id='0' siblings='1'/>
>  </cpus>

One cpu disabled:
>  <cpus num='2'>
>    <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
>    <cpu id='1'/> # siblings are gone but cpu entry stays
>  </cpus>

This is why the exception in getVdsStats occurs. Because it still sees the cpu index in the topology and tries to collect samples.

B) Scenario B in the file shows that libvirt never shows cpus which were already offline before libvirtd was startd:

With disabled cpu 1 when libvirtd started:
>  <cpus num='2'>
>    <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
>  </cpus>

Enable cpu 1 while libvirtd is running has no effect
>  <cpus num='2'>
>    <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
>  </cpus>

In this case getVdsStats does not crash and only shows samples for cpus which were online when libvird was started

Comment 9 Michal Skrivanek 2015-10-01 14:47:40 UTC
for B) you mean you've just restarted libvirt and then you see different xml?

that looks more like a libvirt issue to me
what platform is this?

Comment 10 Roman Mohr 2015-10-01 15:16:09 UTC
(In reply to Michal Skrivanek from comment #9)
> for B) you mean you've just restarted libvirt and then you see different xml?
> 
> that looks more like a libvirt issue to me
> what platform is this?

I think it is the correct behaviour from the libvirt perspective. If a cpu was never there, they have no information about the CPU and so they can't display anything about it. Then you really have numa nodes without "cpus" (note that these CPUs are not the host cpus, they are host threads). When the cpu is hot plugged later on, as far as I understand it, they can't make use of the new cpu because of some cgroup related stuff. So they continue not showing it.

It is another case when the cpu was online. then they already had some topology information and knew how much "cpus" were assigned to that numa node and they just remove the mapping to the host cpus thread (the socket_id='0' core_id='0' siblings='0' stuff). That is how they display suddenly disabled cpus for a long time. When the cpu becomes available again libvirt can start using it again.

Comment 11 Michal Skrivanek 2015-10-05 12:58:33 UTC
NUMA code is not using the getNumaTopology properly, it should skip over offline CPUs in _get_cpu_core_stats()

Comment 12 Roman Mohr 2015-12-21 09:49:06 UTC
I think this should go to 3.6.z since people are seeing this in production too.

Well at least one: https://plus.google.com/106942130198472400333/posts/FKwUf5FBnDq?_utm_source=1-2-2.

Comment 13 Red Hat Bugzilla Rules Engine 2015-12-21 09:49:10 UTC
This bug is marked for z-stream, yet the milestone is for a major version, therefore the milestone has been reset.
Please set the correct milestone or drop the z stream flag.

Comment 14 Mike McCune 2016-03-28 23:29:58 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 15 Michal Skrivanek 2016-04-15 10:30:31 UTC
make the bot happy

Comment 16 Shira Maximov 2016-05-10 07:37:47 UTC
verified on: Red Hat Enterprise Virtualization Manager Version: 3.6.6-0.1.el6

verification steps: 
1. service vdsmd restart
2. echo 0 > /sys/devices/system/cpu/cpu2/online
3. repeatedly run vdsClient -s 0 getVdsStats until vdsm thinks that it is running long enough that it can also provide you some cpu sampling data - 
There is no errors, the vdsClient present an information without the cpu until there is more information.