Description of problem: When a cpu goes online or offline while vdsm is running 'getVdsStats' crashes with an exception. The reason is, that vdsm still tries to collect samples for this cpu. Version-Release number of selected component (if applicable): How reproducible: Start vdsmd, disable a CPU and try to query for the stats. Steps to Reproduce: 1. service vdsmd restart 2. echo 0 > /sys/devices/system/cpu/cpu2/online 3. repeatedly run vdsClient -s 0 getVdsStats until vdsm thinks that it is running long enough that it can also provide you some cpu sampling data Actual results: getVdsStats is broken and can't recover until the cpu comes online again or you manually restart vdsmd Expected results: Skip offline cpus when collecting cpu samples Additional info: The exception: Thread-6189::ERROR::2015-09-16 14:48:58,863::__init__::527::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 522, in _serveRequest res = method(**params) File "/usr/share/vdsm/rpc/Bridge.py", line 279, in _dynamicMethod raise InvalidCall(fn, methodArgs, e) InvalidCall: Attempt to call function: <bound method Global.getStats of <API.Global object at 0x7fa45b90d350>> with arguments: () error: 'NoneType' object has no attribute '__getitem__'
Please attach vdsm log.
Created attachment 1074387 [details] Log with failing getVdsStats
The attached log ends at Thread-735::DEBUG::2015-09-15 10:05:29,601 and there is no getStats issue. Please attach the log from 2015-09-16.
Created attachment 1074393 [details] Log with failing getVdsStats Uploaded the wrong log file
(In reply to Piotr Kliczewski from comment #3) > The attached log ends at Thread-735::DEBUG::2015-09-15 10:05:29,601 and > there is no getStats issue. Please attach the log from 2015-09-16. Uplaoded the wrong log file before. I don't have the original vdsm.log. This is a fresh log from today with the relevant exceptions.
Ok. I can see that the error is a bit different than in BZ description. Actual issue is: Traceback (most recent call last): File "/usr/share/vdsm/rpc/bindingxmlrpc.py", line 1260, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/rpc/bindingxmlrpc.py", line 535, in getStats return api.getStats() File "/usr/share/vdsm/API.py", line 1359, in getStats decStats = self._cif._hostStats.get() File "/usr/share/vdsm/virt/sampling.py", line 662, in get first_sample, last_sample) File "/usr/share/vdsm/virt/sampling.py", line 696, in _get_cpu_core_stats 'cpuUser': compute_cpu_usage(cpu_core, 'user'), File "/usr/share/vdsm/virt/sampling.py", line 685, in compute_cpu_usage last_sample.cpuCores.getCoreSample(cpu_core)[mode] - TypeError: 'NoneType' object has no attribute '__getitem__'
(In reply to Piotr Kliczewski from comment #6) > Ok. I can see that the error is a bit different than in BZ description. > Actual issue is: > > Traceback (most recent call last): > File "/usr/share/vdsm/rpc/bindingxmlrpc.py", line 1260, in wrapper > res = f(*args, **kwargs) > File "/usr/share/vdsm/rpc/bindingxmlrpc.py", line 535, in getStats > return api.getStats() > File "/usr/share/vdsm/API.py", line 1359, in getStats > decStats = self._cif._hostStats.get() > File "/usr/share/vdsm/virt/sampling.py", line 662, in get > first_sample, last_sample) > File "/usr/share/vdsm/virt/sampling.py", line 696, in _get_cpu_core_stats > 'cpuUser': compute_cpu_usage(cpu_core, 'user'), > File "/usr/share/vdsm/virt/sampling.py", line 685, in compute_cpu_usage > last_sample.cpuCores.getCoreSample(cpu_core)[mode] - > TypeError: 'NoneType' object has no attribute '__getitem__' Oh yes you are right. The reason for that is, that I installed a newer master build to verify if it is still an issue on the latest commit because there was some refactoring going on. So the exception in the log looks a little bit different than the original one because the code lines which cause the issue where moved somewhere else.
Created attachment 1074803 [details] Effects on libvirtd and vdsm when cpus go offline and online Haven been digging deeper and the attached file reveals that the problem already starts in libvirt: A) Scenario A in the file shows that libvirt detects when cores go offline while libvrirtd is running and removes siblings from the cpu, but keeps the empty cpu tag in the topology: Both cpus online: > <cpus num='2'> > <cpu id='0' socket_id='0' core_id='0' siblings='0'/> > <cpu id='1' socket_id='1' core_id='0' siblings='1'/> > </cpus> One cpu disabled: > <cpus num='2'> > <cpu id='0' socket_id='0' core_id='0' siblings='0'/> > <cpu id='1'/> # siblings are gone but cpu entry stays > </cpus> This is why the exception in getVdsStats occurs. Because it still sees the cpu index in the topology and tries to collect samples. B) Scenario B in the file shows that libvirt never shows cpus which were already offline before libvirtd was startd: With disabled cpu 1 when libvirtd started: > <cpus num='2'> > <cpu id='0' socket_id='0' core_id='0' siblings='0'/> > </cpus> Enable cpu 1 while libvirtd is running has no effect > <cpus num='2'> > <cpu id='0' socket_id='0' core_id='0' siblings='0'/> > </cpus> In this case getVdsStats does not crash and only shows samples for cpus which were online when libvird was started
for B) you mean you've just restarted libvirt and then you see different xml? that looks more like a libvirt issue to me what platform is this?
(In reply to Michal Skrivanek from comment #9) > for B) you mean you've just restarted libvirt and then you see different xml? > > that looks more like a libvirt issue to me > what platform is this? I think it is the correct behaviour from the libvirt perspective. If a cpu was never there, they have no information about the CPU and so they can't display anything about it. Then you really have numa nodes without "cpus" (note that these CPUs are not the host cpus, they are host threads). When the cpu is hot plugged later on, as far as I understand it, they can't make use of the new cpu because of some cgroup related stuff. So they continue not showing it. It is another case when the cpu was online. then they already had some topology information and knew how much "cpus" were assigned to that numa node and they just remove the mapping to the host cpus thread (the socket_id='0' core_id='0' siblings='0' stuff). That is how they display suddenly disabled cpus for a long time. When the cpu becomes available again libvirt can start using it again.
NUMA code is not using the getNumaTopology properly, it should skip over offline CPUs in _get_cpu_core_stats()
I think this should go to 3.6.z since people are seeing this in production too. Well at least one: https://plus.google.com/106942130198472400333/posts/FKwUf5FBnDq?_utm_source=1-2-2.
This bug is marked for z-stream, yet the milestone is for a major version, therefore the milestone has been reset. Please set the correct milestone or drop the z stream flag.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
make the bot happy
verified on: Red Hat Enterprise Virtualization Manager Version: 3.6.6-0.1.el6 verification steps: 1. service vdsmd restart 2. echo 0 > /sys/devices/system/cpu/cpu2/online 3. repeatedly run vdsClient -s 0 getVdsStats until vdsm thinks that it is running long enough that it can also provide you some cpu sampling data - There is no errors, the vdsClient present an information without the cpu until there is more information.