Bug 1454633 - mom continuously crashing on getVmInfo (mom/HypervisorInterfaces/vdsmjsonrpcInterface.py) data['pid'] = vm['pid'] KeyError: 'pid'
Summary: mom continuously crashing on getVmInfo (mom/HypervisorInterfaces/vdsmjsonrpcI...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: mom
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.2.0
: ---
Assignee: Andrej Krejcir
QA Contact: Liran Rotenberg
URL:
Whiteboard:
Depends On: 1496413
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-23 08:35 UTC by Shira Maximov
Modified: 2017-12-20 11:27 UTC (History)
6 users (show)

Fixed In Version: mom-0.5.11-1
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-12-20 11:27:33 UTC
oVirt Team: SLA
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: blocker+
rule-engine: planning_ack+
rule-engine: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
mom.log (1.61 MB, text/plain)
2017-05-23 08:35 UTC, Shira Maximov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 81765 0 master MERGED Handle missing PID in VM status 2020-09-22 18:40:15 UTC
oVirt gerrit 82036 0 master MERGED mom: Do not use GuestQemuProc mom collector 2020-09-22 18:40:19 UTC

Description Shira Maximov 2017-05-23 08:35:05 UTC
Created attachment 1281371 [details]
mom.log

Description of problem:
CPU qos not working as expected because Guest Manager crashed

Version-Release number of selected component (if applicable):
oVirt Engine Version: 4.2.0-0.0.master.20170521155744.gitb6f1a86.el7.centos


How reproducible:
100%

Steps to Reproduce:
1. Creating  CPU qos of with 10% limit  
2. Create CPU profile with the qos created in the step 1.
3. Attach the CPU profile create to a VM and start the load the VM.
The host should allocate the following cpu percentage for the VM : 
host cores / VM cores * 10 ( the limit if CPU qos) 

in my case, the host has 8 cores and the VM 1 core. 
so the host should allocate 80% cpu (from 1 core) for that specific VM, instead the vm gets 100%.

Actual results:
The VM gets 100% of 1 core, and it should get only 80%



Expected results:


Additional info:

in mom.log: 
2017-05-23 10:38:52,637 - mom.GuestManager - ERROR - Guest Manager crashed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mom/GuestManager.py", line 88, in run
    self._spawn_guest_monitors(domain_list)
  File "/usr/lib/python2.7/site-packages/mom/GuestManager.py", line 113, in _spawn_guest_monitors
    info = self.hypervisor_iface.getVmInfo(id)
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcInterface.py", line 133, in getVmInfo
    data['pid'] = vm['pid']
KeyError: 'pid'

a guest monitor is responsible for evaluating all the policies so the VM will have no QoS when it crashes

Comment 1 Martin Sivák 2017-05-23 12:25:05 UTC
Have you seen this in 4.1 as well? Or is it 4.2 only issue?

Comment 2 Red Hat Bugzilla Rules Engine 2017-05-23 12:25:09 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Shira Maximov 2017-05-23 12:30:09 UTC
(In reply to Martin Sivák from comment #1)
> Have you seen this in 4.1 as well? Or is it 4.2 only issue?

it's only in 4.2

Comment 4 Michal Skrivanek 2017-05-24 05:20:37 UTC
IIRC the PID was dropped from stats

Comment 5 Yaniv Kaul 2017-09-12 10:58:48 UTC
I'm seeing it constantly crashing in ovirt-system-tests. Raising severity.
2017-09-12 06:53:47,056 - mom.GuestManager - INFO - Guest Manager starting: multi-thread
2017-09-12 06:53:47,061 - mom.Policy - INFO - Loaded policy '00-defines'
2017-09-12 06:53:47,064 - mom.Policy - INFO - Loaded policy '01-parameters'
2017-09-12 06:53:47,070 - mom.GuestManager - ERROR - Guest Manager crashed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mom/GuestManager.py", line 88, in run
    self._spawn_guest_monitors(domain_list)
  File "/usr/lib/python2.7/site-packages/mom/GuestManager.py", line 113, in _spawn_guest_monitors
    info = self.hypervisor_iface.getVmInfo(id)
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcInterface.py", line 133, in getVmInfo
    data['pid'] = vm['pid']
KeyError: 'pid'
2017-09-12 06:53:47,097 - mom.Policy - INFO - Loaded policy '02-balloon'
2017-09-12 06:53:47,122 - mom.Policy - INFO - Loaded policy '03-ksm'
2017-09-12 06:53:47,153 - mom.Policy - INFO - Loaded policy '04-cputune'
2017-09-12 06:53:47,189 - mom.Policy - INFO - Loaded policy '05-iotune'
2017-09-12 06:53:47,189 - mom.PolicyEngine - INFO - Policy Engine starting
2017-09-12 06:53:47,190 - mom.RPCServer - INFO - Using unix socket /var/run/vdsm/mom-vdsm.sock
2017-09-12 06:53:47,191 - mom.RPCServer - INFO - RPC Server starting
2017-09-12 06:53:48,924 - mom.RPCServer - INFO - ping()
2017-09-12 06:53:48,925 - mom.RPCServer - INFO - getStatistics()
2017-09-12 06:53:56,545 - mom.RPCServer - INFO - ping()
2017-09-12 06:53:56,546 - mom.RPCServer - INFO - getStatistics()
2017-09-12 06:54:02,205 - mom - ERROR - Thread 'GuestManager' has exited
2017-09-12 06:54:02,228 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:0 merge_across_nodes:1 run:0 sleep_millisecs:0
2017-09-12 06:54:02,233 - mom.PolicyEngine - INFO - Policy Engine ending
2017-09-12 06:54:02,559 - mom.RPCServer - INFO - RPC Server ending
2017-09-12 06:54:07,559 - mom - INFO - MOM ending
2017-09-12 06:54:12,724 - mom - INFO - MOM starting
2017-09-12 06:54:12,753 - mom.HostMonitor - INFO - Host Monitor starting
2017-09-12 06:54:12,753 - mom - INFO - hypervisor interface vdsmjsonrpcbulk
2017-09-12 06:54:12,929 - mom.HostMonitor - INFO - HostMonitor is ready
2017-09-12 06:54:13,074 - mom.GuestManager - INFO - Guest Manager starting: multi-thread
2017-09-12 06:54:13,082 - mom.Policy - INFO - Loaded policy '00-defines'
2017-09-12 06:54:13,088 - mom.Policy - INFO - Loaded policy '01-parameters'
2017-09-12 06:54:13,090 - mom.GuestManager - ERROR - Guest Manager crashed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mom/GuestManager.py", line 88, in run
    self._spawn_guest_monitors(domain_list)
  File "/usr/lib/python2.7/site-packages/mom/GuestManager.py", line 113, in _spawn_guest_monitors
    info = self.hypervisor_iface.getVmInfo(id)
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcInterface.py", line 133, in getVmInfo
    data['pid'] = vm['pid']
KeyError: 'pid'
2017-09-12 06:54:13,110 - mom.Policy - INFO - Loaded policy '02-balloon'
2017-09-12 06:54:13,135 - mom.Policy - INFO - Loaded policy '03-ksm'
2017-09-12 06:54:13,167 - mom.Policy - INFO - Loaded policy '04-cputune'
2017-09-12 06:54:13,208 - mom.Policy - INFO - Loaded policy '05-iotune'





mom-0.5.10-0.0.master.el7.centos.noarch
vdsm-4.20.3-22.git95788e5.el7.centos.x86_64

Comment 6 Liran Rotenberg 2017-10-01 07:50:28 UTC
Verified on:
4.2.0-0.0.master.20170929123516.git007c392.el7.centos
vdsm-4.20.3-121.git77235c7.el7.centos.x86_64

Steps of verification:
1. Created  CPU qos of with 10% limit  
2. Created CPU profile with the qos created in the step 1.
3. Attached the CPU profile create to a VM and start the load the VM.

As mentioned before:
The host should allocate the following cpu percentage for the VM : 
host cores / VM cores * 10 ( the limit if CPU qos) 

I tried it on two hosts:
-One with 4 cores:
VM set to 1 core and i got 40% cpu.
Host is 10%.

VM set to 2 cores and i got 20% cpu.
Host is 10%.

-Second host with 8 cores:
VM set to 1 core and i got 80% cpu.
Host is 10%.

VM set to 2 cores and i got 40% cpu.
Host is 10%.

Actual results:
The VM gets the cpu expected. The host is in the QoS limitation. 
mom.log doesn't show any errors about the guest manager.

Comment 7 Sandro Bonazzola 2017-12-20 11:27:33 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.