Bug 1143992
Summary: | QOS CPU profile not working when guest agent is not functioning | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nikolai Sednev <nsednev> | ||||||||||||
Component: | mom | Assignee: | Roy Golan <rgolan> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> | ||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||
Priority: | urgent | ||||||||||||||
Version: | 3.5.0 | CC: | bkorren, dfediuck, ecohen, gklein, iheim, lpeer, lsurette, mavital, mperina, nsednev, rbalakri, Rhev-m-bugs, sherold, yeylon | ||||||||||||
Target Milestone: | --- | Keywords: | Triaged | ||||||||||||
Target Release: | 3.5.0 | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | sla | ||||||||||||||
Fixed In Version: | vdsm-4.16.8.1-3.el6ev | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 1174669 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2015-02-11 20:27:37 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 906927, 1084930, 1162774, 1164308, 1164311, 1174669 | ||||||||||||||
Attachments: |
|
Description
Nikolai Sednev
2014-09-18 12:10:06 UTC
I see this error at vdsm.log GuestMonitor-VM10_stress::DEBUG::2014-09-18 15:08:16,500::vm::486::vm.Vm::(_getUserCpuTuneInfo) vmId=`7154e0ff-a1c6-4fe0-a4e8-9756d83e1529`::Domain Metadata is not set I assume that without the metadata support in libvirt this would never work. btw I'm getting the same thing with libvirt version 1.1.3.6 on F20 please output your components versions. I used latest components then, but happens on these as well: rhevm-3.5.0-0.14.beta.el6ev.noarch libvirt-0.10.2-46.el6.x86_64 vdsm-4.16.6-1.el6ev.x86_64 qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 sanlock-2.8-1.el6.x86_64 CentOS 6.5 does not support metadata elements at all (libvirt limitation). You will get the metadata error on F20 for VMs that have no metadata. We handle it gracefully, but it is still logged (there is a fix for that that was not merged to 3.5 branch + one related logging issue that is not ours). Nikolai: if you test this on F19 or F20 it should work. If it doesn't, give us the following info: virsh dumpxml VM10_stress mom.log (In reply to Martin Sivák from comment #5) > CentOS 6.5 does not support metadata elements at all (libvirt limitation). > > You will get the metadata error on F20 for VMs that have no metadata. We > handle it gracefully, but it is still logged (there is a fix for that that > was not merged to 3.5 branch + one related logging issue that is not ours). > > Nikolai: if you test this on F19 or F20 it should work. If it doesn't, give > us the following info: > > virsh dumpxml VM10_stress > mom.log I think its RHEL 6.6 and the libvirt version is as stated at #4 0.10.2 the vdsm log from alma show Thread-38::DEBUG::2014-09-18 14:51:45,878::BindingXMLRPC::1132::vds::(wrapper) client [10.35.163.77]::call vmUpdateVmPolicy with ({'vmId': '7154e0ff-a1c6-4fe0-a4e8-9756d83e1529', 'vcpuLimit': '80'},) {} Thread-38::DEBUG::2014-09-18 14:51:45,883::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 74 edom: 10 level: 2 message: argument unsupported: QEmu driver does not support modifying <metadata> element Thread-38::ERROR::2014-09-18 14:51:45,883::vm::3795::vm.Vm::(updateVmPolicy) vmId=`7154e0ff-a1c6-4fe0-a4e8-9756d83e1529`::updateVmPolicy failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 3793, in updateVmPolicy METADATA_VM_TUNE_URI, 0) File "/usr/share/vdsm/virt/vm.py", line 670, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1597, in setMetadata if ret == -1: raise libvirtError ('virDomainSetMetadata() failed', dom=self) libvirtError: argument unsupported: QEmu driver does not support modifying <metadata> element Seems like version problems in the test setup and the api is working with libvirt 0.10.2-46. Nikolai was able to get it running and now he's checking the actual cpu limit. a separate bug should be opened for jsonrpc wasn't working with UpdateVmPolicy because internally is had missing argument. Nikolai please fill-in what's missing (In reply to Roy Golan from comment #7) > Seems like version problems in the test setup and the api is working with > libvirt 0.10.2-46. > > Nikolai was able to get it running and now he's checking the actual cpu > limit. > > a separate bug should be opened for jsonrpc wasn't working with > UpdateVmPolicy > because internally is had missing argument. > > Nikolai please fill-in what's missing Even with Json disabled, feature not working. Screen shots attached. Created attachment 948840 [details]
screenshots
we need logs from that machine, libvirt mom vdsm BTW please make sure you hit Sync MoM policies at Cluster->Hosts subtab otherwise I think MoM wouldn't be in sync. msivak correct me if I'm wrong I synced MoM policies at Cluster->Hosts sub-tab and attached logs as requested. I didn't touched Json configs, reproduced on HE 3.5 environment. Created attachment 949501 [details]
mom.log
why libvirt and vdsm logs are missing? please add them. Nikolai, Regardless of the mom errors which may be unrelated, we need to know this is not a libvirt bug. Please create limitations using virsh both in rhel 6.6 and rhel 7 and open a libvirt bug if needed for the relevant rhel, which will make this bz a test only bug. what I see is: cause: GuestMonitor isn't reporting back the vm entity because it is has a boolean "ready = False". This means that non of the controllers will work. so CpuTune will never set the vcpu quota root cause: GuestMemory collector is failing the validation phase of the fields because there is no guest-agent installed and non of the expected fields- swap_in, swap_out etc is reported. when one of the collectors fail it will mark the monitor as "ready = False". the problem here is that a single validation failure of a collector fails the whole monitor cycle. maybe at the early days when the guest agent was a must that was ok but CpuTune works with the vm metadata. suggesting that field check is wrong at Monitor.py:108 if not set(data).issuerset(self.fields): self._set_not_ready("Incomplete data: missing %s" % \ (self.fields - set(data))) since set(data) will hold all the collected fields it may be a subset and not superset of all the possible fields (self.fields) Since this bug could be easily worked around to let you test the feature and unblock the RFE I suggest to following: 1. test the scenario with guest-agent installed - that is aligned with our RHEV customers which are expected to install guest-agent. 2. thet with guest memory collector excluded from mom.conf excluding the GuestMemory in collectors key at /etc/vdsm/mom.conf will prevent the "failure" inside mom and continue to the CpuTune policy execution. that is aligned with oVirt users who aren't expected to install guest-agent. #to exclude the GuestMemory sed -i '/collectors/ s/GuestMemory,// ' /etc/vdsm/mom.conf 3. write notes on this bug and remove the blocking flags 4. approve the RFE and put release-notes (if this bug isn't closed meanwhile) (In reply to Roy Golan from comment #18) > Since this bug could be easily worked around to let you test the feature and > unblock the RFE I suggest to following: > > 1. test the scenario with guest-agent installed - > > that is aligned with our RHEV customers which are expected to install > guest-agent. > > 2. thet with guest memory collector excluded from mom.conf > > excluding the GuestMemory in collectors key at /etc/vdsm/mom.conf will > prevent > the "failure" inside mom and continue to the CpuTune policy execution. > that is aligned with oVirt users who aren't expected to install > guest-agent. > > #to exclude the GuestMemory > sed -i '/collectors/ s/GuestMemory,// ' /etc/vdsm/mom.conf > > > 3. write notes on this bug and remove the blocking flags > > 4. approve the RFE and put release-notes (if this bug isn't closed meanwhile) 1.My setup uses LiveCD, from which VM boots each time its running and guest-agent component can't be installed, this is not possible, also some customers might be using live CDs, you can't force them to work with guest-agent, I tried to run on components as shown bellow and CPU SLA profile worked inaccurately (limitation didn't lowered VM CPUs to 10% as policy enforced, distributed load much higher, 54%/35%/44% for 3 running VMs and host was at 99% CPU load, but now limitation started to work) after mom config file was altered on each host as follows: "sed -i '/collectors/ s/GuestMemory,// ' /etc/vdsm/mom.conf". [root@blue-vdsc ~]# cat /etc/vdsm/mom.conf ### DO NOT REMOVE THIS COMMENT -- MOM Configuration for VDSM ### [main] # The wake up frequency of the main daemon (in seconds) main-loop-interval: 5 # The data collection interval for host statistics (in seconds) host-monitor-interval: 5 # The data collection interval for guest statistics (in seconds) guest-monitor-interval: 5 # The wake up frequency of the guest manager (in seconds). The guest manager # sets up monitoring and control for newly-created guests and cleans up after # deleted guests. guest-manager-interval: 5 # The interface MOM using to discover active guests and collect guest memory # statistics. There're two choices for it: libvirt or vdsm. hypervisor-interface: VDSM # The wake up frequency of the policy engine (in seconds). During each # interval the policy engine evaluates the policy and passes the results # to each enabled controller plugin. policy-engine-interval: 10 # A comma-separated list of Controller plugins to enable controllers: Balloon, KSM, CpuTune # Sets the maximum number of statistic samples to keep for the purpose of # calculating moving averages. sample-history-length: 10 # Set this to an existing, writable directory to enable plotting. For each # invocation of the program a subdirectory momplot-NNN will be created where NNN # is a sequence number. Within that directory, tab-delimited data files will be # created and updated with all data generated by the configured Collectors. plot-dir: # Activate the RPC server on the designated port (-1 to disable). RPC is # disabled by default until authentication is added to the protocol. rpc-port: -1 # At startup, load a policy from the given directory. If empty, no policy is loaded policy-dir: /etc/vdsm/mom.d [logging] # Set the destination for program log messages. This can be either 'stdio' or # a filename. When the log goes to a file, log rotation will be done # automatically. log: /var/log/vdsm/mom.log # Set the logging verbosity level. The following levels are supported: # 5 or debug: Debugging messages # 4 or info: Detailed messages concerning normal program operation # 3 or warn: Warning messages (program operation may be impacted) # 2 or error: Errors that severely impact program operation # 1 or critical: Emergency conditions # This option can be specified by number or name. verbosity: info ## The following two variables are used only when logging is directed to a file. # Set the maximum size of a log file (in bytes) before it is rotated. max-bytes: 2097152 # Set the maximum number of rotated logs to retain. backup-count: 5 [host] # A comma-separated list of Collector plugins to use for Host data collection. collectors: HostMemory, HostKSM, HostCpu [guest] # A comma-separated list of Collector plugins to use for Guest data collection. collectors: GuestQemuProc, GuestBalloon, GuestCpuTune 2.I also installed RHEL6.5 VM with stress tool on it, and guest-agent, not worked for me, it was running on host together with HE and it wasn't limited, even I got thrown out of the engine's WEBUI session with error 501 or 503, not sure which of them. Components as they appear on my setup: rhevm-3.5.0-0.19.beta.el6ev.noarch qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 libvirt-0.10.2-46.el6_6.1.x86_64 ovirt-hosted-engine-ha-1.2.4-1.el6ev.noarch vdsm-4.16.7.3-1.el6ev.x86_64 ovirt-hosted-engine-setup-1.2.1-3.el6ev.noarch sanlock-2.8-1.el6.x86_64 ovirt-host-deploy-1.3.0-1.el6ev.noarch Regarding topics 3&4, please decide with PM. 5.Please add "fixed in version" component version number as soon as you'll have it. I verified using a python script libvirt on rhel 6.6 does support meta data i.e libvirt version libvirt-0.10.2-46.el6_6.1.x86_64 as noted above. so 3 issues currently; 1. json rpc verb updateVmPolicy - msivak has already posted patches for it in Bug 1120246 2. policy var cpuTuneEnabled is False by default so the cputune policy is never calculating. need to set it to true in 00-defines.policy 3. policy is ignoring recalculation when period didn't change if period didn't change then it has the value None which means the controller will not call setVmCpuTune 04-cputune.policy:21 (if (!= guest.vcpu_period calcPeriod) (guest.Control "vcpu_period" calcPeriod) 0) CputTune.py:32 if quota is not None and period is not None: patch 34528 is merged (its commit msg is missing the the Bug-Url so it appears as NEW) added VDSM martin's patch to trackers *** Bug 1144280 has been marked as a duplicate of this bug. *** Not working on vt13.1. cpuTuneEnabled patch wasn't sent to ovirt-engine-3.5 which is my mistake its still 0 by default Niko, please elaborate if you got things working by changing /etc/vdsm/mom.d/00-defines cpuTuneEnabled 1 and restart vdsmd (In reply to Roy Golan from comment #26) > Niko, please elaborate if you got things working by changing > /etc/vdsm/mom.d/00-defines cpuTuneEnabled 1 and restart vdsmd Looks better on host, where I modified by your tip the /etc/vdsm/mom.d/00-defines.policy config file and had set it to cpuTuneEnabled 1: # cat /etc/vdsm/mom.d/00-defines.policy # This file defines python constans that make it easier to convert data # received by setMOMPolicyParameters (defvar False 0) (defvar True 1) # Define variables for configurable options here (defvar ksmEnabled 1) (defvar balloonEnabled 0) (defvar cpuTuneEnabled 1) Now looks like much better and seems like over-all feature started working on one of my hosts. virt-top 17:41:29 - x86_64 2/2CPU 1999MHz 15948MB 93.4% 33.1% 96.6% 7 domains, 7 active, 7 running, 0 sleeping, 0 paused, 0 inactive D:0 O:0 X:0 CPU: 84.0% Mem: 7168 MB (7168 MB by guests) ID S RDRQ WRRQ RXBY TXBY %CPU %MEM TIME NAME 5 R 0 0 612 0 22.2 6.0 4:02.59 StressVM3 2 R 0 1 674 0 14.1 6.0 12:41.95 RHEL6_5VM1 7 R 0 0 612 0 13.4 6.0 4:08.65 StressVM2 4 R 0 0 612 0 8.7 6.0 7:38.18 StressVM1 6 R 0 0 612 0 8.7 6.0 3:52.52 StressVM4 1 R 0 0 612 0 8.7 6.0 7:39.51 StressVM5 3 R 0 0 612 0 8.1 6.0 7:52.37 StressVM6 Still CPU peaks can be seen, although 10% CPU limitation enforced, some of the VMs getting over it and even sometimes getting to 43%+ load, which means that policy isn't strict. (In reply to Nikolai Sednev from comment #27) > > Still CPU peaks can be seen, although 10% CPU limitation enforced, some of > the VMs getting over it and even sometimes getting to 43%+ load, which means > that policy isn't strict. CPU QoS is not host CPU load. It's a compute unit. Remember that CPU load can be more than 100%, and we ask cgroup to limit to a partial quota. The policy is strict but depends on quota and period as defined in- http://libvirt.org/formatdomain.html#elementsCPUTuning Still not fixed in qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 libvirt-0.10.2-46.el6_6.2.x86_64 vdsm-4.16.8.1-3.el6ev.x86_64 ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch sanlock-2.8-1.el6.x86_64 ovirt-host-deploy-1.3.0-2.el6ev.noarch ovirt-hosted-engine-ha-1.2.4-3.el6ev.noarch ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch ovirt-host-deploy-1.3.0-2.el6ev.noarch ovirt-hosted-engine-ha-1.2.4-3.el6ev.noarch rhevm-3.5.0-0.25.el6ev.noarch I see that on both hosts CPU usage in UI being shown 75%-89%, while for at 96%+ on hosts. virt-top 20:07:05 - x86_64 4/4CPU 1600MHz 7872MB 85.8% 75.9% 5 domains, 5 active, 5 running, 0 sleeping, 0 paused, 0 inactive D:0 O:0 X:0 CPU: 68.7% Mem: 8192 MB (8192 MB by guests) ID S RDRQ WRRQ RXBY TXBY %CPU %MEM TIME NAME 28 R 0 0 1664 0 20.6 13.0 6:07.59 StressVM2 29 R 0 0 1724 0 19.4 13.0 6:15.15 StressVM3 30 R 0 0 1786 0 15.1 13.0 6:00.36 StressVM4 31 R 0 0 1786 0 12.2 13.0 6:00.66 StressVM1 24 R 0 6 6210 2332 1.4 52.0 3:56.84 HostedEngine virt-top 20:07:21 - x86_64 2/2CPU 1999MHz 15948MB 82.6% 75.2% 92.2% 65.6% 97.4% 99.1% 45.8% 82.2% 96.2% 86.2% 45.4% 2 domains, 2 active, 2 running, 0 sleeping, 0 paused, 0 inactive D:0 O:0 X:0 CPU: 98.8% Mem: 2048 MB (2048 MB by guests) ID S RDRQ WRRQ RXBY TXBY %CPU %MEM TIME NAME 14 R 0 0 84K 0 49.4 6.0 9:03.48 StressVM5 15 R 0 0 84K 0 49.4 6.0 9:21.57 StressVM6 After components were updated, config file appears correctly on both hosts: cat /etc/vdsm/mom.d/00-defines.policy # This file defines python constans that make it easier to convert data # received by setMOMPolicyParameters (defvar False 0) (defvar True 1) # Define variables for configurable options here (defvar ksmEnabled 1) (defvar balloonEnabled 0) (defvar cpuTuneEnabled 1) Guests are all without guest-agents. Sometimes virt-top shows 42%+ over 10% limit for the first host, thus passing over its limit a way over. Which mom version was used? Can you please provide the relevant log files? mom-0.4.1-4.el6ev.noarch Created attachment 968921 [details]
dump xmls from 2 hosts.tar.gz
Attached dump_xmls from both hosts while Json-RPC was active on both hosts and on both of them CPU SLA policy of 2% was running, while doesn't seems working: mom-0.4.1-4.el6ev.noarch qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 sanlock-2.8-1.el6.x86_64 vdsm-4.16.8.1-3.el6ev.x86_64 libvirt-0.10.2-46.el6_6.2.x86_64 ovirt-host-deploy-1.3.0-2.el6ev.noarch ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch ovirt-hosted-engine-ha-1.2.4-3.el6ev.noarch brown-vdsd: virt-top 13:56:44 - x86_64 2/2CPU 1999MHz 15948MB 2 domains, 2 active, 2 running, 0 sleeping, 0 paused, 0 inactive D:0 O:0 X:0 CPU: 80.7% Mem: 2048 MB (2048 MB by guests) ID S RDRQ WRRQ RXBY TXBY %CPU %MEM TIME NAME 25 R 0 0 1754 0 41.2 6.0 16:06.18 StressVM6 24 R 0 0 1754 0 39.5 6.0 15:41.39 StressVM5 blue-vdsc: virt-top 13:56:56 - x86_64 4/4CPU 1600MHz 7872MB 6 domains, 6 active, 6 running, 0 sleeping, 0 paused, 0 inactive D:0 O:0 X:0 CPU: 74.4% Mem: 9216 MB (9216 MB by guests) ID S RDRQ WRRQ RXBY TXBY %CPU %MEM TIME NAME 38 R 0 0 1930 0 24.1 13.0 15:20.22 StressVM2 41 R 0 0 1930 0 23.7 13.0 16:24.16 StressVM3 39 R 0 0 1992 0 12.5 13.0 15:34.40 StressVM4 42 R 0 0 0 0 12.2 13.0 15:23.57 StressVM1 32 R 0 13 23K 4771 1.7 52.0 10:48.93 HostedEngine 40 R 0 0 1930 0 0.2 13.0 7:04.11 RHEL6_5VM1 Created attachment 968937 [details]
logs from both hosts and engine
There were two issues: Nikolai's setup had broken configuration (manually modified config files that RPM refused to update). VDSM bug that is now tracked in #1174669. works for me on these components: vdsm-4.16.8.1-4.el7ev.x86_64 qemu-kvm-rhev-1.5.3-60.el7_0.11.x86_64 mom-0.4.1-4.el7ev.noarch libvirt-client-1.2.8-10.el7.x86_64 sanlock-3.2.2-2.el7.x86_64 rhevm-3.5.0-0.26.el6ev.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0186.html |