Red Hat Bugzilla – Bug 1396050
fence_vmware_soap makes high CPU usage
Last modified: 2018-05-10 17:39:05 EDT
Description of problem: Please help to analyze fence_vmware_so makes high CPU usage is normal? why ? It's bug? ~~~ PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 55694 root 20 0 386244 229636 5360 R 96.4 1.4 0:15.80 fence_vmware_so 55701 root 20 0 382160 225400 5360 R 96.4 1.4 0:15.23 fence_vmware_so The CPU usage is too high, but it's just for 20s,interval of one minute. I'm in my experiment environment was tested, the result was the same. I update the fence-agents-vmware-soap from 4.0.11-27.el7.x86_64 to fence-agents-vmware-soap-4.0.11-47.el7.x86_64, the result was the same. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 7.2 (Maipo) Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux pacemaker-1.1.13-10.el7.x86_64 corosync-2.3.4-7.el7.x86_64 pcs-0.9.143-15.el7.x86_64 How reproducible: It's always show when it running. Steps to Reproduce: 1.top Actual results: The fence_vmware_so makes high CPU usage. Expected results: Additional info:
There are not too many of loops in fence_vmware_soap, so this should not happend and it looks like a bug. Can you send me a verbose output? (verbose=1 or -v on command line) Does your VMWare hosts a lot of virtual machines?
Thank you for your attention. I had a experiment environment a few days ago, but now I haven't. It's easy to reappear. It's just for 20s,interval of one minute, you can see the CPU usage is too high during this time. It's a long page about verbose output, I kept it in my experiment environment, but it was gone, so I cann't send you this log. I will rebuild a new experiment environment if I can found a vmware platform. (In reply to Marek Grac from comment #2) > There are not too many of loops in fence_vmware_soap, so this should not > happend and it looks like a bug. > > Can you send me a verbose output? (verbose=1 or -v on command line) > > Does your VMWare hosts a lot of virtual machines?
If it is done every one minute, I will suspect that is monitoring action which lists all available VM. It can take a while on VMWare side but without logs I can't tell if we are doing something wrong.
Hi Marek >>I will suspect that is monitoring action which lists all available VM Exactly, in my test environment, we has only several VMs, the fence monitor configuration like following, the monitor check fence status every 60s by default. ~~~ ... <op id="vmware_fence1-monitor-interval-60s" interval="60s" name="monitor"/> ... <op id="vmware_fence2-monitor-interval-60s" interval="60s" name="monitor"/> ... ~~~ and we monitor the fence status by manual, here is the command ~~~ # fence_vmware_soap -o status ~~~ by executing above command, the vm will spent a lot of cpu resources ( we allocate 2 cores and one socket to this vm as same as customer's environment, and we can observe this phenomena by using "top", when executing "fence_vmware_soap -o status" , the cpu utility rise to 80%~100% immediately, each process continue about 20s) Even worse, our test environment have been broken, so we can't re-produce this issue. Did you have a VCenter environment to test this issue ?
Hi, I have tested this issue on our vCenter 5.5 and I can confirm it. I was able to track the issue down to a specific line that opens the connection and login user into VMWare. conn = Client(url + "/vimService.wsdl", location=url, transport=RequestsTransport(verify=verify), headers=headers) This line took the majority of execution time. The bug might be in package python-suds that we are using but I'm not sure about it. When fence agent is executed with verbose flag (-v / verbose=1) communication log has more than 80MB what might be issue on VMWare side.
We're seeing this in our VMware-hosted Pacemaker clusters too, and the CPU usage is seriously high. In our case the fence agent is contacting a vCenter which hosts 892 VMs. It takes around 13 seconds to run "stonith_admin -Q vmware_fence".
I don't quite see why the fence agent would need to list all VMs, and not just the VMs in the cluster...
@John: AFAIK the problem is in the login process which takes a really long time (and 60MB of data from VMWare). If you have worked with their API (in any language) and you have some tips how to improve it, I would be glad to implement it.
@Marek - I may take you up on that, as I've used the VMware API, mostly using vmware's Perl modules, and not seen this shirt off slowdown/CPU hit.
Hmm, using BZ with phone autocorrect not advisable... "sort of" instead of "shirt off"!
An optional use of vCenter 6.5's REST API would be nice and much simpler, but I guess that would be fence-agents-vmware-vcenter-rest, and not fence-agents-vmware-soap. Just trying it here, I can: * Get a session token in about half a second * Find a VM by name in under a second * Fetch full details of a VM in under a second. As for the SOAP API, with an old Perl script of mine, I can fetch datastore utilisation for all datastores a datacenter in under a second.
For something more comparable, sample pyvmomi scripts can do useful work in under a second too. pyvmomi appears to use its own SOAP implementation and not suds though. With a bit of debug in fence_vmware_soap I determined that it is indeed fetching all VMs with the VMware API, and then attempting to match the fence "plug". In our environment we are "lucky" in that of the >800 VMs in the vCenter, the user used to query vCenter only has access to 36 VMs, but still running "status" takes 10s. If I supply a username that can read all VMs, the time only goes up to 12s, hence proving it's at least not the VM power status fetching that's slow, but something more in the connection/login phase, as you said.
https://github.com/ClusterLabs/fence-agents/pull/153
Oyvind: Oh, thank you for the new agent! We shall have to give it a go, once I figure out how to add a new fence agent.
OK, first feedback on fence_vmware_rest.py ... first I tried it on a CentOS 6 machine, and it didn't work for me, at least when run the way I can run fence_vmware_soap. This is presumably because it relies upon something new in the common fence agent library: ./fence_vmware_rest.py --ssl --ssl-insecure -a vcenter.example.com -l vcenter_user -p password -o status -n vm-name Traceback (most recent call last): File "./fence_vmware_rest.py", line 183, in <module> main() File "./fence_vmware_rest.py", line 175, in main conn = connect(options) File "./fence_vmware_rest.py", line 78, in connect logging.debug("Failed: {}".format(e)) ValueError: zero length field name in format I then tried it on a CentOS 7 machine and it works: time ./fence_vmware_rest.py --ssl --ssl-insecure -a vcenter.example.com -l vcenter_user -p password -o status -n vm-name 0.14s user 0.12s system 10% cpu 2.502 total So, a decent time, and not a lot of system time or CPU...a massive improvement over fence_vmware_soap!
Great. This bz is for RHEL7, so I didnt test it on RHEL6.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0758