Bug 2102678 - vdsmd uses ~ 100% cpu on HostedEngine VM start and stop
Summary: vdsmd uses ~ 100% cpu on HostedEngine VM start and stop
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.50.1.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.5.2
: ---
Assignee: Marcin Sobczyk
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-30 12:35 UTC by Yedidyah Bar David
Modified: 2022-08-30 08:49 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-08 08:17:16 UTC
oVirt Team: Infra
Embargoed:
mperina: ovirt-4.5+


Attachments (Terms of Use)
vdsm-ha-logs.tar.gz (2.76 MB, application/gzip)
2022-06-30 12:35 UTC, Yedidyah Bar David
no flags Details
vdsm profile file (505.06 KB, application/octet-stream)
2022-07-01 14:42 UTC, Albert Esteve
no flags Details
top output while reproducing (49.06 KB, text/plain)
2022-07-01 16:33 UTC, Nir Soffer
no flags Details
top output for single request (48.73 KB, text/plain)
2022-07-01 17:02 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm pull 274 0 None open Fix a busy loop after socket is closed 2022-07-01 19:27:27 UTC
Red Hat Issue Tracker RHV-46826 0 None None None 2022-06-30 12:51:26 UTC

Description Yedidyah Bar David 2022-06-30 12:35:57 UTC
Created attachment 1893642 [details]
vdsm-ha-logs.tar.gz

Description of problem:

When starting the HE VM with 'hosted-engine --vm-start', vdsmd immediately jumps from using 2-3% cpu to 99-100%, for many minutes, even after the VM is up and idle. Eventually it goes down. Similarly, when shutting down the VM, vdsmd jumps again to 99-100% cpu, and remains high for many minutes, also after the VM is down. Not sure these are the only flows or if it's related to HE, but these are the ones I noticed.

Version-Release number of selected component (if applicable):
Current master, not sure when it started

How reproducible:
Not sure, saw it many times

Steps to Reproduce:
1. Deploy hosted-engine and monitor vdsmd's cpu usage
2. Wait until vdsmd is (almost) idle, at 2-3% cpu
3. Shutdown the engine VM (e.g. by logging into it and running 'poweroff'
4. Wait until vdsmd is idle
5. hosted-engine --vm-start

Actual results:
vdsmd's cpu usage remains high for a very long time after starting/stopping the HE VM.

Expected results:
vdsmd's cpu usage remains low, or at least goes down much faster.

Additional info:
Attaching relevant logs. vdsm and HA logs are with level set to DEBUG, top.log is 'top -b > top.log'.

Comment 1 Nir Soffer 2022-06-30 18:35:31 UTC
Based on top.log, the events are:

10:14:43 high cpu
10:28:17 normal cpu
11:43:59 high cpu
12:00:16 normal cpu

Which apis are called during this time?

First combine vdsm logs:
cat vdsm.log.1 vdsm.log > vdsm-combined.log

And create 2 logs one fron 10:14:43 to 10:28:17 (vdsm-high.log)
and one from 10:28:17 to 11:43:59 (vdsm-normal.log)

Health logs:

$ grep '\[health\] user=' vdsm-high.log 
2022-06-30 10:15:57,398+0000 DEBUG (health) [health] user=68.56%, sys=4.76%, rss=107364 kB (+108), threads=57 (health:126)
2022-06-30 10:20:57,434+0000 DEBUG (health) [health] user=93.28%, sys=5.88%, rss=107400 kB (+36), threads=57 (health:126)
2022-06-30 10:25:57,466+0000 DEBUG (health) [health] user=93.28%, sys=5.88%, rss=107480 kB (+80), threads=57 (health:126)

$ grep '\[health\] user=' vdsm-normal.log 
2022-06-30 10:30:57,514+0000 DEBUG (health) [health] user=43.54%, sys=3.49%, rss=107500 kB (+20), threads=57 (health:126)
2022-06-30 10:35:57,569+0000 DEBUG (health) [health] user=2.97%, sys=1.47%, rss=107532 kB (+32), threads=57 (health:126)
2022-06-30 10:40:57,623+0000 DEBUG (health) [health] user=2.86%, sys=1.50%, rss=107588 kB (+56), threads=57 (health:126)
2022-06-30 10:45:57,673+0000 DEBUG (health) [health] user=2.96%, sys=1.47%, rss=107640 kB (+52), threads=57 (health:126)
2022-06-30 10:50:57,717+0000 DEBUG (health) [health] user=2.95%, sys=1.52%, rss=107644 kB (+4), threads=57 (health:126)
2022-06-30 10:55:57,764+0000 DEBUG (health) [health] user=3.09%, sys=1.55%, rss=107648 kB (+4), threads=57 (health:126)
2022-06-30 11:00:57,834+0000 DEBUG (health) [health] user=3.10%, sys=1.55%, rss=107668 kB (+20), threads=57 (health:126)
2022-06-30 11:05:57,888+0000 DEBUG (health) [health] user=3.09%, sys=1.55%, rss=107672 kB (+4), threads=57 (health:126)
2022-06-30 11:10:57,956+0000 DEBUG (health) [health] user=3.08%, sys=1.54%, rss=107680 kB (+8), threads=57 (health:126)
2022-06-30 11:15:58,012+0000 DEBUG (health) [health] user=3.00%, sys=1.51%, rss=107688 kB (+8), threads=57 (health:126)
2022-06-30 11:20:58,063+0000 DEBUG (health) [health] user=3.02%, sys=1.45%, rss=107696 kB (+8), threads=57 (health:126)
2022-06-30 11:25:58,104+0000 DEBUG (health) [health] user=3.02%, sys=1.48%, rss=107732 kB (+36), threads=57 (health:126)
2022-06-30 11:30:58,166+0000 DEBUG (health) [health] user=3.04%, sys=1.54%, rss=107744 kB (+12), threads=57 (health:126)
2022-06-30 11:35:58,206+0000 DEBUG (health) [health] user=10.07%, sys=1.73%, rss=107920 kB (+176), threads=57 (health:126)
2022-06-30 11:40:58,271+0000 DEBUG (health) [health] user=3.15%, sys=1.58%, rss=109680 kB (+1760), threads=57 (health:126)

API calls when cpu is high (10:14:43 to 10:28:17, 13 minutes):

$ awk '/\[jsonrpc.JsonRpcServer\] Calling/ {print $7}' vdsm-high.log | sort | uniq -c | sort -rn
    544 'Host.ping2'
    163 'Host.getStorageRepoStats'
    163 'Host.getStats'
     96 'VM.getStats'
     54 'Host.getAllVmStats'

API calls when cpu is normal (10:28:17 to 11:43:59, 76 minutes):

$ awk '/\[jsonrpc.JsonRpcServer\] Calling/ {print $7}' vdsm-normal.log | sort | uniq -c | sort -rn
   3025 'Host.ping2'
    939 'Host.getStats'
    906 'Host.getStorageRepoStats'
    530 'VM.getStats'
    332 'Host.getAllVmStats'
     34 'Host.getAllVmIoTunePolicies'
      1 'VM.destroy'
      1 'VM.create'
      1 'StoragePool.getSpmStatus'
      1 'StoragePool.connect'
      1 'Host.setKsmTune'
      1 'Host.hostdevListByCaps'
      1 'Host.getHardwareInfo'
      1 'Host.getCapabilities'
      1 'Host.dumpxmls'

The request distribution is similar, so this is not caused by client activity.

However some calls are unexpected:
- Host.ping2 called every 1.5 seconds - why? this API should be used to check connectivity
  with vdsm and there is no reason to check connectivity every 1.5 seconds.
- Host.getStorageRepoStats called every 4.7 seconds - why?
  Vdsm expect this call once in 15 seconds.
- Host.getStats callled every 5 seconds - why?

We will have to reproduce this and maybe profile vdsm.

Comment 2 Albert Esteve 2022-07-01 14:42:25 UTC
Created attachment 1893929 [details]
vdsm profile file

I have been able to reproduce and profile the issue. Attached the *.prof file.

Vdsm configuration for the profiling:
# cat /etc/vdsm/vdsm.conf
[vars]
ssl= true 
 
[addresses]
management_port= 54321 

[devel]
cpu_profile_enable = true
cpu_profile_filename = /run/vdsm/vdsmd.prof
cpu_profile_format = pstat
cpu_profile_clock = cpu

Comment 3 Albert Esteve 2022-07-01 15:01:29 UTC
Some interesting inspections of the profile file above.

- Compact print of the most expensive functions
$ contrib/profile-stats -c ~/Work/vdsmd.prof | head -30
Fri Jul  1 16:01:15 2022    /home/aesteve/Work/vdsmd.prof

         435235626 function calls (435453317 primitive calls) in 713.979 seconds

   Ordered by: internal time
   List reduced from 3523 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 37566669  106.104    0.000  260.103    0.000 stomp.py:483(AsyncDispatcher.next_check_interval)
45078444/45078455   95.478    0.000  456.348    0.000 betterAsyncore.py:171(Dispatcher._delegate_call)
3758837/3758839   76.148    0.000  640.156    0.000 asyncore.py:164(poll2)
 37566669   45.854    0.000   98.010    0.000 stomp.py:465(AsyncDispatcher._outgoing_heartbeat_expiration_interval)
  3758839   42.142    0.000  234.540    0.000 betterAsyncore.py:254(Reactor._get_timeout)
 18784108   36.588    0.000  194.046    0.000 stomp.py:522(AsyncDispatcher.writable)
 37592173   35.063    0.000   52.220    0.000 time.py:45(monotonic_time)
 18784259   30.633    0.000  170.400    0.000 betterAsyncore.py:89(Dispatcher.next_check_interval)
  7508178   22.216    0.000   68.581    0.000 ssl.py:825(SSLSocket.read)
 18784259   19.919    0.000   86.992    0.000 betterAsyncore.py:65(Dispatcher.readable)
 18784108   19.650    0.000   27.484    0.000 stompserver.py:75(StompAdapterImpl.has_outgoing_messages)
 18784259   18.641    0.000  268.995    0.000 betterAsyncore.py:68(Dispatcher.writable)
 37566669   17.775    0.000   17.775    0.000 stomp.py:471(AsyncDispatcher._incoming_heartbeat_expiration_interval)
 37566669   17.620    0.000   17.620    0.000 stomp.py:477(AsyncDispatcher._reconnect_expiration_interval)
  7508167   17.015    0.000  113.872    0.000 stomp.py:405(AsyncDispatcher.handle_read)
  7511526   14.452    0.000  171.257    0.000 asyncore.py:105(readwrite)
  7508188   12.744    0.000   81.325    0.000 sslutils.py:42(SSLSocket.read)
  7508178   10.337    0.000   91.661    0.000 betterAsyncore.py:117(Dispatcher.recv)
  7508178    9.762    0.000   43.128    0.000 ssl.py:583(SSLObject.read)
  7509707    9.665    0.000  156.392    0.000 asyncore.py:413(AsyncoreEvent.handle_read_event)

- Compact print of the most expensive functions sorted by cumtime
$ contrib/profile-stats -c -scumtime ~/Work/vdsmd.prof | head -30
Fri Jul  1 16:01:15 2022    /home/aesteve/Work/vdsmd.prof

         435235626 function calls (435453317 primitive calls) in 713.979 seconds

   Ordered by: cumulative time
   List reduced from 3523 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3758837/3758839    8.758    0.000  650.918    0.000 asyncore.py:192(loop)
3758837/3758839   76.148    0.000  640.156    0.000 asyncore.py:164(poll2)
45078444/45078455   95.478    0.000  456.348    0.000 betterAsyncore.py:171(Dispatcher._delegate_call)
 18784259   18.641    0.000  268.995    0.000 betterAsyncore.py:68(Dispatcher.writable)
 37566669  106.104    0.000  260.103    0.000 stomp.py:483(AsyncDispatcher.next_check_interval)
  3758839   42.142    0.000  234.540    0.000 betterAsyncore.py:254(Reactor._get_timeout)
 18784108   36.588    0.000  194.046    0.000 stomp.py:522(AsyncDispatcher.writable)
  7511526   14.452    0.000  171.257    0.000 asyncore.py:105(readwrite)
 18784259   30.633    0.000  170.400    0.000 betterAsyncore.py:89(Dispatcher.next_check_interval)
  7509707    9.665    0.000  156.392    0.000 asyncore.py:413(AsyncoreEvent.handle_read_event)
7508194/7508204    8.136    0.000  146.745    0.000 betterAsyncore.py:71(Dispatcher.handle_read)
  7508167   17.015    0.000  113.872    0.000 stomp.py:405(AsyncDispatcher.handle_read)
 37566669   45.854    0.000   98.010    0.000 stomp.py:465(AsyncDispatcher._outgoing_heartbeat_expiration_interval)
  7508178   10.337    0.000   91.661    0.000 betterAsyncore.py:117(Dispatcher.recv)
 18784259   19.919    0.000   86.992    0.000 betterAsyncore.py:65(Dispatcher.readable)
  7508188   12.744    0.000   81.325    0.000 sslutils.py:42(SSLSocket.read)
  7508178   22.216    0.000   68.581    0.000 ssl.py:825(SSLSocket.read)
 37592173   35.063    0.000   52.220    0.000 time.py:45(monotonic_time)
  7508178    9.762    0.000   43.128    0.000 ssl.py:583(SSLObject.read)
 18784108   19.650    0.000   27.484    0.000 stompserver.py:75(StompAdapterImpl.has_outgoing_messages)

- Detail of the top functions ordered by cumtime:
   Ordered by: cumulative time
   List reduced from 3523 to 20 due to restriction <20>

Function                                                               was called by...
                                                                           ncalls  tottime  cumtime
asyncore.py:192(loop)                                                  <- 3758839/3758837    8.758  650.918  betterAsyncore.py:239(Reactor.process_requests)
asyncore.py:164(poll2)                                                 <- 3758839/3758837   76.148  640.156  asyncore.py:192(loop)
betterAsyncore.py:171(Dispatcher._delegate_call)                       <-     6/5    0.000    0.002  betterAsyncore.py:53(Dispatcher.handle_close)
                                                                               11    0.000    0.008  betterAsyncore.py:56(Dispatcher.handle_accept)
                                                                         18784259   39.601   67.072  betterAsyncore.py:65(Dispatcher.readable)
                                                                         18784259   38.866  250.355  betterAsyncore.py:68(Dispatcher.writable)
                                                                        7508204/7508194   17.003  138.610  betterAsyncore.py:71(Dispatcher.handle_read)
                                                                             1716    0.008    0.302  betterAsyncore.py:74(Dispatcher.handle_write)
betterAsyncore.py:68(Dispatcher.writable)                              <- 18784259   18.641  268.995  asyncore.py:164(poll2)
stomp.py:483(AsyncDispatcher.next_check_interval)                      <- 18784108   52.906  130.132  betterAsyncore.py:89(Dispatcher.next_check_interval)
                                                                         18782561   53.198  129.972  stomp.py:522(AsyncDispatcher.writable)
betterAsyncore.py:254(Reactor._get_timeout)                            <- 3758839   42.142  234.540  betterAsyncore.py:239(Reactor.process_requests)
stomp.py:522(AsyncDispatcher.writable)                                 <- 18784108   36.588  194.046  betterAsyncore.py:171(Dispatcher._delegate_call)

Comment 4 Albert Esteve 2022-07-01 15:53:31 UTC
So looks like it gets stuck in a busy loop within the `asyncore.loop` call. I do not understand this part of the code in deep, so not sure exactly why is this happening.

Some notes from my side:
- This was not 100% reproducible in my case. It does not happen every time, needed a few tries to hit it sometimes.
  Sometimes I hit it when shutting it down (it already takes a long time to do it), sometimes I need to invoke `hosted-engine --vm-start` to trigger the issue.

- There are a few modules using the send() call from the Dispatcher in betterAsyncore.py:
lib/yajsonrpc/__init__.py:220:    def sendReply(self):
lib/yajsonrpc/betterAsyncore.py:145:    def send(self, data):
lib/yajsonrpc/jsonrpcclient.py:173:        def send_notification(message):
lib/yajsonrpc/stomp.py:613:    def send_raw(self, msg):
lib/yajsonrpc/stompclient.py:169:    def send(self, destination, data="", headers=None):
lib/yajsonrpc/stompclient.py:292:    def send(self, message, destination=SUBSCRIPTION_ID_RESPONSE,
lib/yajsonrpc/stompclient.py:342:    def send(self, data, destination=None, flow_id=None):
lib/yajsonrpc/stompserver.py:306:    def send(self, message, destination=stomp.SUBSCRIPTION_ID_RESPONSE):
lib/yajsonrpc/stompserver.py:443:    def send(self, data):

Adding a DEBUG log to these calls could help to find from which thread is called, and probably find a pattern there.

@msobczyk could you please take a look?

Comment 5 Nir Soffer 2022-07-01 16:31:31 UTC
If found an easy way to reproduce this without hosted engine:

    for i in $(seq 20); do sudo vdsm-client Host getStorageRepoStats; sleep 5; done

Vdsm is using 100% cpu constantly while this runs:

top - 19:28:23 up 16 min,  3 users,  load average: 3.77, 2.74, 1.56
Threads: 422 total,  15 running, 407 sleeping,   0 stopped,   0 zombie
%Cpu(s): 49.0 us,  3.9 sy,  0.0 ni, 46.4 id,  0.1 wa,  0.4 hi,  0.1 si,  0.0 st
MiB Mem :   3731.7 total,   1303.3 free,   1254.8 used,   1173.6 buff/cache
MiB Swap:   2116.0 total,   2116.0 free,      0.0 used.   2130.2 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   4672 vdsm       0 -20 1956740 106552  30248 R  97.8   2.8   3:50.64 JsonRpc (StompR
   6400 vdsm       0 -20 1956740 106552  30248 S   0.6   2.8   0:09.90 mailbox-spm
   6995 qemu      20   0 1988076 561448  23592 S   0.6  14.7   0:25.41 CPU 0/KVM
   5984 vdsm       0 -20 1956740 106552  30248 S   0.4   2.8   0:00.68 monitor/90789a0
   5087 vdsm      20   0  561656  32576  10360 S   0.4   0.9   0:01.88 momd

Attaching top.log generated by:

    top -bHi -d5 > top.log

Comment 6 Nir Soffer 2022-07-01 16:33:04 UTC
Created attachment 1893945 [details]
top output while reproducing

Comment 7 Nir Soffer 2022-07-01 17:01:31 UTC
It is actually even simpler - we can reproduce with single API call:

    sudo vdsm-client Host getStorageRepoStats

This cause vdsm to consume 100% cpu for 22 seconds:

Running:

    top -bHi -d1 > top-single.log

Vdsm consumed ~100% cpu from:

top - 19:49:14 up 37 min,  3 users,  load average: 1.79, 1.86, 1.61
Threads: 423 total,   2 running, 421 sleeping,   0 stopped,   0 zombie
%Cpu(s): 46.5 us,  4.0 sy,  0.0 ni, 49.0 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
MiB Mem :   3731.7 total,   1691.1 free,    727.7 used,   1312.9 buff/cache
MiB Swap:   2116.0 total,   2116.0 free,      0.0 used.   2654.0 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  40009 vdsm       0 -20 1972104 101404  30312 R  97.0   2.7   1:21.82 JsonRpc (StompR
  43097 nsoffer   20   0   65704   5428   4348 R   1.0   0.1   0:00.38 top

To:

top - 19:49:36 up 37 min,  3 users,  load average: 3.97, 2.35, 1.78
Threads: 419 total,  13 running, 406 sleeping,   0 stopped,   0 zombie
%Cpu(s): 46.8 us,  4.0 sy,  0.0 ni, 48.8 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
MiB Mem :   3731.7 total,   1692.8 free,    725.9 used,   1313.0 buff/cache
MiB Swap:   2116.0 total,   2116.0 free,      0.0 used.   2655.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  40009 vdsm       0 -20 1972104 103348  30312 R  99.0   2.7   1:43.44 JsonRpc (StompR
  43097 nsoffer   20   0   65704   5428   4348 R   1.0   0.1   0:00.51 top

See attached top-single.log.

My guess: client close is not handled correctly - when client send a request and 
close the socket, vdsm does not handle the close correctly and continue to watch
the socket. The socket is writeable, and the reactor tries to handle the write
event in a busy loop on a socket that was closed by the remote side.

Comment 8 Nir Soffer 2022-07-01 17:02:22 UTC
Created attachment 1893958 [details]
top output for single request

Comment 9 Nir Soffer 2022-07-01 19:30:18 UTC
The issue is easy to reproduce and fix without hosted engine.

Didi, if you want to test the fix you can use this build:
https://github.com/oVirt/vdsm/actions/runs/2598697662

Comment 10 Yedidyah Bar David 2022-07-04 13:39:16 UTC
(In reply to Nir Soffer from comment #9)
> The issue is easy to reproduce and fix without hosted engine.
> 
> Didi, if you want to test the fix you can use this build:
> https://github.com/oVirt/vdsm/actions/runs/2598697662

I see that the patch was already merged. Thanks.

Comment 11 Nir Soffer 2022-07-04 14:29:16 UTC
Fix should available in next 4.5.2 build.

Comment 13 Sandro Bonazzola 2022-08-30 08:49:07 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.