Created attachment 1893642 [details] vdsm-ha-logs.tar.gz Description of problem: When starting the HE VM with 'hosted-engine --vm-start', vdsmd immediately jumps from using 2-3% cpu to 99-100%, for many minutes, even after the VM is up and idle. Eventually it goes down. Similarly, when shutting down the VM, vdsmd jumps again to 99-100% cpu, and remains high for many minutes, also after the VM is down. Not sure these are the only flows or if it's related to HE, but these are the ones I noticed. Version-Release number of selected component (if applicable): Current master, not sure when it started How reproducible: Not sure, saw it many times Steps to Reproduce: 1. Deploy hosted-engine and monitor vdsmd's cpu usage 2. Wait until vdsmd is (almost) idle, at 2-3% cpu 3. Shutdown the engine VM (e.g. by logging into it and running 'poweroff' 4. Wait until vdsmd is idle 5. hosted-engine --vm-start Actual results: vdsmd's cpu usage remains high for a very long time after starting/stopping the HE VM. Expected results: vdsmd's cpu usage remains low, or at least goes down much faster. Additional info: Attaching relevant logs. vdsm and HA logs are with level set to DEBUG, top.log is 'top -b > top.log'.
Based on top.log, the events are: 10:14:43 high cpu 10:28:17 normal cpu 11:43:59 high cpu 12:00:16 normal cpu Which apis are called during this time? First combine vdsm logs: cat vdsm.log.1 vdsm.log > vdsm-combined.log And create 2 logs one fron 10:14:43 to 10:28:17 (vdsm-high.log) and one from 10:28:17 to 11:43:59 (vdsm-normal.log) Health logs: $ grep '\[health\] user=' vdsm-high.log 2022-06-30 10:15:57,398+0000 DEBUG (health) [health] user=68.56%, sys=4.76%, rss=107364 kB (+108), threads=57 (health:126) 2022-06-30 10:20:57,434+0000 DEBUG (health) [health] user=93.28%, sys=5.88%, rss=107400 kB (+36), threads=57 (health:126) 2022-06-30 10:25:57,466+0000 DEBUG (health) [health] user=93.28%, sys=5.88%, rss=107480 kB (+80), threads=57 (health:126) $ grep '\[health\] user=' vdsm-normal.log 2022-06-30 10:30:57,514+0000 DEBUG (health) [health] user=43.54%, sys=3.49%, rss=107500 kB (+20), threads=57 (health:126) 2022-06-30 10:35:57,569+0000 DEBUG (health) [health] user=2.97%, sys=1.47%, rss=107532 kB (+32), threads=57 (health:126) 2022-06-30 10:40:57,623+0000 DEBUG (health) [health] user=2.86%, sys=1.50%, rss=107588 kB (+56), threads=57 (health:126) 2022-06-30 10:45:57,673+0000 DEBUG (health) [health] user=2.96%, sys=1.47%, rss=107640 kB (+52), threads=57 (health:126) 2022-06-30 10:50:57,717+0000 DEBUG (health) [health] user=2.95%, sys=1.52%, rss=107644 kB (+4), threads=57 (health:126) 2022-06-30 10:55:57,764+0000 DEBUG (health) [health] user=3.09%, sys=1.55%, rss=107648 kB (+4), threads=57 (health:126) 2022-06-30 11:00:57,834+0000 DEBUG (health) [health] user=3.10%, sys=1.55%, rss=107668 kB (+20), threads=57 (health:126) 2022-06-30 11:05:57,888+0000 DEBUG (health) [health] user=3.09%, sys=1.55%, rss=107672 kB (+4), threads=57 (health:126) 2022-06-30 11:10:57,956+0000 DEBUG (health) [health] user=3.08%, sys=1.54%, rss=107680 kB (+8), threads=57 (health:126) 2022-06-30 11:15:58,012+0000 DEBUG (health) [health] user=3.00%, sys=1.51%, rss=107688 kB (+8), threads=57 (health:126) 2022-06-30 11:20:58,063+0000 DEBUG (health) [health] user=3.02%, sys=1.45%, rss=107696 kB (+8), threads=57 (health:126) 2022-06-30 11:25:58,104+0000 DEBUG (health) [health] user=3.02%, sys=1.48%, rss=107732 kB (+36), threads=57 (health:126) 2022-06-30 11:30:58,166+0000 DEBUG (health) [health] user=3.04%, sys=1.54%, rss=107744 kB (+12), threads=57 (health:126) 2022-06-30 11:35:58,206+0000 DEBUG (health) [health] user=10.07%, sys=1.73%, rss=107920 kB (+176), threads=57 (health:126) 2022-06-30 11:40:58,271+0000 DEBUG (health) [health] user=3.15%, sys=1.58%, rss=109680 kB (+1760), threads=57 (health:126) API calls when cpu is high (10:14:43 to 10:28:17, 13 minutes): $ awk '/\[jsonrpc.JsonRpcServer\] Calling/ {print $7}' vdsm-high.log | sort | uniq -c | sort -rn 544 'Host.ping2' 163 'Host.getStorageRepoStats' 163 'Host.getStats' 96 'VM.getStats' 54 'Host.getAllVmStats' API calls when cpu is normal (10:28:17 to 11:43:59, 76 minutes): $ awk '/\[jsonrpc.JsonRpcServer\] Calling/ {print $7}' vdsm-normal.log | sort | uniq -c | sort -rn 3025 'Host.ping2' 939 'Host.getStats' 906 'Host.getStorageRepoStats' 530 'VM.getStats' 332 'Host.getAllVmStats' 34 'Host.getAllVmIoTunePolicies' 1 'VM.destroy' 1 'VM.create' 1 'StoragePool.getSpmStatus' 1 'StoragePool.connect' 1 'Host.setKsmTune' 1 'Host.hostdevListByCaps' 1 'Host.getHardwareInfo' 1 'Host.getCapabilities' 1 'Host.dumpxmls' The request distribution is similar, so this is not caused by client activity. However some calls are unexpected: - Host.ping2 called every 1.5 seconds - why? this API should be used to check connectivity with vdsm and there is no reason to check connectivity every 1.5 seconds. - Host.getStorageRepoStats called every 4.7 seconds - why? Vdsm expect this call once in 15 seconds. - Host.getStats callled every 5 seconds - why? We will have to reproduce this and maybe profile vdsm.
Created attachment 1893929 [details] vdsm profile file I have been able to reproduce and profile the issue. Attached the *.prof file. Vdsm configuration for the profiling: # cat /etc/vdsm/vdsm.conf [vars] ssl= true [addresses] management_port= 54321 [devel] cpu_profile_enable = true cpu_profile_filename = /run/vdsm/vdsmd.prof cpu_profile_format = pstat cpu_profile_clock = cpu
Some interesting inspections of the profile file above. - Compact print of the most expensive functions $ contrib/profile-stats -c ~/Work/vdsmd.prof | head -30 Fri Jul 1 16:01:15 2022 /home/aesteve/Work/vdsmd.prof 435235626 function calls (435453317 primitive calls) in 713.979 seconds Ordered by: internal time List reduced from 3523 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 37566669 106.104 0.000 260.103 0.000 stomp.py:483(AsyncDispatcher.next_check_interval) 45078444/45078455 95.478 0.000 456.348 0.000 betterAsyncore.py:171(Dispatcher._delegate_call) 3758837/3758839 76.148 0.000 640.156 0.000 asyncore.py:164(poll2) 37566669 45.854 0.000 98.010 0.000 stomp.py:465(AsyncDispatcher._outgoing_heartbeat_expiration_interval) 3758839 42.142 0.000 234.540 0.000 betterAsyncore.py:254(Reactor._get_timeout) 18784108 36.588 0.000 194.046 0.000 stomp.py:522(AsyncDispatcher.writable) 37592173 35.063 0.000 52.220 0.000 time.py:45(monotonic_time) 18784259 30.633 0.000 170.400 0.000 betterAsyncore.py:89(Dispatcher.next_check_interval) 7508178 22.216 0.000 68.581 0.000 ssl.py:825(SSLSocket.read) 18784259 19.919 0.000 86.992 0.000 betterAsyncore.py:65(Dispatcher.readable) 18784108 19.650 0.000 27.484 0.000 stompserver.py:75(StompAdapterImpl.has_outgoing_messages) 18784259 18.641 0.000 268.995 0.000 betterAsyncore.py:68(Dispatcher.writable) 37566669 17.775 0.000 17.775 0.000 stomp.py:471(AsyncDispatcher._incoming_heartbeat_expiration_interval) 37566669 17.620 0.000 17.620 0.000 stomp.py:477(AsyncDispatcher._reconnect_expiration_interval) 7508167 17.015 0.000 113.872 0.000 stomp.py:405(AsyncDispatcher.handle_read) 7511526 14.452 0.000 171.257 0.000 asyncore.py:105(readwrite) 7508188 12.744 0.000 81.325 0.000 sslutils.py:42(SSLSocket.read) 7508178 10.337 0.000 91.661 0.000 betterAsyncore.py:117(Dispatcher.recv) 7508178 9.762 0.000 43.128 0.000 ssl.py:583(SSLObject.read) 7509707 9.665 0.000 156.392 0.000 asyncore.py:413(AsyncoreEvent.handle_read_event) - Compact print of the most expensive functions sorted by cumtime $ contrib/profile-stats -c -scumtime ~/Work/vdsmd.prof | head -30 Fri Jul 1 16:01:15 2022 /home/aesteve/Work/vdsmd.prof 435235626 function calls (435453317 primitive calls) in 713.979 seconds Ordered by: cumulative time List reduced from 3523 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 3758837/3758839 8.758 0.000 650.918 0.000 asyncore.py:192(loop) 3758837/3758839 76.148 0.000 640.156 0.000 asyncore.py:164(poll2) 45078444/45078455 95.478 0.000 456.348 0.000 betterAsyncore.py:171(Dispatcher._delegate_call) 18784259 18.641 0.000 268.995 0.000 betterAsyncore.py:68(Dispatcher.writable) 37566669 106.104 0.000 260.103 0.000 stomp.py:483(AsyncDispatcher.next_check_interval) 3758839 42.142 0.000 234.540 0.000 betterAsyncore.py:254(Reactor._get_timeout) 18784108 36.588 0.000 194.046 0.000 stomp.py:522(AsyncDispatcher.writable) 7511526 14.452 0.000 171.257 0.000 asyncore.py:105(readwrite) 18784259 30.633 0.000 170.400 0.000 betterAsyncore.py:89(Dispatcher.next_check_interval) 7509707 9.665 0.000 156.392 0.000 asyncore.py:413(AsyncoreEvent.handle_read_event) 7508194/7508204 8.136 0.000 146.745 0.000 betterAsyncore.py:71(Dispatcher.handle_read) 7508167 17.015 0.000 113.872 0.000 stomp.py:405(AsyncDispatcher.handle_read) 37566669 45.854 0.000 98.010 0.000 stomp.py:465(AsyncDispatcher._outgoing_heartbeat_expiration_interval) 7508178 10.337 0.000 91.661 0.000 betterAsyncore.py:117(Dispatcher.recv) 18784259 19.919 0.000 86.992 0.000 betterAsyncore.py:65(Dispatcher.readable) 7508188 12.744 0.000 81.325 0.000 sslutils.py:42(SSLSocket.read) 7508178 22.216 0.000 68.581 0.000 ssl.py:825(SSLSocket.read) 37592173 35.063 0.000 52.220 0.000 time.py:45(monotonic_time) 7508178 9.762 0.000 43.128 0.000 ssl.py:583(SSLObject.read) 18784108 19.650 0.000 27.484 0.000 stompserver.py:75(StompAdapterImpl.has_outgoing_messages) - Detail of the top functions ordered by cumtime: Ordered by: cumulative time List reduced from 3523 to 20 due to restriction <20> Function was called by... ncalls tottime cumtime asyncore.py:192(loop) <- 3758839/3758837 8.758 650.918 betterAsyncore.py:239(Reactor.process_requests) asyncore.py:164(poll2) <- 3758839/3758837 76.148 640.156 asyncore.py:192(loop) betterAsyncore.py:171(Dispatcher._delegate_call) <- 6/5 0.000 0.002 betterAsyncore.py:53(Dispatcher.handle_close) 11 0.000 0.008 betterAsyncore.py:56(Dispatcher.handle_accept) 18784259 39.601 67.072 betterAsyncore.py:65(Dispatcher.readable) 18784259 38.866 250.355 betterAsyncore.py:68(Dispatcher.writable) 7508204/7508194 17.003 138.610 betterAsyncore.py:71(Dispatcher.handle_read) 1716 0.008 0.302 betterAsyncore.py:74(Dispatcher.handle_write) betterAsyncore.py:68(Dispatcher.writable) <- 18784259 18.641 268.995 asyncore.py:164(poll2) stomp.py:483(AsyncDispatcher.next_check_interval) <- 18784108 52.906 130.132 betterAsyncore.py:89(Dispatcher.next_check_interval) 18782561 53.198 129.972 stomp.py:522(AsyncDispatcher.writable) betterAsyncore.py:254(Reactor._get_timeout) <- 3758839 42.142 234.540 betterAsyncore.py:239(Reactor.process_requests) stomp.py:522(AsyncDispatcher.writable) <- 18784108 36.588 194.046 betterAsyncore.py:171(Dispatcher._delegate_call)
So looks like it gets stuck in a busy loop within the `asyncore.loop` call. I do not understand this part of the code in deep, so not sure exactly why is this happening. Some notes from my side: - This was not 100% reproducible in my case. It does not happen every time, needed a few tries to hit it sometimes. Sometimes I hit it when shutting it down (it already takes a long time to do it), sometimes I need to invoke `hosted-engine --vm-start` to trigger the issue. - There are a few modules using the send() call from the Dispatcher in betterAsyncore.py: lib/yajsonrpc/__init__.py:220: def sendReply(self): lib/yajsonrpc/betterAsyncore.py:145: def send(self, data): lib/yajsonrpc/jsonrpcclient.py:173: def send_notification(message): lib/yajsonrpc/stomp.py:613: def send_raw(self, msg): lib/yajsonrpc/stompclient.py:169: def send(self, destination, data="", headers=None): lib/yajsonrpc/stompclient.py:292: def send(self, message, destination=SUBSCRIPTION_ID_RESPONSE, lib/yajsonrpc/stompclient.py:342: def send(self, data, destination=None, flow_id=None): lib/yajsonrpc/stompserver.py:306: def send(self, message, destination=stomp.SUBSCRIPTION_ID_RESPONSE): lib/yajsonrpc/stompserver.py:443: def send(self, data): Adding a DEBUG log to these calls could help to find from which thread is called, and probably find a pattern there. @msobczyk could you please take a look?
If found an easy way to reproduce this without hosted engine: for i in $(seq 20); do sudo vdsm-client Host getStorageRepoStats; sleep 5; done Vdsm is using 100% cpu constantly while this runs: top - 19:28:23 up 16 min, 3 users, load average: 3.77, 2.74, 1.56 Threads: 422 total, 15 running, 407 sleeping, 0 stopped, 0 zombie %Cpu(s): 49.0 us, 3.9 sy, 0.0 ni, 46.4 id, 0.1 wa, 0.4 hi, 0.1 si, 0.0 st MiB Mem : 3731.7 total, 1303.3 free, 1254.8 used, 1173.6 buff/cache MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 2130.2 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4672 vdsm 0 -20 1956740 106552 30248 R 97.8 2.8 3:50.64 JsonRpc (StompR 6400 vdsm 0 -20 1956740 106552 30248 S 0.6 2.8 0:09.90 mailbox-spm 6995 qemu 20 0 1988076 561448 23592 S 0.6 14.7 0:25.41 CPU 0/KVM 5984 vdsm 0 -20 1956740 106552 30248 S 0.4 2.8 0:00.68 monitor/90789a0 5087 vdsm 20 0 561656 32576 10360 S 0.4 0.9 0:01.88 momd Attaching top.log generated by: top -bHi -d5 > top.log
Created attachment 1893945 [details] top output while reproducing
It is actually even simpler - we can reproduce with single API call: sudo vdsm-client Host getStorageRepoStats This cause vdsm to consume 100% cpu for 22 seconds: Running: top -bHi -d1 > top-single.log Vdsm consumed ~100% cpu from: top - 19:49:14 up 37 min, 3 users, load average: 1.79, 1.86, 1.61 Threads: 423 total, 2 running, 421 sleeping, 0 stopped, 0 zombie %Cpu(s): 46.5 us, 4.0 sy, 0.0 ni, 49.0 id, 0.0 wa, 0.5 hi, 0.0 si, 0.0 st MiB Mem : 3731.7 total, 1691.1 free, 727.7 used, 1312.9 buff/cache MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 2654.0 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 40009 vdsm 0 -20 1972104 101404 30312 R 97.0 2.7 1:21.82 JsonRpc (StompR 43097 nsoffer 20 0 65704 5428 4348 R 1.0 0.1 0:00.38 top To: top - 19:49:36 up 37 min, 3 users, load average: 3.97, 2.35, 1.78 Threads: 419 total, 13 running, 406 sleeping, 0 stopped, 0 zombie %Cpu(s): 46.8 us, 4.0 sy, 0.0 ni, 48.8 id, 0.0 wa, 0.5 hi, 0.0 si, 0.0 st MiB Mem : 3731.7 total, 1692.8 free, 725.9 used, 1313.0 buff/cache MiB Swap: 2116.0 total, 2116.0 free, 0.0 used. 2655.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 40009 vdsm 0 -20 1972104 103348 30312 R 99.0 2.7 1:43.44 JsonRpc (StompR 43097 nsoffer 20 0 65704 5428 4348 R 1.0 0.1 0:00.51 top See attached top-single.log. My guess: client close is not handled correctly - when client send a request and close the socket, vdsm does not handle the close correctly and continue to watch the socket. The socket is writeable, and the reactor tries to handle the write event in a busy loop on a socket that was closed by the remote side.
Created attachment 1893958 [details] top output for single request
The issue is easy to reproduce and fix without hosted engine. Didi, if you want to test the fix you can use this build: https://github.com/oVirt/vdsm/actions/runs/2598697662
(In reply to Nir Soffer from comment #9) > The issue is easy to reproduce and fix without hosted engine. > > Didi, if you want to test the fix you can use this build: > https://github.com/oVirt/vdsm/actions/runs/2598697662 I see that the patch was already merged. Thanks.
Fix should available in next 4.5.2 build.
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.