Description of problem: After a loop of migrations that succeeded for 380 iterations, it failed, with this in vdsm.log: 2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could not connect to storageServer (hsm:2446) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2443, in connectStorageServer conObj.connect() File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 449, in connect return self._mountCon.connect() File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 171, in connect self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP) File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line 210, in mount cgroup=cgroup) File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 56, in __call__ return callMethod() File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda> **kwargs) File "<string>", line 2, in mount File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod raise convert_to_error(kind, result) OSError: [Errno 24] Too many open files Some more details in [1]. I am currently trying to reproduce along with parallel loops that log ls -l /proc/*/fd/*. Will attach logs if it happens again. [1] https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/U353TBTVJIC7VXKXZ4WUWXLA5HFWZLWT/
It seems like supervdsm is leaking FDs, based on checking the number of open files over time. To get an idea of what code might be leaking them, I ran strace on it [1]. I picked at random one fd that seemed to have never been closed (178), by pid 45415 (one of the children of 28956), and attach the result of [2]. [1] strace -s 4096 -tt -f -p 28956 -o /tmp/supervdsm-strace1.log [2] grep -C1000 '^45415\>.*\<178\>' /tmp/supervdsm-strace1.log
Created attachment 1755893 [details] Partial strace
The "problem" seems like it might have been here: 45415 09:38:44.185890 pipe2([178, 179], O_CLOEXEC) = 0 45415 09:38:44.185916 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd935ffb9d0) = 580072 580072 09:38:44.186785 set_robust_list(0x7fd935ffb9e0, 24) = 0 45415 09:38:44.186935 close(179) = 0 45415 09:38:44.187083 getpid() = 28956 45415 09:38:44.187474 poll([{fd=44, events=POLLIN}], 1, 60000 <unfinished ...> Meaning, we called pipe2, which opened two FDs 178 and 179, and we closed 179, but never closed 178. Why? Perhaps (later): 45415 09:38:44.201062 kill(580072, SIGTERM <unfinished ...> 580072 09:38:44.201074 <... read resumed>"\200\3X\3\0\0\0Byeq\0.", 13) = 13 45415 09:38:44.201079 <... kill resumed>) = 0 580072 09:38:44.201092 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=28956, si_uid=0} --- 45415 09:38:44.201097 wait4(580072, <unfinished ...> 580072 09:38:44.201109 getpid( <unfinished ...> 45415 09:38:44.201113 <... wait4 resumed>0x7fd935ff8d44, WNOHANG, NULL) = 0 580072 09:38:44.201152 <... getpid resumed>) = 580072 45415 09:38:44.201158 wait4(580072, <unfinished ...> 580072 09:38:44.201167 write(10, "\17", 1 <unfinished ...> 45415 09:38:44.201172 <... wait4 resumed>0x7fd935ff9014, WNOHANG, NULL) = 0 580072 09:38:44.201211 <... write resumed>) = 1 45415 09:38:44.201216 close(118 <unfinished ...> 28956 09:38:44.201220 <... poll resumed>) = 1 ([{fd=9, revents=POLLIN}]) 580072 09:38:44.201259 rt_sigreturn({mask=[]} <unfinished ...> 45415 09:38:44.201266 <... close resumed>) = 0 28956 09:38:44.201270 read(9, <unfinished ...> 580072 09:38:44.201301 <... rt_sigreturn resumed>) = 13 45415 09:38:44.201306 close(44 <unfinished ...> 28956 09:38:44.201310 <... read resumed>"\17", 128) = 1 580072 09:38:44.201456 exit_group(0 <unfinished ...> 45415 09:38:44.201463 <... close resumed>) = 0 28956 09:38:44.201467 poll([{fd=9, events=POLLIN}], 1, -1 <unfinished ...> 580072 09:38:44.201790 <... exit_group resumed>) = ? 45415 09:38:44.201799 getpid( <unfinished ...> 580072 09:38:44.202672 +++ exited with 0 +++ 45415 09:38:44.202688 <... getpid resumed>) = 28956 45415 09:38:44.202701 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=580072, si_uid=107, si_status=0, si_utime=0, si_stime=0} --- 28956 09:38:44.202708 <... poll resumed>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) 45415 09:38:44.202721 getpid( <unfinished ...> 28956 09:38:44.202726 restart_syscall(<... resuming interrupted poll ...> <unfinished ...> 45415 09:38:44.202735 <... getpid resumed>) = 28956 I do not know vdsm's code, but searching a bit I think this is the code doing this (in lib/vdsm/supervdsm_server.py) pipe, hisPipe = Pipe() with closing(pipe), closing(hisPipe): proc = Process(target=child, args=(hisPipe,)) Searching the net for 'python "with statement" with multiple does not call __exit__' finds me [1]. So perhaps it's because we get the SIGCHLD and this, with [1], causes us to not call __exit__ on hisPipe. There are quite many guesses here. Not sure if this is at all relevant. [1] https://bugs.python.org/issue29988
(In reply to Yedidyah Bar David from comment #3) Thanks for this research! > I do not know vdsm's code, but searching a bit I think this is the code > doing this (in lib/vdsm/supervdsm_server.py) > > pipe, hisPipe = Pipe() > with closing(pipe), closing(hisPipe): > proc = Process(target=child, args=(hisPipe,)) > > Searching the net for 'python "with statement" with multiple does not call > __exit__' finds me [1]. > > So perhaps it's because we get the SIGCHLD and this, with [1], causes us to > not call __exit__ on hisPipe. > > There are quite many guesses here. Not sure if this is at all relevant. > > [1] https://bugs.python.org/issue29988 The issue seems to be even with single with statement. This is a very bad python bug, we assume that "with" works and relay on everywhere, like most python code. The main issue in vdsm and supervdsm is that we use SIGCHLD to reap child processes asynchronously, and this means any code in vdsm can be interrupted by signals during normal flows. Typically servers accept only termination signals, so these races are not a practical problem. A server must be able to cope with abnormal termination so bad signal handling during shutdown is not a practical issue. We use async reaping in these cases: 69 def _rescan(): 70 """ 71 Called from supervdsm to perform rescan as root. 72 """ 73 timeout = config.getint('irs', 'scsi_rescan_maximal_timeout') 74 75 p = commands.start( 76 [constants.EXT_FC_SCAN], 77 stdout=subprocess.PIPE, 78 stderr=subprocess.PIPE) 79 try: 80 out, err = p.communicate(timeout=timeout) 81 except subprocess.TimeoutExpired: 82 zombiereaper.autoReapPID(p.pid) 83 raise Error("Timeout scanning (pid=%s)" % p.pid) 84 85 if p.returncode != 0: 86 raise Error("Scan failed with rc=%s out=%r err=%r" 87 % (p.returncode, out, err)) If we don't use that, blocked fc-scan can block storage code for unlimited time since the operation takes a gloabl storage lock. But I don't think it is a good idea to leave block scan running and try to do other scans, since the other scan will lock anyway on the kernel. I think this should be solved by moving this code to single thread that will perform scans serially, and will wait until the current scan ends. If we have a single thread doing scans, it will be busy waiting for the child. Callers will wait on threading.Event() with a timeout and will have the same semantics (scan timed out). We will have a new condition - scan that fail since the scanner thread is busy. This is similar in semantics to new scan blocked on the kernel, so it can be handled in the same way. 337 def session_rescan(timeout=None): 338 # Note: keeping old behaviour of taking the module lock while starting the 339 # command, and releasing the lock while the scan command is running. This 340 # looks like a bug since the purpose of the lock is preventing concurrent 341 # iscsiadm commands, but taking a lock for the entire operation may cause 342 # bigger issues. 343 344 args = [constants.EXT_ISCSIADM, "-m", "session", "-R"] 345 346 with _iscsiadmLock: 347 p = commands.start( 348 args, 349 sudo=True, 350 stdout=subprocess.PIPE, 351 stderr=subprocess.PIPE) 352 353 try: 354 out, err = p.communicate(timeout=timeout) 355 except subprocess.TimeoutExpired: 356 zombiereaper.autoReapPID(p.pid) 357 raise IscsiSessionRescanTimeout(p.pid, timeout) Same issue as fc-scan, same solution can be applied, single thread doing these scans. 928 def _abort(self): 929 self._aborted = True 930 if self._proc is None: 931 logging.warning( 932 'Ignoring request to abort job %r; the job failed to start', 933 self._id) 934 return 935 if self._proc.returncode is None: 936 logging.debug('Job %r killing virt-v2v process', self._id) 937 try: 938 self._proc.kill() 939 except OSError as e: 940 if e.errno != errno.ESRCH: 941 raise 942 logging.debug('Job %r virt-v2v process not running', 943 self._id) 944 else: 945 logging.debug('Job %r virt-v2v process was killed', 946 self._id) 947 finally: 948 for pid in self._proc.pids: 949 zombiereaper.autoReapPID(pid) This may leave zombie virt-v2v process accessing storage. Certainly not good idea. I think this usage should be removed, and we should just wait for the child process. 148 def _runAs(self, user, groups, func, args=(), kwargs={}): 149 def child(pipe): 150 res = ex = None 151 try: 152 uid = resolveUid(user) 153 if groups: 154 gids = [resolveGid(g) for g in groups] 155 156 os.setgid(gids[0]) 157 os.setgroups(gids) 158 os.setuid(uid) 159 160 res = func(*args, **kwargs) 161 except BaseException as e: 162 ex = e 163 164 pipe.send((res, ex)) 165 pipe.recv() 166 167 pipe, hisPipe = Pipe() 168 with closing(pipe), closing(hisPipe): 169 proc = Process(target=child, args=(hisPipe,)) 170 proc.start() 171 172 needReaping = True 173 try: 174 if not safe_poll(pipe, RUN_AS_TIMEOUT): 175 try: 176 177 os.kill(proc.pid, signal.SIGKILL) 178 except OSError as e: 179 # Don't add to zombiereaper of PID no longer exists 180 if e.errno == errno.ESRCH: 181 needReaping = False 182 else: 183 raise 184 185 raise Timeout() 186 187 res, err = pipe.recv() 188 pipe.send("Bye") 189 proc.terminate() 190 191 if err is not None: 192 raise err 193 194 return res 195 196 finally: 197 # Add to zombiereaper if process has not been waited upon 198 if proc.exitcode is None and needReaping: 199 zombiereaper.autoReapPID(proc.pid) Here again there is no good reason to leave zombie around, we should wait for the child. If we remove the 4 places using zombiereaper, we can stop registering SIGCHLD, and this will should avoid the fd leaks and remove some complexity from vdsm. However this is not trivial change for the storage code performing scans. Since 3 uses of zombiereaper are related to storage, we can move this bug to stoage. The virt-v2v case should be handled by the virt team. Regardless of the vdsm issues, I think we need a RHEL bug - if this is solved upstream in python 3.9, we would like a backport in python 3.6. Marcin, what do you think?
I posted a quick minimal fix - can you try it with your reproducer? Build should be available here: https://jenkins.ovirt.org/job/vdsm_standard-check-patch/26535/
Summary of what we found so far. The flow test is moving the host running hosted engine to local maintenance mode and back to normal mode. In this flow, we move the host to maintenance, which cause all running vms to migrate to the other host. According to Didi: > > I suppose normally this should not happen more than once every few > > months, for the HE VM. I suspect the fd leak might affect also > > non-HE-VM migrations, which likely happen more often. This is not a common flow, so if we have small fd leak during this flow it will take lot of time to reach this state. When supervdsmd exceeds the maximum number of open files, it basically disabled since most of the operations it does is running commands and we need 4 file descriptors for running any command. I don't think the system can recover from this state. I'll open another bug to track this issue. We have a simple workaround if we reach this state, restart vdsmd and supervdsmd services. My theory was that the reason for the leak is SIGCHLD signal received during cleanup, causing file descriptors leak. The reason we get such signals is registering signal handler and reaping terminated processes that were not waited for in some flows. What we tested: - https://gerrit.ovirt.org/c/vdsm/+/113395 use try-finally instead of with to cleanup file descriptors in validateAccess flow. This change does not seem to have any effect. - https://gerrit.ovirt.org/c/vdsm/+/113401 disable zombiereaper so vdsm and supervdsm do not get SIGCHLD signal for every child process. This seems to solve the leak, but the the patch is not correct as is, more work is needed to do this properly. How to reproduce the issue: On host (or vm) setup for running OST: export OST_IMAGES_DISTRO=el8 . lagofy.sh he-basic-suite-master while true; do run_tc he-basic-suite-master/test-scenarios/012_local_maintenance_sdk_pytest.py done To watch file descriptors we add this script on both hosts: # cat count-supervdsm-fds.sh while true; do pid=$(pidof supervdsmd) fds=$(ls -1 /proc/$pid/fd | wc -l) echo "$(date) pid=$pid fds=$fds" sleep 10 done And run it: # nohup sh count-supervdsm-fds.sh > count-supervdsm-fds.log & Didi also collected output of ls -lh /proc/*/fd to files. Test was started around at Tue Feb 9 14:55 On Tue Feb 9 19:54 we upgraded vdsm on host 1 to a version disabling zombiereaper (https://gerrit.ovirt.org/c/vdsm/+/113401) On Tue Feb 9 21:56 the test started to fail. According to Didi the way this test is written, it will continue to fail on the next runs, so the tests effectively ended at that point. On host 0 we see: [root@lago-he-basic-suite-master-host-0 ~]# awk 'NR % 360 == 1' count-supervdsm-fds.log Tue Feb 9 14:55:06 CET 2021 pid=754314 fds=120 Tue Feb 9 15:55:16 CET 2021 pid=754314 fds=199 Tue Feb 9 16:55:27 CET 2021 pid=754314 fds=272 Tue Feb 9 17:55:37 CET 2021 pid=754314 fds=335 Tue Feb 9 18:55:48 CET 2021 pid=754314 fds=401 (Upgraded vdsm on other host here) Tue Feb 9 19:55:58 CET 2021 pid=754314 fds=468 Tue Feb 9 20:56:08 CET 2021 pid=754314 fds=539 Tue Feb 9 21:56:18 CET 2021 pid=754314 fds=550 (Test started to fail here) Tue Feb 9 22:56:28 CET 2021 pid=754314 fds=550 ... Wed Feb 10 11:58:17 CET 2021 pid=754314 fds=550 On host 1 we see: [root@lago-he-basic-suite-master-host-1 ~]# awk 'NR % 360 == 1' count-supervdsm-fds.log Tue Feb 9 15:04:27 CET 2021 pid=741766 fds=130 Tue Feb 9 16:04:37 CET 2021 pid=741766 fds=205 Tue Feb 9 17:04:47 CET 2021 pid=741766 fds=277 Tue Feb 9 18:04:58 CET 2021 pid=741766 fds=354 Tue Feb 9 19:05:08 CET 2021 pid=741766 fds=427 (Upgraded vdsm here) Tue Feb 9 20:05:18 CET 2021 pid=1006247 fds=30 Tue Feb 9 21:05:28 CET 2021 pid=1006247 fds=28 (Test started to fail here) Tue Feb 9 22:05:38 CET 2021 pid=1006247 fds=29 ... Wed Feb 10 11:07:42 CET 2021 pid=1006247 fds=29 So clearly disabling zombiereaper fixes this leak. What kind of file descriptor do we leak? Host 0: Tue Feb 9 14:55:06 CET 2021 pid=754314 fds=120 # awk '/754314/ {print $11}' ls-l-proc-fd-20210209-145507.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn 95 pipe 16 socket 3 anon_inode 1 /var/log/vdsm/supervdsm.log 1 /var/lib/sss/mc/passwd 1 /var/lib/sss/mc/group 1 /tmp/#696969 1 /run/vdsm/supervdsmd.lock 1 /dev/null Tue Feb 9 17:55:37 CET 2021 pid=754314 fds=335 # awk '/754314/ {print $11}' ls-l-proc-fd-20210209-175531.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn 309 pipe 17 socket 3 anon_inode 1 /var/log/vdsm/supervdsm.log 1 /var/lib/sss/mc/passwd 1 /var/lib/sss/mc/group 1 /tmp/#696969 1 /run/vdsm/supervdsmd.lock 1 /dev/null Tue Feb 9 21:56:18 CET 2021 pid=754314 fds=550 # awk '/754314/ {print $11}' ls-l-proc-fd-20210209-215617.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn 525 pipe 16 socket 3 anon_inode 1 /var/log/vdsm/supervdsm.log 1 /var/lib/sss/mc/passwd 1 /var/lib/sss/mc/group 1 /tmp/#696969 1 /run/vdsm/supervdsmd.lock 1 /dev/null Host 1: Tue Feb 9 15:04:27 CET 2021 pid=741766 fds=130 # awk '/741766/ {print $11}' ls-l-proc-fd-20210209-150429.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn 105 pipe 16 socket 3 anon_inode 1 /var/log/vdsm/supervdsm.log 1 /var/lib/sss/mc/passwd 1 /var/lib/sss/mc/group 1 /tmp/#697104 1 /run/vdsm/supervdsmd.lock 1 /dev/null Tue Feb 9 19:05:08 CET 2021 pid=741766 fds=427 # awk '/741766/ {print $11}' ls-l-proc-fd-20210209-190501.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn 401 pipe 17 socket 3 anon_inode 1 /var/log/vdsm/supervdsm.log 1 /var/lib/sss/mc/passwd 1 /var/lib/sss/mc/group 1 /tmp/#697104 1 /run/vdsm/supervdsmd.lock 1 /dev/null Tue Feb 9 20:05:18 CET 2021 pid=1006247 fds=30 # awk '/1006247/ {print $11}' ls-l-proc-fd-20210209-200510.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn 17 socket 4 pipe 3 anon_inode 1 /var/log/vdsm/supervdsm.log 1 /var/lib/sss/mc/passwd 1 /var/lib/sss/mc/group 1 /tmp/#697104 1 /run/vdsm/supervdsmd.lock 1 /dev/null Tue Feb 9 21:05:28 CET 2021 pid=1006247 fds=28 # awk '/1006247/ {print $11}' ls-l-proc-fd-20210209-215612.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn 17 socket 3 pipe 3 anon_inode 1 /var/log/vdsm/supervdsm.log 1 /var/lib/sss/mc/passwd 1 /var/lib/sss/mc/group 1 /tmp/#697104 1 /run/vdsm/supervdsmd.lock 1 /dev/null So we leak only pipes. This is likely from code running commands. We create 2 pipes for every command, one for stdout, one for stderr. My guess is that receiving a signal when running a command cause cleanup code in vdsm or python subprocess module to fail and leave file descriptor open. To understand bette this leak, we can connect to supervdsm with strace and check what happens when syscalls fail with EINTR. I'm not sure it worth the effort, since removing zobbiereaper seems will eliminate the root cause for this issue, and make vdsm more robust. In summary we have: - pipe leak, caused by using zombiereaper to wait for some commands. shuld be fixed by reeaping out zombiereaper from vdsm. - supervdsm does not recover from "Too many open files" error. It can recover from this by terminating. systemd will restart the service. Will be tracked in another bug.
Tal, this is bug in supervdsm, caused by using zombiereaper which are maintained by the infra team. But the reason we have zombiereaper are 4 flows: - fc scan - iscsi scan - validating access - running virt-v2v To fix this issue we need to remove usage of zombiereaper from storage code, virt-code, and remove zombiereaper from vdsm. Most of the work is storage related. We can file separate bugs for virt-v2 and infra work.
I don't think we need separate bugs for each team, let's just cooperate and sync on this single bug, it's worth to deliver a fix at once and not by parts
(In reply to Martin Perina from comment #8) > I don't think we need separate bugs for each team, let's just cooperate and > sync on this single bug, it's worth to deliver a fix at once and not by parts I posted patches for fixing both storage and virt issues: https://gerrit.ovirt.org/q/topic:kill-zombiereaper+is:open I will need reviews from Milan and Marcin for the relevant parts. This need more testing, but I think we can plan this for 4.4.6.
We are past 4.4.7 dev freeze and the bug is not marked as blocker, moving out to 4.4.8
Hi, In a thread on the oVirt mailing list, Nir requested that I share my experience with vdsm that may be related to this same bug. (c/f https://lists.ovirt.org/archives/list/users@ovirt.org/thread/YVD2IDBACJX4CILGSCW77WZEWAB2TLNX/#DR7CKPUEHQZSUMFSW3HSJLHE5V5PUIBY). My environment is a 3-node HCI cluster with Gluster storage, but right now, Gluster is only running on 1 host (so no replication). I was hoping to upgrade / replace the storage on my 2nd host today, but aborted that maintenance when I found that I couldn't even get into the Manager. I should probably mention that I think I may have screwed up the lvm when I rebuilt the RAID array, as I didn't specify the inode size in `mkfs.xfs -i` ... so the new volumes have the default inode size (256), and not what they should be (512). To correct that, I guess I'll specify the correct inode size on the 2nd host, and then once I finish getting the storage upgraded on my 2nd host, I'll go back to the first, and re-build it again. I'm running on RHEL 8.4, but I don't have a support subscription (even though I don't have a support subscription, I do see the ability to open a support ticket, so am happy to do so if that would be helpful at all). As Nir suggested, I restarted the vdsmd service on all 3 hosts. That appears to have fixed the "storage" issue, and I now am able to run "hosted-engine --vm-status" and see some expected stdout. Unfortunately, my hosted-engine is still not starting (although I can now clearly connect to the hosted-engine storage), and I see this output every time I try to start the hosted-engine: [root@cha2-storage ~]# hosted-engine --vm-start Command VM.getStats with args {'vmID': 'ffd77d79-a699-455e-88e2-f55ee53166ef'} failed: (code=1, message=Virtual machine does not exist: {'vmId': 'ffd77d79-a699-455e-88e2-f55ee53166ef'}) VM in WaitForLaunch Prior to restarting vdsm on all 3 hosts, I generated an sosreport on two of my hosts, and am happy to provide that for further troubleshooting. In at least one of these cases, I noticed the following in stdout when I generated the sosreport: [plugin:vdsm] command 'vdsm-client Host getConnectedStoragePools' timed out after 300s [plugin:vdsm] command 'vdsm-client Host getStorageDomains' timed out after 300s I see this in /var/log/vdsm/vdsm.log: > 2021-08-13 14:08:10,844-0400 ERROR (Reactor thread) [ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor (protocoldetector:76) > Traceback (most recent call last): > File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite > File "/usr/lib64/python3.6/asyncore.py", line 417, in handle_read_event > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 57, in handle_accept > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 173, in _delegate_call > File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line 53, in handle_accept > File "/usr/lib64/python3.6/asyncore.py", line 348, in accept > File "/usr/lib64/python3.6/socket.py", line 205, in accept > OSError: [Errno 24] Too many open files The storage is mounted, but here's what I see: [root@cha2-storage dwhite]# hosted-engine --vm-status The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable. [root@cha2-storage dwhite]# systemctl status ovirt-ha-agent ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h 44min ago Main PID: 3591872 (ovirt-ha-agent) Tasks: 1 (limit: 409676) Memory: 21.5M CGroup: /system.slice/ovirt-ha-agent.service └─3591872 /usr/libexec/platform-python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent Aug 13 11:10:51 cha2-storage.mgt.example.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent. Any time I try to do anything like connect the engine storage, disconnect the engine storage, or connect to the console, it just sits there, and doesn't do anything, and I eventually have to ctl-c out of it. Maybe I have to be patient? When I ctl-c, I get a trackback error: [root@cha2-storage dwhite]# hosted-engine --console ^CTraceback (most recent call last): File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 214, in <module> [root@cha2-storage dwhite]# args.command(args) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 42, in func f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 91, in checkVmStatus cli = ohautil.connect_vdsm_json_rpc() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 395, in __vdsm_json_rpc_connect timeout=timeout) File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in connect outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries) File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 426, in SimpleClient nr_retries, reconnect_interval) File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 448, in StandAloneRpcClient client = StompClient(utils.create_connected_socket(host, port, sslctx), File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in create_connected_socket sock.connect((host, port)) File "/usr/lib64/python3.6/ssl.py", line 1068, in connect self._real_connect(addr, False) File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect self.do_handshake() File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake self._sslobj.do_handshake() File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake self._sslobj.do_handshake() This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log: MainThread::WARNING::2021-08-11 10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Connection to storage server failed MainThread::ERROR::2021-08-11 10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Connection to storage server failed MainThread::ERROR::2021-08-11 10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 375, in connect sserver.connect_storage_server() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", line 451, in connect_storage_server 'Connection to storage server failed' RuntimeError: Connection to storage server failed MainThread::ERROR::2021-08-11 10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2021-08-11 10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.7 started MainThread::INFO::2021-08-11 10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2021-08-11 10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2021-08-11 10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2021-08-11 10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2021-08-11 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2021-08-11 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2021-08-11 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2021-08-11 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2021-08-11 10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors AFTER I generated the sosreports, I ran: systemctl restart vdsmd on all 3 hosts. I found this stdout really interesting on 1 of the hosts when I then followed it up by running systemctl status vdsmd: ERROR FINISH thread <Thread(event/1, started daemon 139673041090304)> failed Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260, in run ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/clientIF.py", line 256, in contEIOVms vmObj.maybe_resume() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1636, in maybe_resume self.cont() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1711, in cont self._underlyingCont() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4130, in _underlyingCont self._dom.resume() File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2370, in resume raise libvirtError('virDomainResume() failed') libvirt.libvirtError: Failed to acquire lock: No space left on device Let me know how I can help. I'll keep these sosreports for safe keeping.
Sorry to spam the Bugzilla list a 2nd time, but it appears that I just needed to be a little bit patient after restarting vdsmd on all 3 of my hosts. I can confirm that the hosted-engine is now operational and healthy.
Didi, do you have an easy way to test this fix, maybe running your OST test with current vdsm master?
Ran this over the weekend: while true; do ost_run_tc he-basic-suite-master/test-scenarios/test_012_local_maintenance_sdk.py; echo 'Test finished, sleeping'; sleep 10; done And it finished so far 518 loops. [root@ost-he-basic-suite-master-host-0 ~]# ps auxww | grep '[s]upervdsm' root 76939 0.2 0.7 1498284 40028 ? S<sl Nov02 40:57 /usr/bin/python3 /usr/share/vdsm/supervdsmd --sockfile /run/vdsm/svdsm.sock [root@ost-he-basic-suite-master-host-0 ~]# ls -l /proc/76939/fd/* | wc -l 38 [root@ost-he-basic-suite-master-host-0 ~]# rpm -q vdsm vdsm-4.50.0.1-24.git1969ab99c.el8.x86_64 This is slightly old - 1969ab99c was merged Oct 26 - because I reused an existing setup and didn't bother upgrading first - but seems to be at least newer than all the linked patches from current bug, so probably ok. Looks good to me. Thanks!
(In reply to Yedidyah Bar David from comment #14) Thanks for testing! I think we can mark this verified, based on your tests.
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.