Description of problem: The vdsm uses haClient library to get the stats of hosted-engine. The clients getting blocked on a hung broker was solved as per https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/102520. But timeout was set after the "connect" and hence calls were still getting blocked. ~~~ ovirt_hosted_engine_ha/lib/unixrpc.py 72 def connect(self): 73 self.sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) 74 self.sock.connect(base64.b16decode(self.host)) 75 if self.timeout is not None: 76 self.sock.settimeout(self.timeout) <<<<<<<<<< ~~~ Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.4.7-1.el8ev.noarch How reproducible: 100% Steps to Reproduce: 1. If the broker is not responding, the vdsm threads get blocked indefinitely. 2. 3. Actual results: _getHaInfo from vdsm can still get stuck if the broker socket is not responding.
This looks like a duplicate of bug 1720747, bug 1745715. For a summary, see bug 1745715 comment 34. Sadly, Simone got the patch for that bug wrong, and so didn't fix it, and we also didn't reproduce the bug and so didn't really verify the fix. I hope patch 117800 is enough. It seems like a reproduction/verification should involve: 1. Having many HE HA hosts (not sure what's the minimum, but more than the supported maximum of 7 according to e.g [1][2] - filed [3] to have them in oVirt docs) 2. Perhaps loaded/slow/unresponsive storage. "Perhaps", because (1.) might be enough to make decent storage slow, depending on number of hosts. Also filed bug 2027087 "[RFE] Warn the user on too many hosted-engine hosts". [1] https://access.redhat.com/articles/906543 [2] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/planning_and_prerequisites_guide/index#self-hosted-engine-recommendations [3] "Planning and Prerequisites Guide" is a link to RHV documentation
Many thanks to Jake Reynolds for patch 117800.
(In reply to Yedidyah Bar David from comment #1) > 1. Having many HE HA hosts (not sure what's the minimum, but more than the > supported maximum of 7 according to e.g [1][2] - filed [3] to have them in > oVirt docs) > [3] "Planning and Prerequisites Guide" is a link to RHV documentation https://github.com/oVirt/ovirt-site/issues/2641
So what should be the expected reproduction steps for this bug?
Reproduced/Verified current bug/patch by sending a SIGSTOP to ha-broker, which suspends it (but lives it running, so nothing tries to restart it etc.). It's slightly artificial/synthetic, but IMO good enough for current case and easy to reproduce. Also tried with [1] which in itself didn't reproduce. Flow: 1. Setup HE on two hosts 2. On the host not running the HE VM: Find the pid of ha-broker, kill -STOP $pid 3. Wait a bit (a few seconds/minutes) With a broken version ===================== e.g. what's in 4.4.9, ovirt-hosted-engine-ha-2.4.9SOMETHING 1. vdsm.log has: 2021-12-16 08:31:54,563+0100 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7fcbb0373518> at 0x7fcbb03734e0> timeout=15, duration=15.01 at 0x7fcba5263dd8> task#=22 at 0x7fcba9b0d748>, traceback: File: "/usr/lib64/python3.6/threading.py", line 905, in _bootstrap self._bootstrap_inner() File: "/usr/lib64/python3.6/threading.py", line 937, in _bootstrap_inner self.run() File: "/usr/lib64/python3.6/threading.py", line 885, in run self._target(*self._args, **self._kwargs) File: "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260, in run ret = func(*args, **kwargs) File: "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 188, in __call__ self._func() File: "/usr/lib/python3.6/site-packages/vdsm/virt/sampling.py", line 490, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File: "/usr/lib/python3.6/site-packages/vdsm/host/api.py", line 75, in get_stats ret['haStats'] = _getHaInfo() File: "/usr/lib/python3.6/site-packages/vdsm/host/api.py", line 156, in _getHaInfo stats = instance.get_all_stats(timeout=5) File: "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats stats = broker.get_stats_from_storage() File: "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 148, in get_stats_from_storage result = self._proxy.get_stats() File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File: "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File: "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output self.send(msg) File: "/usr/lib64/python3.6/http/client.py", line 982, in send self.connect() File: "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) (executor:363) And similar ones, every few seconds. 2. engine.log has: 2021-12-16 08:34:55,084+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ost-he-basic-suite-master-host-0.lago.local command Get Host Statistics failed: Message timeout which can be caused by communication issues and similar, and eventually: 2021-12-16 08:57:50,550+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-16) [] EVENT_ID: VDS_FAILURE(12), Host ost-he-basic-suite-master-host-0.lago.local is non responsive. 2021-12-16 08:57:50,590+01 INFO [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engine-Thread-15) [] ResourceManager::vdsNotResponding entered for Host '12577d97-753f-4bec-8229-37bff4715a9b', 'ost-he-basic-suite-master-host-0.lago.local' ... 2021-12-16 08:58:18,018+01 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-15) [381b0faf] EVENT_ID: VDS_ALERT_NO_PM_CONFIG_FENCE_OPERATION_SKIPPED(9,028), Host ost-he-basic-suite-master-host-0.lago.local became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 3. Web UI shows it as not responsive. If you then kill -CONT $pid, things automatically heal. With a fixed version ==================== Meaning, that includes both the fix for current bug and for bug 2025381. This is what's in current master (4.5). If we want to backport to 4.4, should backport both bugs/patches. 1. vdsm.log has: 2021-12-16 10:27:45,432+0100 WARN (jsonrpc/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:168) 2. engine.log has nothing relevant (that I could find, with default logging options). 3. Web UI keeps showing the host as Up, but the grey crown disappears. After sending kill -CONT $pid, things heal automatically - no such logs in vdsm, grey crown reappears. [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/117833
Moving back to POST, need to cherry-pick the patch for 4.4.
Verified in ovirt-hosted-engine-ha-2.4.10-1.el8ev.noarch Tested with kill -STOP $pid (as per comment #5) Compared versions: ovirt-hosted-engine-ha-2.4.9-1.el8ev.noarch (without fix) x ovirt-hosted-engine-ha-2.4.10-1.el8ev.noarch (fixed) - In both cases non-responsive host wasn't seen. - Fixed traceback in vdsm.log (without occurence in fixed version)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.10]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0462