Bug 2026625 - _getHaInfo from vdsm can still get stuck if broker socket is not repsonding [NEEDINFO]
Summary: _getHaInfo from vdsm can still get stuck if broker socket is not repsonding
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.4.6
Hardware: All
OS: Linux
Target Milestone: ovirt-4.4.10
: ---
Assignee: Yedidyah Bar David
QA Contact: Petr Kubica
Depends On:
TreeView+ depends on / blocked
Reported: 2021-11-25 11:07 UTC by nijin ashok
Modified: 2022-02-08 10:08 UTC (History)
7 users (show)

Fixed In Version: ovirt-hosted-engine-ha-2.4.10-1
Doc Type: Bug Fix
Doc Text:
Previously, the timeout setting on the broker socket in the 'ovirt-hosted-engine-ha' library was ignored because the timeout was set after the connection was opened. This caused the VDSM threads to be blocked if the broker did not respond. Now the timeout setting is established before the connection is opened, resolving this issue.
Clone Of:
Last Closed: 2022-02-08 10:08:35 UTC
oVirt Team: Integration
Target Upstream Version:
didi: needinfo? (nashok)

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44080 0 None None None 2021-11-25 11:07:47 UTC
Red Hat Product Errata RHBA-2022:0462 0 None None None 2022-02-08 10:08:41 UTC
oVirt gerrit 117800 0 master MERGED Fix timeout in UnixXmlRpcHttpConnection - unixrpc.py 2021-11-30 13:24:24 UTC
oVirt gerrit 118069 0 ovirt-4.4 MERGED Fix timeout in UnixXmlRpcHttpConnection - unixrpc.py 2021-12-16 13:08:51 UTC

Description nijin ashok 2021-11-25 11:07:24 UTC
Description of problem:

The vdsm uses haClient library to get the stats of hosted-engine. The clients getting blocked on a hung broker was solved as per https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/102520. 

But timeout was set after the "connect" and hence calls were still getting blocked.

 72     def connect(self):
 73         self.sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
 74         self.sock.connect(base64.b16decode(self.host))
 75         if self.timeout is not None:             
 76             self.sock.settimeout(self.timeout)   <<<<<<<<<<

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1. If the broker is not responding, the vdsm threads get blocked indefinitely.

Actual results:

 _getHaInfo from vdsm can still get stuck if the broker socket is not responding.

Comment 1 Yedidyah Bar David 2021-11-28 07:33:16 UTC
This looks like a duplicate of bug 1720747, bug 1745715. For a summary, see bug 1745715 comment 34.

Sadly, Simone got the patch for that bug wrong, and so didn't fix it, and we also didn't reproduce the bug and so didn't really verify the fix.

I hope patch 117800 is enough.

It seems like a reproduction/verification should involve:

1. Having many HE HA hosts (not sure what's the minimum, but more than the supported maximum of 7 according to e.g [1][2] - filed [3] to have them in oVirt docs)
2. Perhaps loaded/slow/unresponsive storage. "Perhaps", because (1.) might be enough to make decent storage  slow, depending on number of hosts.

Also filed bug 2027087 "[RFE] Warn the user on too many hosted-engine hosts".

[1] https://access.redhat.com/articles/906543
[2] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/planning_and_prerequisites_guide/index#self-hosted-engine-recommendations
[3] "Planning and Prerequisites Guide" is a link to RHV documentation

Comment 2 Yedidyah Bar David 2021-11-28 07:35:33 UTC
Many thanks to Jake Reynolds for patch 117800.

Comment 3 Yedidyah Bar David 2021-11-28 07:50:03 UTC
(In reply to Yedidyah Bar David from comment #1)
> 1. Having many HE HA hosts (not sure what's the minimum, but more than the
> supported maximum of 7 according to e.g [1][2] - filed [3] to have them in
> oVirt docs)
> [3] "Planning and Prerequisites Guide" is a link to RHV documentation


Comment 4 Nikolai Sednev 2021-12-06 08:19:42 UTC
So what should be the expected reproduction steps for this bug?

Comment 5 Yedidyah Bar David 2021-12-16 12:49:37 UTC
Reproduced/Verified current bug/patch by sending a SIGSTOP to ha-broker, which suspends it (but lives it running, so nothing tries to restart it etc.). It's slightly artificial/synthetic, but IMO good enough for current case and easy to reproduce. Also tried with [1] which in itself didn't reproduce.

1. Setup HE on two hosts
2. On the host not running the HE VM: Find the pid of ha-broker, kill -STOP $pid
3. Wait a bit (a few seconds/minutes)

With a broken version
e.g. what's in 4.4.9, ovirt-hosted-engine-ha-2.4.9SOMETHING

1. vdsm.log has:

2021-12-16 08:31:54,563+0100 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7fcbb0373518> at 0x7fcbb03734e0> timeout=15, duration=15.01 at 0x7fcba5263dd8> task#=22 at 0x7fcba9b0d748>, traceback:
File: "/usr/lib64/python3.6/threading.py", line 905, in _bootstrap
File: "/usr/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
File: "/usr/lib64/python3.6/threading.py", line 885, in run
  self._target(*self._args, **self._kwargs)
File: "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 301, in _run
File: "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 315, in _execute_task
File: "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 391, in __call__
File: "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 188, in __call__
File: "/usr/lib/python3.6/site-packages/vdsm/virt/sampling.py", line 490, in __call__
  stats = hostapi.get_stats(self._cif, self._samples.stats())
File: "/usr/lib/python3.6/site-packages/vdsm/host/api.py", line 75, in get_stats
  ret['haStats'] = _getHaInfo()
File: "/usr/lib/python3.6/site-packages/vdsm/host/api.py", line 156, in _getHaInfo
  stats = instance.get_all_stats(timeout=5)
File: "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats
  stats = broker.get_stats_from_storage()
File: "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 148, in get_stats_from_storage
  result = self._proxy.get_stats()
File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
  return self.__send(self.__name, args)
File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
  return self.single_request(host, handler, request_body, verbose)
File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
  http_conn = self.send_request(host, handler, request_body, verbose)
File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
  self.send_content(connection, request_body)
File: "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
File: "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders
  self._send_output(message_body, encode_chunked=encode_chunked)
File: "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output
File: "/usr/lib64/python3.6/http/client.py", line 982, in send
File: "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect
  self.sock.connect(base64.b16decode(self.host)) (executor:363)

And similar ones, every few seconds.

2. engine.log has:

2021-12-16 08:34:55,084+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ost-he-basic-suite-master-host-0.lago.local command Get Host Statistics failed: Message timeout which can be caused by communication issues

and similar, and eventually:

2021-12-16 08:57:50,550+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-16) [] EVENT_ID: VDS_FAILURE(12), Host ost-he-basic-suite-master-host-0.lago.local is non responsive.
2021-12-16 08:57:50,590+01 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engine-Thread-15) [] ResourceManager::vdsNotResponding entered for Host '12577d97-753f-4bec-8229-37bff4715a9b', 'ost-he-basic-suite-master-host-0.lago.local'
2021-12-16 08:58:18,018+01 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-15) [381b0faf] EVENT_ID: VDS_ALERT_NO_PM_CONFIG_FENCE_OPERATION_SKIPPED(9,028), Host ost-he-basic-suite-master-host-0.lago.local became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

3. Web UI shows it as not responsive.

If you then kill -CONT $pid, things automatically heal.

With a fixed version
Meaning, that includes both the fix for current bug and for bug 2025381. This is what's in current master (4.5). If we want to backport to 4.4, should backport both bugs/patches.

1. vdsm.log has:

2021-12-16 10:27:45,432+0100 WARN  (jsonrpc/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:168)

2. engine.log has nothing relevant (that I could find, with default logging options).

3. Web UI keeps showing the host as Up, but the grey crown disappears.

After sending kill -CONT $pid, things heal automatically - no such logs in vdsm, grey crown reappears.

[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/117833

Comment 6 Yedidyah Bar David 2021-12-16 12:59:04 UTC
Moving back to POST, need to cherry-pick the patch for 4.4.

Comment 10 Petr Kubica 2022-01-26 12:11:57 UTC
Verified in ovirt-hosted-engine-ha-2.4.10-1.el8ev.noarch

Tested with kill -STOP $pid (as per comment #5)

Compared versions:
ovirt-hosted-engine-ha-2.4.9-1.el8ev.noarch (without fix)
ovirt-hosted-engine-ha-2.4.10-1.el8ev.noarch (fixed)

- In both cases non-responsive host wasn't seen. 
- Fixed traceback in vdsm.log (without occurence in fixed version)

Comment 14 errata-xmlrpc 2022-02-08 10:08:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.10]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.