Description of problem: 1. Broker starts: MainThread::INFO::2019-01-15 13:37:16,086::broker::45::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.2.16 started 2. mail server is not available: Listener::ERROR::2019-01-15 13:38:57,995::notifications::39::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 113] No route to host Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 26, in send_email timeout=float(cfg["smtp-timeout"])) File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__ (code, msg) = self.connect(host, port) File "/usr/lib64/python2.7/smtplib.py", line 315, in connect self.sock = self._get_socket(host, port, self.timeout) File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket return socket.create_connection((host, port), timeout) File "/usr/lib64/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 113] No route to host 3. Cannot talk to vdsm (this VDSM is in trouble, see additional notes). Thread-3::ERROR::2019-01-15 13:44:23,776::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.SubmonitorBase::(_worker) Error executing submonitor mem-free, args {'use_ssl': 'true', 'address': '0'} Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker self.action(self._options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/mem_free.py", line 40, in action logger=self._log File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 468, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 411, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds 4. Not sure what is opening so many files, but shortly after.... Thread-1::ERROR::2019-01-15 13:49:10,866::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.SubmonitorBase::(_worker) Error executing submonitor ping, args {'addr': '<removed>'} Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker self.action(self._options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/ping.py", line 51, in action if self._ping(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/ping.py", line 71, in _ping stdout=devnull, stderr=devnull) File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) File "/usr/lib64/python2.7/subprocess.py", line 1216, in _execute_child errpipe_read, errpipe_write = self.pipe_cloexec() File "/usr/lib64/python2.7/subprocess.py", line 1168, in pipe_cloexec r, w = os.pipe() OSError: [Errno 24] Too many open files Even reading from storage fails: StatusStorageThread::ERROR::2019-01-15 13:50:52,808::storage_broker::161::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 152, in get_raw_stats OSError: [Errno 24] Too many open files: '/var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75' StatusStorageThread::ERROR::2019-01-15 13:50:52,808::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats RequestError: failed to read metadata: [Errno 24] Too many open files: '/var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75' Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch vdsm-4.20.39.1-1.el7ev.x86_64 How reproducible: Have not tried yet. Additional info: The vdsm cannot connect may be related to this https://bugzilla.redhat.com/show_bug.cgi?id=1666123, its the same environment.
Hmmm, look at this... Broker is leaking or creating too many connections to VDSM? ovirt-ha- 3031 36 127.0.0.1:44464->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44492->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44496->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44500->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44504->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44508->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44512->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44516->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44520->127.0.0.1:54321 (ESTABLISHED) ovirt-ha- 3031 36 127.0.0.1:44524->127.0.0.1:54321 (ESTABLISHED) ..... $ grep ovirt-ha- lsof | grep TCP | awk -F ' ' '{print $1,$2,$3,$9,$10}' | sort | uniq | wc -l 2296 $ grep 3031 ps vdsm 3031 1.1 0.0 25329220 210672 ? Ssl 13:37 0:10 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker
The broker uses the jsonrpc library provided by vdsm that should handle connection management behind the scene. We have had this kind of bugs before. See for example https://bugzilla.redhat.com/show_bug.cgi?id=1343005
This appears to be a slow leak A look at the code shows submonitor_base defines a default interval of 10 seconds submonitors.ping tries to ping 5 times with a 2 sec timeout and a .5 sec delay, which leaves us, basically, with a 2.5 second overlap in monitor threads if it's unreachable. submonitors.ping should either define a different interval (12.5 or 13, or example), or the entire submonitor logic should use a semaphore to ensure that only one is actually active at the same time.
sync2jira
deprecating SLA team usage, moving to Virt
Unable to reproduce, possibly fixed as part of bug 1666123.