1666553 – Possible fd leak in ha broker

Bug 1666553 - Possible fd leak in ha broker

Summary: Possible fd leak in ha broker

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	4.2.7
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.4.6-1
Target Release:	---
Assignee:	Asaf Rachmani
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-16 02:16 UTC by Germano Veit Michel
Modified:	2022-02-23 14:41 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-03 07:16:57 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3813491	0	None	None	None	2019-01-17 06:48:20 UTC

Description Germano Veit Michel 2019-01-16 02:16:01 UTC

Description of problem:

1. Broker starts:

MainThread::INFO::2019-01-15 13:37:16,086::broker::45::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.2.16 started

2. mail server is not available:

Listener::ERROR::2019-01-15 13:38:57,995::notifications::39::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 113] No route to host
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 26, in send_email
    timeout=float(cfg["smtp-timeout"]))
  File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__
    (code, msg) = self.connect(host, port)
  File "/usr/lib64/python2.7/smtplib.py", line 315, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket
    return socket.create_connection((host, port), timeout)
  File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
    raise err
error: [Errno 113] No route to host

3. Cannot talk to vdsm (this VDSM is in trouble, see additional notes).

Thread-3::ERROR::2019-01-15 13:44:23,776::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.SubmonitorBase::(_worker) Error executing submonitor mem-free, args {'use_ssl': 'true', 'address': '0'}
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
    self.action(self._options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/mem_free.py", line 40, in action
    logger=self._log
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 468, in connect_vdsm_json_rpc
    __vdsm_json_rpc_connect(logger, timeout)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 411, in __vdsm_json_rpc_connect
    timeout=VDSM_MAX_RETRY * VDSM_DELAY
RuntimeError: Couldn't  connect to VDSM within 60 seconds

4. Not sure what is opening so many files, but shortly after....

Thread-1::ERROR::2019-01-15 13:49:10,866::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.SubmonitorBase::(_worker) Error executing submonitor ping, args {'addr': '<removed>'}
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
    self.action(self._options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/ping.py", line 51, in action
    if self._ping():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/ping.py", line 71, in _ping
    stdout=devnull, stderr=devnull)
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1216, in _execute_child
    errpipe_read, errpipe_write = self.pipe_cloexec()
  File "/usr/lib64/python2.7/subprocess.py", line 1168, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files

Even reading from storage fails:

StatusStorageThread::ERROR::2019-01-15 13:50:52,808::storage_broker::161::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 152, in get_raw_stats
OSError: [Errno 24] Too many open files: '/var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75'

StatusStorageThread::ERROR::2019-01-15 13:50:52,808::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats
RequestError: failed to read metadata: [Errno 24] Too many open files: '/var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75'

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch
vdsm-4.20.39.1-1.el7ev.x86_64

How reproducible:
Have not tried yet.

Additional info:
The vdsm cannot connect may be related to this https://bugzilla.redhat.com/show_bug.cgi?id=1666123, its the same environment.

Comment 1 Germano Veit Michel 2019-01-16 02:22:05 UTC

Hmmm, look at this...

Broker is leaking or creating too many connections to VDSM?

ovirt-ha- 3031 36 127.0.0.1:44464->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44492->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44496->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44500->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44504->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44508->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44512->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44516->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44520->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44524->127.0.0.1:54321 (ESTABLISHED)
.....

$ grep ovirt-ha- lsof  | grep TCP | awk -F ' ' '{print $1,$2,$3,$9,$10}' | sort | uniq | wc -l
2296

$ grep 3031 ps
vdsm      3031  1.1  0.0 25329220 210672 ?     Ssl  13:37   0:10 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker

Comment 4 Martin Sivák 2019-01-16 08:42:36 UTC

The broker uses the jsonrpc library provided by vdsm that should handle connection management behind the scene. We have had this kind of bugs before. See for example https://bugzilla.redhat.com/show_bug.cgi?id=1343005

Comment 5 Ryan Barry 2019-01-17 00:56:16 UTC

This appears to be a slow leak

A look at the code shows

submonitor_base defines a default interval of 10 seconds

submonitors.ping tries to ping 5 times with a 2 sec timeout and a .5 sec delay, which leaves us, basically, with a 2.5 second overlap in monitor threads if it's unreachable.

submonitors.ping should either define a different interval (12.5 or 13, or example), or the entire submonitor logic should use a semaphore to ensure that only one is actually active at the same time.

Comment 7 Daniel Gur 2019-08-28 13:13:33 UTC

sync2jira

Comment 8 Daniel Gur 2019-08-28 13:17:46 UTC

sync2jira

Comment 9 Michal Skrivanek 2020-03-17 12:16:50 UTC

deprecating SLA team usage, moving to Virt

Comment 10 Asaf Rachmani 2021-05-03 07:16:57 UTC

Unable to reproduce, possibly fixed as part of bug 1666123.

Note You need to log in before you can comment on or make changes to this bug.