Bug 1666553 - Possible fd leak in ha broker
Summary: Possible fd leak in ha broker
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.2.7
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.4.6-1
: ---
Assignee: Asaf Rachmani
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-16 02:16 UTC by Germano Veit Michel
Modified: 2022-02-23 14:41 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-03 07:16:57 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3813491 0 None None None 2019-01-17 06:48:20 UTC

Description Germano Veit Michel 2019-01-16 02:16:01 UTC
Description of problem:

1. Broker starts:

MainThread::INFO::2019-01-15 13:37:16,086::broker::45::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.2.16 started

2. mail server is not available:

Listener::ERROR::2019-01-15 13:38:57,995::notifications::39::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) [Errno 113] No route to host
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 26, in send_email
    timeout=float(cfg["smtp-timeout"]))
  File "/usr/lib64/python2.7/smtplib.py", line 255, in __init__
    (code, msg) = self.connect(host, port)
  File "/usr/lib64/python2.7/smtplib.py", line 315, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/usr/lib64/python2.7/smtplib.py", line 290, in _get_socket
    return socket.create_connection((host, port), timeout)
  File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
    raise err
error: [Errno 113] No route to host

3. Cannot talk to vdsm (this VDSM is in trouble, see additional notes).

Thread-3::ERROR::2019-01-15 13:44:23,776::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.SubmonitorBase::(_worker) Error executing submonitor mem-free, args {'use_ssl': 'true', 'address': '0'}
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
    self.action(self._options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/mem_free.py", line 40, in action
    logger=self._log
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 468, in connect_vdsm_json_rpc
    __vdsm_json_rpc_connect(logger, timeout)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 411, in __vdsm_json_rpc_connect
    timeout=VDSM_MAX_RETRY * VDSM_DELAY
RuntimeError: Couldn't  connect to VDSM within 60 seconds

4. Not sure what is opening so many files, but shortly after....

Thread-1::ERROR::2019-01-15 13:49:10,866::submonitor_base::119::ovirt_hosted_engine_ha.broker.submonitor_base.SubmonitorBase::(_worker) Error executing submonitor ping, args {'addr': '<removed>'}
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
    self.action(self._options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/ping.py", line 51, in action
    if self._ping():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/ping.py", line 71, in _ping
    stdout=devnull, stderr=devnull)
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1216, in _execute_child
    errpipe_read, errpipe_write = self.pipe_cloexec()
  File "/usr/lib64/python2.7/subprocess.py", line 1168, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files

Even reading from storage fails:

StatusStorageThread::ERROR::2019-01-15 13:50:52,808::storage_broker::161::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 152, in get_raw_stats
OSError: [Errno 24] Too many open files: '/var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75'

StatusStorageThread::ERROR::2019-01-15 13:50:52,808::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats
RequestError: failed to read metadata: [Errno 24] Too many open files: '/var/run/vdsm/storage/b07315d6-0f94-4aff-8f0f-d3228cba8981/3d578c08-1b15-48fd-9211-37b000c4bb2c/fee59bf9-41d8-4baa-8d4e-bf95c0607c75'

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch
vdsm-4.20.39.1-1.el7ev.x86_64

How reproducible:
Have not tried yet.

Additional info:
The vdsm cannot connect may be related to this https://bugzilla.redhat.com/show_bug.cgi?id=1666123, its the same environment.

Comment 1 Germano Veit Michel 2019-01-16 02:22:05 UTC
Hmmm, look at this...

Broker is leaking or creating too many connections to VDSM?

ovirt-ha- 3031 36 127.0.0.1:44464->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44492->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44496->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44500->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44504->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44508->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44512->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44516->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44520->127.0.0.1:54321 (ESTABLISHED)
ovirt-ha- 3031 36 127.0.0.1:44524->127.0.0.1:54321 (ESTABLISHED)
.....

$ grep ovirt-ha- lsof  | grep TCP | awk -F ' ' '{print $1,$2,$3,$9,$10}' | sort | uniq | wc -l
2296

$ grep 3031 ps
vdsm      3031  1.1  0.0 25329220 210672 ?     Ssl  13:37   0:10 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker

Comment 4 Martin Sivák 2019-01-16 08:42:36 UTC
The broker uses the jsonrpc library provided by vdsm that should handle connection management behind the scene. We have had this kind of bugs before. See for example https://bugzilla.redhat.com/show_bug.cgi?id=1343005

Comment 5 Ryan Barry 2019-01-17 00:56:16 UTC
This appears to be a slow leak

A look at the code shows

submonitor_base defines a default interval of 10 seconds

submonitors.ping tries to ping 5 times with a 2 sec timeout and a .5 sec delay, which leaves us, basically, with a 2.5 second overlap in monitor threads if it's unreachable.

submonitors.ping should either define a different interval (12.5 or 13, or example), or the entire submonitor logic should use a semaphore to ensure that only one is actually active at the same time.

Comment 7 Daniel Gur 2019-08-28 13:13:33 UTC
sync2jira

Comment 8 Daniel Gur 2019-08-28 13:17:46 UTC
sync2jira

Comment 9 Michal Skrivanek 2020-03-17 12:16:50 UTC
deprecating SLA team usage, moving to Virt

Comment 10 Asaf Rachmani 2021-05-03 07:16:57 UTC
Unable to reproduce, possibly fixed as part of bug 1666123.


Note You need to log in before you can comment on or make changes to this bug.