Bug 1286850 - Too many open files
Summary: Too many open files
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Broker
Version: 1.3.2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-3.6.1
: 1.3.3
Assignee: Martin Sivák
QA Contact: Shira Maximov
URL:
Whiteboard: sla
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-30 21:43 UTC by matt.deboer
Modified: 2016-02-10 19:19 UTC (History)
5 users (show)

Fixed In Version: ovirt-hosted-engine-ha-1.3.3
Clone Of:
Environment:
Last Closed: 2016-01-13 14:37:53 UTC
oVirt Team: SLA
Embargoed:
rule-engine: ovirt-3.6.z+
mgoldboi: planning_ack+
msivak: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
var log messages (4.11 KB, text/plain)
2015-11-30 21:43 UTC, matt.deboer
no flags Details
broker.log gzipped (692.43 KB, application/x-gzip)
2015-12-03 15:15 UTC, matt.deboer
no flags Details
agent.log gzipped (131.23 KB, application/x-gzip)
2015-12-03 15:16 UTC, matt.deboer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 49078 0 None None None Never

Description matt.deboer 2015-11-30 21:43:32 UTC
Created attachment 1100642 [details]
var log messages

Description of problem:
Errors keep occuring saying too many open files. See attached log.  This repeats continuously until the engine eventually reboots the host (at least it did once)

Version-Release number of selected component (if applicable):
1.3.2.1

How reproducible:
happens after host has been running for a while

Steps to Reproduce:
1.
2.
3.

Actual results:
errors in log and reboot of host

Expected results:
no errors, no reboot

Additional info:
let me know if you would like more information. not sure what else i need to provide. I know this can probably be fixed with a ulimit fix but it seems like its an over time it builds up to the point where it cant handle it anymore.

Comment 1 Martin Sivák 2015-12-01 09:33:33 UTC
Does it help to restart the ovirt-ha-broker and ovirt-ha-agent services? Or as a second step.. restart vdsmd?

Comment 2 matt.deboer 2015-12-01 14:29:40 UTC
Restarting the broker and agent services stopped the errors.  What would a long term fix for this be?

Comment 3 Martin Sivák 2015-12-01 14:47:45 UTC
Well there seems to be a descriptor leak somewhere. Can you check the open files of the agent and broker using lsof -p <pid>? Is there anything that repeats too much?

Comment 4 matt.deboer 2015-12-01 14:59:18 UTC
right now none of the hosts are in error.  

All the hosts have /dev/null 3 times  (/dev/null was one of the files it would report trying to open and couldnt because too many open files)

/var/run/ovirt-hosted-engine-ha/broker.socket is open 2 or 3 times on each host.


all the other files are only once.


When one of the hosts goes into error again i will try and post what files they have open.

Comment 5 matt.deboer 2015-12-03 14:40:38 UTC
the hosted engine image is open 1016 times.

ovirt-ha- 1338 vdsm 1023r   REG               0,33   1028096 12961270410858168845 /rhev/data-center/mnt/glusterSD/hyp1:_hosted__engine/97c2d887-6f7d-4287-8231-8967d6f02f67/images/820f8645-c538-4613-a240-7585b140829e/6cc5b697-e6bc-4bc2-9194-9f243c6df593


all other files are ok.

i have 2 hosts that did this overnight.  i had to reset one because i need the system running but i left one in error so if you need more information

Comment 6 Martin Sivák 2015-12-03 14:49:04 UTC
Hi Matt, thanks for the information. Can you please attach the full agent and broker log? Also who exactly holds the descriptors? The broker?

Comment 7 matt.deboer 2015-12-03 15:15:51 UTC
Created attachment 1101829 [details]
broker.log gzipped

Comment 8 matt.deboer 2015-12-03 15:16:33 UTC
Created attachment 1101830 [details]
agent.log gzipped

Comment 9 matt.deboer 2015-12-03 15:18:36 UTC
The broker process holds the descriptors.


The agent process descriptors looks normal.

Unfortunately it appears the agent stopped logging when the broker went into error. but i have uploaded everything i have.

Comment 10 Martin Sivák 2015-12-04 09:18:07 UTC
Ok I found the error. This is pretty serious mistake. I still do not know what triggers it, but the code (1.3.2.1) in ovirt_hosted_engine_ha/broker/storage_broker.py:128 does not close the metadata whiteboard file when an error occurs. And I see the following in the log:

Thread-97729::ERROR::2015-12-02 17:24:03,100::storage_broker::132::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats_for_service_type) Failed to read metadata from /rhev/data-center/mnt/glusterSD/738234-hyp1.NTGinc.com:_hosted__engine/97c2d887-6f7d-4287-8231-8967d6f02f67/ha_agent/hosted-engine.metadata
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 128, in get_raw_stats_for_service_type
    data = os.read(f, read_size)
OSError: [Errno 22] Invalid argument


We should already have a fix for that in ovirt-hosted-engine-ha-1.3.3 though.

Comment 11 Red Hat Bugzilla Rules Engine 2015-12-04 09:18:11 UTC
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.

Comment 12 Red Hat Bugzilla Rules Engine 2015-12-04 09:18:11 UTC
Fixed bug tickets must have target milestone set prior to fixing them. Please set the correct milestone and move the bugs back to the previous status after this is corrected.

Comment 13 Red Hat Bugzilla Rules Engine 2015-12-04 09:18:11 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 14 Red Hat Bugzilla Rules Engine 2015-12-08 10:16:24 UTC
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.

Comment 15 Red Hat Bugzilla Rules Engine 2015-12-08 10:16:24 UTC
Fixed bug tickets must have target milestone set prior to fixing them. Please set the correct milestone and move the bugs back to the previous status after this is corrected.

Comment 16 Red Hat Bugzilla Rules Engine 2015-12-08 10:16:24 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 17 Shira Maximov 2016-01-05 08:56:29 UTC
veiried on : ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch


verfication steps: 
1. check that the logs (/var/log/messages, broker log and engine log)
   that there is no exception about too many open files. 
2. check whether the list of open descriptors for the broker process is grows over night using lsof command

Comment 18 Sandro Bonazzola 2016-01-13 14:37:53 UTC
oVirt 3.6.1 has been released, closing current release


Note You need to log in before you can comment on or make changes to this bug.