Bug 1286850
| Summary: | Too many open files | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | matt.deboer | ||||||||
| Component: | Broker | Assignee: | Martin Sivák <msivak> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Shira Maximov <mshira> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 1.3.2.1 | CC: | bugs, matt.deboer, mavital, mgoldboi, sbonazzo | ||||||||
| Target Milestone: | ovirt-3.6.1 | Flags: | rule-engine:
ovirt-3.6.z+
mgoldboi: planning_ack+ msivak: devel_ack+ mavital: testing_ack+ |
||||||||
| Target Release: | 1.3.3 | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | sla | ||||||||||
| Fixed In Version: | ovirt-hosted-engine-ha-1.3.3 | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2016-01-13 14:37:53 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
Does it help to restart the ovirt-ha-broker and ovirt-ha-agent services? Or as a second step.. restart vdsmd? Restarting the broker and agent services stopped the errors. What would a long term fix for this be? Well there seems to be a descriptor leak somewhere. Can you check the open files of the agent and broker using lsof -p <pid>? Is there anything that repeats too much? right now none of the hosts are in error. All the hosts have /dev/null 3 times (/dev/null was one of the files it would report trying to open and couldnt because too many open files) /var/run/ovirt-hosted-engine-ha/broker.socket is open 2 or 3 times on each host. all the other files are only once. When one of the hosts goes into error again i will try and post what files they have open. the hosted engine image is open 1016 times. ovirt-ha- 1338 vdsm 1023r REG 0,33 1028096 12961270410858168845 /rhev/data-center/mnt/glusterSD/hyp1:_hosted__engine/97c2d887-6f7d-4287-8231-8967d6f02f67/images/820f8645-c538-4613-a240-7585b140829e/6cc5b697-e6bc-4bc2-9194-9f243c6df593 all other files are ok. i have 2 hosts that did this overnight. i had to reset one because i need the system running but i left one in error so if you need more information Hi Matt, thanks for the information. Can you please attach the full agent and broker log? Also who exactly holds the descriptors? The broker? Created attachment 1101829 [details]
broker.log gzipped
Created attachment 1101830 [details]
agent.log gzipped
The broker process holds the descriptors. The agent process descriptors looks normal. Unfortunately it appears the agent stopped logging when the broker went into error. but i have uploaded everything i have. Ok I found the error. This is pretty serious mistake. I still do not know what triggers it, but the code (1.3.2.1) in ovirt_hosted_engine_ha/broker/storage_broker.py:128 does not close the metadata whiteboard file when an error occurs. And I see the following in the log:
Thread-97729::ERROR::2015-12-02 17:24:03,100::storage_broker::132::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats_for_service_type) Failed to read metadata from /rhev/data-center/mnt/glusterSD/738234-hyp1.NTGinc.com:_hosted__engine/97c2d887-6f7d-4287-8231-8967d6f02f67/ha_agent/hosted-engine.metadata
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 128, in get_raw_stats_for_service_type
data = os.read(f, read_size)
OSError: [Errno 22] Invalid argument
We should already have a fix for that in ovirt-hosted-engine-ha-1.3.3 though.
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset. Please set the correct milestone or add the z-stream flag. Fixed bug tickets must have target milestone set prior to fixing them. Please set the correct milestone and move the bugs back to the previous status after this is corrected. Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset. Please set the correct milestone or add the z-stream flag. Fixed bug tickets must have target milestone set prior to fixing them. Please set the correct milestone and move the bugs back to the previous status after this is corrected. Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. veiried on : ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch verfication steps: 1. check that the logs (/var/log/messages, broker log and engine log) that there is no exception about too many open files. 2. check whether the list of open descriptors for the broker process is grows over night using lsof command oVirt 3.6.1 has been released, closing current release |
Created attachment 1100642 [details] var log messages Description of problem: Errors keep occuring saying too many open files. See attached log. This repeats continuously until the engine eventually reboots the host (at least it did once) Version-Release number of selected component (if applicable): 1.3.2.1 How reproducible: happens after host has been running for a while Steps to Reproduce: 1. 2. 3. Actual results: errors in log and reboot of host Expected results: no errors, no reboot Additional info: let me know if you would like more information. not sure what else i need to provide. I know this can probably be fixed with a ulimit fix but it seems like its an over time it builds up to the point where it cant handle it anymore.