Bug 1372958

Summary: supervdsm running too many open files
Product: [oVirt] vdsm Reporter: Eldad Marciano <emarcian>
Component: SuperVDSMAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED INSUFFICIENT_DATA QA Contact: eberman
Severity: high Docs Contact:
Priority: high    
Version: 4.18.0CC: bugs, mgoldboi, nsoffer, oourfali
Target Milestone: ovirt-4.2.0Flags: oourfali: ovirt-4.2?
mgoldboi: planning_ack+
rule-engine: devel_ack?
pstehlik: testing_ack+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-04 10:48:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Eldad Marciano 2016-09-04 11:14:54 UTC
Description of problem:
vdsm and supervdsm throws exceptions related to too many open files

the engine marks those hosts with the same problem with "activating" status.


Version-Release number of selected component (if applicable):
vdsm-4.18.11-1.el7ev.x86_64

How reproducible:
not clear

Steps to Reproduce:
1. running a setup with 33 hosts 12 NFS SD's and 77 vm per host.
2. sporadically some of the hosts having this problem.
3.

Actual results:
OS error too many open files, engine mark this host as activating for long time.

Expected results:
stable amount of open files.

Additional info:

Comment 8 Yaniv Bronhaim 2016-09-20 09:07:28 UTC
Although its indeed critical I can't say anything about the environment. I don't know what calls were ran perior vdsm got to this state. The logs are not relevant and I can't reproduce it. I'm suspectthat it happened because the storage domain was not reachable for a while and hbaRescan() function hanged forever until reaching to open fds limit - every call to supervdsm opens new fd until the call returns. If all calls get stuck, we end up crossing the limit. 

Nir, can you say if this sounds like a reasonable scenario? Eldad, can you check if such flow reproduce the same behavior?

Comment 9 Eldad Marciano 2016-09-20 15:04:12 UTC
(In reply to Yaniv Bronhaim from comment #8)
> Although its indeed critical I can't say anything about the environment. I
> don't know what calls were ran perior vdsm got to this state. The logs are
> not relevant and I can't reproduce it. I'm suspectthat it happened because
> the storage domain was not reachable for a while and hbaRescan() function
> hanged forever until reaching to open fds limit - every call to supervdsm
> opens new fd until the call returns. If all calls get stuck, we end up
> crossing the limit. 
> 
> Nir, can you say if this sounds like a reasonable scenario? Eldad, can you
> check if such flow reproduce the same behavior?

yes, i'll try fetch up some priority for that.
but what you mentioning, opening too many fd's for that scenarios sounds like a overhead, what about trying to do that via single fd at least for the hbascan?, also what about check if there some existing fds for that purpose.

Comment 10 Oved Ourfali 2016-09-21 11:37:07 UTC
Restoring needinfo on Nir.

Comment 11 Oved Ourfali 2016-11-17 12:37:16 UTC
Moving to 4.0.7 as we don't get the required info.

Comment 12 Eldad Marciano 2016-12-21 10:06:15 UTC
Oved, please lets keep it open or re target it, since we dont have much priority for it.

Comment 13 Oved Ourfali 2016-12-21 10:07:13 UTC
It is open.