Hide Forgot
Description of problem: vdsm and supervdsm throws exceptions related to too many open files the engine marks those hosts with the same problem with "activating" status. Version-Release number of selected component (if applicable): vdsm-4.18.11-1.el7ev.x86_64 How reproducible: not clear Steps to Reproduce: 1. running a setup with 33 hosts 12 NFS SD's and 77 vm per host. 2. sporadically some of the hosts having this problem. 3. Actual results: OS error too many open files, engine mark this host as activating for long time. Expected results: stable amount of open files. Additional info:
Although its indeed critical I can't say anything about the environment. I don't know what calls were ran perior vdsm got to this state. The logs are not relevant and I can't reproduce it. I'm suspectthat it happened because the storage domain was not reachable for a while and hbaRescan() function hanged forever until reaching to open fds limit - every call to supervdsm opens new fd until the call returns. If all calls get stuck, we end up crossing the limit. Nir, can you say if this sounds like a reasonable scenario? Eldad, can you check if such flow reproduce the same behavior?
(In reply to Yaniv Bronhaim from comment #8) > Although its indeed critical I can't say anything about the environment. I > don't know what calls were ran perior vdsm got to this state. The logs are > not relevant and I can't reproduce it. I'm suspectthat it happened because > the storage domain was not reachable for a while and hbaRescan() function > hanged forever until reaching to open fds limit - every call to supervdsm > opens new fd until the call returns. If all calls get stuck, we end up > crossing the limit. > > Nir, can you say if this sounds like a reasonable scenario? Eldad, can you > check if such flow reproduce the same behavior? yes, i'll try fetch up some priority for that. but what you mentioning, opening too many fd's for that scenarios sounds like a overhead, what about trying to do that via single fd at least for the hbascan?, also what about check if there some existing fds for that purpose.
Restoring needinfo on Nir.
Moving to 4.0.7 as we don't get the required info.
Oved, please lets keep it open or re target it, since we dont have much priority for it.
It is open.