Description of problem: MainThread::WARNING::2016-05-31 18:21:21,193::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 444, in start_monitoring self._initialize_vdsm() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 635, in _initialize_vdsm timeout=envconstants.VDSCLI_SSL_TIMEOUT File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 187, in connect_vdsm_json_rpc requestQueue=requestQueue, File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 222, in connect responseQueue) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 212, in _create lazy_start=False) File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 576, in StandAloneRpcClient reactor = Reactor() File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 200, in __init__ self._wakeupEvent = AsyncoreEvent(self._map) File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 159, in __init__ self._eventfd = EventFD() File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 61, in __init__ self._verify_code(fd) File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code raise OSError(err, msg) OSError: [Errno 24] Too many open files When I've tried to recover logs using sosreport on host, I've got to this error and failed: # sosreport sosreport (version 3.2) This command will collect diagnostic and configuration information from this CentOS Linux system and installed applications. An archive containing the collected information will be generated in /var/tmp/sos.hYiSP7 and may be provided to a CentOS support representative. Any information provided to CentOS will be treated in accordance with the published support policies at: https://www.centos.org/ The generated archive may contain data considered sensitive and its content should be reviewed by the originating organization before being passed to any third party. No changes will be made to system configuration. Press ENTER to continue, or CTRL-C to quit. Please enter your first initial and last name [alma03.qa.lab.tlv.redhat.com]: Please enter the case id that you are generating this report for []: Setting up archive ... Setting up plugins ... [plugin:virsh] command 'virsh list --all' timed out after 300s Running plugins. Please wait ... Running 86/86: yum... .. Traceback (most recent call last): File "/usr/sbin/sosreport", line 25, in <module> main(sys.argv[1:]) File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1593, in main sos.execute() File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1568, in execute self.plain_report() File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1316, in plain_report fd.write(str(PlainTextReport(report))) File "/usr/lib/python2.7/site-packages/sos/reporting.py", line 150, in __str__ return "\n".join(buf) UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 26: ordinal not in range(128) Version-Release number of selected component (if applicable): Engine: ovirt-engine-setup-plugin-ovirt-engine-4.0.0.2-0.1.el7ev.noarch ovirt-vmconsole-1.0.3-1.el7ev.noarch ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch rhevm-4.0.0.2-0.1.el7ev.noarch ovirt-engine-setup-base-4.0.0.2-0.1.el7ev.noarch ovirt-engine-websocket-proxy-4.0.0.2-0.1.el7ev.noarch ovirt-image-uploader-4.0.0-1.el7ev.noarch ovirt-engine-backend-4.0.0.2-0.1.el7ev.noarch ovirt-engine-tools-4.0.0.2-0.1.el7ev.noarch rhevm-guest-agent-common-1.0.12-1.el7ev.noarch ovirt-engine-lib-4.0.0.2-0.1.el7ev.noarch ovirt-engine-dwh-setup-4.0.0-2.el7ev.noarch ovirt-log-collector-4.0.0-1.el7ev.noarch rhevm-branding-rhev-4.0.0-0.0.master.20160531161414.el7ev.noarch ovirt-engine-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch ovirt-host-deploy-java-1.5.0-1.el7ev.noarch ovirt-engine-dbscripts-4.0.0.2-0.1.el7ev.noarch ovirt-engine-4.0.0.2-0.1.el7ev.noarch ovirt-engine-setup-plugin-websocket-proxy-4.0.0.2-0.1.el7ev.noarch ovirt-engine-tools-backup-4.0.0.2-0.1.el7ev.noarch ovirt-engine-userportal-4.0.0.2-0.1.el7ev.noarch ovirt-engine-setup-4.0.0.2-0.1.el7ev.noarch ovirt-vmconsole-proxy-1.0.3-1.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch ovirt-engine-restapi-4.0.0.2-0.1.el7ev.noarch rhevm-setup-plugins-4.0.0-1.el7ev.noarch ovirt-engine-cli-3.6.2.0-1.el7ev.noarch rhevm-doc-4.0.0-2.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-common-4.0.0.2-0.1.el7ev.noarch ovirt-engine-extensions-api-impl-4.0.0.2-0.1.el7ev.noarch ovirt-iso-uploader-4.0.0-1.el7ev.noarch ovirt-engine-webadmin-portal-4.0.0.2-0.1.el7ev.noarch ovirt-engine-dwh-4.0.0-2.el7ev.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch ovirt-host-deploy-1.5.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch Red Hat Enterprise Linux Server release 7.2 (Maipo) Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016 Hovirt-vmconsole-host-1.0.2-0.0.master.20160517094103.git06df50a.el7.noarch vdsm-4.17.999-1155.gitcf216a0.el7.centos.x86_64 ovirt-setup-lib-1.0.2-0.0.master.20160502125738.gitf05af9e.el7.centos.noarch ovirt-release40-4.0.0-0.3.beta1.noarch ovirt-vmconsole-1.0.2-0.0.master.20160517094103.git06df50a.el7.noarch libvirt-client-1.2.17-13.el7_2.4.x86_64 ovirt-engine-sdk-python-3.6.5.1-0.1.20160507.git5fb7e0e.el7.centos.noarch ovirt-host-deploy-1.5.0-0.1.alpha1.el7.centos.noarch ovirt-hosted-engine-setup-2.0.0-0.1.beta1.el7.centos.noarch ovirt-release-host-node-4.0.0-0.3.beta1.el7.noarch ovirt-engine-appliance-4.0-20160528.1.el7.centos.noarch sanlock-3.2.4-2.el7_2.x86_64 ovirt-hosted-engine-ha-2.0.0-0.1.beta1.el7.centos.noarch ovirt-node-ng-image-update-placeholder-4.0.0-0.3.beta1.el7.noarch ost: CentOS Linux release 7.2.1511 (Core) Linux version 3.10.0-327.18.2.el7.x86_64 (builder.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu May 12 11:03:55 UTC 2016 Linux 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux How reproducible: 100% Steps to Reproduce: 1.During HE upgrade, when was in global maintenance the agent died. 2. 3. Actual results: OSError: [Errno 24] Too many open files - ovirt-ha-agent is dead Expected results: ovirt-ha-agent should be running. Additional info: Sosreport from the engine, as failed to get the same from the host.
Created attachment 1165145 [details] sosreport from the engine
This bug might had been caused by insufficient space within the /var/tmp/ on the host, as there were too many sosreports there and agent that is writing it's logs in to /var/log/ovirt-hosted-engine-ha/ could not write it's logs there and thus failed to start. When I've freed some space and rebooted the host, agent started OK.
I do not sure this a reason, because I also encounter this error on RHEL7.2, and I had enough space under /var/
Guys, please stop discussing two different issues (Too many open files and sosreport crash) in the same bug. You are only confusing the report. Always file a new bug for each separate issue.
Adding more details, including sosreport from host, that now is el7.2 that was cleanly reprovisioned. [root@alma03 ~]# systemctl status ovirt-ha-agent.service -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Mon 2016-06-06 20:19:54 IDT; 16h ago Main PID: 18098 (code=exited, status=0/SUCCESS) Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 61, in __init__ Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: self._verify_code(fd) Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: raise OSError(err, msg) Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: OSError: [Errno 24] Too many open files Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Shutting down the agent because of 3 failures in a row! Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Shutting down the agent because of 3 failures in a row! Jun 06 20:19:54 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Failed to stop monitoring domain (sd_uuid=b3051ff3-9728-4ac8-a36d-4fd4c5d12869): Error 900 from stopMonitoringDomain: Storage domain is member of pool: 'domain=b3051ff3-9728-4ac8-a36d-4fd4c5d12869' Jun 06 20:19:54 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: Exception AttributeError: "'EventFD' object has no attribute '_fd'" in <bound method EventFD.__del__ of <vdsm.infra.eventfd.EventFD object at 0x4954610>> ignored Jun 06 20:19:54 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : False Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 2ba059b6 Host timestamp : 8040 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=8040 (Mon Jun 6 20:18:34 2016) host-id=1 score=0 maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- Status up-to-date : False Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : unknown stale-data Score : 3400 stopped : False Local maintenance : False crc32 : 937d0433 Host timestamp : 78999 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=78999 (Tue Jun 7 13:06:43 2016) host-id=2 score=3400 maintenance=False state=EngineUp stopped=False
Created attachment 1165568 [details] new sosreport from the engine
Did you use RHEV-H or NGN or RHEL-H as a hypervisor? Is the Errno 24 specific to RHEV-H / NGN?
(In reply to Doron Fediuck from comment #7) > Did you use RHEV-H or NGN or RHEL-H as a hypervisor? > Is the Errno 24 specific to RHEV-H / NGN? For the first time I've used NGN4.0 RHEVH (next generation RHEVH) as one of my hosts (alma03), the second host was RHEL7.2. Now I have both hosts RHEL7.2. I could not collect the sosreport from both hosts due to https://bugzilla.redhat.com/show_bug.cgi?id=1296813 and https://bugzilla.redhat.com/show_bug.cgi?id=1343437. No, it's not specific to NGN, please see the comment #5, which had been posted from RHEL7.2 host with these components: qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64 ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch mom-0.5.4-1.el7ev.noarch ovirt-vmconsole-host-1.0.3-1.el7ev.noarch ovirt-host-deploy-1.5.0-1.el7ev.noarch ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch libvirt-client-1.2.17-13.el7_2.5.x86_64 sanlock-3.2.4-2.el7_2.x86_64 ovirt-setup-lib-1.0.2-1.el7ev.noarch vdsm-4.18.1-11.gita92976e.el7ev.x86_64 ovirt-hosted-engine-setup-2.0.0-1.el7ev.noarch ovirt-vmconsole-1.0.3-1.el7ev.noarch Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016 Linux alma03.qa.lab.tlv.redhat.com 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo)
Seems related to this patch: https://gerrit.ovirt.org/#/c/57942/
Adding sosreport from host, alma04, as I see this error message: Jun 08 11:01:43 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[5481]: IOError: [Errno 24] Too many open files But this time the agent is not dead yet.
As file size was larger than Bugzilla can support, added this external link for host's sosreport here: https://drive.google.com/open?id=0B85BEaDBcF88eWNDbWg4LXNYTm8
(In reply to Fred Rolland from comment #9) > Seems related to this patch: > https://gerrit.ovirt.org/#/c/57942/ Yes, this was on jsonrpc client
Still being reproduced on these components: Host: mom-0.5.4-1.el7ev.noarch ovirt-vmconsole-1.0.3-1.el7ev.noarch sanlock-3.2.4-2.el7_2.x86_64 libvirt-client-1.2.17-13.el7_2.5.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.15.x86_64 vdsm-4.18.1-11.gita92976e.el7ev.x86_64 ovirt-hosted-engine-setup-2.0.0-1.el7ev.noarch ovirt-host-deploy-1.5.0-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-vmconsole-host-1.0.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016 Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) Engine: ovirt-engine-setup-plugin-ovirt-engine-4.0.0.2-0.1.el7ev.noarch ovirt-vmconsole-1.0.3-1.el7ev.noarch ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch rhevm-4.0.0.2-0.1.el7ev.noarch ovirt-engine-setup-base-4.0.0.2-0.1.el7ev.noarch ovirt-engine-websocket-proxy-4.0.0.2-0.1.el7ev.noarch ovirt-image-uploader-4.0.0-1.el7ev.noarch ovirt-engine-backend-4.0.0.2-0.1.el7ev.noarch ovirt-engine-tools-4.0.0.2-0.1.el7ev.noarch rhevm-guest-agent-common-1.0.12-1.el7ev.noarch ovirt-engine-lib-4.0.0.2-0.1.el7ev.noarch ovirt-engine-dwh-setup-4.0.0-2.el7ev.noarch ovirt-log-collector-4.0.0-1.el7ev.noarch rhevm-branding-rhev-4.0.0-0.0.master.20160531161414.el7ev.noarch ovirt-engine-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch ovirt-host-deploy-java-1.5.0-1.el7ev.noarch ovirt-engine-dbscripts-4.0.0.2-0.1.el7ev.noarch ovirt-engine-4.0.0.2-0.1.el7ev.noarch rhev-guest-tools-iso-4.0-2.el7ev.noarch ovirt-engine-setup-plugin-websocket-proxy-4.0.0.2-0.1.el7ev.noarch ovirt-engine-tools-backup-4.0.0.2-0.1.el7ev.noarch ovirt-engine-userportal-4.0.0.2-0.1.el7ev.noarch rhev-release-4.0.0-12-001.noarch ovirt-engine-setup-4.0.0.2-0.1.el7ev.noarch ovirt-vmconsole-proxy-1.0.3-1.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch ovirt-engine-restapi-4.0.0.2-0.1.el7ev.noarch rhevm-setup-plugins-4.0.0-1.el7ev.noarch ovirt-engine-cli-3.6.2.0-1.el7ev.noarch rhevm-doc-4.0.0-2.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-common-4.0.0.2-0.1.el7ev.noarch ovirt-engine-extensions-api-impl-4.0.0.2-0.1.el7ev.noarch ovirt-iso-uploader-4.0.0-1.el7ev.noarch ovirt-engine-webadmin-portal-4.0.0.2-0.1.el7ev.noarch ovirt-engine-dwh-4.0.0-2.el7ev.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch ovirt-host-deploy-1.5.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016 Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) MainThread::ERROR::2016-06-13 17:20:14,349::config::219::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Failed scanning for OVF_STORE due to [Errno 24] Too many open files MainThread::ERROR::2016-06-13 17:20:14,350::config::235::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf MainThread::WARNING::2016-06-13 17:20:14,351::hosted_engine::477::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Erro r while monitoring engine: Path to volume 7b7535a3-d9d4-4dae-8b72-0bd3e6154308 not found in /rhev/data-center/mnt MainThread::WARNING::2016-06-13 17:20:14,351::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unex pected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 445, in start_monitoring File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 676, in _initialize_storage_images File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 244, in refresh_local_conf_file File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 273, in get_volume_path RuntimeError: Path to volume 7b7535a3-d9d4-4dae-8b72-0bd3e6154308 not found in /rhev/data-center/mnt MainThread::INFO::2016-06-13 17:20:14,351::hosted_engine::496::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleepin g 60 seconds
The patch was merged recently on master and your version of vdsm do not contain it.
MainThread::INFO::2016-06-21 16:59:18,403::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply MainThread::WARNING::2016-06-21 16:59:20,405::hosted_engine::477::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Erro r while monitoring engine: [Errno 24] Too many open files MainThread::WARNING::2016-06-21 16:59:20,405::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unex pected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 444, in start_monitoring self._initialize_vdsm() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 635, in _initialize_vdsm timeout=envconstants.VDSCLI_SSL_TIMEOUT File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 187, in connect_vdsm_json_rpc requestQueue=requestQueue, File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 222, in connect responseQueue) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 212, in _create lazy_start=False) File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 576, in StandAloneRpcClient reactor = Reactor() File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 200, in __init__ self._wakeupEvent = AsyncoreEvent(self._map) File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 159, in __init__ self._eventfd = EventFD() File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 61, in __init__ self._verify_code(fd) File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code raise OSError(err, msg) OSError: [Errno 24] Too many open files MainThread::ERROR::2016-06-21 16:59:20,406::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Shutting down the agent because of 3 failures in a row! I still see the error in log, on components as appear bellow: Host: libvirt-client-1.2.17-13.el7_2.5.x86_64 ovirt-vmconsole-1.0.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch vdsm-4.18.3-0.el7ev.x86_64 ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch mom-0.5.4-1.el7ev.noarch sanlock-3.2.4-2.el7_2.x86_64 ovirt-host-deploy-1.5.0-1.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64 ovirt-vmconsole-host-1.0.3-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) Engine: rhevm-doc-4.0.0-2.el7ev.noarch rhev-release-4.0.0-18-001.noarch rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch rhevm-branding-rhev-4.0.0-1.el7ev.noarch rhevm-4.0.0.6-0.1.el7ev.noarch rhevm-guest-agent-common-1.0.12-2.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch rhev-guest-tools-iso-4.0-2.el7ev.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) I've power-cycled the host on which HE-VM was running, after host booted up, I've seen the error with the log. BTW, last message received over CLI terminal, which was still open to the engine, during host's power-cycling, was: "[root@nsednev-he-2 ~]# Message from syslogd@nsednev-he-2 at Jun 21 10:41:11 ... kernel:BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u8:1:59]" Looks pretty the same as appears here: http://ubuntuforums.org/showthread.php?t=2205211 .
Created attachment 1170317 [details] latest sosreport from engine
Created attachment 1170319 [details] latest sosreport from host alma04
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Please provice lsof output for the process.
Created attachment 1170334 [details] lsof from alma03
(In reply to Nikolai Sednev from comment #15) > MainThread::INFO::2016-06-21 > 16:59:18,403::util::194::ovirt_hosted_engine_ha.agent.hosted_engine. > HostedEngine::(connect_vdsm_json_rpc) Waiting for > VDSM to reply > MainThread::WARNING::2016-06-21 > 16:59:20,405::hosted_engine::477::ovirt_hosted_engine_ha.agent.hosted_engine. > HostedEngine::(start_monitoring) Erro > r while monitoring engine: [Errno 24] Too many open files > MainThread::WARNING::2016-06-21 > 16:59:20,405::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine. > HostedEngine::(start_monitoring) Unex > pected error > Traceback (most recent call last): > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine. > py", line 444, in start_monitoring > self._initialize_vdsm() > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine. > py", line 635, in _initialize_vdsm > timeout=envconstants.VDSCLI_SSL_TIMEOUT > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line > 187, in connect_vdsm_json_rpc > requestQueue=requestQueue, > File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 222, > in connect > responseQueue) > File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 212, > in _create > lazy_start=False) > File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line > 576, in StandAloneRpcClient > reactor = Reactor() > File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line > 200, in __init__ > self._wakeupEvent = AsyncoreEvent(self._map) > File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line > 159, in __init__ > self._eventfd = EventFD() > File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", > line 61, in __init__ > self._verify_code(fd) > File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", > line 111, in _verify_code > raise OSError(err, msg) > OSError: [Errno 24] Too many open files > MainThread::ERROR::2016-06-21 > 16:59:20,406::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine. > HostedEngine::(start_monitoring) Shutting down the agent because of 3 > failures in a row! > > > I still see the error in log, on components as appear bellow: > Host: > libvirt-client-1.2.17-13.el7_2.5.x86_64 > ovirt-vmconsole-1.0.3-1.el7ev.noarch > ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch > vdsm-4.18.3-0.el7ev.x86_64 > ovirt-setup-lib-1.0.2-1.el7ev.noarch > ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch > mom-0.5.4-1.el7ev.noarch > sanlock-3.2.4-2.el7_2.x86_64 > ovirt-host-deploy-1.5.0-1.el7ev.noarch > qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64 > ovirt-vmconsole-host-1.0.3-1.el7ev.noarch > ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch > Linux version 3.10.0-327.22.2.el7.x86_64 > (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 > (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 > Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 > x86_64 x86_64 GNU/Linux > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > Engine: > rhevm-doc-4.0.0-2.el7ev.noarch > rhev-release-4.0.0-18-001.noarch > rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch > rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch > rhevm-branding-rhev-4.0.0-1.el7ev.noarch > rhevm-4.0.0.6-0.1.el7ev.noarch > rhevm-guest-agent-common-1.0.12-2.el7ev.noarch > rhevm-dependencies-4.0.0-1.el7ev.noarch > rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch > rhev-guest-tools-iso-4.0-2.el7ev.noarch > Linux version 3.10.0-327.22.2.el7.x86_64 > (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 > (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 > Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 > x86_64 x86_64 GNU/Linux > Red Hat Enterprise Linux Server release 7.2 (Maipo) > > I've power-cycled the host on which HE-VM was running, after host booted up, > I've seen the error with the log. > > BTW, last message received over CLI terminal, which was still open to the > engine, during host's power-cycling, was: "[root@nsednev-he-2 ~]# > Message from syslogd@nsednev-he-2 at Jun 21 10:41:11 ... > kernel:BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u8:1:59]" Looks > pretty the same as appears here: > http://ubuntuforums.org/showthread.php?t=2205211 . The fix was only part of vdsm 4.18.4. Please re-test with this version.
FYI if this is similar what I have seen, then it impacts HE migration from 3.6 to 4.0, as "recommended" flow is to end global maintenance after the migration to let HA agents start HE VM. But this won't happen as "too many open files" issue for HA agent.
[root@alma03 ~]# yum list | grep vdsm vdsm.x86_64 4.18.3-0.el7ev @rhev-4.0.0-17 I can't verify this bug until QA receives vdsm4.18.4.
VDSM 4.18.3 was released yesterday
(In reply to Eyal Edri from comment #24) > VDSM 4.18.3 was released yesterday I guess Eyal me a text 4.18.4
(In reply to Oved Ourfali from comment #25) > (In reply to Eyal Edri from comment #24) > > VDSM 4.18.3 was released yesterday > > I guess Eyal me a text 4.18.4 I meant "meant"... Auto correction.....
So re-target to 4.0.0?
Works for me on these components: Engine: rhevm-doc-4.0.0-2.el7ev.noarch rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch rhevm-4.0.0.6-0.1.el7ev.noarch rhev-release-4.0.0-19-001.noarch rhevm-guest-agent-common-1.0.12-2.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-branding-rhev-4.0.0-2.el7ev.noarch rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch rhev-guest-tools-iso-4.0-2.el7ev.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux nsednev-he-1.qa.lab.tlv.redhat.com 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) Hosts: ovirt-vmconsole-1.0.3-1.el7ev.noarch ovirt-host-deploy-1.5.0-1.el7ev.noarch sanlock-3.2.4-2.el7_2.x86_64 ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch libvirt-client-1.2.17-13.el7_2.5.x86_64 ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64 mom-0.5.4-1.el7ev.noarch ovirt-vmconsole-host-1.0.3-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch vdsm-4.18.4-2.el7ev.x86_64 ovirt-setup-lib-1.0.2-1.el7ev.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo)
Not sure if you all are saying this is supposed to be fixed in vdsm 4.18.4... I can report that I have vdsm-4.18.4.1-0.el7.centos.x86_64 and the issue is _NOT_ fixed. According to lsof the number of open files named "[eventfd]" keeps growing until ovirt-ha-agent dies due to too many open files. Here's what lsof shows for one of these open files: [root@sexi-albert /]# lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | head -1 ovirt-ha- 56795 vdsm 5u a_inode 0,9 0 7259 [eventfd] As you can see the number of these open files goes up quite quickly: [root@sexi-albert /]# for i in {1..30}; do echo -n "$(date): "; lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | wc -l; sleep 2; done Tue Jun 28 01:06:53 PDT 2016: 744 Tue Jun 28 01:06:55 PDT 2016: 744 Tue Jun 28 01:06:57 PDT 2016: 744 Tue Jun 28 01:06:59 PDT 2016: 744 Tue Jun 28 01:07:01 PDT 2016: 744 Tue Jun 28 01:07:04 PDT 2016: 744 Tue Jun 28 01:07:06 PDT 2016: 744 Tue Jun 28 01:07:08 PDT 2016: 746 Tue Jun 28 01:07:10 PDT 2016: 746 Tue Jun 28 01:07:12 PDT 2016: 748 Tue Jun 28 01:07:14 PDT 2016: 748 Tue Jun 28 01:07:16 PDT 2016: 748 Tue Jun 28 01:07:18 PDT 2016: 750 Tue Jun 28 01:07:20 PDT 2016: 750 Tue Jun 28 01:07:23 PDT 2016: 752 Tue Jun 28 01:07:25 PDT 2016: 752 Tue Jun 28 01:07:27 PDT 2016: 752 Tue Jun 28 01:07:29 PDT 2016: 754 Tue Jun 28 01:07:31 PDT 2016: 754 Tue Jun 28 01:07:33 PDT 2016: 756 Tue Jun 28 01:07:35 PDT 2016: 756 Tue Jun 28 01:07:37 PDT 2016: 756 Tue Jun 28 01:07:40 PDT 2016: 756 Tue Jun 28 01:07:42 PDT 2016: 756 Tue Jun 28 01:07:44 PDT 2016: 756 Tue Jun 28 01:07:46 PDT 2016: 756 Tue Jun 28 01:07:48 PDT 2016: 758 Tue Jun 28 01:07:50 PDT 2016: 758 Tue Jun 28 01:07:52 PDT 2016: 758 Tue Jun 28 01:07:54 PDT 2016: 760 This is spamming the crap out of me and my other admins with hundreds of email alerts per day.... I have 5 HA hosted engine hosts and they're all spewing ReinitializeFSM-EngineStarting, EngineStarting-EngineUnexpectedlyDown, StartState-ReinitializeFSM, etc. _ad nauseum_. Please make it stop! ;-) This is a cluster upgraded from 3.6 -> 4.0: [root@sexi-albert /]# rpm -qa | grep -E "(ovirt|vdsm)" | sort libgovirt-0.3.3-1.el7_2.1.x86_64 ovirt-engine-appliance-4.0-20160623.1.el7.centos.noarch ovirt-engine-sdk-python-3.6.7.0-1.el7.centos.noarch ovirt-host-deploy-1.5.0-1.el7.centos.noarch ovirt-hosted-engine-ha-2.0.0-1.el7.centos.noarch ovirt-hosted-engine-setup-2.0.0.2-1.el7.centos.noarch ovirt-imageio-common-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch ovirt-imageio-daemon-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch ovirt-release40-4.0.0-5.noarch ovirt-setup-lib-1.0.2-1.el7.centos.noarch ovirt-vmconsole-1.0.3-1.el7.centos.noarch ovirt-vmconsole-host-1.0.3-1.el7.centos.noarch vdsm-4.18.4.1-0.el7.centos.x86_64 vdsm-api-4.18.4.1-0.el7.centos.noarch vdsm-cli-4.18.4.1-0.el7.centos.noarch vdsm-hook-vmfex-dev-4.18.4.1-0.el7.centos.noarch vdsm-infra-4.18.4.1-0.el7.centos.noarch vdsm-jsonrpc-4.18.4.1-0.el7.centos.noarch vdsm-python-4.18.4.1-0.el7.centos.noarch vdsm-xmlrpc-4.18.4.1-0.el7.centos.noarch vdsm-yajsonrpc-4.18.4.1-0.el7.centos.noarch Thanks!
(In reply to Carl Thompson from comment #30) > Not sure if you all are saying this is supposed to be fixed in vdsm > 4.18.4... I can report that I have vdsm-4.18.4.1-0.el7.centos.x86_64 and the > issue is _NOT_ fixed. > > According to lsof the number of open files named "[eventfd]" keeps growing > until ovirt-ha-agent dies due to too many open files. > > Here's what lsof shows for one of these open files: > > [root@sexi-albert /]# lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | > head -1 > ovirt-ha- 56795 vdsm 5u a_inode 0,9 0 7259 [eventfd] > > As you can see the number of these open files goes up quite quickly: > > [root@sexi-albert /]# for i in {1..30}; do echo -n "$(date): "; lsof -p > $(pidof -x ovirt-ha-agent) | grep eventfd | wc -l; sleep 2; done > Tue Jun 28 01:06:53 PDT 2016: 744 > Tue Jun 28 01:06:55 PDT 2016: 744 > Tue Jun 28 01:06:57 PDT 2016: 744 > Tue Jun 28 01:06:59 PDT 2016: 744 > Tue Jun 28 01:07:01 PDT 2016: 744 > Tue Jun 28 01:07:04 PDT 2016: 744 > Tue Jun 28 01:07:06 PDT 2016: 744 > Tue Jun 28 01:07:08 PDT 2016: 746 > Tue Jun 28 01:07:10 PDT 2016: 746 > Tue Jun 28 01:07:12 PDT 2016: 748 > Tue Jun 28 01:07:14 PDT 2016: 748 > Tue Jun 28 01:07:16 PDT 2016: 748 > Tue Jun 28 01:07:18 PDT 2016: 750 > Tue Jun 28 01:07:20 PDT 2016: 750 > Tue Jun 28 01:07:23 PDT 2016: 752 > Tue Jun 28 01:07:25 PDT 2016: 752 > Tue Jun 28 01:07:27 PDT 2016: 752 > Tue Jun 28 01:07:29 PDT 2016: 754 > Tue Jun 28 01:07:31 PDT 2016: 754 > Tue Jun 28 01:07:33 PDT 2016: 756 > Tue Jun 28 01:07:35 PDT 2016: 756 > Tue Jun 28 01:07:37 PDT 2016: 756 > Tue Jun 28 01:07:40 PDT 2016: 756 > Tue Jun 28 01:07:42 PDT 2016: 756 > Tue Jun 28 01:07:44 PDT 2016: 756 > Tue Jun 28 01:07:46 PDT 2016: 756 > Tue Jun 28 01:07:48 PDT 2016: 758 > Tue Jun 28 01:07:50 PDT 2016: 758 > Tue Jun 28 01:07:52 PDT 2016: 758 > Tue Jun 28 01:07:54 PDT 2016: 760 > > This is spamming the crap out of me and my other admins with hundreds of > email alerts per day.... I have 5 HA hosted engine hosts and they're all > spewing ReinitializeFSM-EngineStarting, > EngineStarting-EngineUnexpectedlyDown, StartState-ReinitializeFSM, etc. _ad > nauseum_. Please make it stop! ;-) > > This is a cluster upgraded from 3.6 -> 4.0: > > [root@sexi-albert /]# rpm -qa | grep -E "(ovirt|vdsm)" | sort > libgovirt-0.3.3-1.el7_2.1.x86_64 > ovirt-engine-appliance-4.0-20160623.1.el7.centos.noarch > ovirt-engine-sdk-python-3.6.7.0-1.el7.centos.noarch > ovirt-host-deploy-1.5.0-1.el7.centos.noarch > ovirt-hosted-engine-ha-2.0.0-1.el7.centos.noarch > ovirt-hosted-engine-setup-2.0.0.2-1.el7.centos.noarch > ovirt-imageio-common-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch > ovirt-imageio-daemon-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch > ovirt-release40-4.0.0-5.noarch > ovirt-setup-lib-1.0.2-1.el7.centos.noarch > ovirt-vmconsole-1.0.3-1.el7.centos.noarch > ovirt-vmconsole-host-1.0.3-1.el7.centos.noarch > vdsm-4.18.4.1-0.el7.centos.x86_64 > vdsm-api-4.18.4.1-0.el7.centos.noarch > vdsm-cli-4.18.4.1-0.el7.centos.noarch > vdsm-hook-vmfex-dev-4.18.4.1-0.el7.centos.noarch > vdsm-infra-4.18.4.1-0.el7.centos.noarch > vdsm-jsonrpc-4.18.4.1-0.el7.centos.noarch > vdsm-python-4.18.4.1-0.el7.centos.noarch > vdsm-xmlrpc-4.18.4.1-0.el7.centos.noarch > vdsm-yajsonrpc-4.18.4.1-0.el7.centos.noarch > > Thanks! Can you confirm that all of your hosts running with latest components as appears in https://bugzilla.redhat.com/show_bug.cgi?id=1343005#c29, on bothe hosts and the engines? Regarding your present issue, please attach sosreports from your hosts and engine's if possible, so we could follow the root cause of this issue. Regarding email spamming your inbox I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1350758 a separate bug.
Fix for this bug was reverted in 4.18.4.1 due to BZ1349461, but hopefully it will be part of next VDSM release
*** Bug 1350687 has been marked as a duplicate of this bug. ***
oVirt 4.0.0 has been released, closing current release.
(In reply to Sandro Bonazzola from comment #34) > oVirt 4.0.0 has been released, closing current release. Hello, if I read this correctly this bug appears to have been marked as closed because it should be fixed in the current 4.0 release. However, I don't believe it is fixed in 4.0. As I stated in my comment above I have 4.0 and it is still broken there. Was this closed prematurely? Thanks!
This fix it part of vdsm 4.18.4+. Please make sure that it is the version if you still see the issue please provide logs.
ovirt-ha-agent terminates with too many open files: WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Error while monitoring engine: [Errno 24] Too many open files WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Unexpected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 444, in start_monitoring self._initialize_vdsm() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 635, in _initialize_vdsm timeout=envconstants.VDSCLI_SSL_TIMEOUT File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 187, in connect_vdsm_json_rpc requestQueue=requestQueue, File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 222, in connect responseQueue) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 212, in _create lazy_start=False) File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 576, in StandAloneRpcClient reactor = Reactor() File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 200, in __init__ self._wakeupEvent = AsyncoreEvent(self._map) File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 159, in __init__ self._eventfd = EventFD() File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 61, in __init__ self._verify_code(fd) File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code raise OSError(err, msg) OSError: [Errno 24] Too many open files ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Shutting down the agent because of 3 failures in a row! [root@node1 ~]# rpm -qa|grep -i -E '(vdsm|ovirt)'|sort libgovirt-0.3.3-1.el7_2.1.x86_64 ovirt-engine-sdk-python-3.6.7.0-1.el7.centos.noarch ovirt-host-deploy-1.5.0-1.el7.centos.noarch ovirt-hosted-engine-ha-2.0.0-1.el7.centos.noarch ovirt-hosted-engine-setup-2.0.0.2-1.el7.centos.noarch ovirt-imageio-common-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch ovirt-imageio-daemon-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch ovirt-setup-lib-1.0.2-1.el7.centos.noarch ovirt-vmconsole-1.0.3-1.el7.centos.noarch ovirt-vmconsole-host-1.0.3-1.el7.centos.noarch vdsm-4.18.4.1-0.el7.centos.x86_64 vdsm-api-4.18.4.1-0.el7.centos.noarch vdsm-cli-4.18.4.1-0.el7.centos.noarch vdsm-hook-vmfex-dev-4.18.4.1-0.el7.centos.noarch vdsm-infra-4.18.4.1-0.el7.centos.noarch vdsm-jsonrpc-4.18.4.1-0.el7.centos.noarch vdsm-python-4.18.4.1-0.el7.centos.noarch vdsm-xmlrpc-4.18.4.1-0.el7.centos.noarch vdsm-yajsonrpc-4.18.4.1-0.el7.centos.noarch can't find a newer vdsm on http://resources.ovirt.org/pub/ovirt-4.0/rpm/el7/x86_64/
Simone can you please check it?
I just checked vdsm.x86_64 4.18.4.1-0.el7.centos and the patch is not in.
(In reply to Piotr Kliczewski from comment #36) > This fix it part of vdsm 4.18.4+. Please make sure that it is the version if > you still see the issue please provide logs. Read comment #32.
Please fill in "Fixed In Version:" field before moving to ON-QA.
Works for me on these components on host: ovirt-vmconsole-host-1.0.3-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch libvirt-client-1.2.17-13.el7_2.5.x86_64 ovirt-host-deploy-1.5.0-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.18.x86_64 mom-0.5.5-1.el7ev.noarch ovirt-vmconsole-1.0.3-1.el7ev.noarch ovirt-imageio-common-0.3.0-0.el7ev.noarch vdsm-4.18.5.1-1.el7ev.x86_64 rhevm-appliance-20160623.0-1.el7ev.noarch ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch rhev-release-4.0.1-1-001.noarch sanlock-3.2.4-2.el7_2.x86_64 ovirt-imageio-daemon-0.3.0-0.el7ev.noarch Linux version 3.10.0-327.28.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jun 27 14:48:28 EDT 2016 Linux 3.10.0-327.28.2.el7.x86_64 #1 SMP Mon Jun 27 14:48:28 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) On engine: rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch rhevm-guest-agent-common-1.0.12-2.el7ev.noarch rhevm-4.0.2-0.2.rc1.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-branding-rhev-4.0.0-2.el7ev.noarch rhevm-doc-4.0.0-2.el7ev.noarch rhev-guest-tools-iso-4.0-2.el7ev.noarch Linux version 3.10.0-462.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-8) (GCC) ) #1 SMP Thu Jul 7 10:15:22 EDT 2016 Linux 3.10.0-462.el7.x86_64 #1 SMP Thu Jul 7 10:15:22 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 Beta (Maipo) [root@alma04 ~]# systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2016-07-11 14:37:35 IDT; 2 days ago Main PID: 60170 (ovirt-ha-agent) CGroup: /system.slice/ovirt-ha-agent.service └─60170 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon Jul 13 20:30:33 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Trying to get a fresher copy of vm configuration from the OVF_STORE Jul 13 20:30:38 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Found OVF_STORE: imgUUID:2ff018b6-5061-4f43-84fa-257b4c95cf53, volUUID:8d8728af-ebab-4f25-b36c-910f296f998c Jul 13 20:30:38 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Found OVF_STORE: imgUUID:c486a13b-8992-4709-8c21-cbddfca0804b, volUUID:7f705583-4bfa-44d6-a86c-47c7c8a9713f Jul 13 20:30:39 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Extracting Engine VM OVF from the OVF_STORE Jul 13 20:30:39 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:OVF_STORE volume path: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__2/8fdd4f94-d071-4369-9307-07d7395ef3d9/images/c486a13b-8992-4709-8c21-cbddfca0804b/7f705583-4bfa-44d6-a86c-47c7c8a9713f Jul 13 20:30:39 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Found an OVF for HE VM, trying to convert Jul 13 20:30:39 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Got vm.conf from OVF_STORE Jul 13 20:30:44 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Current state EngineUp (score: 3400) Jul 13 20:30:54 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Engine vm running on localhost Jul 13 20:30:54 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM
Since the problem described in this bug report should be resolved in oVirt 4.0.1 released on July 19th 2016, it has been closed with a resolution of CURRENT RELEASE. For information on the release, and how to update to this release, follow the link below. If the solution does not work for you, open a new bug report. http://www.ovirt.org/release/4.0.1/
This is not solved in my opinion the released version: I see this happening on the commandline of the engine on several of hosts and various CPU's: [root@hosted-engine-01 ~]# Message from syslogd@hosted-engine-01 at Jul 19 14:49:02 ... kernel:BUG: soft lockup - CPU#1 stuck for 22s! [kworker/u8:1:3995]
To the error in comment 44 looks like a different problem. Why do you think that this message is related to this bug?
It happens all at the same time and finally I need to start the engine manually.
It still doesn't make it related. Seems different as Fabian mentioned. You should open a seperate bug on it.