Bug 1343005 - OSError: [Errno 24] Too many open files - ovirt-ha-agent is dead
Summary: OSError: [Errno 24] Too many open files - ovirt-ha-agent is dead
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Bindings-API
Version: 4.18.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ovirt-4.0.1
: 4.18.5
Assignee: Piotr Kliczewski
QA Contact: Nikolai Sednev
URL:
Whiteboard:
: 1350687 (view as bug list)
Depends On: 1349461
Blocks: 1349829 1350758 1417708 1417709
TreeView+ depends on / blocked
 
Reported: 2016-06-06 10:22 UTC by Nikolai Sednev
Modified: 2017-01-30 17:44 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1349829 (view as bug list)
Environment:
Last Closed: 2016-07-19 06:22:42 UTC
oVirt Team: Infra
rule-engine: ovirt-4.0.z+
rule-engine: blocker+
rule-engine: planning_ack+
oourfali: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
sosreport from the engine (8.46 MB, application/x-xz)
2016-06-06 10:23 UTC, Nikolai Sednev
no flags Details
new sosreport from the engine (7.98 MB, application/x-xz)
2016-06-07 10:15 UTC, Nikolai Sednev
no flags Details
latest sosreport from engine (6.19 MB, application/x-xz)
2016-06-21 15:05 UTC, Nikolai Sednev
no flags Details
latest sosreport from host alma04 (8.84 MB, application/x-xz)
2016-06-21 15:10 UTC, Nikolai Sednev
no flags Details
lsof from alma03 (115.66 KB, text/plain)
2016-06-21 16:26 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 57942 0 'None' MERGED jsonrpc: close client 2021-02-02 12:24:58 UTC
oVirt gerrit 59106 0 'None' MERGED jsonrpc: close client 2021-02-02 12:24:58 UTC

Description Nikolai Sednev 2016-06-06 10:22:09 UTC
Description of problem:
MainThread::WARNING::2016-05-31 18:21:21,193::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 444, in start_monitoring
    self._initialize_vdsm()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 635, in _initialize_vdsm
    timeout=envconstants.VDSCLI_SSL_TIMEOUT
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 187, in connect_vdsm_json_rpc
    requestQueue=requestQueue,
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 222, in connect
    responseQueue)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 212, in _create
    lazy_start=False)
  File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 576, in StandAloneRpcClient
    reactor = Reactor()
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 200, in __init__
    self._wakeupEvent = AsyncoreEvent(self._map)
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 159, in __init__
    self._eventfd = EventFD()
  File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 61, in __init__
    self._verify_code(fd)
  File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code
    raise OSError(err, msg)
OSError: [Errno 24] Too many open files



When I've tried to recover logs using sosreport on host, I've got to this error and failed:
# sosreport

sosreport (version 3.2)

This command will collect diagnostic and configuration information from
this CentOS Linux system and installed applications.

An archive containing the collected information will be generated in
/var/tmp/sos.hYiSP7 and may be provided to a CentOS support
representative.

Any information provided to CentOS will be treated in accordance with
the published support policies at:

  https://www.centos.org/

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.

Please enter your first initial and last name [alma03.qa.lab.tlv.redhat.com]: 
Please enter the case id that you are generating this report for []: 

 Setting up archive ...
 Setting up plugins ...
[plugin:virsh] command 'virsh list --all' timed out after 300s
 Running plugins. Please wait ...

  Running 86/86: yum...              ..        
Traceback (most recent call last):
  File "/usr/sbin/sosreport", line 25, in <module>
    main(sys.argv[1:])
  File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1593, in main
    sos.execute()
  File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1568, in execute
    self.plain_report()
  File "/usr/lib/python2.7/site-packages/sos/sosreport.py", line 1316, in plain_report
    fd.write(str(PlainTextReport(report)))
  File "/usr/lib/python2.7/site-packages/sos/reporting.py", line 150, in __str__
    return "\n".join(buf)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 26: ordinal not in range(128)



Version-Release number of selected component (if applicable):
Engine:
ovirt-engine-setup-plugin-ovirt-engine-4.0.0.2-0.1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch
rhevm-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-setup-base-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-websocket-proxy-4.0.0.2-0.1.el7ev.noarch
ovirt-image-uploader-4.0.0-1.el7ev.noarch
ovirt-engine-backend-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-tools-4.0.0.2-0.1.el7ev.noarch
rhevm-guest-agent-common-1.0.12-1.el7ev.noarch
ovirt-engine-lib-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-dwh-setup-4.0.0-2.el7ev.noarch
ovirt-log-collector-4.0.0-1.el7ev.noarch
rhevm-branding-rhev-4.0.0-0.0.master.20160531161414.el7ev.noarch
ovirt-engine-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch
ovirt-host-deploy-java-1.5.0-1.el7ev.noarch
ovirt-engine-dbscripts-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-tools-backup-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-userportal-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-setup-4.0.0.2-0.1.el7ev.noarch
ovirt-vmconsole-proxy-1.0.3-1.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
ovirt-engine-restapi-4.0.0.2-0.1.el7ev.noarch
rhevm-setup-plugins-4.0.0-1.el7ev.noarch
ovirt-engine-cli-3.6.2.0-1.el7ev.noarch
rhevm-doc-4.0.0-2.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-extensions-api-impl-4.0.0.2-0.1.el7ev.noarch
ovirt-iso-uploader-4.0.0-1.el7ev.noarch
ovirt-engine-webadmin-portal-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-dwh-4.0.0-2.el7ev.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch
Red Hat Enterprise Linux Server release 7.2 (Maipo)
Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild@x86-034.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016

Hovirt-vmconsole-host-1.0.2-0.0.master.20160517094103.git06df50a.el7.noarch
vdsm-4.17.999-1155.gitcf216a0.el7.centos.x86_64
ovirt-setup-lib-1.0.2-0.0.master.20160502125738.gitf05af9e.el7.centos.noarch
ovirt-release40-4.0.0-0.3.beta1.noarch
ovirt-vmconsole-1.0.2-0.0.master.20160517094103.git06df50a.el7.noarch
libvirt-client-1.2.17-13.el7_2.4.x86_64
ovirt-engine-sdk-python-3.6.5.1-0.1.20160507.git5fb7e0e.el7.centos.noarch
ovirt-host-deploy-1.5.0-0.1.alpha1.el7.centos.noarch
ovirt-hosted-engine-setup-2.0.0-0.1.beta1.el7.centos.noarch
ovirt-release-host-node-4.0.0-0.3.beta1.el7.noarch
ovirt-engine-appliance-4.0-20160528.1.el7.centos.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-hosted-engine-ha-2.0.0-0.1.beta1.el7.centos.noarch
ovirt-node-ng-image-update-placeholder-4.0.0-0.3.beta1.el7.noarch
ost:
CentOS Linux release 7.2.1511 (Core) 
Linux version 3.10.0-327.18.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu May 12 11:03:55 UTC 2016
Linux 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux




How reproducible:
100%

Steps to Reproduce:
1.During HE upgrade, when was in global maintenance the agent died.
2.
3.

Actual results:
OSError: [Errno 24] Too many open files - ovirt-ha-agent is dead 

Expected results:
ovirt-ha-agent should be running.

Additional info:
Sosreport from the engine, as failed to get the same from the host.

Comment 1 Nikolai Sednev 2016-06-06 10:23:40 UTC
Created attachment 1165145 [details]
sosreport from the engine

Comment 2 Nikolai Sednev 2016-06-07 08:32:19 UTC
This bug might had been caused by insufficient space within the /var/tmp/ on the host, as there were too many sosreports there and agent that is writing it's logs in to /var/log/ovirt-hosted-engine-ha/ could not write it's logs there and thus failed to start. When I've freed some space and rebooted the host, agent started OK.

Comment 3 Artyom 2016-06-07 08:51:11 UTC
I do not sure this a reason, because I also encounter this error on RHEL7.2, and I had enough space under /var/

Comment 4 Martin Sivák 2016-06-07 08:59:56 UTC
Guys, please stop discussing two different issues (Too many open files and sosreport crash) in the same bug. You are only confusing the report. Always file a new bug for each separate issue.

Comment 5 Nikolai Sednev 2016-06-07 10:12:05 UTC
Adding more details, including sosreport from host, that now is el7.2 that was cleanly reprovisioned.

[root@alma03 ~]# systemctl status ovirt-ha-agent.service -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2016-06-06 20:19:54 IDT; 16h ago
 Main PID: 18098 (code=exited, status=0/SUCCESS)

Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 61, in __init__
Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: self._verify_code(fd)
Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code
Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: raise OSError(err, msg)
Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: OSError: [Errno 24] Too many open files
Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Shutting down the agent because of 3 failures in a row!
Jun 06 20:19:51 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Shutting down the agent because of 3 failures in a row!
Jun 06 20:19:54 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Failed to stop monitoring domain (sd_uuid=b3051ff3-9728-4ac8-a36d-4fd4c5d12869): Error 900 from stopMonitoringDomain: Storage domain is member of pool: 'domain=b3051ff3-9728-4ac8-a36d-4fd4c5d12869'
Jun 06 20:19:54 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: Exception AttributeError: "'EventFD' object has no attribute '_fd'" in <bound method EventFD.__del__ of <vdsm.infra.eventfd.EventFD object at 0x4954610>> ignored
Jun 06 20:19:54 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[18098]: INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down

[root@alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : False
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : 2ba059b6
Host timestamp                     : 8040
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=8040 (Mon Jun  6 20:18:34 2016)
        host-id=1
        score=0
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 937d0433
Host timestamp                     : 78999
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=78999 (Tue Jun  7 13:06:43 2016)
        host-id=2
        score=3400
        maintenance=False
        state=EngineUp
        stopped=False

Comment 6 Nikolai Sednev 2016-06-07 10:15:22 UTC
Created attachment 1165568 [details]
new sosreport from the engine

Comment 7 Doron Fediuck 2016-06-07 10:56:11 UTC
Did you use RHEV-H or NGN or RHEL-H as a hypervisor?
Is the Errno 24 specific to RHEV-H / NGN?

Comment 8 Nikolai Sednev 2016-06-07 11:31:50 UTC
(In reply to Doron Fediuck from comment #7)
> Did you use RHEV-H or NGN or RHEL-H as a hypervisor?
> Is the Errno 24 specific to RHEV-H / NGN?

For the first time I've used NGN4.0 RHEVH (next generation RHEVH) as one of my hosts (alma03), the second host was RHEL7.2.
Now I have both hosts RHEL7.2.
I could not collect the sosreport from both hosts due to https://bugzilla.redhat.com/show_bug.cgi?id=1296813 and https://bugzilla.redhat.com/show_bug.cgi?id=1343437. 

No, it's not specific to NGN, please see the comment #5, which had been posted from RHEL7.2 host with these components:

 qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
mom-0.5.4-1.el7ev.noarch
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
sanlock-3.2.4-2.el7_2.x86_64
ovirt-setup-lib-1.0.2-1.el7ev.noarch
vdsm-4.18.1-11.gita92976e.el7ev.x86_64
ovirt-hosted-engine-setup-2.0.0-1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild@x86-034.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016
Linux alma03.qa.lab.tlv.redhat.com 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Comment 9 Fred Rolland 2016-06-07 13:09:45 UTC
Seems related to this patch:
https://gerrit.ovirt.org/#/c/57942/

Comment 10 Nikolai Sednev 2016-06-08 08:14:02 UTC
Adding sosreport from host, alma04, as I see this error message:
Jun 08 11:01:43 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[5481]: IOError: [Errno 24] Too many open files

But this time the agent is not dead yet.

Comment 11 Nikolai Sednev 2016-06-08 08:22:18 UTC
As file size was larger than Bugzilla can support, added this external link for host's sosreport here:
https://drive.google.com/open?id=0B85BEaDBcF88eWNDbWg4LXNYTm8

Comment 12 Simone Tiraboschi 2016-06-09 09:26:00 UTC
(In reply to Fred Rolland from comment #9)
> Seems related to this patch:
> https://gerrit.ovirt.org/#/c/57942/

Yes, this was on jsonrpc client

Comment 13 Nikolai Sednev 2016-06-13 14:28:39 UTC
Still being reproduced on these components:
Host:
mom-0.5.4-1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
libvirt-client-1.2.17-13.el7_2.5.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.15.x86_64
vdsm-4.18.1-11.gita92976e.el7ev.x86_64
ovirt-hosted-engine-setup-2.0.0-1.el7ev.noarch
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch
Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild@x86-034.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016
Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
ovirt-engine-setup-plugin-ovirt-engine-4.0.0.2-0.1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch
rhevm-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-setup-base-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-websocket-proxy-4.0.0.2-0.1.el7ev.noarch
ovirt-image-uploader-4.0.0-1.el7ev.noarch
ovirt-engine-backend-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-tools-4.0.0.2-0.1.el7ev.noarch
rhevm-guest-agent-common-1.0.12-1.el7ev.noarch
ovirt-engine-lib-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-dwh-setup-4.0.0-2.el7ev.noarch
ovirt-log-collector-4.0.0-1.el7ev.noarch
rhevm-branding-rhev-4.0.0-0.0.master.20160531161414.el7ev.noarch
ovirt-engine-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch
ovirt-host-deploy-java-1.5.0-1.el7ev.noarch
ovirt-engine-dbscripts-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-4.0.0.2-0.1.el7ev.noarch
rhev-guest-tools-iso-4.0-2.el7ev.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-tools-backup-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-userportal-4.0.0.2-0.1.el7ev.noarch
rhev-release-4.0.0-12-001.noarch
ovirt-engine-setup-4.0.0.2-0.1.el7ev.noarch
ovirt-vmconsole-proxy-1.0.3-1.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
ovirt-engine-restapi-4.0.0.2-0.1.el7ev.noarch
rhevm-setup-plugins-4.0.0-1.el7ev.noarch
ovirt-engine-cli-3.6.2.0-1.el7ev.noarch
rhevm-doc-4.0.0-2.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-extensions-api-impl-4.0.0.2-0.1.el7ev.noarch
ovirt-iso-uploader-4.0.0-1.el7ev.noarch
ovirt-engine-webadmin-portal-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-dwh-4.0.0-2.el7ev.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch
Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild@x86-034.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016
Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)


MainThread::ERROR::2016-06-13 17:20:14,349::config::219::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file)
 Failed scanning for OVF_STORE due to [Errno 24] Too many open files
MainThread::ERROR::2016-06-13 17:20:14,350::config::235::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file)
 Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
MainThread::WARNING::2016-06-13 17:20:14,351::hosted_engine::477::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Erro
r while monitoring engine: Path to volume 7b7535a3-d9d4-4dae-8b72-0bd3e6154308 not found in /rhev/data-center/mnt
MainThread::WARNING::2016-06-13 17:20:14,351::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unex
pected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 445, in start_monitoring
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 676, in _initialize_storage_images
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 244, in refresh_local_conf_file
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 273, in get_volume_path
RuntimeError: Path to volume 7b7535a3-d9d4-4dae-8b72-0bd3e6154308 not found in /rhev/data-center/mnt
MainThread::INFO::2016-06-13 17:20:14,351::hosted_engine::496::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleepin
g 60 seconds

Comment 14 Piotr Kliczewski 2016-06-14 07:44:40 UTC
The patch was merged recently on master and your version of vdsm do not contain it.

Comment 15 Nikolai Sednev 2016-06-21 15:04:02 UTC
MainThread::INFO::2016-06-21 16:59:18,403::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for
 VDSM to reply
MainThread::WARNING::2016-06-21 16:59:20,405::hosted_engine::477::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Erro
r while monitoring engine: [Errno 24] Too many open files
MainThread::WARNING::2016-06-21 16:59:20,405::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unex
pected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 444, in start_monitoring
    self._initialize_vdsm()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 635, in _initialize_vdsm
    timeout=envconstants.VDSCLI_SSL_TIMEOUT
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 187, in connect_vdsm_json_rpc
    requestQueue=requestQueue,
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 222, in connect
    responseQueue)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 212, in _create
    lazy_start=False)
  File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 576, in StandAloneRpcClient
    reactor = Reactor()
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 200, in __init__
    self._wakeupEvent = AsyncoreEvent(self._map)
  File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 159, in __init__
    self._eventfd = EventFD()
  File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 61, in __init__
    self._verify_code(fd)
  File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code
    raise OSError(err, msg)
OSError: [Errno 24] Too many open files
MainThread::ERROR::2016-06-21 16:59:20,406::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Shutting down the agent because of 3 failures in a row!


I still see the error in log, on components as appear bellow:
Host:
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-vmconsole-1.0.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
vdsm-4.18.3-0.el7ev.x86_64
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
mom-0.5.4-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-host-deploy-1.5.0-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild@x86-030.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
rhevm-doc-4.0.0-2.el7ev.noarch
rhev-release-4.0.0-18-001.noarch
rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
rhevm-branding-rhev-4.0.0-1.el7ev.noarch
rhevm-4.0.0.6-0.1.el7ev.noarch
rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
rhev-guest-tools-iso-4.0-2.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild@x86-030.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

I've power-cycled the host on which HE-VM was running, after host booted up, I've seen the error with the log.

BTW, last message received over CLI terminal, which was still open to the engine, during host's power-cycling, was: "[root@nsednev-he-2 ~]# 
Message from syslogd@nsednev-he-2 at Jun 21 10:41:11 ...
 kernel:BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u8:1:59]" Looks pretty the same as appears here: http://ubuntuforums.org/showthread.php?t=2205211 .

Comment 16 Nikolai Sednev 2016-06-21 15:05:08 UTC
Created attachment 1170317 [details]
latest sosreport from engine

Comment 17 Nikolai Sednev 2016-06-21 15:10:53 UTC
Created attachment 1170319 [details]
latest sosreport from host alma04

Comment 18 Red Hat Bugzilla Rules Engine 2016-06-21 15:20:11 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 19 Piotr Kliczewski 2016-06-21 15:33:14 UTC
Please provice lsof output for the process.

Comment 20 Nikolai Sednev 2016-06-21 16:26:43 UTC
Created attachment 1170334 [details]
lsof from alma03

Comment 21 Oved Ourfali 2016-06-21 18:14:47 UTC
(In reply to Nikolai Sednev from comment #15)
> MainThread::INFO::2016-06-21
> 16:59:18,403::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(connect_vdsm_json_rpc) Waiting for
>  VDSM to reply
> MainThread::WARNING::2016-06-21
> 16:59:20,405::hosted_engine::477::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(start_monitoring) Erro
> r while monitoring engine: [Errno 24] Too many open files
> MainThread::WARNING::2016-06-21
> 16:59:20,405::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(start_monitoring) Unex
> pected error
> Traceback (most recent call last):
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.
> py", line 444, in start_monitoring
>     self._initialize_vdsm()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.
> py", line 635, in _initialize_vdsm
>     timeout=envconstants.VDSCLI_SSL_TIMEOUT
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line
> 187, in connect_vdsm_json_rpc
>     requestQueue=requestQueue,
>   File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 222,
> in connect
>     responseQueue)
>   File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 212,
> in _create
>     lazy_start=False)
>   File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line
> 576, in StandAloneRpcClient
>     reactor = Reactor()
>   File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line
> 200, in __init__
>     self._wakeupEvent = AsyncoreEvent(self._map)
>   File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line
> 159, in __init__
>     self._eventfd = EventFD()
>   File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py",
> line 61, in __init__
>     self._verify_code(fd)
>   File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py",
> line 111, in _verify_code
>     raise OSError(err, msg)
> OSError: [Errno 24] Too many open files
> MainThread::ERROR::2016-06-21
> 16:59:20,406::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(start_monitoring) Shutting down the agent because of 3
> failures in a row!
> 
> 
> I still see the error in log, on components as appear bellow:
> Host:
> libvirt-client-1.2.17-13.el7_2.5.x86_64
> ovirt-vmconsole-1.0.3-1.el7ev.noarch
> ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
> vdsm-4.18.3-0.el7ev.x86_64
> ovirt-setup-lib-1.0.2-1.el7ev.noarch
> ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
> mom-0.5.4-1.el7ev.noarch
> sanlock-3.2.4-2.el7_2.x86_64
> ovirt-host-deploy-1.5.0-1.el7ev.noarch
> qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
> ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
> ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch
> Linux version 3.10.0-327.22.2.el7.x86_64
> (mockbuild@x86-030.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623
> (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
> Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64
> x86_64 x86_64 GNU/Linux
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
> 
> Engine:
> rhevm-doc-4.0.0-2.el7ev.noarch
> rhev-release-4.0.0-18-001.noarch
> rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
> rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
> rhevm-branding-rhev-4.0.0-1.el7ev.noarch
> rhevm-4.0.0.6-0.1.el7ev.noarch
> rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
> rhevm-dependencies-4.0.0-1.el7ev.noarch
> rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
> rhev-guest-tools-iso-4.0-2.el7ev.noarch
> Linux version 3.10.0-327.22.2.el7.x86_64
> (mockbuild@x86-030.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623
> (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
> Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64
> x86_64 x86_64 GNU/Linux
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
> 
> I've power-cycled the host on which HE-VM was running, after host booted up,
> I've seen the error with the log.
> 
> BTW, last message received over CLI terminal, which was still open to the
> engine, during host's power-cycling, was: "[root@nsednev-he-2 ~]# 
> Message from syslogd@nsednev-he-2 at Jun 21 10:41:11 ...
>  kernel:BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u8:1:59]" Looks
> pretty the same as appears here:
> http://ubuntuforums.org/showthread.php?t=2205211 .

The fix was only part of vdsm 4.18.4.
Please re-test with this version.

Comment 22 Jiri Belka 2016-06-22 12:03:19 UTC
FYI if this is similar what I have seen, then it impacts HE migration from 3.6 to 4.0, as "recommended" flow is to end global maintenance after the migration to let HA agents start HE VM. But this won't happen as "too many open files" issue for HA agent.

Comment 23 Nikolai Sednev 2016-06-23 08:13:08 UTC
[root@alma03 ~]# yum list | grep vdsm
vdsm.x86_64                         4.18.3-0.el7ev          @rhev-4.0.0-17    

I can't verify this bug until QA receives vdsm4.18.4.

Comment 24 Eyal Edri 2016-06-23 08:26:22 UTC
VDSM 4.18.3 was released yesterday

Comment 25 Oved Ourfali 2016-06-23 08:37:40 UTC
(In reply to Eyal Edri from comment #24)
> VDSM 4.18.3 was released yesterday

I guess Eyal me a text 4.18.4

Comment 26 Oved Ourfali 2016-06-23 08:38:24 UTC
(In reply to Oved Ourfali from comment #25)
> (In reply to Eyal Edri from comment #24)
> > VDSM 4.18.3 was released yesterday
> 
> I guess Eyal me a text 4.18.4

I meant "meant"... Auto correction.....

Comment 28 Sandro Bonazzola 2016-06-23 10:08:51 UTC
So re-target to 4.0.0?

Comment 29 Nikolai Sednev 2016-06-23 10:50:35 UTC
Works for me on these components:

Engine:
rhevm-doc-4.0.0-2.el7ev.noarch
rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
rhevm-4.0.0.6-0.1.el7ev.noarch
rhev-release-4.0.0-19-001.noarch
rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-branding-rhev-4.0.0-2.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
rhev-guest-tools-iso-4.0-2.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild@x86-030.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux nsednev-he-1.qa.lab.tlv.redhat.com 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Hosts:
ovirt-vmconsole-1.0.3-1.el7ev.noarch
ovirt-host-deploy-1.5.0-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
mom-0.5.4-1.el7ev.noarch
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
vdsm-4.18.4-2.el7ev.x86_64
ovirt-setup-lib-1.0.2-1.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild@x86-030.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Comment 30 Carl Thompson 2016-06-28 08:24:15 UTC
Not sure if you all are saying this is supposed to be fixed in vdsm 4.18.4... I can report that I have vdsm-4.18.4.1-0.el7.centos.x86_64 and the issue is _NOT_ fixed.

According to lsof the number of open files named "[eventfd]" keeps growing until ovirt-ha-agent dies due to too many open files.

Here's what lsof shows for one of these open files:

[root@sexi-albert /]# lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | head -1
ovirt-ha- 56795 vdsm    5u  a_inode     0,9     0    7259 [eventfd]

As you can see the number of these open files goes up quite quickly:

[root@sexi-albert /]# for i in {1..30}; do echo -n "$(date): "; lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | wc -l; sleep 2; done
Tue Jun 28 01:06:53 PDT 2016: 744
Tue Jun 28 01:06:55 PDT 2016: 744
Tue Jun 28 01:06:57 PDT 2016: 744
Tue Jun 28 01:06:59 PDT 2016: 744
Tue Jun 28 01:07:01 PDT 2016: 744
Tue Jun 28 01:07:04 PDT 2016: 744
Tue Jun 28 01:07:06 PDT 2016: 744
Tue Jun 28 01:07:08 PDT 2016: 746
Tue Jun 28 01:07:10 PDT 2016: 746
Tue Jun 28 01:07:12 PDT 2016: 748
Tue Jun 28 01:07:14 PDT 2016: 748
Tue Jun 28 01:07:16 PDT 2016: 748
Tue Jun 28 01:07:18 PDT 2016: 750
Tue Jun 28 01:07:20 PDT 2016: 750
Tue Jun 28 01:07:23 PDT 2016: 752
Tue Jun 28 01:07:25 PDT 2016: 752
Tue Jun 28 01:07:27 PDT 2016: 752
Tue Jun 28 01:07:29 PDT 2016: 754
Tue Jun 28 01:07:31 PDT 2016: 754
Tue Jun 28 01:07:33 PDT 2016: 756
Tue Jun 28 01:07:35 PDT 2016: 756
Tue Jun 28 01:07:37 PDT 2016: 756
Tue Jun 28 01:07:40 PDT 2016: 756
Tue Jun 28 01:07:42 PDT 2016: 756
Tue Jun 28 01:07:44 PDT 2016: 756
Tue Jun 28 01:07:46 PDT 2016: 756
Tue Jun 28 01:07:48 PDT 2016: 758
Tue Jun 28 01:07:50 PDT 2016: 758
Tue Jun 28 01:07:52 PDT 2016: 758
Tue Jun 28 01:07:54 PDT 2016: 760

This is spamming the crap out of me and my other admins with hundreds of email alerts per day.... I have 5 HA hosted engine hosts and they're all spewing ReinitializeFSM-EngineStarting, EngineStarting-EngineUnexpectedlyDown, StartState-ReinitializeFSM, etc. _ad nauseum_. Please make it stop!  ;-)

This is a cluster upgraded from 3.6 -> 4.0:

[root@sexi-albert /]# rpm -qa | grep -E "(ovirt|vdsm)" | sort
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-engine-appliance-4.0-20160623.1.el7.centos.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7.centos.noarch
ovirt-host-deploy-1.5.0-1.el7.centos.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7.centos.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7.centos.noarch
ovirt-imageio-common-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch
ovirt-imageio-daemon-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch
ovirt-release40-4.0.0-5.noarch
ovirt-setup-lib-1.0.2-1.el7.centos.noarch
ovirt-vmconsole-1.0.3-1.el7.centos.noarch
ovirt-vmconsole-host-1.0.3-1.el7.centos.noarch
vdsm-4.18.4.1-0.el7.centos.x86_64
vdsm-api-4.18.4.1-0.el7.centos.noarch
vdsm-cli-4.18.4.1-0.el7.centos.noarch
vdsm-hook-vmfex-dev-4.18.4.1-0.el7.centos.noarch
vdsm-infra-4.18.4.1-0.el7.centos.noarch
vdsm-jsonrpc-4.18.4.1-0.el7.centos.noarch
vdsm-python-4.18.4.1-0.el7.centos.noarch
vdsm-xmlrpc-4.18.4.1-0.el7.centos.noarch
vdsm-yajsonrpc-4.18.4.1-0.el7.centos.noarch

Thanks!

Comment 31 Nikolai Sednev 2016-06-28 10:09:57 UTC
(In reply to Carl Thompson from comment #30)
> Not sure if you all are saying this is supposed to be fixed in vdsm
> 4.18.4... I can report that I have vdsm-4.18.4.1-0.el7.centos.x86_64 and the
> issue is _NOT_ fixed.
> 
> According to lsof the number of open files named "[eventfd]" keeps growing
> until ovirt-ha-agent dies due to too many open files.
> 
> Here's what lsof shows for one of these open files:
> 
> [root@sexi-albert /]# lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd |
> head -1
> ovirt-ha- 56795 vdsm    5u  a_inode     0,9     0    7259 [eventfd]
> 
> As you can see the number of these open files goes up quite quickly:
> 
> [root@sexi-albert /]# for i in {1..30}; do echo -n "$(date): "; lsof -p
> $(pidof -x ovirt-ha-agent) | grep eventfd | wc -l; sleep 2; done
> Tue Jun 28 01:06:53 PDT 2016: 744
> Tue Jun 28 01:06:55 PDT 2016: 744
> Tue Jun 28 01:06:57 PDT 2016: 744
> Tue Jun 28 01:06:59 PDT 2016: 744
> Tue Jun 28 01:07:01 PDT 2016: 744
> Tue Jun 28 01:07:04 PDT 2016: 744
> Tue Jun 28 01:07:06 PDT 2016: 744
> Tue Jun 28 01:07:08 PDT 2016: 746
> Tue Jun 28 01:07:10 PDT 2016: 746
> Tue Jun 28 01:07:12 PDT 2016: 748
> Tue Jun 28 01:07:14 PDT 2016: 748
> Tue Jun 28 01:07:16 PDT 2016: 748
> Tue Jun 28 01:07:18 PDT 2016: 750
> Tue Jun 28 01:07:20 PDT 2016: 750
> Tue Jun 28 01:07:23 PDT 2016: 752
> Tue Jun 28 01:07:25 PDT 2016: 752
> Tue Jun 28 01:07:27 PDT 2016: 752
> Tue Jun 28 01:07:29 PDT 2016: 754
> Tue Jun 28 01:07:31 PDT 2016: 754
> Tue Jun 28 01:07:33 PDT 2016: 756
> Tue Jun 28 01:07:35 PDT 2016: 756
> Tue Jun 28 01:07:37 PDT 2016: 756
> Tue Jun 28 01:07:40 PDT 2016: 756
> Tue Jun 28 01:07:42 PDT 2016: 756
> Tue Jun 28 01:07:44 PDT 2016: 756
> Tue Jun 28 01:07:46 PDT 2016: 756
> Tue Jun 28 01:07:48 PDT 2016: 758
> Tue Jun 28 01:07:50 PDT 2016: 758
> Tue Jun 28 01:07:52 PDT 2016: 758
> Tue Jun 28 01:07:54 PDT 2016: 760
> 
> This is spamming the crap out of me and my other admins with hundreds of
> email alerts per day.... I have 5 HA hosted engine hosts and they're all
> spewing ReinitializeFSM-EngineStarting,
> EngineStarting-EngineUnexpectedlyDown, StartState-ReinitializeFSM, etc. _ad
> nauseum_. Please make it stop!  ;-)
> 
> This is a cluster upgraded from 3.6 -> 4.0:
> 
> [root@sexi-albert /]# rpm -qa | grep -E "(ovirt|vdsm)" | sort
> libgovirt-0.3.3-1.el7_2.1.x86_64
> ovirt-engine-appliance-4.0-20160623.1.el7.centos.noarch
> ovirt-engine-sdk-python-3.6.7.0-1.el7.centos.noarch
> ovirt-host-deploy-1.5.0-1.el7.centos.noarch
> ovirt-hosted-engine-ha-2.0.0-1.el7.centos.noarch
> ovirt-hosted-engine-setup-2.0.0.2-1.el7.centos.noarch
> ovirt-imageio-common-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch
> ovirt-imageio-daemon-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch
> ovirt-release40-4.0.0-5.noarch
> ovirt-setup-lib-1.0.2-1.el7.centos.noarch
> ovirt-vmconsole-1.0.3-1.el7.centos.noarch
> ovirt-vmconsole-host-1.0.3-1.el7.centos.noarch
> vdsm-4.18.4.1-0.el7.centos.x86_64
> vdsm-api-4.18.4.1-0.el7.centos.noarch
> vdsm-cli-4.18.4.1-0.el7.centos.noarch
> vdsm-hook-vmfex-dev-4.18.4.1-0.el7.centos.noarch
> vdsm-infra-4.18.4.1-0.el7.centos.noarch
> vdsm-jsonrpc-4.18.4.1-0.el7.centos.noarch
> vdsm-python-4.18.4.1-0.el7.centos.noarch
> vdsm-xmlrpc-4.18.4.1-0.el7.centos.noarch
> vdsm-yajsonrpc-4.18.4.1-0.el7.centos.noarch
> 
> Thanks!

Can you confirm that all of your hosts running with latest components as appears in https://bugzilla.redhat.com/show_bug.cgi?id=1343005#c29, on bothe hosts and the engines?

Regarding your present issue, please attach sosreports from your hosts and engine's if possible, so we could follow the root cause of this issue.

Regarding email spamming your inbox I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1350758 a separate bug.

Comment 32 Martin Perina 2016-06-28 11:54:44 UTC
Fix for this bug was reverted in 4.18.4.1 due to BZ1349461, but hopefully it will be part of next VDSM release

Comment 33 Doron Fediuck 2016-07-03 08:18:27 UTC
*** Bug 1350687 has been marked as a duplicate of this bug. ***

Comment 34 Sandro Bonazzola 2016-07-05 07:43:54 UTC
oVirt 4.0.0 has been released, closing current release.

Comment 35 Carl Thompson 2016-07-05 18:18:46 UTC
(In reply to Sandro Bonazzola from comment #34)
> oVirt 4.0.0 has been released, closing current release.


Hello, if I read this correctly this bug appears to have been marked as closed because it should be fixed in the current 4.0 release. However, I don't believe it is fixed in 4.0. As I stated in my comment above I have 4.0 and it is still broken there. Was this closed prematurely? Thanks!

Comment 36 Piotr Kliczewski 2016-07-06 08:10:13 UTC
This fix it part of vdsm 4.18.4+. Please make sure that it is the version if you still see the issue please provide logs.

Comment 37 Christoph 2016-07-06 13:44:00 UTC
ovirt-ha-agent terminates with too many open files:



WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Error while monitoring engine: [Errno 24] Too many open files
WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Unexpected error
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 444, in start_monitoring
self._initialize_vdsm()
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 635, in _initialize_vdsm
timeout=envconstants.VDSCLI_SSL_TIMEOUT
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 187, in connect_vdsm_json_rpc
requestQueue=requestQueue,
File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 222, in connect
responseQueue)
File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 212, in _create
lazy_start=False)
File "/usr/lib/python2.7/site-packages/yajsonrpc/stompreactor.py", line 576, in StandAloneRpcClient
reactor = Reactor()
File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 200, in __init__
self._wakeupEvent = AsyncoreEvent(self._map)
File "/usr/lib/python2.7/site-packages/yajsonrpc/betterAsyncore.py", line 159, in __init__
self._eventfd = EventFD()
File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 61, in __init__
self._verify_code(fd)
File "/usr/lib/python2.7/site-packages/vdsm/infra/eventfd/__init__.py", line 111, in _verify_code
raise OSError(err, msg)
OSError: [Errno 24] Too many open files
ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Shutting down the agent because of 3 failures in a row!


[root@node1 ~]# rpm -qa|grep -i -E '(vdsm|ovirt)'|sort
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-engine-sdk-python-3.6.7.0-1.el7.centos.noarch
ovirt-host-deploy-1.5.0-1.el7.centos.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7.centos.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7.centos.noarch
ovirt-imageio-common-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch
ovirt-imageio-daemon-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch
ovirt-setup-lib-1.0.2-1.el7.centos.noarch
ovirt-vmconsole-1.0.3-1.el7.centos.noarch
ovirt-vmconsole-host-1.0.3-1.el7.centos.noarch
vdsm-4.18.4.1-0.el7.centos.x86_64
vdsm-api-4.18.4.1-0.el7.centos.noarch
vdsm-cli-4.18.4.1-0.el7.centos.noarch
vdsm-hook-vmfex-dev-4.18.4.1-0.el7.centos.noarch
vdsm-infra-4.18.4.1-0.el7.centos.noarch
vdsm-jsonrpc-4.18.4.1-0.el7.centos.noarch
vdsm-python-4.18.4.1-0.el7.centos.noarch
vdsm-xmlrpc-4.18.4.1-0.el7.centos.noarch
vdsm-yajsonrpc-4.18.4.1-0.el7.centos.noarch

can't find a newer vdsm on http://resources.ovirt.org/pub/ovirt-4.0/rpm/el7/x86_64/

Comment 38 Piotr Kliczewski 2016-07-06 14:06:58 UTC
Simone can you please check it?

Comment 39 Simone Tiraboschi 2016-07-06 14:23:47 UTC
I just checked vdsm.x86_64 4.18.4.1-0.el7.centos and the patch is not in.

Comment 40 Carl Thompson 2016-07-06 20:07:54 UTC
(In reply to Piotr Kliczewski from comment #36)
> This fix it part of vdsm 4.18.4+. Please make sure that it is the version if
> you still see the issue please provide logs.

Read comment #32.

Comment 41 Nikolai Sednev 2016-07-11 09:14:54 UTC
Please fill in "Fixed In Version:" field before moving to ON-QA.

Comment 42 Nikolai Sednev 2016-07-13 17:31:58 UTC
Works for me on these components on host:
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.18.x86_64
mom-0.5.5-1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
vdsm-4.18.5.1-1.el7ev.x86_64
rhevm-appliance-20160623.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
rhev-release-4.0.1-1-001.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-imageio-daemon-0.3.0-0.el7ev.noarch
Linux version 3.10.0-327.28.2.el7.x86_64 (mockbuild@x86-017.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jun 27 14:48:28 EDT 2016
Linux 3.10.0-327.28.2.el7.x86_64 #1 SMP Mon Jun 27 14:48:28 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)


On engine:
rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
rhevm-4.0.2-0.2.rc1.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-branding-rhev-4.0.0-2.el7ev.noarch
rhevm-doc-4.0.0-2.el7ev.noarch
rhev-guest-tools-iso-4.0-2.el7ev.noarch
Linux version 3.10.0-462.el7.x86_64 (mockbuild@x86-034.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-8) (GCC) ) #1 SMP Thu Jul 7 10:15:22 EDT 2016
Linux 3.10.0-462.el7.x86_64 #1 SMP Thu Jul 7 10:15:22 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 Beta (Maipo)


[root@alma04 ~]# systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2016-07-11 14:37:35 IDT; 2 days ago
 Main PID: 60170 (ovirt-ha-agent)
   CGroup: /system.slice/ovirt-ha-agent.service
           └─60170 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon

Jul 13 20:30:33 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Trying to get a fresher copy of vm configuration from the OVF_STORE
Jul 13 20:30:38 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Found OVF_STORE: imgUUID:2ff018b6-5061-4f43-84fa-257b4c95cf53, volUUID:8d8728af-ebab-4f25-b36c-910f296f998c
Jul 13 20:30:38 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Found OVF_STORE: imgUUID:c486a13b-8992-4709-8c21-cbddfca0804b, volUUID:7f705583-4bfa-44d6-a86c-47c7c8a9713f
Jul 13 20:30:39 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Extracting Engine VM OVF from the OVF_STORE
Jul 13 20:30:39 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:OVF_STORE volume path: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__2/8fdd4f94-d071-4369-9307-07d7395ef3d9/images/c486a13b-8992-4709-8c21-cbddfca0804b/7f705583-4bfa-44d6-a86c-47c7c8a9713f
Jul 13 20:30:39 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Found an OVF for HE VM, trying to convert
Jul 13 20:30:39 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Got vm.conf from OVF_STORE
Jul 13 20:30:44 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Current state EngineUp (score: 3400)
Jul 13 20:30:54 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Engine vm running on localhost
Jul 13 20:30:54 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[60170]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM

Comment 43 Sandro Bonazzola 2016-07-19 06:22:42 UTC
Since the problem described in this bug report should be
resolved in oVirt 4.0.1 released on July 19th 2016, it has been closed with a
resolution of CURRENT RELEASE.

For information on the release, and how to update to this release, follow the link below.

If the solution does not work for you, open a new bug report.

http://www.ovirt.org/release/4.0.1/

Comment 44 Yamakasi 2016-07-19 18:13:41 UTC
This is not solved in my opinion the released version:

I see this happening on the commandline of the engine on several of hosts and various CPU's:

[root@hosted-engine-01 ~]#
Message from syslogd@hosted-engine-01 at Jul 19 14:49:02 ...
 kernel:BUG: soft lockup - CPU#1 stuck for 22s! [kworker/u8:1:3995]

Comment 45 Fabian Deutsch 2016-07-19 18:37:01 UTC
To the error in comment 44 looks like a different problem.

Why do you think that this message is related to this bug?

Comment 46 Yamakasi 2016-07-19 18:38:40 UTC
It happens all at the same time and finally I need to start the engine manually.

Comment 47 Oved Ourfali 2016-07-19 18:53:14 UTC
It still doesn't make it related. Seems different as Fabian mentioned. 
You should open a seperate bug on it.


Note You need to log in before you can comment on or make changes to this bug.