Description of problem: Taking this scenario from original https://bugzilla.redhat.com/show_bug.cgi?id=1343005, comment #30. Not sure if you all are saying this is supposed to be fixed in vdsm 4.18.4... I can report that I have vdsm-4.18.4.1-0.el7.centos.x86_64 and the issue is _NOT_ fixed. According to lsof the number of open files named "[eventfd]" keeps growing until ovirt-ha-agent dies due to too many open files. Here's what lsof shows for one of these open files: [root@sexi-albert /]# lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | head -1 ovirt-ha- 56795 vdsm 5u a_inode 0,9 0 7259 [eventfd] As you can see the number of these open files goes up quite quickly: [root@sexi-albert /]# for i in {1..30}; do echo -n "$(date): "; lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | wc -l; sleep 2; done Tue Jun 28 01:06:53 PDT 2016: 744 Tue Jun 28 01:06:55 PDT 2016: 744 Tue Jun 28 01:06:57 PDT 2016: 744 Tue Jun 28 01:06:59 PDT 2016: 744 Tue Jun 28 01:07:01 PDT 2016: 744 Tue Jun 28 01:07:04 PDT 2016: 744 Tue Jun 28 01:07:06 PDT 2016: 744 Tue Jun 28 01:07:08 PDT 2016: 746 Tue Jun 28 01:07:10 PDT 2016: 746 Tue Jun 28 01:07:12 PDT 2016: 748 Tue Jun 28 01:07:14 PDT 2016: 748 Tue Jun 28 01:07:16 PDT 2016: 748 Tue Jun 28 01:07:18 PDT 2016: 750 Tue Jun 28 01:07:20 PDT 2016: 750 Tue Jun 28 01:07:23 PDT 2016: 752 Tue Jun 28 01:07:25 PDT 2016: 752 Tue Jun 28 01:07:27 PDT 2016: 752 Tue Jun 28 01:07:29 PDT 2016: 754 Tue Jun 28 01:07:31 PDT 2016: 754 Tue Jun 28 01:07:33 PDT 2016: 756 Tue Jun 28 01:07:35 PDT 2016: 756 Tue Jun 28 01:07:37 PDT 2016: 756 Tue Jun 28 01:07:40 PDT 2016: 756 Tue Jun 28 01:07:42 PDT 2016: 756 Tue Jun 28 01:07:44 PDT 2016: 756 Tue Jun 28 01:07:46 PDT 2016: 756 Tue Jun 28 01:07:48 PDT 2016: 758 Tue Jun 28 01:07:50 PDT 2016: 758 Tue Jun 28 01:07:52 PDT 2016: 758 Tue Jun 28 01:07:54 PDT 2016: 760 This is spamming the crap out of me and my other admins with hundreds of email alerts per day.... I have 5 HA hosted engine hosts and they're all spewing ReinitializeFSM-EngineStarting, EngineStarting-EngineUnexpectedlyDown, StartState-ReinitializeFSM, etc. _ad nauseum_. Please make it stop! ;-) This is a cluster upgraded from 3.6 -> 4.0: [root@sexi-albert /]# rpm -qa | grep -E "(ovirt|vdsm)" | sort libgovirt-0.3.3-1.el7_2.1.x86_64 ovirt-engine-appliance-4.0-20160623.1.el7.centos.noarch ovirt-engine-sdk-python-3.6.7.0-1.el7.centos.noarch ovirt-host-deploy-1.5.0-1.el7.centos.noarch ovirt-hosted-engine-ha-2.0.0-1.el7.centos.noarch ovirt-hosted-engine-setup-2.0.0.2-1.el7.centos.noarch ovirt-imageio-common-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch ovirt-imageio-daemon-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch ovirt-release40-4.0.0-5.noarch ovirt-setup-lib-1.0.2-1.el7.centos.noarch ovirt-vmconsole-1.0.3-1.el7.centos.noarch ovirt-vmconsole-host-1.0.3-1.el7.centos.noarch vdsm-4.18.4.1-0.el7.centos.x86_64 vdsm-api-4.18.4.1-0.el7.centos.noarch vdsm-cli-4.18.4.1-0.el7.centos.noarch vdsm-hook-vmfex-dev-4.18.4.1-0.el7.centos.noarch vdsm-infra-4.18.4.1-0.el7.centos.noarch vdsm-jsonrpc-4.18.4.1-0.el7.centos.noarch vdsm-python-4.18.4.1-0.el7.centos.noarch vdsm-xmlrpc-4.18.4.1-0.el7.centos.noarch vdsm-yajsonrpc-4.18.4.1-0.el7.centos.noarch Thanks! Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Too many emails being received and this spams the inboxes of system administrators. Expected results: Email messages should not spam inbox of recipients. Additional info:
As mentioned in BZ1343005, we needed to revert the fix for BZ1343005 in vdsm-4.18.4.1 because it caused BZ1349461, hopefully both fixes will be part of next VDSM release
(In reply to Martin Perina from comment #1) > As mentioned in BZ1343005, we needed to revert the fix for BZ1343005 in > vdsm-4.18.4.1 because it caused BZ1349461, hopefully both fixes will be part > of next VDSM release Sure, but I mean that email notifications should be minimized and we have to prevent spamming our customers with the same errors.
Once the original issue is fixed there will be no spam. Closing this one as NOTABUG.
(In reply to Oved Ourfali from comment #3) > Once the original issue is fixed there will be no spam. > Closing this one as NOTABUG. Strongly disagree, in case of the same issue being triggered, then the same email message being sent over and over again, please consider redesigning the notifier and reopening this bug.
There are mechanisms to prevent event flood. This is a per-vertical decision with regards to their events. Nothing to fix here in general.
(In reply to Oved Ourfali from comment #5) > There are mechanisms to prevent event flood. > This is a per-vertical decision with regards to their events. > Nothing to fix here in general. Sure, if there are mechanisms to prevent this from happening from the https://bugzilla.redhat.com/show_bug.cgi?id=1317468, then I'm OK with it.
I'd be rather changing to won't fix, as this issue was already fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1317468, but customer's report about email spam was before that fix.
I'm the reporter of this bug. It was reported _after_ bug 1317468 and as far as I can tell has nothing to do with bug 1317468. If there's a way currently to rate limit HE notifications can you please tell me where it is? I agree that bug 1343005 is the direct cause of my issue and if that bug is ever resolved it will be mitigated substantially. But there may be other reasons why an HA agent may be shut down / fail and I believe oVirt could do a smarter job managing the resulting onslaught of alerts.
Redirecting the needinfo to Martin who owns the notification code within HA daemons.
The agent won't send anything when it shuts down or dies. Rate limiting is not supported, but you can limit according to the state transition description. Check the /etc/ovirt-hosted-engine-ha/broker.conf and specifically the [notify] section in that file. The state_transition key contains a regular expression that is used as a trigger for sending an email (case insensitive).