1350758 – Too many spamming emails being sent from hipervisors in case of an error.

Bug 1350758 - Too many spamming emails being sent from hipervisors in case of an error.

Summary: Too many spamming emails being sent from hipervisors in case of an error.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Bindings-API
Sub Component:
Version:	4.18.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Piotr Kliczewski
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:	1343005
Blocks:
TreeView+	depends on / blocked

Reported:	2016-06-28 10:08 UTC by Nikolai Sednev
Modified:	2019-04-17 12:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-06-29 12:13:13 UTC
oVirt Team:	UX
Embargoed:
Dependent Products:
Flags:	rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)

Description Nikolai Sednev 2016-06-28 10:08:11 UTC

Description of problem:
Taking this scenario from original https://bugzilla.redhat.com/show_bug.cgi?id=1343005, comment #30.

Not sure if you all are saying this is supposed to be fixed in vdsm 4.18.4... I can report that I have vdsm-4.18.4.1-0.el7.centos.x86_64 and the issue is _NOT_ fixed.

According to lsof the number of open files named "[eventfd]" keeps growing until ovirt-ha-agent dies due to too many open files.

Here's what lsof shows for one of these open files:

[root@sexi-albert /]# lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | head -1
ovirt-ha- 56795 vdsm    5u  a_inode     0,9     0    7259 [eventfd]

As you can see the number of these open files goes up quite quickly:

[root@sexi-albert /]# for i in {1..30}; do echo -n "$(date): "; lsof -p $(pidof -x ovirt-ha-agent) | grep eventfd | wc -l; sleep 2; done
Tue Jun 28 01:06:53 PDT 2016: 744
Tue Jun 28 01:06:55 PDT 2016: 744
Tue Jun 28 01:06:57 PDT 2016: 744
Tue Jun 28 01:06:59 PDT 2016: 744
Tue Jun 28 01:07:01 PDT 2016: 744
Tue Jun 28 01:07:04 PDT 2016: 744
Tue Jun 28 01:07:06 PDT 2016: 744
Tue Jun 28 01:07:08 PDT 2016: 746
Tue Jun 28 01:07:10 PDT 2016: 746
Tue Jun 28 01:07:12 PDT 2016: 748
Tue Jun 28 01:07:14 PDT 2016: 748
Tue Jun 28 01:07:16 PDT 2016: 748
Tue Jun 28 01:07:18 PDT 2016: 750
Tue Jun 28 01:07:20 PDT 2016: 750
Tue Jun 28 01:07:23 PDT 2016: 752
Tue Jun 28 01:07:25 PDT 2016: 752
Tue Jun 28 01:07:27 PDT 2016: 752
Tue Jun 28 01:07:29 PDT 2016: 754
Tue Jun 28 01:07:31 PDT 2016: 754
Tue Jun 28 01:07:33 PDT 2016: 756
Tue Jun 28 01:07:35 PDT 2016: 756
Tue Jun 28 01:07:37 PDT 2016: 756
Tue Jun 28 01:07:40 PDT 2016: 756
Tue Jun 28 01:07:42 PDT 2016: 756
Tue Jun 28 01:07:44 PDT 2016: 756
Tue Jun 28 01:07:46 PDT 2016: 756
Tue Jun 28 01:07:48 PDT 2016: 758
Tue Jun 28 01:07:50 PDT 2016: 758
Tue Jun 28 01:07:52 PDT 2016: 758
Tue Jun 28 01:07:54 PDT 2016: 760

This is spamming the crap out of me and my other admins with hundreds of email alerts per day.... I have 5 HA hosted engine hosts and they're all spewing ReinitializeFSM-EngineStarting, EngineStarting-EngineUnexpectedlyDown, StartState-ReinitializeFSM, etc. _ad nauseum_. Please make it stop!  ;-)

This is a cluster upgraded from 3.6 -> 4.0:

[root@sexi-albert /]# rpm -qa | grep -E "(ovirt|vdsm)" | sort
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-engine-appliance-4.0-20160623.1.el7.centos.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7.centos.noarch
ovirt-host-deploy-1.5.0-1.el7.centos.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7.centos.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7.centos.noarch
ovirt-imageio-common-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch
ovirt-imageio-daemon-0.3.0-0.201606191345.git9f3d6d4.el7.centos.noarch
ovirt-release40-4.0.0-5.noarch
ovirt-setup-lib-1.0.2-1.el7.centos.noarch
ovirt-vmconsole-1.0.3-1.el7.centos.noarch
ovirt-vmconsole-host-1.0.3-1.el7.centos.noarch
vdsm-4.18.4.1-0.el7.centos.x86_64
vdsm-api-4.18.4.1-0.el7.centos.noarch
vdsm-cli-4.18.4.1-0.el7.centos.noarch
vdsm-hook-vmfex-dev-4.18.4.1-0.el7.centos.noarch
vdsm-infra-4.18.4.1-0.el7.centos.noarch
vdsm-jsonrpc-4.18.4.1-0.el7.centos.noarch
vdsm-python-4.18.4.1-0.el7.centos.noarch
vdsm-xmlrpc-4.18.4.1-0.el7.centos.noarch
vdsm-yajsonrpc-4.18.4.1-0.el7.centos.noarch

Thanks!

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Too many emails being received and this spams the inboxes of system administrators.

Expected results:
Email messages should not spam inbox of recipients.

Additional info:

Comment 1 Martin Perina 2016-06-28 11:59:00 UTC

As mentioned in BZ1343005, we needed to revert the fix for BZ1343005 in vdsm-4.18.4.1 because it caused BZ1349461, hopefully both fixes will be part of next VDSM release

Comment 2 Nikolai Sednev 2016-06-28 12:37:33 UTC

(In reply to Martin Perina from comment #1)
> As mentioned in BZ1343005, we needed to revert the fix for BZ1343005 in
> vdsm-4.18.4.1 because it caused BZ1349461, hopefully both fixes will be part
> of next VDSM release

Sure, but I mean that email notifications should be minimized and we have to prevent spamming our customers with the same errors.

Comment 3 Oved Ourfali 2016-06-29 12:13:13 UTC

Once the original issue is fixed there will be no spam.
Closing this one as NOTABUG.

Comment 4 Nikolai Sednev 2016-06-29 12:47:02 UTC

(In reply to Oved Ourfali from comment #3)
> Once the original issue is fixed there will be no spam.
> Closing this one as NOTABUG.

Strongly disagree, in case of the same issue being triggered, then the same email message being sent over and over again, please consider redesigning the notifier and reopening this bug.

Comment 5 Oved Ourfali 2016-06-29 12:49:16 UTC

There are mechanisms to prevent event flood.
This is a per-vertical decision with regards to their events.
Nothing to fix here in general.

Comment 6 Nikolai Sednev 2016-06-29 13:42:11 UTC

(In reply to Oved Ourfali from comment #5)
> There are mechanisms to prevent event flood.
> This is a per-vertical decision with regards to their events.
> Nothing to fix here in general.

Sure, if there are mechanisms to prevent this from happening from the https://bugzilla.redhat.com/show_bug.cgi?id=1317468, then I'm OK with it.

Comment 7 Nikolai Sednev 2016-06-29 13:43:53 UTC

I'd be rather changing to won't fix, as this issue was already fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1317468, but customer's report about email spam was before that fix.

Comment 8 Carl Thompson 2016-06-29 15:58:56 UTC

I'm the reporter of this bug. It was reported _after_ bug 1317468 and as far as I can tell has nothing to do with bug 1317468. If there's a way currently to rate limit HE notifications can you please tell me where it is?

I agree that bug 1343005 is the direct cause of my issue and if that bug is ever resolved it will be mitigated substantially. But there may be other reasons why an HA agent may be shut down / fail and I believe oVirt could do a smarter job managing the resulting onslaught of alerts.

Comment 9 Sandro Bonazzola 2016-07-19 09:02:51 UTC

Redirecting the needinfo to Martin who owns the notification code within HA daemons.

Comment 10 Martin Sivák 2016-07-19 09:13:02 UTC

The agent won't send anything when it shuts down or dies. Rate limiting is not supported, but you can limit according to the state transition description.

Check the /etc/ovirt-hosted-engine-ha/broker.conf and specifically the [notify] section in that file. The state_transition key contains a regular expression that is used as a trigger for sending an email (case insensitive).

Note You need to log in before you can comment on or make changes to this bug.