1219147 – oVirt's message mechanism should permit space allocation warnings to be thrown

Bug 1219147 - oVirt's message mechanism should permit space allocation warnings to be thrown

Summary: oVirt's message mechanism should permit space allocation warnings to be thrown

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.0.2
Target Release:	4.0.2
Assignee:	Eli Mesika
QA Contact:	Carlos Mestre González
Docs Contact:
URL:
Whiteboard:	infra
Duplicates (1):	1218950 (view as bug list)
Depends On:	1120670
Blocks:
TreeView+	depends on / blocked

Reported:	2015-05-06 16:44 UTC by Ori Gofen
Modified:	2016-08-12 14:29 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-12 14:29:41 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.0.z+ ylavi: planning_ack+ rule-engine: devel_ack+ acanan: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	59979	0	master	MERGED	core: evict event from cache when manually removed	2016-07-12 06:52:27 UTC
oVirt gerrit	60591	0	ovirt-engine-4.0	MERGED	core: evict event from cache when manually removed	2016-07-12 14:28:31 UTC

Description Ori Gofen 2015-05-06 16:44:18 UTC

Description of problem:
oVirt's messages mechanism caches messages and cleans the cache every once in a while, hence, if there's a new warning of the same kind it will be swallowed and will not displayed.

This behavior causes threshold's warning flag to be thrown only once, that means that if a user has reached the threshold limit got a warning, and extends the storage domain, when he will reach the warning threshold again for the
2 >= time, he won't get any warnings anymore.

second option to reproduce this is to reach the warning threshold, than set the warning threshold to a lower bound, then reach the threshold again, no error message is thrown.
 
How reproducible:
100%

Steps to Reproduce:
Are explained at the description

Actual results:
Space allocation warning message is not usable unless we have a very very responsible sysadmin.

Expected results:
Space allocation warning should be thrown once every time a storage domain reaches the configured limit

Comment 1 Ori Gofen 2015-05-06 16:45:52 UTC

Allon Mureinik 2015-05-06 10:40:28 EDT:

Seems like the audit log supression mechanism.
Oved, any insight?

Comment 2 Ori Gofen 2015-05-06 16:46:37 UTC

*** Bug 1218950 has been marked as a duplicate of this bug. ***

Comment 3 Oved Ourfali 2015-05-06 17:07:32 UTC

Moving the needinfo to Eli.

Comment 4 Eli Mesika 2015-05-06 21:33:21 UTC

(In reply to Oved Ourfali from comment #3)
> Moving the needinfo to Eli.

Sure, this is our flooding mechanism and it will not be thrown only once, it will be thrown according to the flood rate defined

In that case IIUC those are the disk space messages defined as

VDS_LOW_DISK_SPACE(23, AuditLogSeverity.WARNING,
            AuditLogTimeInterval.HOUR.getValue() * 12),
VDS_LOW_DISK_SPACE_ERROR(24, AuditLogSeverity.ERROR,
            AuditLogTimeInterval.MINUTE.getValue() * 15),

So the first allows for a message each 12 hours and the second each 15 minutes

Comment 5 Allon Mureinik 2015-05-06 21:57:13 UTC

Eli - thanks.

From this info, the scenario sounds like NOTABUG to me.
Yaniv - your take on this?

Comment 6 Vered Volansky 2015-05-07 11:00:38 UTC

The problem arose when working on the threshold warnings audit log messages, and I am aware of the flooding mechanism.
What Ori meant when he said that the messages are swallowed after the first time they arise, is that this happens WITHIN the flooding window.
This is a problem when you:
1. Do action_1 that raises a warning/error, audit log is displayed.
2. Act upon the message and rectify the situation.
3. Do action_2 that raises the same warning/error. Audit log is not displayed for this issue until flooding window is over. THIS IS THE PROBLEM.
How bug the problem really is, I leave you to decide.

Comment 7 Allon Mureinik 2015-05-07 11:23:06 UTC

Now it's clear, agreed.

What we're missing here is a mechanism to evict the timeout cache once the problems is resolved.

Psuedocode:

Domain monitoring quartz, wakes up every x mins:
    loggable = createLoggable()
    if (domain.hasProblem()): # This exists today
        AuditLogDirector.log(loggable)
    else : # doesn't exist today
        AuditLogDirector.clearTimeout(loggable)


I could not find such an API in AuditLogDirector.
Eli - am I missing something? And if not - does it make sense to add such an API?

Comment 8 Eli Mesika 2015-05-07 13:22:32 UTC

(In reply to Allon Mureinik from comment #7)

No, we don't have such mechanism , we are doing so per flow 

For example , if you have not configured PM on a host you will get an alert, if you configure it , the alert is cleared, but this is done in the specific PM handling code, not in AuditLogDirector 

The current mechanism will prevent duplicate messages on the same instance withing the defined timeout window for this audit log type 

So, I don't understand what is the problem here ?

Please elaborate what do you want to achieve here

Comment 9 Eli Mesika 2015-05-07 13:45:45 UTC

(In reply to Allon Mureinik from comment #7)

After talking with Allon :

The way to solve this issue is to add an option to dismiss an event by the user from the webadmin UI in the same manner we have today for alerts 

This way the event will be marked as deleted and if there will be another event it will be shown since the flood mechanism,m does not take in account events that were marked as deleted

Comment 10 Oved Ourfali 2015-05-07 17:25:00 UTC

We have such an item, targeted for 3.6.0 (weekathon item). 
Adding the proper dependency.

Comment 11 Allon Mureinik 2015-05-10 07:04:55 UTC

Oved, Eli, thanks!

I'm marking this bz as [BLOCKED] until the RFE is resolved. IIUC, once the RFE is resolved there's nothing left in this BZ but testing, so I'm leaving it open for tracking reasons.

Comment 12 Red Hat Bugzilla Rules Engine 2015-10-19 10:59:52 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 13 Sandro Bonazzola 2016-05-02 09:55:44 UTC

Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 14 Yaniv Lavi 2016-05-23 13:17:09 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 15 Yaniv Lavi 2016-05-23 13:23:41 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 17 Carlos Mestre González 2016-06-16 09:09:09 UTC

Tested in rhevm-4.0.0.4-0.1.el7ev.noarch.

Before I mark this as MODIFIED I explain the flow I tested reading:

1. Create a disk to get to the warning threshold
2. Warning for space allocation appears on event => Remove the event from the Event tab pressing the closing icon
3. Immediately remove the disk
4. create a disk to get to throw the warning again
FAILED => the new event is not shown in the event tab.

the new space threshold should be shown according to the bug description right?

Comment 18 Carlos Mestre González 2016-06-16 11:29:26 UTC

Sorry, that was the wrong flow in my comment #17.

This is how I'm testing: 
1. Create a disk to get to the warning threshold
2. Warning for space allocation appears on event 
3. After the warning remove the disk, wait until space in storage is free again.
4. Create a disk to get to throw the warning again
5. The message should appear again (Since we didnd't delete the event and the caching mechanism should be fixed so the status of the storage domain was fixed?)

 	
Idan, is this correct?

In this case the message is not shown, same result as the description on the bug report, then I'll mark this bug as FAILED QA/MODIFIED.

Comment 19 Carlos Mestre González 2016-06-17 14:22:33 UTC

After my results on comment #18, message is cached and is not shown again after the storage space allocation issue is fixed and then triggered again I'm putting this back to assinged.

Comment 20 Red Hat Bugzilla Rules Engine 2016-06-17 14:22:41 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 21 Idan Shaby 2016-06-19 14:18:30 UTC

Actually, the flow that you described on comment #17 is the right one to test. IIUC, there's nothing wrong with the flow described on comment #18. The warning should not appear in that case.

Regarding the flow from comment #17, you are right. It doesn't work as expected.
I tried it myself, and another similar scenario where I should have gotten a warning regarding the number of LVs in a storage domain exceeding the threshold - didn't work either.

I guess that there might be a problem with the new mechanism introduced in Bug 1120670.

Jakub, any idea what can be the cause for that?

Comment 22 jniederm 2016-06-20 13:40:58 UTC

Hi Idan, I did Bug 1120670 as weekaton project so I don't have much insight. The point was to copy the behavior of Alerts tab for Events tab - to add dismissing functionality. My weekaton point of contact were Eli Mesika and Ravi Nori so maybe they could help.

Comment 23 Idan Shaby 2016-06-20 13:59:19 UTC

Does anyone know about possible issues before I dive into it?

Comment 24 Eli Mesika 2016-06-20 14:35:23 UTC

After talking with Idan we concluded that he will check if the audit_log deleted column is update to true after the event removal and if so this BZ should be moved to infra

Comment 25 Idan Shaby 2016-06-21 05:59:13 UTC

I 've just ran the flow and indeed the audit log "delete" column was changed to "TRUE" after I removed the event from the Events pane in the webadmin.

Assigning the BZ to you, Eli.
Thanks!

Comment 26 Martin Perina 2016-07-12 06:53:19 UTC

Moving back to POST as we need to backport to ovirt-engine-4.0 branch

Comment 27 Carlos Mestre González 2016-07-29 08:11:41 UTC

Marking this as failed QA:

1. Create a disk to get to the warning threshold
2. Warning for space allocation appears on event (Critical, Low disk space. iscsi_0 domain has 2 GB of free space.) => Remove the event from the Event tab pressing the closing icon

3. After ~8 seconds of removing the event warning, a new event (?) appears again with the same message, and this repeats forever...

If you remove the event you hope to not see the warning again until you fix the issue or at least until more than ~8 seconds after you did it (a few hours? once a day?)

Comment 28 Red Hat Bugzilla Rules Engine 2016-07-29 08:11:49 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 29 Eli Mesika 2016-07-31 08:46:32 UTC

(In reply to Carlos Mestre González from comment #27)
This is not the way to test that 
If you remove the event , that means that you handle it, if you removed it without handling it , you will got it again, this is working as designed 

Please follow 

1. Create a disk to get to the warning threshold
2. Warning for space allocation appears on event (Critical, Low disk space. iscsi_0 domain has 2 GB of free space.) 
3. Resolve the space problem , in other words , solve the problem BEFORE removing the event 
4. Remove the event from the Event tab pressing the closing icon

Comment 30 Carlos Mestre González 2016-08-02 08:51:03 UTC

rhevm-4.0.2-0.1.rc.el7ev.noarch

Note You need to log in before you can comment on or make changes to this bug.