Description of problem: oVirt's messages mechanism caches messages and cleans the cache every once in a while, hence, if there's a new warning of the same kind it will be swallowed and will not displayed. This behavior causes threshold's warning flag to be thrown only once, that means that if a user has reached the threshold limit got a warning, and extends the storage domain, when he will reach the warning threshold again for the 2 >= time, he won't get any warnings anymore. second option to reproduce this is to reach the warning threshold, than set the warning threshold to a lower bound, then reach the threshold again, no error message is thrown. How reproducible: 100% Steps to Reproduce: Are explained at the description Actual results: Space allocation warning message is not usable unless we have a very very responsible sysadmin. Expected results: Space allocation warning should be thrown once every time a storage domain reaches the configured limit
Allon Mureinik 2015-05-06 10:40:28 EDT: Seems like the audit log supression mechanism. Oved, any insight?
*** Bug 1218950 has been marked as a duplicate of this bug. ***
Moving the needinfo to Eli.
(In reply to Oved Ourfali from comment #3) > Moving the needinfo to Eli. Sure, this is our flooding mechanism and it will not be thrown only once, it will be thrown according to the flood rate defined In that case IIUC those are the disk space messages defined as VDS_LOW_DISK_SPACE(23, AuditLogSeverity.WARNING, AuditLogTimeInterval.HOUR.getValue() * 12), VDS_LOW_DISK_SPACE_ERROR(24, AuditLogSeverity.ERROR, AuditLogTimeInterval.MINUTE.getValue() * 15), So the first allows for a message each 12 hours and the second each 15 minutes
Eli - thanks. From this info, the scenario sounds like NOTABUG to me. Yaniv - your take on this?
The problem arose when working on the threshold warnings audit log messages, and I am aware of the flooding mechanism. What Ori meant when he said that the messages are swallowed after the first time they arise, is that this happens WITHIN the flooding window. This is a problem when you: 1. Do action_1 that raises a warning/error, audit log is displayed. 2. Act upon the message and rectify the situation. 3. Do action_2 that raises the same warning/error. Audit log is not displayed for this issue until flooding window is over. THIS IS THE PROBLEM. How bug the problem really is, I leave you to decide.
Now it's clear, agreed. What we're missing here is a mechanism to evict the timeout cache once the problems is resolved. Psuedocode: Domain monitoring quartz, wakes up every x mins: loggable = createLoggable() if (domain.hasProblem()): # This exists today AuditLogDirector.log(loggable) else : # doesn't exist today AuditLogDirector.clearTimeout(loggable) I could not find such an API in AuditLogDirector. Eli - am I missing something? And if not - does it make sense to add such an API?
(In reply to Allon Mureinik from comment #7) No, we don't have such mechanism , we are doing so per flow For example , if you have not configured PM on a host you will get an alert, if you configure it , the alert is cleared, but this is done in the specific PM handling code, not in AuditLogDirector The current mechanism will prevent duplicate messages on the same instance withing the defined timeout window for this audit log type So, I don't understand what is the problem here ? Please elaborate what do you want to achieve here
(In reply to Allon Mureinik from comment #7) After talking with Allon : The way to solve this issue is to add an option to dismiss an event by the user from the webadmin UI in the same manner we have today for alerts This way the event will be marked as deleted and if there will be another event it will be shown since the flood mechanism,m does not take in account events that were marked as deleted
We have such an item, targeted for 3.6.0 (weekathon item). Adding the proper dependency.
Oved, Eli, thanks! I'm marking this bz as [BLOCKED] until the RFE is resolved. IIUC, once the RFE is resolved there's nothing left in this BZ but testing, so I'm leaving it open for tracking reasons.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
oVirt 4.0 beta has been released, moving to RC milestone.
Tested in rhevm-4.0.0.4-0.1.el7ev.noarch. Before I mark this as MODIFIED I explain the flow I tested reading: 1. Create a disk to get to the warning threshold 2. Warning for space allocation appears on event => Remove the event from the Event tab pressing the closing icon 3. Immediately remove the disk 4. create a disk to get to throw the warning again FAILED => the new event is not shown in the event tab. the new space threshold should be shown according to the bug description right?
Sorry, that was the wrong flow in my comment #17. This is how I'm testing: 1. Create a disk to get to the warning threshold 2. Warning for space allocation appears on event 3. After the warning remove the disk, wait until space in storage is free again. 4. Create a disk to get to throw the warning again 5. The message should appear again (Since we didnd't delete the event and the caching mechanism should be fixed so the status of the storage domain was fixed?) Idan, is this correct? In this case the message is not shown, same result as the description on the bug report, then I'll mark this bug as FAILED QA/MODIFIED.
After my results on comment #18, message is cached and is not shown again after the storage space allocation issue is fixed and then triggered again I'm putting this back to assinged.
Actually, the flow that you described on comment #17 is the right one to test. IIUC, there's nothing wrong with the flow described on comment #18. The warning should not appear in that case. Regarding the flow from comment #17, you are right. It doesn't work as expected. I tried it myself, and another similar scenario where I should have gotten a warning regarding the number of LVs in a storage domain exceeding the threshold - didn't work either. I guess that there might be a problem with the new mechanism introduced in Bug 1120670. Jakub, any idea what can be the cause for that?
Hi Idan, I did Bug 1120670 as weekaton project so I don't have much insight. The point was to copy the behavior of Alerts tab for Events tab - to add dismissing functionality. My weekaton point of contact were Eli Mesika and Ravi Nori so maybe they could help.
Does anyone know about possible issues before I dive into it?
After talking with Idan we concluded that he will check if the audit_log deleted column is update to true after the event removal and if so this BZ should be moved to infra
I 've just ran the flow and indeed the audit log "delete" column was changed to "TRUE" after I removed the event from the Events pane in the webadmin. Assigning the BZ to you, Eli. Thanks!
Moving back to POST as we need to backport to ovirt-engine-4.0 branch
Marking this as failed QA: 1. Create a disk to get to the warning threshold 2. Warning for space allocation appears on event (Critical, Low disk space. iscsi_0 domain has 2 GB of free space.) => Remove the event from the Event tab pressing the closing icon 3. After ~8 seconds of removing the event warning, a new event (?) appears again with the same message, and this repeats forever... If you remove the event you hope to not see the warning again until you fix the issue or at least until more than ~8 seconds after you did it (a few hours? once a day?)
(In reply to Carlos Mestre González from comment #27) This is not the way to test that If you remove the event , that means that you handle it, if you removed it without handling it , you will got it again, this is working as designed Please follow 1. Create a disk to get to the warning threshold 2. Warning for space allocation appears on event (Critical, Low disk space. iscsi_0 domain has 2 GB of free space.) 3. Resolve the space problem , in other words , solve the problem BEFORE removing the event 4. Remove the event from the Event tab pressing the closing icon
rhevm-4.0.2-0.1.rc.el7ev.noarch