Bug 1288862

Summary: Storage domain's space thresholds warnings appears only for the first time
Product: [oVirt] ovirt-engine Reporter: Raz Tamir <ratamir>
Component: BLL.StorageAssignee: Allon Mureinik <amureini>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0.2CC: amureini, bugs, emesika, kgoldbla, oourfali, ratamir, ykaul, ylavi
Target Milestone: ---Keywords: Automation
Target Release: ---Flags: sbonazzo: ovirt-4.1-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-19 21:18:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log
none
engine log none

Description Raz Tamir 2015-12-06 17:41:28 UTC
Description of problem:
After configuring the 'warning_low_space_indicator' or the '
critical_space_action_blocker', and created a preallocated disk that reach this new threshold, the warning message shown up.
After removing the disk and wait for the task to finish, I create a new preallocated disk that reach this threshold and no warning appears

Version-Release number of selected component (if applicable):
rhevm-3.6.1.1-0.1.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. On 50 GB SD, Change the 'critical_space_action_blocker' to 40
2. Create new 10GB preallocated disk
3.

Actual results:
explained above

Expected results:


Additional info:

Comment 1 Raz Tamir 2015-12-06 17:44:37 UTC
It is impossible to add an attachment (logs). I will try again later

Comment 2 Allon Mureinik 2015-12-06 17:54:31 UTC
(In reply to ratamir from comment #1)
> It is impossible to add an attachment (logs). I will try again later
While you're at it, could you also please attach a screenshot? I'm not entirely clear where you're expecting to see this warning.
Thanks!

Comment 3 Red Hat Bugzilla Rules Engine 2015-12-06 17:54:35 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 4 Raz Tamir 2015-12-06 19:21:16 UTC
Created attachment 1102827 [details]
engine log

Hi Allon,
I'm expecting to see the warning under the events tab at the bottom.
I'm also attaching a print screen as requested

Comment 5 Raz Tamir 2015-12-06 19:33:14 UTC
Created attachment 1102840 [details]
engine log

Comment 6 Allon Mureinik 2015-12-07 08:00:11 UTC
The engine suppresses duplicate audit logs within the same hour, this is by design.

Why does this effect automation?

Comment 7 Raz Tamir 2015-12-07 10:12:27 UTC
I need a way to make sure the threshold reached.
I am reading the engine logs for this warning messages.

Comment 8 Allon Mureinik 2015-12-07 12:17:07 UTC
(In reply to ratamir from comment #7)
> I need a way to make sure the threshold reached.
> I am reading the engine logs for this warning messages.

I don't see us changing this behavior for the automation's sake.

Oved - can you, or someone from your familiar with the flood-prevention mechanism, review this case and suggest a solution?
Thanks!

Comment 9 Oved Ourfali 2015-12-07 12:22:00 UTC
Eli - can you take a look?
I think it is either changing it for everyone, as it is not configurable per-user or something, but Eli knows best.

Comment 10 Raz Tamir 2015-12-07 14:13:49 UTC
Hi Allon,
I think this audit log is unique and needs to be displayed every time we exceed the threshold.
This is not only automation problem, a user should be notified with a warning message every time on this issue.
I agree that we don't want to fill the logs and the events with a warnings but when we reach this threshold, I want to be notified.

Maybe we can show the warnings every time the used space surpasses the threshold

Comment 11 Eli Mesika 2015-12-07 14:20:45 UTC
Looking at the code I see that 

The Warning message will appear once in 12 hours 
The Error message will appear once in 15 minutes 

The problem origin is :

The flood mechanism compares events according to the AuditLog equals() method.
The AuditLog object does not contain currently any disk id, so if disk A for example generates an event E1 and disk B (on the same storage domain) generates the same event E2 within the configured flood window, the event from disk B will be suppressed by the flooding mechanism since E1.equals(E2) is TRUE

If disk A and disk B are not on the same storage domain than E1.equals(E2) is FALSE since the storage domain id is a property of the AuditLog class

A workaround can be using the CustomData field in AuditLog which is generally used by external events and populate it with the disk ID when calling to the log method this implies only a minor change in the code 

HostMonitoring::logLowDiskSpaceOnHostDisks :

logable.setCustomData(StringUtils.join(disksWithLowSpace, ", "));

Comment 12 Eli Mesika 2015-12-08 09:31:56 UTC
Allon, is the workaround suggested in comment 11 acceptable ?

Comment 13 Allon Mureinik 2015-12-10 11:39:12 UTC
(In reply to Eli Mesika from comment #12)
> Allon, is the workaround suggested in comment 11 acceptable ?

This isn't the right auditlog. This monitoring refers to low disk space on the host's local disks.

The BZ is about low space in storage domains (IRS_DISK_SPACE_LOW), which is generated by org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData#proceedStorageDomain.

There's no way to add the disk ID there, as the domain's space can be taken up by any number of reasons - adding new disks (but this is a periodic check - what would you show? All the disks added since the last check?), taking snapshots, natural growth of thinly provisioned disks.

I was wondering If there's anything QE can do in their automation to "flush" flush the event cache for their tests. A JMX perhaps?

Comment 14 Eli Mesika 2015-12-14 11:45:18 UTC
(In reply to Allon Mureinik from comment #13)

I agree that in that case we can not add the disk/s id/s as suggested.

Currently we have 2 kinds of events 

1) Events that are always recorded when occurred 
2) Events with the flooding mechanism that are suppressed as long as occurred in the defined flooding window 

Maybe we need to add support for a 3rd kind : last 
This means that only the last event from this kind is reported 
This will solve the issue reported in this bug

Comment 15 Eli Mesika 2016-03-29 14:55:05 UTC
(In reply to ratamir from comment #10)
> Hi Allon,
> I think this audit log is unique and needs to be displayed every time we
> exceed the threshold.
> This is not only automation problem, a user should be notified with a
> warning message every time on this issue.
> I agree that we don't want to fill the logs and the events with a warnings
> but when we reach this threshold, I want to be notified.
> 
> Maybe we can show the warnings every time the used space surpasses the
> threshold

Allon,

What about removing the flood surpression from IRS_DISK_SPACE_LOW and IRS_DISK_SPACE_LOW_ERROR 

Is that acceptable ?

I tend to agree with ratamir on that

Comment 16 Allon Mureinik 2016-04-19 17:49:15 UTC
(In reply to Eli Mesika from comment #15)
> (In reply to ratamir from comment #10)
> > Hi Allon,
> > I think this audit log is unique and needs to be displayed every time we
> > exceed the threshold.
> > This is not only automation problem, a user should be notified with a
> > warning message every time on this issue.
> > I agree that we don't want to fill the logs and the events with a warnings
> > but when we reach this threshold, I want to be notified.
> > 
> > Maybe we can show the warnings every time the used space surpasses the
> > threshold
> 
> Allon,
> 
> What about removing the flood surpression from IRS_DISK_SPACE_LOW and
> IRS_DISK_SPACE_LOW_ERROR 
> 
> Is that acceptable ?
> 
> I tend to agree with ratamir on that
That would give real users a worse experience just so we can automate this. I don't think that's a good approach.
I propose closing WONTFIX.

Comment 17 Allon Mureinik 2016-04-19 17:50:54 UTC
(In reply to ratamir from comment #10)
> Hi Allon,
> I think this audit log is unique and needs to be displayed every time we
> exceed the threshold.
> This is not only automation problem, a user should be notified with a
> warning message every time on this issue.
> I agree that we don't want to fill the logs and the events with a warnings
> but when we reach this threshold, I want to be notified.
> 
> Maybe we can show the warnings every time the used space surpasses the
> threshold
Passing a threshold isn't an point in time event - every time the domain monitor runs after you passed it, you're still beyond it. 
We don't want to flood the user with these useless auditlogs.

Comment 18 Oved Ourfali 2016-04-19 18:53:27 UTC
Allon - Up to you. 
Putting you as the assignee to decide what to do about it.

Comment 19 Allon Mureinik 2016-04-19 21:18:00 UTC
Comment 13 suggests a possible solution to this, which may affect the entire mechanism:

(In reply to Allon Mureinik from comment #13)
> I was wondering If there's anything QE can do in their automation to "flush"
> flush the event cache for their tests. A JMX perhaps?

Oved, if someone on your team feels inclined to add such a capability, I'd be glad. Anything else would be a specific workaround for a non-interesting usecase. Assuming the answer to this question is no based on comment 18, I'm closing this BZ.

Comment 20 Yaniv Kaul 2016-04-26 12:35:48 UTC
I think a warning every 12 hours it not good enough - we probably should have it every hour or so - this is an issue waiting to become and error and will end up causing data unavailability, as VMs will pause eventually.

Comment 21 Allon Mureinik 2016-04-27 11:09:46 UTC
(In reply to Yaniv Kaul from comment #20)
> I think a warning every 12 hours it not good enough - we probably should
> have it every hour or so - this is an issue waiting to become and error and
> will end up causing data unavailability, as VMs will pause eventually.
We warn based on a percentile of the storage domain being full, so 12 hours sounds like more than enough to me, but I honestly wouldn't mind using whatever value you think would be right.

What value do you want to use? 1 hour?

Comment 22 Yaniv Kaul 2016-04-27 17:46:35 UTC
(In reply to Allon Mureinik from comment #21)
> (In reply to Yaniv Kaul from comment #20)
> > I think a warning every 12 hours it not good enough - we probably should
> > have it every hour or so - this is an issue waiting to become and error and
> > will end up causing data unavailability, as VMs will pause eventually.
> We warn based on a percentile of the storage domain being full, so 12 hours
> sounds like more than enough to me, but I honestly wouldn't mind using
> whatever value you think would be right.
> 
> What value do you want to use? 1 hour?

Yes, makes sense to me. It's a catastrophe waiting to happen if not attended quickly - and you don't always have the space readily available to fix the issue.

Comment 23 Yaniv Lavi 2016-04-28 06:44:56 UTC
(In reply to Yaniv Kaul from comment #22)
> (In reply to Allon Mureinik from comment #21)
> > (In reply to Yaniv Kaul from comment #20)
> > > I think a warning every 12 hours it not good enough - we probably should
> > > have it every hour or so - this is an issue waiting to become and error and
> > > will end up causing data unavailability, as VMs will pause eventually.
> > We warn based on a percentile of the storage domain being full, so 12 hours
> > sounds like more than enough to me, but I honestly wouldn't mind using
> > whatever value you think would be right.
> > 
> > What value do you want to use? 1 hour?
> 
> Yes, makes sense to me. It's a catastrophe waiting to happen if not attended
> quickly - and you don't always have the space readily available to fix the
> issue.

This is a global setting. I would check what other warnings will pop up every hour.

Comment 24 Yaniv Kaul 2016-04-28 06:47:09 UTC
(In reply to Yaniv Dary from comment #23)
> (In reply to Yaniv Kaul from comment #22)
> > (In reply to Allon Mureinik from comment #21)
> > > (In reply to Yaniv Kaul from comment #20)
> > > > I think a warning every 12 hours it not good enough - we probably should
> > > > have it every hour or so - this is an issue waiting to become and error and
> > > > will end up causing data unavailability, as VMs will pause eventually.
> > > We warn based on a percentile of the storage domain being full, so 12 hours
> > > sounds like more than enough to me, but I honestly wouldn't mind using
> > > whatever value you think would be right.
> > > 
> > > What value do you want to use? 1 hour?
> > 
> > Yes, makes sense to me. It's a catastrophe waiting to happen if not attended
> > quickly - and you don't always have the space readily available to fix the
> > issue.
> 
> This is a global setting. I would check what other warnings will pop up
> every hour.

We need to do it the other way around - need to see which messages need to pop up more frequently, and come up with a common class for them. Clearly 12 hours for this is too low, I assume there are others.
Y.

Comment 25 Allon Mureinik 2016-05-22 08:26:41 UTC
(In reply to Yaniv Kaul from comment #22)
> (In reply to Allon Mureinik from comment #21)
> > (In reply to Yaniv Kaul from comment #20)
> > > I think a warning every 12 hours it not good enough - we probably should
> > > have it every hour or so - this is an issue waiting to become and error and
> > > will end up causing data unavailability, as VMs will pause eventually.
> > We warn based on a percentile of the storage domain being full, so 12 hours
> > sounds like more than enough to me, but I honestly wouldn't mind using
> > whatever value you think would be right.
> > 
> > What value do you want to use? 1 hour?
> 
> Yes, makes sense to me. It's a catastrophe waiting to happen if not attended
> quickly - and you don't always have the space readily available to fix the
> issue.

Fair enough: https://gerrit.ovirt.org/#/c/57819/

Comment 26 Tal Nisan 2017-05-03 11:07:46 UTC
*** Bug 1447099 has been marked as a duplicate of this bug. ***