Description of problem: After configuring the 'warning_low_space_indicator' or the ' critical_space_action_blocker', and created a preallocated disk that reach this new threshold, the warning message shown up. After removing the disk and wait for the task to finish, I create a new preallocated disk that reach this threshold and no warning appears Version-Release number of selected component (if applicable): rhevm-3.6.1.1-0.1.el6.noarch How reproducible: 100% Steps to Reproduce: 1. On 50 GB SD, Change the 'critical_space_action_blocker' to 40 2. Create new 10GB preallocated disk 3. Actual results: explained above Expected results: Additional info:
It is impossible to add an attachment (logs). I will try again later
(In reply to ratamir from comment #1) > It is impossible to add an attachment (logs). I will try again later While you're at it, could you also please attach a screenshot? I'm not entirely clear where you're expecting to see this warning. Thanks!
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Created attachment 1102827 [details] engine log Hi Allon, I'm expecting to see the warning under the events tab at the bottom. I'm also attaching a print screen as requested
Created attachment 1102840 [details] engine log
The engine suppresses duplicate audit logs within the same hour, this is by design. Why does this effect automation?
I need a way to make sure the threshold reached. I am reading the engine logs for this warning messages.
(In reply to ratamir from comment #7) > I need a way to make sure the threshold reached. > I am reading the engine logs for this warning messages. I don't see us changing this behavior for the automation's sake. Oved - can you, or someone from your familiar with the flood-prevention mechanism, review this case and suggest a solution? Thanks!
Eli - can you take a look? I think it is either changing it for everyone, as it is not configurable per-user or something, but Eli knows best.
Hi Allon, I think this audit log is unique and needs to be displayed every time we exceed the threshold. This is not only automation problem, a user should be notified with a warning message every time on this issue. I agree that we don't want to fill the logs and the events with a warnings but when we reach this threshold, I want to be notified. Maybe we can show the warnings every time the used space surpasses the threshold
Looking at the code I see that The Warning message will appear once in 12 hours The Error message will appear once in 15 minutes The problem origin is : The flood mechanism compares events according to the AuditLog equals() method. The AuditLog object does not contain currently any disk id, so if disk A for example generates an event E1 and disk B (on the same storage domain) generates the same event E2 within the configured flood window, the event from disk B will be suppressed by the flooding mechanism since E1.equals(E2) is TRUE If disk A and disk B are not on the same storage domain than E1.equals(E2) is FALSE since the storage domain id is a property of the AuditLog class A workaround can be using the CustomData field in AuditLog which is generally used by external events and populate it with the disk ID when calling to the log method this implies only a minor change in the code HostMonitoring::logLowDiskSpaceOnHostDisks : logable.setCustomData(StringUtils.join(disksWithLowSpace, ", "));
Allon, is the workaround suggested in comment 11 acceptable ?
(In reply to Eli Mesika from comment #12) > Allon, is the workaround suggested in comment 11 acceptable ? This isn't the right auditlog. This monitoring refers to low disk space on the host's local disks. The BZ is about low space in storage domains (IRS_DISK_SPACE_LOW), which is generated by org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData#proceedStorageDomain. There's no way to add the disk ID there, as the domain's space can be taken up by any number of reasons - adding new disks (but this is a periodic check - what would you show? All the disks added since the last check?), taking snapshots, natural growth of thinly provisioned disks. I was wondering If there's anything QE can do in their automation to "flush" flush the event cache for their tests. A JMX perhaps?
(In reply to Allon Mureinik from comment #13) I agree that in that case we can not add the disk/s id/s as suggested. Currently we have 2 kinds of events 1) Events that are always recorded when occurred 2) Events with the flooding mechanism that are suppressed as long as occurred in the defined flooding window Maybe we need to add support for a 3rd kind : last This means that only the last event from this kind is reported This will solve the issue reported in this bug
(In reply to ratamir from comment #10) > Hi Allon, > I think this audit log is unique and needs to be displayed every time we > exceed the threshold. > This is not only automation problem, a user should be notified with a > warning message every time on this issue. > I agree that we don't want to fill the logs and the events with a warnings > but when we reach this threshold, I want to be notified. > > Maybe we can show the warnings every time the used space surpasses the > threshold Allon, What about removing the flood surpression from IRS_DISK_SPACE_LOW and IRS_DISK_SPACE_LOW_ERROR Is that acceptable ? I tend to agree with ratamir on that
(In reply to Eli Mesika from comment #15) > (In reply to ratamir from comment #10) > > Hi Allon, > > I think this audit log is unique and needs to be displayed every time we > > exceed the threshold. > > This is not only automation problem, a user should be notified with a > > warning message every time on this issue. > > I agree that we don't want to fill the logs and the events with a warnings > > but when we reach this threshold, I want to be notified. > > > > Maybe we can show the warnings every time the used space surpasses the > > threshold > > Allon, > > What about removing the flood surpression from IRS_DISK_SPACE_LOW and > IRS_DISK_SPACE_LOW_ERROR > > Is that acceptable ? > > I tend to agree with ratamir on that That would give real users a worse experience just so we can automate this. I don't think that's a good approach. I propose closing WONTFIX.
(In reply to ratamir from comment #10) > Hi Allon, > I think this audit log is unique and needs to be displayed every time we > exceed the threshold. > This is not only automation problem, a user should be notified with a > warning message every time on this issue. > I agree that we don't want to fill the logs and the events with a warnings > but when we reach this threshold, I want to be notified. > > Maybe we can show the warnings every time the used space surpasses the > threshold Passing a threshold isn't an point in time event - every time the domain monitor runs after you passed it, you're still beyond it. We don't want to flood the user with these useless auditlogs.
Allon - Up to you. Putting you as the assignee to decide what to do about it.
Comment 13 suggests a possible solution to this, which may affect the entire mechanism: (In reply to Allon Mureinik from comment #13) > I was wondering If there's anything QE can do in their automation to "flush" > flush the event cache for their tests. A JMX perhaps? Oved, if someone on your team feels inclined to add such a capability, I'd be glad. Anything else would be a specific workaround for a non-interesting usecase. Assuming the answer to this question is no based on comment 18, I'm closing this BZ.
I think a warning every 12 hours it not good enough - we probably should have it every hour or so - this is an issue waiting to become and error and will end up causing data unavailability, as VMs will pause eventually.
(In reply to Yaniv Kaul from comment #20) > I think a warning every 12 hours it not good enough - we probably should > have it every hour or so - this is an issue waiting to become and error and > will end up causing data unavailability, as VMs will pause eventually. We warn based on a percentile of the storage domain being full, so 12 hours sounds like more than enough to me, but I honestly wouldn't mind using whatever value you think would be right. What value do you want to use? 1 hour?
(In reply to Allon Mureinik from comment #21) > (In reply to Yaniv Kaul from comment #20) > > I think a warning every 12 hours it not good enough - we probably should > > have it every hour or so - this is an issue waiting to become and error and > > will end up causing data unavailability, as VMs will pause eventually. > We warn based on a percentile of the storage domain being full, so 12 hours > sounds like more than enough to me, but I honestly wouldn't mind using > whatever value you think would be right. > > What value do you want to use? 1 hour? Yes, makes sense to me. It's a catastrophe waiting to happen if not attended quickly - and you don't always have the space readily available to fix the issue.
(In reply to Yaniv Kaul from comment #22) > (In reply to Allon Mureinik from comment #21) > > (In reply to Yaniv Kaul from comment #20) > > > I think a warning every 12 hours it not good enough - we probably should > > > have it every hour or so - this is an issue waiting to become and error and > > > will end up causing data unavailability, as VMs will pause eventually. > > We warn based on a percentile of the storage domain being full, so 12 hours > > sounds like more than enough to me, but I honestly wouldn't mind using > > whatever value you think would be right. > > > > What value do you want to use? 1 hour? > > Yes, makes sense to me. It's a catastrophe waiting to happen if not attended > quickly - and you don't always have the space readily available to fix the > issue. This is a global setting. I would check what other warnings will pop up every hour.
(In reply to Yaniv Dary from comment #23) > (In reply to Yaniv Kaul from comment #22) > > (In reply to Allon Mureinik from comment #21) > > > (In reply to Yaniv Kaul from comment #20) > > > > I think a warning every 12 hours it not good enough - we probably should > > > > have it every hour or so - this is an issue waiting to become and error and > > > > will end up causing data unavailability, as VMs will pause eventually. > > > We warn based on a percentile of the storage domain being full, so 12 hours > > > sounds like more than enough to me, but I honestly wouldn't mind using > > > whatever value you think would be right. > > > > > > What value do you want to use? 1 hour? > > > > Yes, makes sense to me. It's a catastrophe waiting to happen if not attended > > quickly - and you don't always have the space readily available to fix the > > issue. > > This is a global setting. I would check what other warnings will pop up > every hour. We need to do it the other way around - need to see which messages need to pop up more frequently, and come up with a common class for them. Clearly 12 hours for this is too low, I assume there are others. Y.
(In reply to Yaniv Kaul from comment #22) > (In reply to Allon Mureinik from comment #21) > > (In reply to Yaniv Kaul from comment #20) > > > I think a warning every 12 hours it not good enough - we probably should > > > have it every hour or so - this is an issue waiting to become and error and > > > will end up causing data unavailability, as VMs will pause eventually. > > We warn based on a percentile of the storage domain being full, so 12 hours > > sounds like more than enough to me, but I honestly wouldn't mind using > > whatever value you think would be right. > > > > What value do you want to use? 1 hour? > > Yes, makes sense to me. It's a catastrophe waiting to happen if not attended > quickly - and you don't always have the space readily available to fix the > issue. Fair enough: https://gerrit.ovirt.org/#/c/57819/
*** Bug 1447099 has been marked as a duplicate of this bug. ***