Created attachment 1163602 [details]
Description of problem:
Events which are related to some hosts are not binded to the host properly. They are not displayed as alerts in list of hosts for the host.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create some pool and fill data to the pool so some utilization threshold is crossed
2. Look on the Hosts page
All hosts have 0 alerts.
A host on which the threshold is crossed should have some alert related.
Don't think it's an UI bug.
As long as the API responds with alerts, the UI will show it. If there's a discrepancy in the API response, not a whole lot the UI can do about it.
Feel free to change the component, post verification & reproduceability.
(In reply to :Deb from comment #1)
> Don't think it's an UI bug.
> As long as the API responds with alerts, the UI will show it. If there's a
> discrepancy in the API response, not a whole lot the UI can do about it.
> Feel free to change the component, post verification & reproduceability.
I am sorry I forgot to said that the events related to some host are present on UI in the events list. But they are not counted in the alerts number for a host.
The screen-shot shows a situation where a host list is filtered for those where a critical or major alarm status is present. The host is correctly listed, but it has no alerts displayed.
Created attachment 1165680 [details]
Created attachment 1165681 [details]
OSD host event
Another example of reproduction scenario is when the disk on the OSD machine is removed.
'OSD host event' shows an event created for such situation
'OSD host' shows that the number of alerts was not changed for the host
In reading the comments, it appears that Lubos is expecting that every related event that is active would show on the host list page. It potentially may be a difference of understanding alert vs. event. If I'm not mistaken, an alert is an event that is noteworthy, i.e. having a major or critical severity level. Informational events don't qualify as alerts.
Depending on the classification of the event type itself, this might be the cause for why certain events/alerts are not being bound to the host.
Pool related events are not propagated to the hosts, rather you can view those in the pool level as well as cluster level so this is not a valid use case
But the issue mentioned as part of comment 6 is valid which needs to be fixed
(In reply to Nishanth Thomas from comment #8)
> Pool related events are not propagated to the hosts, rather you can view
> those in the pool level as well as cluster level so this is not a valid use
I don't think that's correct behaviour. The event is related to the current osd and so to the current host. Moreover the event is counted for Hosts object on dashboard page. And by clicking on the alerts number, hosts are filtered and only the one which is related to the event is displayed. Hence I think such event should be counted as alert for some host.
> But the issue mentioned as part of comment 6 is valid which needs to be fixed
I am not able understand this correctly. Suppose a pool created out of 6 OSDs residing in 6 different hosts, so what you expect the pool related alerts are attached to all the 6 hosts? As far I know pool is treated as a cluster level entity and hence pool related events are attached to the cluster
Please let me know, if you agree I will close this bug
I missed the issue mentioned at comment 6, that needs to be addressed
(In reply to Nishanth Thomas from comment #10)
> I am not able understand this correctly. Suppose a pool created out of 6
> OSDs residing in 6 different hosts, so what you expect the pool related
> alerts are attached to all the 6 hosts? As far I know pool is treated as a
> cluster level entity and hence pool related events are attached to the
> Please let me know, if you agree I will close this bug
Even the scenario mentioned in first comment shows the problem related not just to the pool, but to specific OSD(s) so it's possible to relate it to specific host(s). In other words if the event has some specific host (or OSD) mentioned than it should be mentioned for the host in the hosts list.
I agree if there will be event/alert for the pool which doesn't specify any host (or OSD) than that's pool alert which will not be counted for any host.
As discussed, we are binding all OSD related events to host, we have tested it for the following osd related events and it works as expected:
1. osd utilization crossing the threshold.
2. osd state changing.
However removing an underlying disk from OSD is not triggering OSD state change in ceph cli, hence calamari is not sending event for that. We have raised a bug(https://bugzilla.redhat.com/show_bug.cgi?id=1349813) in ceph for fixing that.
Marking this bug as dependent of BZ1349813 as the above scenario is mentioned in comment 6.
Got an update from ceph team regarding BZ1349813. For ceph to report osd as down when disk is removed, there has to be atleast 2 (this is default value of config mon_osd_min_down_reporters it can be changed) up osds to report this and also this is reported only after first write.
Considering this as the current behavior of ceph, moving this Bug to ON_QA.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.