Description of problem: When the volume is deleted then deleted flag is updated as True for the particular volume. But sometimes again the deleted flag is updated by False. Version-Release number of selected component (if applicable): How reproducible: Deleted some volume from CLI and check deleted flag is True for that volume in the central store. But after few seconds again the flag is updated as False. Because of this problem monitoring-integration again creating a panel in the dashboard for deleted volume. Steps to Reproduce: 1. Delete volume from CLI 2. Keep checking deleted flag for that volume in the central store 3. After few seconds deleted flag is updated as False from True Actual results: The deleted flag is False for deleted volumes Expected results: The deleted flag should always show True for deleted volumes Additional info:
Could you provide more details about: * How to inspect value of flag in question * What does sometimes means here? once in how many retries? * Link to upstream merge request. * Version where the bug is present * Is it possible to reproduce this on previously released RHGS WA?
It is reproducible in rhgs 1.6.1-1, In central store we are storing a volume information in /cluster/{cid}/Volumes/{vid}/deleted. Each volume object has a member variable called deleted with a default value is False. When the volume is deleted then deleted flag is modified as True. But after few minutes again the deleted flag is updated as False. So monitoring-integration creating an alert dashbaord panel for each volume based on this deleted flag only. If volume deleted then alert panel for that volume is deleted and the flag is marked as True. but if after few minutes of volume deletion if the flag is marked by some thread so alert panel for the deleted volume is created again in grafana. To see alert dashbaord in grafana: 1. do sign-in in grafna with a valid credential 2. switch organization to "Alert Dashbaord" organization 3. Press Home to list dashboards.
but it won't happen frequently, it is actually happening because of race condition.
sorry, In version 3.3.1 it can be reproducible
I meant 1.6.1-1 is actually RHGS-WA NVR 1.6.1-1
Based on the information provided here I'm assuming that: * it's not reproducible in RHGS WA 3.3.1, and as such it's connected to a new feature of RHGS WA 3.4 * qe team will verify this by running scenario (using additional details in comment 3) multiple times, but can't approach it as standard bug verification as it's not reproducible in older builds * alert dashbaord in grafana is mentioned only as a hint for qe team and this feature is still not supported (as this is not documented, there is no feature BZ for it and my understanding was that this is internal implementation details to support alerts[1]) [1] see eg. this note from Mrugesh: > Alerts org will contain the panels created for alert callbacks and will be > hidden from the end users. from https://github.com/Tendrl/specifications/issues/191#issuecomment-326197800 Is my understanding correct? If yes, I'm going to provide the qe ack.
(In reply to Martin Bukatovic from comment #8) > Based on the information provided here I'm assuming that: > > * it's not reproducible in RHGS WA 3.3.1, and as such it's connected to a new > feature of RHGS WA 3.4 > Out of band deletion of volume is supported from 3.3.1 so you might see this issue in 3.3.1 as well unless it is introduced during 3.4.0 development. > * qe team will verify this by running scenario (using additional details in > comment 3) multiple times, but can't approach it as standard bug > verification > as it's not reproducible in older builds > > * alert dashbaord in grafana is mentioned only as a hint for qe team and this > feature is still not supported (as this is not documented, there is no > feature > BZ for it and my understanding was that this is internal implementation > details > to support alerts[1]) > > [1] see eg. this note from Mrugesh: > > > Alerts org will contain the panels created for alert callbacks and will be > > hidden from the end users. > > from > https://github.com/Tendrl/specifications/issues/191#issuecomment-326197800 > > Is my understanding correct? If yes, I'm going to provide the qe ack. Yes, alert dashbaord is not supposed to be used by the end-users and hidden
When volume is deleted in current build the record of volume is erased from etcd (/clusters/<cluster-id>/Volumes) and from alert dashboard. Before it is erased there are fields in /clusters/<cluster-id>/Volumes/<volume-id>/data `deleted` and `deleted_at` that are set to "" by default and are filled with data ("deleted_at": "<timestamp>", "deleted": true) when the volume is deleted. When these records related to the volume in etcd are erased, there is no way how WA can restore them for deleted volume so the issue can not happen in current build, right? Tested with: tendrl-gluster-integration-1.6.3-7.el7rhgs.noarch tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-9.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-8.el7rhgs.noarch
In old builds also we have TTL for volumes to delete when it is deleted from CLI. Why we are using deleted flag is it will take some time to delete by TTL so in gluster-integration sync and monitoring-integration sync we need some flag to omit deleted volume to calculate some volume details and creating panels in alert dashbaord. So we used this deleted flag for that purpose. The problem in an old build is if a volume is deleted we are capturing gluster-event for delete volume and we are deleting it from grafana alert dashboard. But if it again marked as deleted is "", then again panel for that volume is created in alert dashbaord and it remains forever. Now in the new build, we fixed this problem like deleted flag marked as properly after deleted. And also even any problem in deletion flag then monitoring-integration will have a intelligence to remove volume panel when a volume is deleted by TTL. So it won't occur.
The logic in monitoring-integration is like if sync collect list of volume like A, B, C then it will create volume panels in alert-dashboard for all A, B, C then next sync if sync collects only A, B then it will compare dashboard panels with newly collected data so C is missing so it will remove panel for C. Even if monitoring-integration down when volume is deleted and Volume detail is deleted by TTL form etcd then it can remove deleted volume detail from grafana dashbaord when monitoring-integration comes up.
Tested 10 times and it seems to be fixed. Records from etcd and alert dashboards were deleted every time. --> VERIFIED Tested with: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-9.el7rhgs.noarch tendrl-gluster-integration-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-8.el7rhgs.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616