| Summary: | [Eventing]: 'gluster vol start vol1' triggers CLIENT_CONNECT/DISCONNECT events for bricks of ALL volumes of the cluster | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Sweta Anandpara <sanandpa> |
| Component: | glusterfs | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> |
| Status: | CLOSED UPSTREAM | QA Contact: | storage-qa-internal <storage-qa-internal> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | rhgs-3.2 | CC: | amukherj, atumball, prasanna.kalever, rhinduja, rkavunga, vbellur |
| Target Milestone: | --- | Keywords: | Reopened, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-11-06 08:25:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Sweta Anandpara
2016-10-04 06:29:06 UTC
Prasanna - can you check this? the mentioned set of events were introduced by http://review.gluster.org/15294 RCA: The events for already started bricks are generated from internal services which are per node basis, like self heal daemon,nfs. In case of shd, basically when you start a volume, to load the volfile for that volume , you need to stop the shd and start it with new volfile. When you bring down the shd deamon, it triggers a disconnect for already connected volume meaning, for volumes which is already started and enabled self heal. Those disconnects triggers the protocol events. And when shd restarts with new volfile, we will get a connect event as well As per the RCA, the events are generated from internal Gluster daemons, and for a glusterfsd process, they are clients and the events are genuine. It is the responsibility of consumers of the events to decide whether to discard or act on the event. Unfortunately we don't have the way to distinguish the correctness, once we implement protocol/client events we may able to do this. Closing this bug as won't fix. The resolution, or rather, the impact of this bug does not look clean to me, that we can live with it. In my limited node cluster (4), with a significantly small number of volumes created (3), which have distinctly small volume sizes/configuration (2*2, 1*4+2, 2*1), I see _44_ CLIENT_CONNECTS and CLIENT_DISCONNECTS with just the default services. This number will just increase if I enable other user-configurable services like bitrot/nfs/snapshot. And needless to say, this number is going to multiply n times in an actual customer environment. It is becoming quite cumbersome for me to even test out the events in my small cluster with 3 volumes, when there is such a huge traffic of /unrelated/ events. The fallback that I have been exercising right now is that I delete the other volumes and start out testing my events in a single-volume-cluster. I do not see this kind of testing to scale up as the release progresses. And it definitely will not suffice testing of Eventing as a feature. I do not agree with this line of Comment 5 - "responsibility of consumers of events to decide whether to discard or act on an event". We are relaxing too much on the consumer to take this call and washing our hands off it. I am using a webhook as a consumer and I am feeling taxed to monitor only the events which matter and discard the rest. And I am little confident if I am doing everything right. What confidence will an external user have?! I am affirmative wrt Comment 3 that the internal-gluster-daemons are clients for glusterfsd, and the events seen are genuine. There is no denying that. The question is: Are the internal events _really_ important? If no, we can do away with them. If yes, can we classify/group such events based on priority, may be set the log_level of such events to 'debug' rather than 'info'. At least, setting the appropriate log_level should act as a filtering mechanism. Amend in the last line of above comment. 'log_level' caters to only logging of events. What I meant was setting the priority/importance of events to high-medium-low or classifying the events to info-error-debug-warning, thereby displaying or responding only to those events which match the configured setting. Atin and team, thoughts? In last 2years we don't have any progress made on this bug, and once had closed as WONTFIX. I would rather close this as UPSTREAM for now, to indicate we will try to fix it if we find the issues with gluster-prometheus in GCS integration. |