Bug 1515855
Summary: | tendrl-notifier doesn't send alerts for gluster native events for UNKNOWN_PEER and PEER_REJECT | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Martin Bukatovic <mbukatov> | ||||
Component: | web-admin-tendrl-notifier | Assignee: | Nishanth Thomas <nthomas> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Kudlej <mkudlej> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rhgs-3.3 | CC: | amukherj, dnarayan, mbukatov, mkudlej, rhs-bugs, sanandpa, sankarshan | ||||
Target Milestone: | --- | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-12-18 04:37:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Martin Bukatovic
2017-11-21 13:54:08 UTC
Did this events are raised by gluster? Whatever events are raised by gluster will be logged at /var/log/glusterfs/events.log Can you verify this and get us the required information QE team doesn't have time to retry scenario with clearly described full reproducer at this point. There are lot of *other issues* to report and provide reproducers for. But to answer your question at least partly: I have evidence for at least one case when glusterfs tried to send event, but tendrl failed to received it, see general BZ 1517468. Only when BZ 1517468 is fully understood, fixed and verified, we could retry this particular scenario. The steps to raise UNKNOWN_PEER and PEER_REJECT are incorrect. PEER_REJECT can be simulated as follows: 1.Have a 3 node cluster(all 3 nodespeer probed). say node1, node2, node3 2.Bring down glusterd service in one of the nodes. say node1 3.perform peer detach from node2 to detach node1. 4.bring glusterd service back on node1. Now you would receive the PEER_REJECT event. I have tried this and was able to see the event in tendrl UI. However, there is an issue handling the duplicates(events API will keep pushing PEER_REJECT untill the condition is resolved). Have attached upstream issue here to fix this. Atin would be providing more details about UNKNOWN_PEER Simulating UNKNOWN_PEER event is not that easy until and unless done from gdb. The code path where this can be hit is when one of the peer sends a friend update request to its counterpart, over the wire if anyhow the friend_req.uuid un-marshalling go for a toss, glusterd would not be able to find a proper peerinfo object in this case and hence the peer update would be rejected. If QE is interested to simulate this by taking glusterd in gdb session I can help out. I have tested PEER_REJECT according Darshan's comment and it works. I've used: etcd-3.2.7-1.el7.x86_64 glusterfs-3.8.4-52.el7_4.x86_64 glusterfs-3.8.4-52.el7rhgs.x86_64 glusterfs-api-3.8.4-52.el7rhgs.x86_64 glusterfs-cli-3.8.4-52.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-52.el7_4.x86_64 glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64 glusterfs-events-3.8.4-52.el7rhgs.x86_64 glusterfs-fuse-3.8.4-52.el7_4.x86_64 glusterfs-fuse-3.8.4-52.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64 glusterfs-libs-3.8.4-52.el7_4.x86_64 glusterfs-libs-3.8.4-52.el7rhgs.x86_64 glusterfs-rdma-3.8.4-52.el7rhgs.x86_64 glusterfs-server-3.8.4-52.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.4.x86_64 python-etcd-0.4.5-1.el7rhgs.noarch python-gluster-3.8.4-52.el7rhgs.noarch rubygem-etcd-0.3.0-1.el7rhgs.noarch tendrl-ansible-1.5.4-2.el7rhgs.noarch tendrl-api-1.5.4-4.el7rhgs.noarch tendrl-api-httpd-1.5.4-4.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch tendrl-commons-1.5.4-6.el7rhgs.noarch tendrl-gluster-integration-1.5.4-8.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-11.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-11.el7rhgs.noarch tendrl-node-agent-1.5.4-9.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch tendrl-ui-1.5.4-5.el7rhgs.noarch vdsm-gluster-4.17.33-1.2.el7rhgs.noarch I've talk with Atin (Thanks Atin for help!) and Gluster has sent UNKNOWN_PEER but I don't see it in WA. Please look at tracing of gluster code: Breakpoint 1, __glusterd_handle_friend_update (req=req@entry=0x7ff1615d5020) at glusterd-handler.c:2737 2737 { (gdb) next 2745 char key[100] = {0,}; (gdb) next 2737 { (gdb) next 2745 char key[100] = {0,}; (gdb) next 2739 gd1_mgmt_friend_update friend_req = {{0},}; (gdb) next 2737 { (gdb) next 2745 char key[100] = {0,}; (gdb) next 2753 GF_ASSERT (req); (gdb) next 2739 gd1_mgmt_friend_update friend_req = {{0},}; (gdb) next 2745 char key[100] = {0,}; (gdb) next 2739 gd1_mgmt_friend_update friend_req = {{0},}; (gdb) next 2743 gd1_mgmt_friend_update_rsp rsp = {{0},}; (gdb) next 2744 dict_t *dict = NULL; (gdb) next 2746 char *uuid_buf = NULL; (gdb) next 2748 int count = 0; (gdb) next 2749 uuid_t uuid = {0,}; (gdb) next 2745 char key[100] = {0,}; (gdb) next 2749 uuid_t uuid = {0,}; (gdb) next 2750 glusterd_peerctx_args_t args = {0}; (gdb) next 2751 int32_t op = 0; (gdb) next 2753 GF_ASSERT (req); (gdb) next 2755 this = THIS; (gdb) next 2756 GF_ASSERT (this); (gdb) next 2758 GF_ASSERT (priv); (gdb) next 2760 ret = xdr_to_generic (req->msg[0], &friend_req, (gdb) next 2762 if (ret < 0) { (gdb) set friend_req.uuid="sdfsdf" (gdb) next 2772 rcu_read_lock (); (gdb) next 2773 if (glusterd_peerinfo_find (friend_req.uuid, NULL) == NULL) { (gdb) next 2776 rcu_read_unlock (); (gdb) next 2778 gf_msg (this->name, GF_LOG_CRITICAL, 0, (gdb) next 2782 gf_event (EVENT_UNKNOWN_PEER, "peer=%s", (gdb) next 2740 glusterd_peerinfo_t *peerinfo = NULL; (gdb) next 2782 gf_event (EVENT_UNKNOWN_PEER, "peer=%s", (gdb) next 2898 gf_uuid_copy (rsp.uuid, MY_UUID); (gdb) next 2899 ret = glusterd_submit_reply (req, &rsp, NULL, 0, NULL, (gdb) next 2901 if (dict) { (gdb) continue Continuing. -->ASSIGNED I tried with my setup and it works for me, attaching the screenshot of events menu in tendrl UI. Pls note its wont be listed under top right corner bell icon as its notify only alert. Created attachment 1363510 [details]
Unknown_peer_seen_in_UI
What is the system configuration of server and storage nodes? @Darshan What was your scenario to see UNKNOWN_PEER event? Have you try to follow steps described in comment 9? I'm not sure but I see on your screenshot event "unknown state of peer" and not "unknown peer" event. Am I wrong? @Nishanth Server has 27 GB of free memory and 16 GB of free disk space. (In reply to Martin Kudlej from comment #17) > @Darshan What was your scenario to see UNKNOWN_PEER event? Have you try to > follow steps described in comment 9? Yes > I'm not sure but I see on your screenshot event "unknown state of peer" and > not "unknown peer" event. Am I wrong? That's the message for unknown_peer. You can have look at the handler function for unknown_peer here: https://github.com/Tendrl/gluster-integration/blob/master/tendrl/gluster_integration/message/callback.py#L213 > > @Nishanth Server has 27 GB of free memory and 16 GB of free disk space. @martin, The procedure same and atin helped us with that. Please retest. Moving the bug to ON_QA again. I've tested this with Darshan's help (thank you for that!) and it works. --> VERIFIED (In reply to Martin Kudlej from comment #20) > I've tested this with Darshan's help (thank you for that!) and it works. --> > VERIFIED Nice, could you update the test case description so that anybody could replicate this again if needed? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3478 |