Description of problem ====================== Tendrl notifier doesn't send alerts for the following gluster native events: * Peer has moved to unknown state (UNKNOWN_PEER) * Peer rejected (PEER_REJECT) Version-Release =============== tendrl-notifier-1.5.4-2.el7rhgs.noarch [root@usm1-server ~]# rpm -qa | grep tendrl | sort tendrl-ansible-1.5.4-1.el7rhgs.noarch tendrl-api-1.5.4-2.el7rhgs.noarch tendrl-api-httpd-1.5.4-2.el7rhgs.noarch tendrl-commons-1.5.4-3.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-4.el7rhgs.noarch tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-4.el7rhgs.noarch tendrl-node-agent-1.5.4-3.el7rhgs.noarch tendrl-notifier-1.5.4-2.el7rhgs.noarch tendrl-selinux-1.5.3-2.el7rhgs.noarch tendrl-ui-1.5.4-3.el7rhgs.noarch [root@usm1-gl1 ~]# rpm -qa | grep gluster | sort glusterfs-3.8.4-52.el7rhgs.x86_64 glusterfs-api-3.8.4-52.el7rhgs.x86_64 glusterfs-cli-3.8.4-52.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64 glusterfs-events-3.8.4-52.el7rhgs.x86_64 glusterfs-fuse-3.8.4-52.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64 glusterfs-libs-3.8.4-52.el7rhgs.x86_64 glusterfs-rdma-3.8.4-52.el7rhgs.x86_64 glusterfs-server-3.8.4-52.el7rhgs.x86_64 gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64 python-gluster-3.8.4-52.el7rhgs.noarch tendrl-gluster-integration-1.5.4-3.el7rhgs.noarch vdsm-gluster-4.17.33-1.2.el7rhgs.noarch How reproducible ================ 100 % Steps to Reproduce ================== 1. Install RHGS WA using tendrl-ansible 2. Configure alerting to send events via both smtp and snmp 3. Import gluster trusted storage pool with a volume 4. See alerts/events send shown via UI and send via smtp and snmp when you perform the following steps 5. Test action 1: on one of storage machines of the cluster, do gluster peer probe on machine with invalid hostname 6. Test action 2: on one of storage machines of the cluster, do gluster peer probe on machine which has been shut down 7. Test action 3: on one of storage machines of the cluster, do gluster peer probe on machine that is already part of **another** trusted storage pool (eg. machine from your colleague's cluster). When qe playbooks for alerting test setup are used: * https://github.com/usmqe/usmqe-setup/blob/master/test_setup.smtp.yml * https://github.com/usmqe/usmqe-setup/blob/master/test_setup.snmp.yml one can check incoming snmp trap messages via: # journalctl -u snmptrapd -fe and email messages via: # mutt Actual results ============== Even though I performed test action 1: ``` [root@usm1-gl1 ~]# ping usm1-doesnotexist.example.com ping: usm1-doesnotexist.example.com: Name or service not known [root@usm1-gl1 ~]# gluster peer probe usm1-doesnotexist.example.com peer probe: failed: Probe returned with Transport endpoint is not connected ``` test action 2: ``` # ping mbukatov.example.com PING mbukatov.example.com (10.37.169.5) 56(84) bytes of data. ^C --- mbukatov.example.com ping statistics --- 5 packets transmitted, 0 received, 100% packet loss, time 4001ms # gluster peer probe mbukatov.example.com peer probe: failed: Probe returned with Transport endpoint is not connected ``` test action 3: ``` # gluster peer probe usm2-gl1.example.com peer probe: failed: usm2-gl1.example.com is either already part of another cluster or having volumes configured ``` I haven't received any alert (neither via Tendrl UI, email or snmp) for any of these 3 actions. Expected results ================ Alerts for both "Peer has moved to unknown state" (UNKNOWN_PEER) and "Peer rejected" (PEER_REJECT) are send in each test action described above.
Did this events are raised by gluster? Whatever events are raised by gluster will be logged at /var/log/glusterfs/events.log Can you verify this and get us the required information
QE team doesn't have time to retry scenario with clearly described full reproducer at this point. There are lot of *other issues* to report and provide reproducers for. But to answer your question at least partly: I have evidence for at least one case when glusterfs tried to send event, but tendrl failed to received it, see general BZ 1517468. Only when BZ 1517468 is fully understood, fixed and verified, we could retry this particular scenario.
The steps to raise UNKNOWN_PEER and PEER_REJECT are incorrect. PEER_REJECT can be simulated as follows: 1.Have a 3 node cluster(all 3 nodespeer probed). say node1, node2, node3 2.Bring down glusterd service in one of the nodes. say node1 3.perform peer detach from node2 to detach node1. 4.bring glusterd service back on node1. Now you would receive the PEER_REJECT event. I have tried this and was able to see the event in tendrl UI. However, there is an issue handling the duplicates(events API will keep pushing PEER_REJECT untill the condition is resolved). Have attached upstream issue here to fix this. Atin would be providing more details about UNKNOWN_PEER
Simulating UNKNOWN_PEER event is not that easy until and unless done from gdb. The code path where this can be hit is when one of the peer sends a friend update request to its counterpart, over the wire if anyhow the friend_req.uuid un-marshalling go for a toss, glusterd would not be able to find a proper peerinfo object in this case and hence the peer update would be rejected. If QE is interested to simulate this by taking glusterd in gdb session I can help out.
I have tested PEER_REJECT according Darshan's comment and it works. I've used: etcd-3.2.7-1.el7.x86_64 glusterfs-3.8.4-52.el7_4.x86_64 glusterfs-3.8.4-52.el7rhgs.x86_64 glusterfs-api-3.8.4-52.el7rhgs.x86_64 glusterfs-cli-3.8.4-52.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-52.el7_4.x86_64 glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64 glusterfs-events-3.8.4-52.el7rhgs.x86_64 glusterfs-fuse-3.8.4-52.el7_4.x86_64 glusterfs-fuse-3.8.4-52.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64 glusterfs-libs-3.8.4-52.el7_4.x86_64 glusterfs-libs-3.8.4-52.el7rhgs.x86_64 glusterfs-rdma-3.8.4-52.el7rhgs.x86_64 glusterfs-server-3.8.4-52.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.4.x86_64 python-etcd-0.4.5-1.el7rhgs.noarch python-gluster-3.8.4-52.el7rhgs.noarch rubygem-etcd-0.3.0-1.el7rhgs.noarch tendrl-ansible-1.5.4-2.el7rhgs.noarch tendrl-api-1.5.4-4.el7rhgs.noarch tendrl-api-httpd-1.5.4-4.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch tendrl-commons-1.5.4-6.el7rhgs.noarch tendrl-gluster-integration-1.5.4-8.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-11.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-11.el7rhgs.noarch tendrl-node-agent-1.5.4-9.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch tendrl-ui-1.5.4-5.el7rhgs.noarch vdsm-gluster-4.17.33-1.2.el7rhgs.noarch
I've talk with Atin (Thanks Atin for help!) and Gluster has sent UNKNOWN_PEER but I don't see it in WA. Please look at tracing of gluster code: Breakpoint 1, __glusterd_handle_friend_update (req=req@entry=0x7ff1615d5020) at glusterd-handler.c:2737 2737 { (gdb) next 2745 char key[100] = {0,}; (gdb) next 2737 { (gdb) next 2745 char key[100] = {0,}; (gdb) next 2739 gd1_mgmt_friend_update friend_req = {{0},}; (gdb) next 2737 { (gdb) next 2745 char key[100] = {0,}; (gdb) next 2753 GF_ASSERT (req); (gdb) next 2739 gd1_mgmt_friend_update friend_req = {{0},}; (gdb) next 2745 char key[100] = {0,}; (gdb) next 2739 gd1_mgmt_friend_update friend_req = {{0},}; (gdb) next 2743 gd1_mgmt_friend_update_rsp rsp = {{0},}; (gdb) next 2744 dict_t *dict = NULL; (gdb) next 2746 char *uuid_buf = NULL; (gdb) next 2748 int count = 0; (gdb) next 2749 uuid_t uuid = {0,}; (gdb) next 2745 char key[100] = {0,}; (gdb) next 2749 uuid_t uuid = {0,}; (gdb) next 2750 glusterd_peerctx_args_t args = {0}; (gdb) next 2751 int32_t op = 0; (gdb) next 2753 GF_ASSERT (req); (gdb) next 2755 this = THIS; (gdb) next 2756 GF_ASSERT (this); (gdb) next 2758 GF_ASSERT (priv); (gdb) next 2760 ret = xdr_to_generic (req->msg[0], &friend_req, (gdb) next 2762 if (ret < 0) { (gdb) set friend_req.uuid="sdfsdf" (gdb) next 2772 rcu_read_lock (); (gdb) next 2773 if (glusterd_peerinfo_find (friend_req.uuid, NULL) == NULL) { (gdb) next 2776 rcu_read_unlock (); (gdb) next 2778 gf_msg (this->name, GF_LOG_CRITICAL, 0, (gdb) next 2782 gf_event (EVENT_UNKNOWN_PEER, "peer=%s", (gdb) next 2740 glusterd_peerinfo_t *peerinfo = NULL; (gdb) next 2782 gf_event (EVENT_UNKNOWN_PEER, "peer=%s", (gdb) next 2898 gf_uuid_copy (rsp.uuid, MY_UUID); (gdb) next 2899 ret = glusterd_submit_reply (req, &rsp, NULL, 0, NULL, (gdb) next 2901 if (dict) { (gdb) continue Continuing. -->ASSIGNED
I tried with my setup and it works for me, attaching the screenshot of events menu in tendrl UI. Pls note its wont be listed under top right corner bell icon as its notify only alert.
Created attachment 1363510 [details] Unknown_peer_seen_in_UI
What is the system configuration of server and storage nodes?
@Darshan What was your scenario to see UNKNOWN_PEER event? Have you try to follow steps described in comment 9? I'm not sure but I see on your screenshot event "unknown state of peer" and not "unknown peer" event. Am I wrong? @Nishanth Server has 27 GB of free memory and 16 GB of free disk space.
(In reply to Martin Kudlej from comment #17) > @Darshan What was your scenario to see UNKNOWN_PEER event? Have you try to > follow steps described in comment 9? Yes > I'm not sure but I see on your screenshot event "unknown state of peer" and > not "unknown peer" event. Am I wrong? That's the message for unknown_peer. You can have look at the handler function for unknown_peer here: https://github.com/Tendrl/gluster-integration/blob/master/tendrl/gluster_integration/message/callback.py#L213 > > @Nishanth Server has 27 GB of free memory and 16 GB of free disk space.
@martin, The procedure same and atin helped us with that. Please retest. Moving the bug to ON_QA again.
I've tested this with Darshan's help (thank you for that!) and it works. --> VERIFIED
(In reply to Martin Kudlej from comment #20) > I've tested this with Darshan's help (thank you for that!) and it works. --> > VERIFIED Nice, could you update the test case description so that anybody could replicate this again if needed?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3478