Bug 1515855

Summary:

tendrl-notifier doesn't send alerts for gluster native events for UNKNOWN_PEER and PEER_REJECT

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Martin Bukatovic <mbukatov>

Component:

web-admin-tendrl-notifier

Assignee:

Nishanth Thomas <nthomas>

Status:

CLOSED ERRATA

QA Contact:

Martin Kudlej <mkudlej>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.3

CC:

amukherj, dnarayan, mbukatov, mkudlej, rhs-bugs, sanandpa, sankarshan

Target Milestone:

---

Keywords:

ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch tendrl-notifier-1.5.4-6.el7rhgs.noarch

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-12-18 04:37:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Unknown_peer_seen_in_UI	none

Description Martin Bukatovic 2017-11-21 13:54:08 UTC

Description of problem
======================

Tendrl notifier doesn't send alerts for the following gluster native events:

* Peer has moved to unknown state (UNKNOWN_PEER)
* Peer rejected (PEER_REJECT)

Version-Release
===============

tendrl-notifier-1.5.4-2.el7rhgs.noarch

[root@usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.5.4-1.el7rhgs.noarch
tendrl-api-1.5.4-2.el7rhgs.noarch
tendrl-api-httpd-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.5.4-3.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-4.el7rhgs.noarch
tendrl-node-agent-1.5.4-3.el7rhgs.noarch
tendrl-notifier-1.5.4-2.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-ui-1.5.4-3.el7rhgs.noarch

[root@usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.8.4-52.el7rhgs.x86_64
glusterfs-api-3.8.4-52.el7rhgs.x86_64
glusterfs-cli-3.8.4-52.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64
glusterfs-events-3.8.4-52.el7rhgs.x86_64
glusterfs-fuse-3.8.4-52.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64
glusterfs-libs-3.8.4-52.el7rhgs.x86_64
glusterfs-rdma-3.8.4-52.el7rhgs.x86_64
glusterfs-server-3.8.4-52.el7rhgs.x86_64
gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64
python-gluster-3.8.4-52.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-3.el7rhgs.noarch
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch

How reproducible
================

100 %

Steps to Reproduce
==================

1. Install RHGS WA using tendrl-ansible
2. Configure alerting to send events via both smtp and snmp
3. Import gluster trusted storage pool with a volume
4. See alerts/events send shown via UI and send via smtp and snmp
   when you perform the following steps
5. Test action 1: on one of storage machines of the cluster, do
   gluster peer probe on machine with invalid hostname
6. Test action 2: on one of storage machines of the cluster, do
   gluster peer probe on machine which has been shut down
7. Test action 3: on one of storage machines of the cluster, do
   gluster peer probe on machine that is already part of **another**
   trusted storage pool (eg. machine from your colleague's cluster).

When qe playbooks for alerting test setup are used:

* https://github.com/usmqe/usmqe-setup/blob/master/test_setup.smtp.yml
* https://github.com/usmqe/usmqe-setup/blob/master/test_setup.snmp.yml

one can check incoming snmp trap messages via:

# journalctl -u snmptrapd -fe

and email messages via:

# mutt

Actual results
==============

Even though I performed test action 1:

```
[root@usm1-gl1 ~]# ping usm1-doesnotexist.example.com
ping: usm1-doesnotexist.example.com: Name or service not known
[root@usm1-gl1 ~]# gluster peer probe usm1-doesnotexist.example.com
peer probe: failed: Probe returned with Transport endpoint is not connected
```

test action 2:

```
# ping mbukatov.example.com                                                     
PING mbukatov.example.com (10.37.169.5) 56(84) bytes of data.                   
^C                                                                                                                                                                                                                                                                                                                           
--- mbukatov.example.com ping statistics ---                                    
5 packets transmitted, 0 received, 100% packet loss, time 4001ms                
# gluster peer probe mbukatov.example.com                                       
peer probe: failed: Probe returned with Transport endpoint is not connected 
```

test action 3:

```
# gluster peer probe usm2-gl1.example.com                                       
peer probe: failed: usm2-gl1.example.com is either already part of another cluster or having volumes configured
```

I haven't received any alert (neither via Tendrl UI, email or snmp) for
any of these 3 actions.

Expected results
================

Alerts for both "Peer has moved to unknown state" (UNKNOWN_PEER) and
"Peer rejected" (PEER_REJECT) are send in each test action described
above.

Comment 2 Nishanth Thomas 2017-11-24 12:58:11 UTC

Did this events are raised by gluster?
Whatever events are raised by gluster will be logged at /var/log/glusterfs/events.log
Can you verify this and get us the required information

Comment 3 Martin Bukatovic 2017-11-25 18:05:01 UTC

QE team doesn't have time to retry scenario with clearly described full
reproducer at this point. There are lot of *other issues* to report and
provide reproducers for.

But to answer your question at least partly: I have evidence for at least
one case when glusterfs tried to send event,
but tendrl failed to received it, see general BZ 1517468.

Only when BZ 1517468 is fully understood, fixed and verified, we could retry
this particular scenario.

Comment 4 Darshan 2017-11-28 11:03:15 UTC

The steps to raise UNKNOWN_PEER and PEER_REJECT are incorrect.

PEER_REJECT can be simulated as follows:

1.Have a 3 node cluster(all 3 nodespeer probed). say node1, node2, node3
2.Bring down glusterd service in one of the nodes. say node1
3.perform peer detach from node2 to detach node1.
4.bring glusterd service back on node1.

Now you would receive the PEER_REJECT event.

I have tried this and was able to see the event in tendrl UI. However, there is an issue handling the duplicates(events API will keep pushing PEER_REJECT untill the condition is resolved).
Have attached upstream issue here to fix this.

Atin would be providing more details about UNKNOWN_PEER

Comment 5 Atin Mukherjee 2017-11-28 11:46:53 UTC

Simulating UNKNOWN_PEER event is not that easy until and unless done from gdb. The code path where this can be hit is when one of the peer sends a friend update request to its counterpart, over the wire if anyhow the friend_req.uuid un-marshalling go for a toss, glusterd would not be able to find a proper peerinfo object in this case and hence the peer update would be rejected. If QE is interested to simulate this by taking glusterd in gdb session I can help out.

Comment 7 Martin Kudlej 2017-12-04 15:01:29 UTC

I have tested PEER_REJECT according Darshan's comment and it works. I've used:
etcd-3.2.7-1.el7.x86_64
glusterfs-3.8.4-52.el7_4.x86_64
glusterfs-3.8.4-52.el7rhgs.x86_64
glusterfs-api-3.8.4-52.el7rhgs.x86_64
glusterfs-cli-3.8.4-52.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-52.el7_4.x86_64
glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64
glusterfs-events-3.8.4-52.el7rhgs.x86_64
glusterfs-fuse-3.8.4-52.el7_4.x86_64
glusterfs-fuse-3.8.4-52.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64
glusterfs-libs-3.8.4-52.el7_4.x86_64
glusterfs-libs-3.8.4-52.el7rhgs.x86_64
glusterfs-rdma-3.8.4-52.el7rhgs.x86_64
glusterfs-server-3.8.4-52.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.4.x86_64
python-etcd-0.4.5-1.el7rhgs.noarch
python-gluster-3.8.4-52.el7rhgs.noarch
rubygem-etcd-0.3.0-1.el7rhgs.noarch
tendrl-ansible-1.5.4-2.el7rhgs.noarch
tendrl-api-1.5.4-4.el7rhgs.noarch
tendrl-api-httpd-1.5.4-4.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch
tendrl-commons-1.5.4-6.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-8.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-11.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-11.el7rhgs.noarch
tendrl-node-agent-1.5.4-9.el7rhgs.noarch
tendrl-notifier-1.5.4-6.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
tendrl-ui-1.5.4-5.el7rhgs.noarch
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch

Comment 13 Martin Kudlej 2017-12-05 16:17:24 UTC

I've talk with Atin (Thanks Atin for help!) and Gluster has sent UNKNOWN_PEER but I don't see it in WA.

Please look at tracing of gluster code:

Breakpoint 1, __glusterd_handle_friend_update (req=req@entry=0x7ff1615d5020) at glusterd-handler.c:2737
2737    {
(gdb) next
2745            char                    key[100] = {0,};
(gdb) next
2737    {
(gdb) next
2745            char                    key[100] = {0,};
(gdb) next
2739            gd1_mgmt_friend_update     friend_req = {{0},};
(gdb) next
2737    {
(gdb) next
2745            char                    key[100] = {0,};
(gdb) next
2753            GF_ASSERT (req);
(gdb) next
2739            gd1_mgmt_friend_update     friend_req = {{0},};
(gdb) next
2745            char                    key[100] = {0,};
(gdb) next
2739            gd1_mgmt_friend_update     friend_req = {{0},};
(gdb) next
2743            gd1_mgmt_friend_update_rsp rsp = {{0},};
(gdb) next
2744            dict_t                  *dict = NULL;
(gdb) next
2746            char                    *uuid_buf = NULL;
(gdb) next
2748            int                     count = 0;
(gdb) next
2749            uuid_t                  uuid = {0,};
(gdb) next
2745            char                    key[100] = {0,};
(gdb) next
2749            uuid_t                  uuid = {0,};
(gdb) next
2750            glusterd_peerctx_args_t args = {0};
(gdb) next
2751            int32_t                 op = 0;
(gdb) next
2753            GF_ASSERT (req);
(gdb) next
2755            this = THIS;
(gdb) next
2756            GF_ASSERT (this);
(gdb) next
2758            GF_ASSERT (priv);
(gdb) next
2760            ret = xdr_to_generic (req->msg[0], &friend_req,
(gdb) next
2762            if (ret < 0) {
(gdb) set friend_req.uuid="sdfsdf"
(gdb) next
2772            rcu_read_lock ();
(gdb) next
2773            if (glusterd_peerinfo_find (friend_req.uuid, NULL) == NULL) {
(gdb) next
2776            rcu_read_unlock ();
(gdb) next
2778                    gf_msg (this->name, GF_LOG_CRITICAL, 0,
(gdb) next
2782                    gf_event (EVENT_UNKNOWN_PEER, "peer=%s",
(gdb) next
2740            glusterd_peerinfo_t     *peerinfo = NULL;
(gdb) next
2782                    gf_event (EVENT_UNKNOWN_PEER, "peer=%s",
(gdb) next
2898            gf_uuid_copy (rsp.uuid, MY_UUID);
(gdb) next
2899            ret = glusterd_submit_reply (req, &rsp, NULL, 0, NULL,
(gdb) next
2901            if (dict) {
(gdb) continue
Continuing.


-->ASSIGNED

Comment 14 Darshan 2017-12-06 06:17:37 UTC

I tried with my setup and it works for me, attaching the screenshot of events menu in tendrl UI. Pls note its wont be listed under top right corner bell icon as its notify only alert.

Comment 15 Darshan 2017-12-06 06:18:33 UTC

Created attachment 1363510 [details]
Unknown_peer_seen_in_UI

Comment 16 Nishanth Thomas 2017-12-06 06:39:18 UTC

What is the system configuration of server and storage nodes?

Comment 17 Martin Kudlej 2017-12-06 09:39:08 UTC

@Darshan What was your scenario to see UNKNOWN_PEER event? Have you try to follow steps described in comment 9?
I'm not sure but I see on your screenshot event "unknown state of peer" and not "unknown peer" event. Am I wrong?

@Nishanth Server has 27 GB of free memory and 16 GB of free disk space.

Comment 18 Darshan 2017-12-06 09:48:35 UTC

(In reply to Martin Kudlej from comment #17)
> @Darshan What was your scenario to see UNKNOWN_PEER event? Have you try to
> follow steps described in comment 9?

Yes

> I'm not sure but I see on your screenshot event "unknown state of peer" and
> not "unknown peer" event. Am I wrong?

That's the message for unknown_peer. You can have look at the handler function for unknown_peer here: https://github.com/Tendrl/gluster-integration/blob/master/tendrl/gluster_integration/message/callback.py#L213

> 
> @Nishanth Server has 27 GB of free memory and 16 GB of free disk space.

Comment 19 Nishanth Thomas 2017-12-06 09:55:36 UTC

@martin, The procedure same and atin helped us with that. Please retest. Moving the bug to ON_QA again.

Comment 20 Martin Kudlej 2017-12-06 10:56:54 UTC

I've tested this with Darshan's help (thank you for that!) and it works. --> VERIFIED

Comment 21 Martin Bukatovic 2017-12-11 17:01:53 UTC

(In reply to Martin Kudlej from comment #20)
> I've tested this with Darshan's help (thank you for that!) and it works. -->
> VERIFIED

Nice, could you update the test case description so that anybody could replicate
this again if needed?

Comment 23 errata-xmlrpc 2017-12-18 04:37:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478