Bug 1507841

Summary: geo replication info is not correct
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Martin Kudlej <mkudlej>
Component: web-admin-tendrl-gluster-integrationAssignee: Shubhendu Tripathi <shtripat>
Status: CLOSED ERRATA QA Contact: Rochelle <rallan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: avishwan, julim, nthomas, rallan, rcyriac, sankarshan, shtripat
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-gluster-integration-1.5.4-4.el7rhgs Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-18 04:39:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Kudlej 2017-10-31 09:42:07 UTC
Description of problem:
I think that GEO rep. info in Grafana is not correct. Stopping slave volume or shut down machines of slave volume have no effect on GEO rep. info.

Version-Release number of selected component (if applicable):
etcd-3.2.7-1.el7.x86_64
glusterfs-3.8.4-18.4.el7.x86_64
glusterfs-3.8.4-50.el7rhgs.x86_64
glusterfs-api-3.8.4-50.el7rhgs.x86_64
glusterfs-cli-3.8.4-50.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-18.4.el7.x86_64
glusterfs-client-xlators-3.8.4-50.el7rhgs.x86_64
glusterfs-events-3.8.4-50.el7rhgs.x86_64
glusterfs-fuse-3.8.4-18.4.el7.x86_64
glusterfs-fuse-3.8.4-50.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-50.el7rhgs.x86_64
glusterfs-libs-3.8.4-18.4.el7.x86_64
glusterfs-libs-3.8.4-50.el7rhgs.x86_64
glusterfs-server-3.8.4-50.el7rhgs.x86_64
python-etcd-0.4.5-1.noarch
rubygem-etcd-0.3.0-1.el7.noarch
tendrl-ansible-1.5.3-2.el7rhgs.noarch
tendrl-api-1.5.3-2.el7rhgs.noarch
tendrl-api-httpd-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.3-1.el7rhgs.noarch
tendrl-gluster-integration-1.5.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch
tendrl-node-agent-1.5.3-3.el7rhgs.noarch
tendrl-notifier-1.5.3-1.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-ui-1.5.3-2.el7rhgs.noarch


How reproducible:
100%

Expected results:
If there is any problem with GEO rep. session, it should be visible in GEO rep. info in Grafana.

Additional info:
$ gluster volume geo-replication master status # master is name of master volume
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                       SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
---------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.103    master        /rhs/brick1/b2    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.42.103    master        /rhs/brick2/b5    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.42.103    master        /rhs/brick3/b8    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.27     master        /rhs/brick1/b3    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.27     master        /rhs/brick2/b6    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.27     master        /rhs/brick3/b9    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.99     master        /rhs/brick1/b1    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.99     master        /rhs/brick2/b4    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.99     master        /rhs/brick3/b7    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  


and GEO rep. info from Grafana looks like this:
Geo-Replication Session
Total - 1
Up - 1
Partial - 0
Down - 0

Comment 2 Shubhendu Tripathi 2017-10-31 10:18:00 UTC
Currently the volume level georep session status is calculated in tendrl based on status values reported from `gluster get-state` for individual georep pairs. the logic is as below

--------------------
if no of faulty pairs == 0: volume level georep status is marked as active(UP)
else if no of faulty pairs == total no of pairs: volume level georep status is marked as faulty (DOWN)
else volume level georep status is partial
---------------------

I would request suggestions from Aravinda, Atin, Amar to comment and suggest.

Comment 3 Shubhendu Tripathi 2017-10-31 10:19:48 UTC
Also I feel in this situation all the pairs would move to faulty state and next sync cycle in gluster-integration would mark the volume level georep session overall as DOWN. Darshan??

Comment 4 Aravinda VK 2017-10-31 10:41:04 UTC
(In reply to Shubhendu Tripathi from comment #2)
> Currently the volume level georep session status is calculated in tendrl
> based on status values reported from `gluster get-state` for individual
> georep pairs. the logic is as below
> 
> --------------------
> if no of faulty pairs == 0: volume level georep status is marked as
> active(UP)

Add one more condition if num_faulty_pairs == 0 and num_stopped_or_paused_or_created == 0

> else if no of faulty pairs == total no of pairs: volume level georep status
> is marked as faulty (DOWN)

Add one more condition if num_faulty_pairs == total_pairs and num_stopped_or_paused_or_created == 0

> else volume level georep status is partial

Add one more condition elif num_stopped_or_paused_or_created == 0

> ---------------------
> 
> I would request suggestions from Aravinda, Atin, Amar to comment and suggest.

Comment 5 Aravinda VK 2017-10-31 10:42:48 UTC
For Grafana you can push the following states

- created
- up
- down
- partial
- stopped
- paused

Comment 6 Shubhendu Tripathi 2017-10-31 11:01:23 UTC
Thanks Aravinda for clearly marking the requirements. So as per my understanding the logic could be as below now

-----------------
if all pairs are in created state, georep session status = CREATED

if no faulty pairs and num_stopped_or_paused_or_created == 0, georep session status = UP 

if no of faulty pairs = total no of pairs and num_stopped_or_paused_or_created == 0, georep session status = DOWN

if all pairs are in stopped state, georep sesion status = STOPPED

if all pairs in paused state, georep session status = PAUSED
-----------------

@Aravinda, Ack if this looks fine.

@Ju, ack if these states could be depicted in dashboards.

Comment 7 Aravinda VK 2017-10-31 11:37:52 UTC
(In reply to Shubhendu Tripathi from comment #6)
> Thanks Aravinda for clearly marking the requirements. So as per my
> understanding the logic could be as below now
> 
> -----------------
> if all pairs are in created state, georep session status = CREATED
> 
> if no faulty pairs and num_stopped_or_paused_or_created == 0, georep session
> status = UP 
> 
> if no of faulty pairs = total no of pairs and
> num_stopped_or_paused_or_created == 0, georep session status = DOWN
> 
> if all pairs are in stopped state, georep sesion status = STOPPED
> 
> if all pairs in paused state, georep session status = PAUSED
> -----------------
> 
> @Aravinda, Ack if this looks fine.
> 
> @Ju, ack if these states could be depicted in dashboards.

Looks good to me.

Comment 8 Shubhendu Tripathi 2017-11-02 03:26:06 UTC
@Ju, we need comments from you regarding UX here.

Comment 9 Ju Lim 2017-11-07 15:45:42 UTC
@shtripat

Ack the new 6 statuses for geo-repl sessions, and these should be reflected on the applicable Grafana dashboards.

CREATED: Geo-replication session is established
STOPPED: Geo-replication session is stopped
ONLINE/UP: Geo-replication session (All bricks) are UP and Running
OFFLINE/DOWN: Geo-replication session (All bricks) are Down
PARTIAL: Geo-replication session - some bricks are Online and Some bricks are offline
PAUSED: all the pairs are in paused state

Comment 12 Rochelle 2017-11-19 08:56:48 UTC
After stopping the geo-rep session as well as the slave volume, the slave volume going down was reflected successfully but the session moving to "stopped" state was not reflected after waiting for about 10 minutes after stopping the session. 

    Checked in the following version : tendrl-api-1.5.4-2.el7rhgs.noarch

Comment 14 Rochelle 2017-11-23 12:52:46 UTC
All the states with respect geo-replication were reflected correctly on the Grafana dashboard. 

Moving this bug to verified.

Comment 18 errata-xmlrpc 2017-12-18 04:39:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478