Bug 1507841 - geo replication info is not correct
Summary: geo replication info is not correct
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: web-admin-tendrl-gluster-integration
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Shubhendu Tripathi
QA Contact: Rochelle
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-31 09:42 UTC by Martin Kudlej
Modified: 2017-12-18 04:39 UTC (History)
7 users (show)

Fixed In Version: tendrl-gluster-integration-1.5.4-4.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-18 04:39:36 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:3478 normal SHIPPED_LIVE RHGS Web Administration packages 2017-12-18 09:34:49 UTC
Github https://github.com/Tendrl gluster-integration issues 459 None None None 2017-11-07 06:42:28 UTC

Description Martin Kudlej 2017-10-31 09:42:07 UTC
Description of problem:
I think that GEO rep. info in Grafana is not correct. Stopping slave volume or shut down machines of slave volume have no effect on GEO rep. info.

Version-Release number of selected component (if applicable):
etcd-3.2.7-1.el7.x86_64
glusterfs-3.8.4-18.4.el7.x86_64
glusterfs-3.8.4-50.el7rhgs.x86_64
glusterfs-api-3.8.4-50.el7rhgs.x86_64
glusterfs-cli-3.8.4-50.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-18.4.el7.x86_64
glusterfs-client-xlators-3.8.4-50.el7rhgs.x86_64
glusterfs-events-3.8.4-50.el7rhgs.x86_64
glusterfs-fuse-3.8.4-18.4.el7.x86_64
glusterfs-fuse-3.8.4-50.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-50.el7rhgs.x86_64
glusterfs-libs-3.8.4-18.4.el7.x86_64
glusterfs-libs-3.8.4-50.el7rhgs.x86_64
glusterfs-server-3.8.4-50.el7rhgs.x86_64
python-etcd-0.4.5-1.noarch
rubygem-etcd-0.3.0-1.el7.noarch
tendrl-ansible-1.5.3-2.el7rhgs.noarch
tendrl-api-1.5.3-2.el7rhgs.noarch
tendrl-api-httpd-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.3-1.el7rhgs.noarch
tendrl-gluster-integration-1.5.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch
tendrl-node-agent-1.5.3-3.el7rhgs.noarch
tendrl-notifier-1.5.3-1.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-ui-1.5.3-2.el7rhgs.noarch


How reproducible:
100%

Expected results:
If there is any problem with GEO rep. session, it should be visible in GEO rep. info in Grafana.

Additional info:
$ gluster volume geo-replication master status # master is name of master volume
MASTER NODE     MASTER VOL    MASTER BRICK      SLAVE USER    SLAVE                       SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
---------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.103    master        /rhs/brick1/b2    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.42.103    master        /rhs/brick2/b5    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.42.103    master        /rhs/brick3/b8    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.27     master        /rhs/brick1/b3    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.27     master        /rhs/brick2/b6    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.27     master        /rhs/brick3/b9    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.99     master        /rhs/brick1/b1    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.99     master        /rhs/brick2/b4    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  
10.70.43.99     master        /rhs/brick3/b7    root          ssh://10.70.42.74::slave    N/A           Stopped    N/A             N/A                  


and GEO rep. info from Grafana looks like this:
Geo-Replication Session
Total - 1
Up - 1
Partial - 0
Down - 0

Comment 2 Shubhendu Tripathi 2017-10-31 10:18:00 UTC
Currently the volume level georep session status is calculated in tendrl based on status values reported from `gluster get-state` for individual georep pairs. the logic is as below

--------------------
if no of faulty pairs == 0: volume level georep status is marked as active(UP)
else if no of faulty pairs == total no of pairs: volume level georep status is marked as faulty (DOWN)
else volume level georep status is partial
---------------------

I would request suggestions from Aravinda, Atin, Amar to comment and suggest.

Comment 3 Shubhendu Tripathi 2017-10-31 10:19:48 UTC
Also I feel in this situation all the pairs would move to faulty state and next sync cycle in gluster-integration would mark the volume level georep session overall as DOWN. Darshan??

Comment 4 Aravinda VK 2017-10-31 10:41:04 UTC
(In reply to Shubhendu Tripathi from comment #2)
> Currently the volume level georep session status is calculated in tendrl
> based on status values reported from `gluster get-state` for individual
> georep pairs. the logic is as below
> 
> --------------------
> if no of faulty pairs == 0: volume level georep status is marked as
> active(UP)

Add one more condition if num_faulty_pairs == 0 and num_stopped_or_paused_or_created == 0

> else if no of faulty pairs == total no of pairs: volume level georep status
> is marked as faulty (DOWN)

Add one more condition if num_faulty_pairs == total_pairs and num_stopped_or_paused_or_created == 0

> else volume level georep status is partial

Add one more condition elif num_stopped_or_paused_or_created == 0

> ---------------------
> 
> I would request suggestions from Aravinda, Atin, Amar to comment and suggest.

Comment 5 Aravinda VK 2017-10-31 10:42:48 UTC
For Grafana you can push the following states

- created
- up
- down
- partial
- stopped
- paused

Comment 6 Shubhendu Tripathi 2017-10-31 11:01:23 UTC
Thanks Aravinda for clearly marking the requirements. So as per my understanding the logic could be as below now

-----------------
if all pairs are in created state, georep session status = CREATED

if no faulty pairs and num_stopped_or_paused_or_created == 0, georep session status = UP 

if no of faulty pairs = total no of pairs and num_stopped_or_paused_or_created == 0, georep session status = DOWN

if all pairs are in stopped state, georep sesion status = STOPPED

if all pairs in paused state, georep session status = PAUSED
-----------------

@Aravinda, Ack if this looks fine.

@Ju, ack if these states could be depicted in dashboards.

Comment 7 Aravinda VK 2017-10-31 11:37:52 UTC
(In reply to Shubhendu Tripathi from comment #6)
> Thanks Aravinda for clearly marking the requirements. So as per my
> understanding the logic could be as below now
> 
> -----------------
> if all pairs are in created state, georep session status = CREATED
> 
> if no faulty pairs and num_stopped_or_paused_or_created == 0, georep session
> status = UP 
> 
> if no of faulty pairs = total no of pairs and
> num_stopped_or_paused_or_created == 0, georep session status = DOWN
> 
> if all pairs are in stopped state, georep sesion status = STOPPED
> 
> if all pairs in paused state, georep session status = PAUSED
> -----------------
> 
> @Aravinda, Ack if this looks fine.
> 
> @Ju, ack if these states could be depicted in dashboards.

Looks good to me.

Comment 8 Shubhendu Tripathi 2017-11-02 03:26:06 UTC
@Ju, we need comments from you regarding UX here.

Comment 9 Ju Lim 2017-11-07 15:45:42 UTC
@shtripat

Ack the new 6 statuses for geo-repl sessions, and these should be reflected on the applicable Grafana dashboards.

CREATED: Geo-replication session is established
STOPPED: Geo-replication session is stopped
ONLINE/UP: Geo-replication session (All bricks) are UP and Running
OFFLINE/DOWN: Geo-replication session (All bricks) are Down
PARTIAL: Geo-replication session - some bricks are Online and Some bricks are offline
PAUSED: all the pairs are in paused state

Comment 12 Rochelle 2017-11-19 08:56:48 UTC
After stopping the geo-rep session as well as the slave volume, the slave volume going down was reflected successfully but the session moving to "stopped" state was not reflected after waiting for about 10 minutes after stopping the session. 

    Checked in the following version : tendrl-api-1.5.4-2.el7rhgs.noarch

Comment 14 Rochelle 2017-11-23 12:52:46 UTC
All the states with respect geo-replication were reflected correctly on the Grafana dashboard. 

Moving this bug to verified.

Comment 18 errata-xmlrpc 2017-12-18 04:39:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478


Note You need to log in before you can comment on or make changes to this bug.