Bug 1109727

Summary: [Nagios] - when one brick in replicate volume goes faulty and if the other one is active geo replication volume status should be shown as 'PARTIAL_FAULTY'
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: gluster-nagios-addonsAssignee: Sahina Bose <sabose>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: dpati, nsathyan, psriniva, rnachimu, sabose
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.0.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gluster-nagios-addons-0.1.11-1.el6rhs Doc Type: Bug Fix
Doc Text:
Previously, When one of the bricks in a replica pair was down in a replicate volume type, the status of the Geo-replication session was set to FAULTY. This resulted in the status of the Nagios plugin to be set to CRITICAL. With this fix, changes are made to ensure that if only one of bricks in a replica pair is down, the status of the Geo-replication session is set to PARTIAL FAULTY as the Geo-replication session is active on another Red Hat Storage node in such a scenario.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-15 13:48:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description RamaKasturi 2014-06-16 09:08:42 UTC
Description of problem:
When one brick in replicate goes to faulty and it the other replica pair is active service 'Geo-Replication - <volume>' displays the status as 'CRITICAL'

Version-Release number of selected component (if applicable):
nagios-server-addons-0.1.3-3.el6rhs.x86_64
gluster-nagios-common-0.1.3-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a master vol of type replicate out of a cluster called master_cluster.
2. Create a slave vol out of type replicate out of a cluster called slave_cluster.
3. Have password less connection between the one node of master to one node of slave volume.
4. Run the command "gluster system:: execute gsec_create"
5. Now run the command gluster volume geo-replication <master_vol>::<salve_vol> create push-pem force.
6. Now start geo-rep session by running the command "gluster volume geo-replication <master-vol>::<salve_vol> start".
7. Now make one of the brick in replica to go to faulty state which is not active by running the following command 'ps aux | grep feedback' and kill the feedback process of that brick.

Actual results:
Geo-replication -mastervol status is  shown as 'CRITICAL' with status information as 'Session status- vol_slave-FAULTY'

Expected results:
Geo-replication -mastervol status should be shown as 'WARNING' with status information as 'Session status- vol_slave- PARTIAL_FAULTY'

Additional info:

Comment 3 Sahina Bose 2014-08-08 13:44:38 UTC
Currently, we cannot determine the status of nodes - sub-volume wise. There's no way to correlate the output of geo-rep status with that of gluster volume info as geo-rep status uses hostname of the node. We will be able to do this when we have the xml output for geo-rep which returns the host uuid.

The logic to determine Faulty is count of passive + faulty nodes > (brick count/replica count)

For instances in a 3 X 2 volume,

B1 <-> B2, B3 <-> B4, B5 <-> B6
P - F, A - P, A - P

Count of P+ F = 4 > (6/2) ==> Critical

The existing code had a >= comparison to handle both replicate and distribute cases - separated the logic for these 2 volume types to fix this.
in http://review.gluster.org/8443

Comment 6 RamaKasturi 2014-10-17 05:41:08 UTC
From kanagaraj, i understand that these bugs have been moved to on_qa by errata. 

Since QE has not yet received the build i am moving this bug back to assigned state. Please move it on to on_qa once builds are attached to errata.

Comment 7 RamaKasturi 2014-11-17 06:48:55 UTC
Verified and works fine with build nagios-server-addons-0.1.8-1.el6rhs.noarch.

In a replicate and distribute replicate when passive node goes to faulty, geo-replication status is shown as "Warning" with status information "Session Status: <vol_name> - PARTIAL_FAULTY.

Comment 8 Pavithra 2014-12-17 06:31:04 UTC
Hi Sahina,

Can you please review the edited doc text and sign off on the technical accuracy?

Comment 9 Sahina Bose 2014-12-24 09:17:22 UTC
Looks good.

Comment 11 errata-xmlrpc 2015-01-15 13:48:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html