1109727 – [Nagios] - when one brick in replicate volume goes faulty and if the other one is active geo replication volume status should be shown as 'PARTIAL_FAULTY'

Bug 1109727 - [Nagios] - when one brick in replicate volume goes faulty and if the other one is active geo replication volume status should be shown as 'PARTIAL_FAULTY'

Summary: [Nagios] - when one brick in replicate volume goes faulty and if the other on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nagios-addons
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.0.3
Assignee:	Sahina Bose
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-16 09:08 UTC by RamaKasturi
Modified:	2015-05-13 17:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:	gluster-nagios-addons-0.1.11-1.el6rhs
Doc Type:	Bug Fix
Doc Text:	Previously, When one of the bricks in a replica pair was down in a replicate volume type, the status of the Geo-replication session was set to FAULTY. This resulted in the status of the Nagios plugin to be set to CRITICAL. With this fix, changes are made to ensure that if only one of bricks in a replica pair is down, the status of the Geo-replication session is set to PARTIAL FAULTY as the Geo-replication session is active on another Red Hat Storage node in such a scenario.
Clone Of:
Environment:
Last Closed:	2015-01-15 13:48:19 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0039	0	normal	SHIPPED_LIVE	Red Hat Storage Console 3.0 enhancement and bug fix update #3	2015-01-15 18:46:40 UTC

Description RamaKasturi 2014-06-16 09:08:42 UTC

Description of problem:
When one brick in replicate goes to faulty and it the other replica pair is active service 'Geo-Replication - <volume>' displays the status as 'CRITICAL'

Version-Release number of selected component (if applicable):
nagios-server-addons-0.1.3-3.el6rhs.x86_64
gluster-nagios-common-0.1.3-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a master vol of type replicate out of a cluster called master_cluster.
2. Create a slave vol out of type replicate out of a cluster called slave_cluster.
3. Have password less connection between the one node of master to one node of slave volume.
4. Run the command "gluster system:: execute gsec_create"
5. Now run the command gluster volume geo-replication <master_vol>::<salve_vol> create push-pem force.
6. Now start geo-rep session by running the command "gluster volume geo-replication <master-vol>::<salve_vol> start".
7. Now make one of the brick in replica to go to faulty state which is not active by running the following command 'ps aux | grep feedback' and kill the feedback process of that brick.

Actual results:
Geo-replication -mastervol status is  shown as 'CRITICAL' with status information as 'Session status- vol_slave-FAULTY'

Expected results:
Geo-replication -mastervol status should be shown as 'WARNING' with status information as 'Session status- vol_slave- PARTIAL_FAULTY'

Additional info:

Comment 3 Sahina Bose 2014-08-08 13:44:38 UTC

Currently, we cannot determine the status of nodes - sub-volume wise. There's no way to correlate the output of geo-rep status with that of gluster volume info as geo-rep status uses hostname of the node. We will be able to do this when we have the xml output for geo-rep which returns the host uuid.

The logic to determine Faulty is count of passive + faulty nodes > (brick count/replica count)

For instances in a 3 X 2 volume,

B1 <-> B2, B3 <-> B4, B5 <-> B6
P - F, A - P, A - P

Count of P+ F = 4 > (6/2) ==> Critical

The existing code had a >= comparison to handle both replicate and distribute cases - separated the logic for these 2 volume types to fix this.
in http://review.gluster.org/8443

Comment 6 RamaKasturi 2014-10-17 05:41:08 UTC

From kanagaraj, i understand that these bugs have been moved to on_qa by errata. 

Since QE has not yet received the build i am moving this bug back to assigned state. Please move it on to on_qa once builds are attached to errata.

Comment 7 RamaKasturi 2014-11-17 06:48:55 UTC

Verified and works fine with build nagios-server-addons-0.1.8-1.el6rhs.noarch.

In a replicate and distribute replicate when passive node goes to faulty, geo-replication status is shown as "Warning" with status information "Session Status: <vol_name> - PARTIAL_FAULTY.

Comment 8 Pavithra 2014-12-17 06:31:04 UTC

Hi Sahina,

Can you please review the edited doc text and sign off on the technical accuracy?

Comment 9 Sahina Bose 2014-12-24 09:17:22 UTC

Looks good.

Comment 11 errata-xmlrpc 2015-01-15 13:48:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0039.html

Note You need to log in before you can comment on or make changes to this bug.