Description of problem: When one brick in replicate goes to faulty and it the other replica pair is active service 'Geo-Replication - <volume>' displays the status as 'CRITICAL' Version-Release number of selected component (if applicable): nagios-server-addons-0.1.3-3.el6rhs.x86_64 gluster-nagios-common-0.1.3-1.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a master vol of type replicate out of a cluster called master_cluster. 2. Create a slave vol out of type replicate out of a cluster called slave_cluster. 3. Have password less connection between the one node of master to one node of slave volume. 4. Run the command "gluster system:: execute gsec_create" 5. Now run the command gluster volume geo-replication <master_vol>::<salve_vol> create push-pem force. 6. Now start geo-rep session by running the command "gluster volume geo-replication <master-vol>::<salve_vol> start". 7. Now make one of the brick in replica to go to faulty state which is not active by running the following command 'ps aux | grep feedback' and kill the feedback process of that brick. Actual results: Geo-replication -mastervol status is shown as 'CRITICAL' with status information as 'Session status- vol_slave-FAULTY' Expected results: Geo-replication -mastervol status should be shown as 'WARNING' with status information as 'Session status- vol_slave- PARTIAL_FAULTY' Additional info:
Currently, we cannot determine the status of nodes - sub-volume wise. There's no way to correlate the output of geo-rep status with that of gluster volume info as geo-rep status uses hostname of the node. We will be able to do this when we have the xml output for geo-rep which returns the host uuid. The logic to determine Faulty is count of passive + faulty nodes > (brick count/replica count) For instances in a 3 X 2 volume, B1 <-> B2, B3 <-> B4, B5 <-> B6 P - F, A - P, A - P Count of P+ F = 4 > (6/2) ==> Critical The existing code had a >= comparison to handle both replicate and distribute cases - separated the logic for these 2 volume types to fix this. in http://review.gluster.org/8443
From kanagaraj, i understand that these bugs have been moved to on_qa by errata. Since QE has not yet received the build i am moving this bug back to assigned state. Please move it on to on_qa once builds are attached to errata.
Verified and works fine with build nagios-server-addons-0.1.8-1.el6rhs.noarch. In a replicate and distribute replicate when passive node goes to faulty, geo-replication status is shown as "Warning" with status information "Session Status: <vol_name> - PARTIAL_FAULTY.
Hi Sahina, Can you please review the edited doc text and sign off on the technical accuracy?
Looks good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0039.html