Bug 1036039

Summary: [RHSC] Bricks status is not getting synched when gluster CLI output shows the port as N/A
Product: Red Hat Gluster Storage Reporter: Shruti Sampat <ssampat>
Component: rhscAssignee: Sahina Bose <sabose>
Status: CLOSED ERRATA QA Contact: Shruti Sampat <ssampat>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: dpati, dtsang, knarra, mmahoney, pprakash, rhs-bugs, sdharane
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cb11 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-25 08:06:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
engine logs none

Description Shruti Sampat 2013-11-29 09:08:50 UTC
Description of problem:
------------------------

When glusterd was stopped and then started on a machine, the gluster CLI command for volume status returned the following output - 

[root@rhs glusterfs_rpms]# gluster v status                                                                                                                                                                    
Status of volume: dis_rep_vol
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.84:/rhs/brick4/b1                        N/A     Y       21427
Brick 10.70.37.132:/rhs/brick4/b1                       49153   Y       15844
Brick 10.70.37.84:/rhs/brick5/b1                        N/A     Y       21438
Brick 10.70.37.132:/rhs/brick5/b1                       49154   Y       15856
Brick 10.70.37.64:/rhs/brick5/b1                        49154   Y       6428
Brick 10.70.37.176:/rhs/brick5/b1                       49154   Y       14884
NFS Server on localhost                                 2049    Y       3285
Self-heal Daemon on localhost                           N/A     Y       3293
NFS Server on 10.70.37.176                              2049    Y       5005
Self-heal Daemon on 10.70.37.176                        N/A     Y       5012
NFS Server on 10.70.37.132                              2049    Y       30595
Self-heal Daemon on 10.70.37.132                        N/A     Y       30605
NFS Server on 10.70.37.64                               2049    Y       22804
Self-heal Daemon on 10.70.37.64                         N/A     Y       22812
 
Task Status of Volume dis_rep_vol
------------------------------------------------------------------------------
There are no active volume tasks


The port number for a couple of bricks, as seen above is N/A. Because of this, the brick status that was set to down, due to glusterd going down, was not set to up after glusterd was started. The following is from the engine logs -

2013-11-28 20:57:38,270 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-67) Error while refreshing brick statuses for volume dis_rep_vol of cluster test: org.ovirt.eng
ine.core.common.errors.VdcBLLException: VdcBLLException: java.lang.NumberFormatException: For input string: "N/A" (Failed with error ENGINE and code 5001)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:122) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterJob.runVdsCommand(GlusterJob.java:64) [bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterSyncJob.getVolumeAdvancedDetails(GlusterSyncJob.java:848) [bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterSyncJob.refreshBrickStatuses(GlusterSyncJob.java:806) [bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterSyncJob.refreshClusterHeavyWeightData(GlusterSyncJob.java:791) [bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterSyncJob.refreshHeavyWeightData(GlusterSyncJob.java:766) [bll.jar:]
        at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source) [:1.7.0_45]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_45]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_45]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]


Version-Release number of selected component (if applicable):
Red Hat Storage Console Version: 2.1.2-0.25.master.el6_5 
glusterfs 3.4.0.44.1u2rhs

How reproducible:
Saw it a couple of times.

Steps to Reproduce:
1. In a cluster of 4 nodes, kill glusterd on one of the nodes, see the status of bricks residing on that node, being set to DOWN in the UI.
2. Start glusterd on the node and wait for 5 minutes for the brick status to be synched correctly, as UP.

Actual results:
The brick status is not set to UP, even more than 10 minutes.
Find the above pasted exception in the engine logs.

Expected results:
The brick status should have been set to UP.

Additional info:

Comment 1 Shruti Sampat 2013-11-29 09:10:58 UTC
Created attachment 830549 [details]
engine logs

Comment 3 Sahina Bose 2013-12-02 09:36:46 UTC
If the port is returned as N/A for a brick, the brick should be shown as DOWN - according to gluster team.

Handled code in engine so that an exception is not thrown in such cases.

Comment 4 Shruti Sampat 2013-12-17 07:04:18 UTC
Verified as fixed in Red Hat Storage Console Version: 2.1.2-0.27.beta.el6_5. Brick status remains down when "gluster volume status" returns ports as N/A. No exception seen in engine logs.

Comment 6 errata-xmlrpc 2014-02-25 08:06:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html