Bug 1331934

Summary: glusterd restart is failing if volume brick is down due to underlying FS crash.
Product: [Community] GlusterFS Reporter: Atin Mukherjee <amukherj>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.11CC: bsrirama, bugs
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1330481 Environment:
Last Closed: 2016-06-28 12:16:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1330385, 1330481    
Bug Blocks:    

Description Atin Mukherjee 2016-04-30 06:42:43 UTC
+++ This bug was initially created as a clone of Bug #1330481 +++

+++ This bug was initially created as a clone of Bug #1330385 +++

Description of problem:
=======================
glusterd restart is failing if volume brick is down due to underlying filsystem crash (XFS)


Version-Release number of selected component (if applicable):
============================================================
mainline


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Have one/two node cluster
2. Create 1*2 volume and start it.
3. crash underlying filesystem for one of the volume brick using "godown tool" OR any other way.
4. Check brick is down using "volume status"
5. Try glusterd restart //restart will fail.

Actual results:
===============
glusterd restart is failing if volume brick is down due to FS crash.


Expected results:
=================
glusterd restart should work.


Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-04-26 01:57:25 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Byreddy on 2016-04-26 02:13:45 EDT ---

Additional info:
================

[root@dhcp42-82 ~]# gluster volume status
Status of volume: Dis
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.82:/bricks/brick2/br2        49155     0          Y       23291
Brick 10.70.42.82:/bricks/brick1/br1        N/A       N/A        N       N/A  
Brick 10.70.42.82:/bricks/brick2/br3        49154     0          Y       23329
NFS Server on localhost                     2049      0          Y       23354
NFS Server on dhcp43-136.lab.eng.blr.redhat
.com                                        2049      0          Y       8049 
 
Task Status of Volume Dis
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp42-82 ~]# 
[root@dhcp42-82 ~]# systemctl restart glusterd
Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details.
[root@dhcp42-82 ~]# 


glusterd logs:
=============


pid --log-level INFO)
[2016-04-26 06:08:47.439960] I [MSGID: 106478] [glusterd.c:1337:init] 0-management: Maximum allowed open file descriptors set to 65536
[2016-04-26 06:08:47.440044] I [MSGID: 106479] [glusterd.c:1386:init] 0-management: Using /var/lib/glusterd as working directory
[2016-04-26 06:08:47.453605] W [MSGID: 103071] [rdma.c:4594:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2016-04-26 06:08:47.453658] W [MSGID: 103055] [rdma.c:4901:init] 0-rdma.management: Failed to initialize IB Device
[2016-04-26 06:08:47.453677] W [rpc-transport.c:359:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2016-04-26 06:08:47.453885] W [rpcsvc.c:1597:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed
[2016-04-26 06:08:47.453924] E [MSGID: 106243] [glusterd.c:1610:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2016-04-26 06:08:52.512606] I [MSGID: 106513] [glusterd-store.c:2065:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30712
[2016-04-26 06:08:53.671078] I [MSGID: 106544] [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID: eac322e5-ef82-47db-b88b-2449c0164482
[2016-04-26 06:08:53.671466] C [MSGID: 106425] [glusterd-store.c:2434:glusterd_store_retrieve_bricks] 0-management: realpath() failed for brick /bricks/brick1/br1. The underlying file system may be in bad state [Input/output error]
[2016-04-26 06:08:53.671847] E [MSGID: 106201] [glusterd-store.c:3092:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: Dis
[2016-04-26 06:08:53.671888] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2016-04-26 06:08:53.671900] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed
[2016-04-26 06:08:53.671907] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed
[2016-04-26 06:08:53.672475] W [glusterfsd.c:1251:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7fe6e9e2b2ad] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x120) [0x7fe6e9e2b150] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7fe6e9e2a739] ) 0-: received signum (0), shutting down
(END)

--- Additional comment from Vijay Bellur on 2016-04-26 06:26:29 EDT ---

REVIEW: http://review.gluster.org/14075 (glusterd: glusterd should restart on a underlying file system crash) posted (#1) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-04-26 13:27:58 EDT ---

REVIEW: http://review.gluster.org/14075 (glusterd: glusterd should restart on a underlying file system crash) posted (#2) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-04-29 02:47:44 EDT ---

REVIEW: http://review.gluster.org/14075 (glusterd: persist brickinfo->real_path) posted (#3) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-04-29 12:17:26 EDT ---

COMMIT: http://review.gluster.org/14075 committed in master by Jeff Darcy (jdarcy) 
------
commit f0fb05d2cefae08c143f2bfdef151084f5ddb498
Author: Atin Mukherjee <amukherj>
Date:   Tue Apr 26 15:27:43 2016 +0530

    glusterd: persist brickinfo->real_path
    
    Since real_path was not persisted and gets constructed at every glusterd
    restart, glusterd will fail to come up if one of the brick's underlying file
    system is crashed.
    
    Solution is to construct real_path only once and get it persisted.
    
    Change-Id: I97abc30372c1ffbbb2d43b716d7af09172147b47
    BUG: 1330481
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/14075
    CentOS-regression: Gluster Build System <jenkins.com>
    Smoke: Gluster Build System <jenkins.com>
    Reviewed-by: Kaushal M <kaushal>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 1 Vijay Bellur 2016-04-30 06:43:44 UTC
REVIEW: http://review.gluster.org/14124 (glusterd: persist brickinfo->real_path) posted (#1) for review on release-3.7 by Atin Mukherjee (amukherj)

Comment 2 Vijay Bellur 2016-05-02 11:40:57 UTC
COMMIT: http://review.gluster.org/14124 committed in release-3.7 by Jeff Darcy (jdarcy) 
------
commit 45f6b416be0a8daeca9752910a332201bc17d851
Author: Atin Mukherjee <amukherj>
Date:   Tue Apr 26 15:27:43 2016 +0530

    glusterd: persist brickinfo->real_path
    
    Backport of http://review.gluster.org/14075
    
    Since real_path was not persisted and gets constructed at every glusterd
    restart, glusterd will fail to come up if one of the brick's underlying file
    system is crashed.
    
    Solution is to construct real_path only once and get it persisted.
    
    Change-Id: I97abc30372c1ffbbb2d43b716d7af09172147b47
    BUG: 1331934
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/14075
    CentOS-regression: Gluster Build System <jenkins.com>
    Smoke: Gluster Build System <jenkins.com>
    Reviewed-by: Kaushal M <kaushal>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-on: http://review.gluster.org/14124
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 3 Kaushal 2016-06-28 12:16:09 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.12, please open a new bug report.

glusterfs-3.7.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-devel/2016-June/049918.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user