1330481 – glusterd restart is failing if volume brick is down due to underlying FS crash.

Bug 1330481 - glusterd restart is failing if volume brick is down due to underlying FS crash.

Summary: glusterd restart is failing if volume brick is down due to underlying FS crash.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1330385
Blocks:	1331934
TreeView+	depends on / blocked

Reported:	2016-04-26 10:25 UTC by Atin Mukherjee
Modified:	2016-06-16 14:04 UTC (History)
CC List:	2 users (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:	1330385
Clones:	1331934 (view as bug list)
Environment:
Last Closed:	2016-06-16 14:04:21 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Atin Mukherjee 2016-04-26 10:25:52 UTC

+++ This bug was initially created as a clone of Bug #1330385 +++

Description of problem:
=======================
glusterd restart is failing if volume brick is down due to underlying filsystem crash (XFS)


Version-Release number of selected component (if applicable):
============================================================
mainline


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Have one/two node cluster
2. Create 1*2 volume and start it.
3. crash underlying filesystem for one of the volume brick using "godown tool" OR any other way.
4. Check brick is down using "volume status"
5. Try glusterd restart //restart will fail.

Actual results:
===============
glusterd restart is failing if volume brick is down due to FS crash.


Expected results:
=================
glusterd restart should work.


Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-04-26 01:57:25 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Byreddy on 2016-04-26 02:13:45 EDT ---

Additional info:
================

[root@dhcp42-82 ~]# gluster volume status
Status of volume: Dis
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.42.82:/bricks/brick2/br2        49155     0          Y       23291
Brick 10.70.42.82:/bricks/brick1/br1        N/A       N/A        N       N/A  
Brick 10.70.42.82:/bricks/brick2/br3        49154     0          Y       23329
NFS Server on localhost                     2049      0          Y       23354
NFS Server on dhcp43-136.lab.eng.blr.redhat
.com                                        2049      0          Y       8049 
 
Task Status of Volume Dis
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp42-82 ~]# 
[root@dhcp42-82 ~]# systemctl restart glusterd
Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details.
[root@dhcp42-82 ~]# 


glusterd logs:
=============


pid --log-level INFO)
[2016-04-26 06:08:47.439960] I [MSGID: 106478] [glusterd.c:1337:init] 0-management: Maximum allowed open file descriptors set to 65536
[2016-04-26 06:08:47.440044] I [MSGID: 106479] [glusterd.c:1386:init] 0-management: Using /var/lib/glusterd as working directory
[2016-04-26 06:08:47.453605] W [MSGID: 103071] [rdma.c:4594:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2016-04-26 06:08:47.453658] W [MSGID: 103055] [rdma.c:4901:init] 0-rdma.management: Failed to initialize IB Device
[2016-04-26 06:08:47.453677] W [rpc-transport.c:359:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2016-04-26 06:08:47.453885] W [rpcsvc.c:1597:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed
[2016-04-26 06:08:47.453924] E [MSGID: 106243] [glusterd.c:1610:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2016-04-26 06:08:52.512606] I [MSGID: 106513] [glusterd-store.c:2065:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30712
[2016-04-26 06:08:53.671078] I [MSGID: 106544] [glusterd.c:159:glusterd_uuid_init] 0-management: retrieved UUID: eac322e5-ef82-47db-b88b-2449c0164482
[2016-04-26 06:08:53.671466] C [MSGID: 106425] [glusterd-store.c:2434:glusterd_store_retrieve_bricks] 0-management: realpath() failed for brick /bricks/brick1/br1. The underlying file system may be in bad state [Input/output error]
[2016-04-26 06:08:53.671847] E [MSGID: 106201] [glusterd-store.c:3092:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: Dis
[2016-04-26 06:08:53.671888] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2016-04-26 06:08:53.671900] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed
[2016-04-26 06:08:53.671907] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed
[2016-04-26 06:08:53.672475] W [glusterfsd.c:1251:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7fe6e9e2b2ad] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x120) [0x7fe6e9e2b150] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7fe6e9e2a739] ) 0-: received signum (0), shutting down
(END)

Comment 1 Vijay Bellur 2016-04-26 10:26:29 UTC

REVIEW: http://review.gluster.org/14075 (glusterd: glusterd should restart on a underlying file system crash) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 2 Vijay Bellur 2016-04-26 17:27:58 UTC

REVIEW: http://review.gluster.org/14075 (glusterd: glusterd should restart on a underlying file system crash) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 3 Vijay Bellur 2016-04-29 06:47:44 UTC

REVIEW: http://review.gluster.org/14075 (glusterd: persist brickinfo->real_path) posted (#3) for review on master by Atin Mukherjee (amukherj)

Comment 4 Vijay Bellur 2016-04-29 16:17:26 UTC

COMMIT: http://review.gluster.org/14075 committed in master by Jeff Darcy (jdarcy) 
------
commit f0fb05d2cefae08c143f2bfdef151084f5ddb498
Author: Atin Mukherjee <amukherj>
Date:   Tue Apr 26 15:27:43 2016 +0530

    glusterd: persist brickinfo->real_path
    
    Since real_path was not persisted and gets constructed at every glusterd
    restart, glusterd will fail to come up if one of the brick's underlying file
    system is crashed.
    
    Solution is to construct real_path only once and get it persisted.
    
    Change-Id: I97abc30372c1ffbbb2d43b716d7af09172147b47
    BUG: 1330481
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/14075
    CentOS-regression: Gluster Build System <jenkins.com>
    Smoke: Gluster Build System <jenkins.com>
    Reviewed-by: Kaushal M <kaushal>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 5 Niels de Vos 2016-06-16 14:04:21 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.