Bug 1324014

Summary: glusterd: glusted didn't come up after node reboot error" realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]"
Product: [Community] GlusterFS Reporter: Atin Mukherjee <amukherj>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.7.10CC: ashah, bugs, storage-qa-internal
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.11 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1322772 Environment:
Last Closed: 2016-04-19 07:13:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1322765, 1322772    
Bug Blocks:    

Description Atin Mukherjee 2016-04-05 10:47:50 UTC
+++ This bug was initially created as a clone of Bug #1322772 +++

+++ This bug was initially created as a clone of Bug #1322765 +++

Description of problem:

After node reboot, glusterd didn't come up .
Error" realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]"

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-1.el7rhgs.x86_64


How reproducible:

100%

Steps to Reproduce:
1. Create 2*2 distribute replicate volume
2. Enable uss
3. Create snasphot and activate
4. Reboot one of the node

Actual results:

After node reboot, glusterd should come up

Expected results:

glusterd is down after node reboot

Additional info:

[root@dhcp46-4 ~]# gluster v info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60769503-f742-458d-97c0-8e090147f82a
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.4:/rhs/brick1/b1
Brick2: 10.70.47.46:/rhs/brick2/b2
Brick3: 10.70.46.213:/rhs/brick3/b3
Brick4: 10.70.46.148:/rhs/brick4/b4
Options Reconfigured:
performance.readdir-ahead: on
features.uss: enable
features.barrier: disable
snap-activate-on-create: enable


================================
glusterd logs from node which is rebooted

[2016-03-31 12:03:00.551394] C [MSGID: 106425] [glusterd-store.c:2425:glusterd_store_retrieve_bricks] 0-management: realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]

[2016-03-31 12:19:44.102994] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2016-03-31 12:19:44.106631] W [socket.c:870:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 15, Invalid argument
[2016-03-31 12:19:44.106676] E [socket.c:2966:socket_connect] 0-management: Failed to set keep-alive: Invalid argument
The message "I [MSGID: 106498] [glusterd-handler.c:3640:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 2 times between [2016-03-31 12:19:44.085669] and [2016-03-31 12:19:44.086167]
[2016-03-31 12:19:44.114223] C [MSGID: 106425] [glusterd-store.c:2425:glusterd_store_retrieve_bricks] 0-management: realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]
[2016-03-31 12:19:44.114364] E [MSGID: 106201] [glusterd-store.c:3082:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: 130949baac8843cda443cf8a6441157f
[2016-03-31 12:19:44.114387] E [MSGID: 106195] [glusterd-store.c:3475:glusterd_store_retrieve_snap] 0-management: Failed to retrieve snap volumes for snap snap1
[2016-03-31 12:19:44.114399] E [MSGID: 106043] [glusterd-store.c:3629:glusterd_store_retrieve_snaps] 0-management: Unable to restore snapshot: snap1
[2016-03-31 12:19:44.114509] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2016-03-31 12:19:44.114542] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed
[2016-03-31 12:19:44.114554] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed
[2016-03-31 12:19:44.115626] W [glusterfsd.c:1251:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7fc632a1b2ad] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x120) [0x7fc632a1b150] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7fc632a1a739] ) 0-: received signum (0), shutting down

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-03-31 05:46:54 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from RHEL Product and Program Management on 2016-03-31 06:02:19 EDT ---

This bug report has Keywords: Regression or TestBlocker.

Since no regressions or test blockers are allowed between releases,
it is also being identified as a blocker for this release.

Please resolve ASAP.

--- Additional comment from Vijay Bellur on 2016-03-31 06:18:46 EDT ---

REVIEW: http://review.gluster.org/13869 (glusterd: build realpath post recreate of brick mount for snapshot) posted (#1) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-04-01 06:16:10 EDT ---

REVIEW: http://review.gluster.org/13869 (glusterd: build realpath post recreate of brick mount for snapshot) posted (#2) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-04-04 00:53:45 EDT ---

REVIEW: http://review.gluster.org/13869 (glusterd: build realpath post recreate of brick mount for snapshot) posted (#3) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-04-04 03:07:51 EDT ---

REVIEW: http://review.gluster.org/13869 (glusterd: build realpath post recreate of brick mount for snapshot) posted (#4) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-04-04 11:19:52 EDT ---

REVIEW: http://review.gluster.org/13869 (glusterd: build realpath post recreate of brick mount for snapshot) posted (#5) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-04-05 06:46:51 EDT ---

COMMIT: http://review.gluster.org/13869 committed in master by Atin Mukherjee (amukherj) 
------
commit d3c77459593255ed2c88094c8477b8a0c9ff9073
Author: Atin Mukherjee <amukherj>
Date:   Thu Mar 31 14:58:02 2016 +0530

    glusterd: build realpath post recreate of brick mount for snapshot
    
    Commit a60c39d introduced a new field called real_path in brickinfo to hold the
    realpath() conversion. However at restore path for all snapshots and snapshot
    restored volumes the brickpath gets recreated post restoration of bricks  which
    means the realpath () call will fail here for all the snapshots and cloned
    volumes.
    
    Fix is to store the realpath for snapshots and clones post recreating the brick
    mounts. For normal volume it would be done during retrieving the brick details
    from the store.
    
    Change-Id: Ia34853acddb28bcb7f0f70ca85fabcf73276ef13
    BUG: 1322772
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/13869
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Avra Sengupta <asengupt>
    Reviewed-by: Rajesh Joseph <rjoseph>
    Smoke: Gluster Build System <jenkins.com>

Comment 1 Vijay Bellur 2016-04-05 10:48:58 UTC
REVIEW: http://review.gluster.org/13905 (glusterd: build realpath post recreate of brick mount for snapshot) posted (#1) for review on release-3.7 by Atin Mukherjee (amukherj)

Comment 2 Vijay Bellur 2016-04-06 04:02:20 UTC
COMMIT: http://review.gluster.org/13905 committed in release-3.7 by Vijay Bellur (vbellur) 
------
commit 6bcae5cc8081697eca0ac72631e31327e1a786a9
Author: Atin Mukherjee <amukherj>
Date:   Thu Mar 31 14:58:02 2016 +0530

    glusterd: build realpath post recreate of brick mount for snapshot
    
    Backport of http://review.gluster.org/#/c/13869
    
    Commit a60c39d introduced a new field called real_path in brickinfo to hold the
    realpath() conversion. However at restore path for all snapshots and snapshot
    restored volumes the brickpath gets recreated post restoration of bricks  which
    means the realpath () call will fail here for all the snapshots and cloned
    volumes.
    
    Fix is to store the realpath for snapshots and clones post recreating the brick
    mounts. For normal volume it would be done during retrieving the brick details
    from the store.
    
    Change-Id: Ia34853acddb28bcb7f0f70ca85fabcf73276ef13
    BUG: 1324014
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/13869
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Avra Sengupta <asengupt>
    Reviewed-by: Rajesh Joseph <rjoseph>
    Smoke: Gluster Build System <jenkins.com>
    Reviewed-on: http://review.gluster.org/13905
    Reviewed-by: Vijay Bellur <vbellur>

Comment 3 Kaushal 2016-04-19 07:13:10 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.11, please open a new bug report.

glusterfs-3.7.11 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-April/026321.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user