Description of problem: After node reboot, glusterd didn't come up . Error" realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]" Version-Release number of selected component (if applicable): glusterfs-3.7.9-1.el7rhgs.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create 2*2 distribute replicate volume 2. Enable uss 3. Create snasphot and activate 4. Reboot one of the node Actual results: After node reboot, glusterd should come up Expected results: glusterd is down after node reboot Additional info: [root@dhcp46-4 ~]# gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: 60769503-f742-458d-97c0-8e090147f82a Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.46.4:/rhs/brick1/b1 Brick2: 10.70.47.46:/rhs/brick2/b2 Brick3: 10.70.46.213:/rhs/brick3/b3 Brick4: 10.70.46.148:/rhs/brick4/b4 Options Reconfigured: performance.readdir-ahead: on features.uss: enable features.barrier: disable snap-activate-on-create: enable ================================ glusterd logs from node which is rebooted [2016-03-31 12:03:00.551394] C [MSGID: 106425] [glusterd-store.c:2425:glusterd_store_retrieve_bricks] 0-management: realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory] [2016-03-31 12:19:44.102994] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2016-03-31 12:19:44.106631] W [socket.c:870:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 15, Invalid argument [2016-03-31 12:19:44.106676] E [socket.c:2966:socket_connect] 0-management: Failed to set keep-alive: Invalid argument The message "I [MSGID: 106498] [glusterd-handler.c:3640:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 2 times between [2016-03-31 12:19:44.085669] and [2016-03-31 12:19:44.086167] [2016-03-31 12:19:44.114223] C [MSGID: 106425] [glusterd-store.c:2425:glusterd_store_retrieve_bricks] 0-management: realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory] [2016-03-31 12:19:44.114364] E [MSGID: 106201] [glusterd-store.c:3082:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: 130949baac8843cda443cf8a6441157f [2016-03-31 12:19:44.114387] E [MSGID: 106195] [glusterd-store.c:3475:glusterd_store_retrieve_snap] 0-management: Failed to retrieve snap volumes for snap snap1 [2016-03-31 12:19:44.114399] E [MSGID: 106043] [glusterd-store.c:3629:glusterd_store_retrieve_snaps] 0-management: Unable to restore snapshot: snap1 [2016-03-31 12:19:44.114509] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2016-03-31 12:19:44.114542] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed [2016-03-31 12:19:44.114554] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed [2016-03-31 12:19:44.115626] W [glusterfsd.c:1251:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7fc632a1b2ad] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x120) [0x7fc632a1b150] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7fc632a1a739] ) 0-: received signum (0), shutting down
An upstream patch http://review.gluster.org/#/c/13869/1 has been posted for review which explains the RCA
Downstream patch https://code.engineering.redhat.com/gerrit/#/c/71478/ posted for review.
Downstream patch is merged now. Moving the status to Modified.
glusterd in running after node reboot, when snapshots are available of volume. Bug verified on build glusterfs-3.7.9-2.el7rhgs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240