Bug 1322765

Summary:	glusterd: glusted didn't come up after node reboot error" realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]"
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Anil Shah <ashah>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED ERRATA	QA Contact:	Anil Shah <ashah>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	asrivast, rhs-bugs, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Regression, ZStream
Target Release:	RHGS 3.1.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-2	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1322772 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:15:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1289439, 1311817, 1322772, 1324014

Description Anil Shah 2016-03-31 09:46:51 UTC

Description of problem:

After node reboot, glusterd didn't come up .
Error" realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]"

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-1.el7rhgs.x86_64


How reproducible:

100%

Steps to Reproduce:
1. Create 2*2 distribute replicate volume
2. Enable uss
3. Create snasphot and activate
4. Reboot one of the node

Actual results:

After node reboot, glusterd should come up

Expected results:

glusterd is down after node reboot

Additional info:

[root@dhcp46-4 ~]# gluster v info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 60769503-f742-458d-97c0-8e090147f82a
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.4:/rhs/brick1/b1
Brick2: 10.70.47.46:/rhs/brick2/b2
Brick3: 10.70.46.213:/rhs/brick3/b3
Brick4: 10.70.46.148:/rhs/brick4/b4
Options Reconfigured:
performance.readdir-ahead: on
features.uss: enable
features.barrier: disable
snap-activate-on-create: enable


================================
glusterd logs from node which is rebooted

[2016-03-31 12:03:00.551394] C [MSGID: 106425] [glusterd-store.c:2425:glusterd_store_retrieve_bricks] 0-management: realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]

[2016-03-31 12:19:44.102994] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2016-03-31 12:19:44.106631] W [socket.c:870:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 15, Invalid argument
[2016-03-31 12:19:44.106676] E [socket.c:2966:socket_connect] 0-management: Failed to set keep-alive: Invalid argument
The message "I [MSGID: 106498] [glusterd-handler.c:3640:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0" repeated 2 times between [2016-03-31 12:19:44.085669] and [2016-03-31 12:19:44.086167]
[2016-03-31 12:19:44.114223] C [MSGID: 106425] [glusterd-store.c:2425:glusterd_store_retrieve_bricks] 0-management: realpath () failed for brick /run/gluster/snaps/130949baac8843cda443cf8a6441157f/brick3/b3. The underlying file system may be in bad state [No such file or directory]
[2016-03-31 12:19:44.114364] E [MSGID: 106201] [glusterd-store.c:3082:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: 130949baac8843cda443cf8a6441157f
[2016-03-31 12:19:44.114387] E [MSGID: 106195] [glusterd-store.c:3475:glusterd_store_retrieve_snap] 0-management: Failed to retrieve snap volumes for snap snap1
[2016-03-31 12:19:44.114399] E [MSGID: 106043] [glusterd-store.c:3629:glusterd_store_retrieve_snaps] 0-management: Unable to restore snapshot: snap1
[2016-03-31 12:19:44.114509] E [MSGID: 101019] [xlator.c:433:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2016-03-31 12:19:44.114542] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed
[2016-03-31 12:19:44.114554] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed
[2016-03-31 12:19:44.115626] W [glusterfsd.c:1251:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7fc632a1b2ad] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x120) [0x7fc632a1b150] -->/usr/sbin/glusterd(cleanup_and_exit+0x69) [0x7fc632a1a739] ) 0-: received signum (0), shutting down

Comment 3 Atin Mukherjee 2016-03-31 10:21:54 UTC

An upstream patch http://review.gluster.org/#/c/13869/1 has been posted for review which explains the RCA

Comment 5 Atin Mukherjee 2016-04-06 05:28:59 UTC

Downstream patch https://code.engineering.redhat.com/gerrit/#/c/71478/ posted for review.

Comment 6 Atin Mukherjee 2016-04-06 05:57:38 UTC

Downstream patch is merged now. Moving the status to Modified.

Comment 8 Anil Shah 2016-04-25 08:19:24 UTC

glusterd in running after node reboot, when snapshots are available of volume.

Bug verified on build glusterfs-3.7.9-2.el7rhgs.x86_64

Comment 10 errata-xmlrpc 2016-06-23 05:15:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240