969150 – Files go missing on mount point

Bug 969150 - Files go missing on mount point

Summary: Files go missing on mount point

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:	813137 970111
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-30 18:59 UTC by Sachidananda Urs
Modified:	2014-01-17 11:44 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-01-17 11:44:37 UTC
Embargoed:
Dependent Products:
Flags:	rhinduja: needinfo+

Attachments	(Terms of Use)
Client logs (2.21 MB, application/x-xz) 2013-05-30 19:00 UTC, Sachidananda Urs	no flags	Details
sosreports (4.53 MB, application/x-tar) 2013-05-30 19:40 UTC, Sachidananda Urs	no flags	Details
View All

Description Sachidananda Urs 2013-05-30 18:59:00 UTC

Description of problem:

When one of the nodes crash on a distributed-replicate volume. Files go missing on the mountpoint.

SETUP:

[root@tex ~]# gluster volume info
 
Volume Name: intu
Type: Distributed-Replicate
Volume ID: 0f281edf-05e3-455b-97d1-522d9fcda36b
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: tex.lab.eng.blr.redhat.com:/rhs/brick1/int
Brick2: mater.lab.eng.blr.redhat.com:/rhs/brick1/int
Brick3: van.lab.eng.blr.redhat.com:/rhs/brick1/int
Brick4: wingo.lab.eng.blr.redhat.com:/rhs/brick1/int
[root@tex ~]# 

[root@tex ~]# gluster volume status
Status of volume: intu
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick tex.lab.eng.blr.redhat.com:/rhs/brick1/int        24011   Y       2695
Brick mater.lab.eng.blr.redhat.com:/rhs/brick1/int      24011   Y       32026
Brick van.lab.eng.blr.redhat.com:/rhs/brick1/int        24012   Y       1555
Brick wingo.lab.eng.blr.redhat.com:/rhs/brick1/int      24011   Y       18468
NFS Server on localhost                                 38467   Y       2701
Self-heal Daemon on localhost                           N/A     Y       2706
NFS Server on van.lab.eng.blr.redhat.com                38467   Y       1561
Self-heal Daemon on van.lab.eng.blr.redhat.com          N/A     Y       1567
NFS Server on wingo.lab.eng.blr.redhat.com              38467   Y       18473
Self-heal Daemon on wingo.lab.eng.blr.redhat.com        N/A     Y       18478
NFS Server on mater.lab.eng.blr.redhat.com              38467   Y       32031
Self-heal Daemon on mater.lab.eng.blr.redhat.com        N/A     Y       32038


The node van crashes and comes up a few times. And the files on mount-point go missing. The files on tex and mater are not seen on the mount point.


Version-Release number of selected component (if applicable):
glusterfs 3.3.0.10rhs built on May 29 2013 05:38:09


How reproducible:

Takes a long time, maybe not always.

Steps to Reproduce:
1. Create huge IO on the client
2. Destroy one of the nodes and bring it back between some intervals

Comment 1 Sachidananda Urs 2013-05-30 19:00:35 UTC

Created attachment 755016 [details]
Client logs

Comment 3 Sachidananda Urs 2013-05-30 19:19:17 UTC

After a while I can see them on the mount again. But meanwhile the IO is disrupted on the mount and the applications fail.

Comment 4 Sachidananda Urs 2013-05-30 19:20:12 UTC

Some of application errors:

tar: linux-3.9.4/arch/m32r/include/asm/spinlock_types.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/string.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/mmu.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/mutex.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/switch_to.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/page.h: Cannot open: No such file or directory
tar: Skipping to next header
xz: (stdin): Compressed data is corrupt
tar: Child returned status 1
tar: Error is not recoverable: exiting now
tar: Skipping to next header
xz: (stdin): Compressed data is corrupt
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Comment 5 Sachidananda Urs 2013-05-30 19:40:44 UTC

Created attachment 755033 [details]
sosreports

sosreport is for two servers. On the other two servers, sosreport is taking forever to complete. Will attach as soon as they are done.

Comment 6 Vijay Bellur 2013-05-30 19:47:16 UTC

Hi Sac,

Do you have timestamp of the client node around which files were missed? That would be helpful to debug the problem.

I notice these logs: 

2013-05-30 21:48:22.369006] I [client.c:2098:client_rpc_notify] 0-intu-client-1: disconnected
[2013-05-30 21:48:22.369034] E [afr-common.c:3650:afr_notify] 0-intu-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2013-05-30 21:48:22.568215] I [afr-common.c:3771:afr_local_init] 0-intu-replicate-0: no subvolumes up

This seems to indicate that both intu-client-0 (tex) and intu-client-1 (mater) were down from replicate's perspective. This might explain why files were not being seen on the mount. However need to understand if this timestamp matches your observation.

Comment 7 Sachidananda Urs 2013-05-30 19:55:11 UTC

Vijay,

I am not sure around what time the files went missing. But I can say that I started seeing them again around 2013-05-30 00:15 or so (not exact though), however the server never rebooted nor the gluster daemons were restarted. So I'm quite clueless as to why the subvolumes went down intermittently...

I'm looking around in the servers to see if I can find anything, will keep you posted.

Comment 8 Pranith Kumar K 2013-05-31 12:27:41 UTC

   According to the info gathered from Sac, one subvolume of dht went down because of which ls is giving partial listing of the directory entries. Untars of the file failed because file is located on the subvolume that went down. According to the logs subvolume of dht went down because of ping timer expiry.

[2013-05-30 21:16:55.078146] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-intu-client-0: server 10.70.34.132:24011 has not responded in the last 42 seconds, disconnecting.
[2013-05-30 21:48:21.665363] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-intu-client-1: server 10.70.34.103:24011 has not responded in the last 42 seconds, disconnecting.

To identify Root cause we need to figure out why brick was not able to respond to pings from mount. We are going to simulate the tests with large IO and see if we can re-create such scenario. Will update you guys with my results if I could re-create it.

Comment 9 Pranith Kumar K 2013-06-03 03:59:18 UTC

I tried re-creating this issue on my VMs. 3x2 configuration with 3 mounts all of them doing 10 parallel untars each in a while loop. This did not give any ping timeouts :-(.

Comment 10 Pranith Kumar K 2013-06-03 04:32:05 UTC

As the issue is proving to be tricky to re-create I need help in recreating the issue from QE. Could you guys provide exact steps to re-create the issue.
I am continuing my runs, with plain replicate with 40 untars in parallel on a single mount point with one brick down, to recreate the issue.

Comment 11 Sudhir D 2013-06-03 06:03:51 UTC

Pranith, can you take a look at the https://bugzilla.redhat.com/show_bug.cgi?id=969020 ? Seem related? Let me know your findings.

Comment 12 Pranith Kumar K 2013-06-03 06:31:31 UTC

969020 seems to result in permanent data loss because of rebalance/renames, Where as this bug results in temporary data loss because of the non-availability of one of the dht-subvolumes. The subvolume is not available because the bricks got disconnected even when there were no explicit brick downs. We need to figure out why there were disconnects. Any steps to re-create such a scenario would help us.

Comment 14 Pranith Kumar K 2013-06-05 05:57:30 UTC

Brian,
    Could you move this bug to MODIFIED once the xfs patch is backported. Assigning the bug to you for now.

Pranith

Comment 16 Pranith Kumar K 2014-01-16 08:58:26 UTC

Bug was in xfs. Since 813137 is fixed, marking this ON_QA.

Comment 17 Vivek Agarwal 2014-01-17 11:44:37 UTC

Based on comment 16, closing this bug as it has been fixed in current release of xfs

Note You need to log in before you can comment on or make changes to this bug.