Bug 969150

Summary: Files go missing on mount point
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Sachidananda Urs <surs>
Component: glusterfsAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED CURRENTRELEASE QA Contact: Sudhir D <sdharane>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.0CC: aavati, pkarampu, rhinduja, rhs-bugs, sdharane, surs, vagarwal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---Flags: rhinduja: needinfo+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-17 11:44:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 813137, 970111    
Bug Blocks:    
Attachments:
Description Flags
Client logs
none
sosreports none

Description Sachidananda Urs 2013-05-30 18:59:00 UTC
Description of problem:

When one of the nodes crash on a distributed-replicate volume. Files go missing on the mountpoint.

SETUP:

[root@tex ~]# gluster volume info
 
Volume Name: intu
Type: Distributed-Replicate
Volume ID: 0f281edf-05e3-455b-97d1-522d9fcda36b
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: tex.lab.eng.blr.redhat.com:/rhs/brick1/int
Brick2: mater.lab.eng.blr.redhat.com:/rhs/brick1/int
Brick3: van.lab.eng.blr.redhat.com:/rhs/brick1/int
Brick4: wingo.lab.eng.blr.redhat.com:/rhs/brick1/int
[root@tex ~]# 

[root@tex ~]# gluster volume status
Status of volume: intu
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick tex.lab.eng.blr.redhat.com:/rhs/brick1/int        24011   Y       2695
Brick mater.lab.eng.blr.redhat.com:/rhs/brick1/int      24011   Y       32026
Brick van.lab.eng.blr.redhat.com:/rhs/brick1/int        24012   Y       1555
Brick wingo.lab.eng.blr.redhat.com:/rhs/brick1/int      24011   Y       18468
NFS Server on localhost                                 38467   Y       2701
Self-heal Daemon on localhost                           N/A     Y       2706
NFS Server on van.lab.eng.blr.redhat.com                38467   Y       1561
Self-heal Daemon on van.lab.eng.blr.redhat.com          N/A     Y       1567
NFS Server on wingo.lab.eng.blr.redhat.com              38467   Y       18473
Self-heal Daemon on wingo.lab.eng.blr.redhat.com        N/A     Y       18478
NFS Server on mater.lab.eng.blr.redhat.com              38467   Y       32031
Self-heal Daemon on mater.lab.eng.blr.redhat.com        N/A     Y       32038


The node van crashes and comes up a few times. And the files on mount-point go missing. The files on tex and mater are not seen on the mount point.


Version-Release number of selected component (if applicable):
glusterfs 3.3.0.10rhs built on May 29 2013 05:38:09


How reproducible:

Takes a long time, maybe not always.

Steps to Reproduce:
1. Create huge IO on the client
2. Destroy one of the nodes and bring it back between some intervals

Comment 1 Sachidananda Urs 2013-05-30 19:00:35 UTC
Created attachment 755016 [details]
Client logs

Comment 3 Sachidananda Urs 2013-05-30 19:19:17 UTC
After a while I can see them on the mount again. But meanwhile the IO is disrupted on the mount and the applications fail.

Comment 4 Sachidananda Urs 2013-05-30 19:20:12 UTC
Some of application errors:

tar: linux-3.9.4/arch/m32r/include/asm/spinlock_types.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/string.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/mmu.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/mutex.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/switch_to.h: Cannot open: No such file or directory
tar: linux-3.9.4/arch/m32r/include/asm/page.h: Cannot open: No such file or directory
tar: Skipping to next header
xz: (stdin): Compressed data is corrupt
tar: Child returned status 1
tar: Error is not recoverable: exiting now
tar: Skipping to next header
xz: (stdin): Compressed data is corrupt
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Comment 5 Sachidananda Urs 2013-05-30 19:40:44 UTC
Created attachment 755033 [details]
sosreports

sosreport is for two servers. On the other two servers, sosreport is taking forever to complete. Will attach as soon as they are done.

Comment 6 Vijay Bellur 2013-05-30 19:47:16 UTC
Hi Sac,

Do you have timestamp of the client node around which files were missed? That would be helpful to debug the problem.

I notice these logs: 

2013-05-30 21:48:22.369006] I [client.c:2098:client_rpc_notify] 0-intu-client-1: disconnected
[2013-05-30 21:48:22.369034] E [afr-common.c:3650:afr_notify] 0-intu-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2013-05-30 21:48:22.568215] I [afr-common.c:3771:afr_local_init] 0-intu-replicate-0: no subvolumes up

This seems to indicate that both intu-client-0 (tex) and intu-client-1 (mater) were down from replicate's perspective. This might explain why files were not being seen on the mount. However need to understand if this timestamp matches your observation.

Comment 7 Sachidananda Urs 2013-05-30 19:55:11 UTC
Vijay,

I am not sure around what time the files went missing. But I can say that I started seeing them again around 2013-05-30 00:15 or so (not exact though), however the server never rebooted nor the gluster daemons were restarted. So I'm quite clueless as to why the subvolumes went down intermittently...

I'm looking around in the servers to see if I can find anything, will keep you posted.

Comment 8 Pranith Kumar K 2013-05-31 12:27:41 UTC
   According to the info gathered from Sac, one subvolume of dht went down because of which ls is giving partial listing of the directory entries. Untars of the file failed because file is located on the subvolume that went down. According to the logs subvolume of dht went down because of ping timer expiry.

[2013-05-30 21:16:55.078146] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-intu-client-0: server 10.70.34.132:24011 has not responded in the last 42 seconds, disconnecting.
[2013-05-30 21:48:21.665363] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-intu-client-1: server 10.70.34.103:24011 has not responded in the last 42 seconds, disconnecting.

To identify Root cause we need to figure out why brick was not able to respond to pings from mount. We are going to simulate the tests with large IO and see if we can re-create such scenario. Will update you guys with my results if I could re-create it.

Comment 9 Pranith Kumar K 2013-06-03 03:59:18 UTC
I tried re-creating this issue on my VMs. 3x2 configuration with 3 mounts all of them doing 10 parallel untars each in a while loop. This did not give any ping timeouts :-(.

Comment 10 Pranith Kumar K 2013-06-03 04:32:05 UTC
As the issue is proving to be tricky to re-create I need help in recreating the issue from QE. Could you guys provide exact steps to re-create the issue.
I am continuing my runs, with plain replicate with 40 untars in parallel on a single mount point with one brick down, to recreate the issue.

Comment 11 Sudhir D 2013-06-03 06:03:51 UTC
Pranith, can you take a look at the https://bugzilla.redhat.com/show_bug.cgi?id=969020 ? Seem related? Let me know your findings.

Comment 12 Pranith Kumar K 2013-06-03 06:31:31 UTC
969020 seems to result in permanent data loss because of rebalance/renames, Where as this bug results in temporary data loss because of the non-availability of one of the dht-subvolumes. The subvolume is not available because the bricks got disconnected even when there were no explicit brick downs. We need to figure out why there were disconnects. Any steps to re-create such a scenario would help us.

Comment 14 Pranith Kumar K 2013-06-05 05:57:30 UTC
Brian,
    Could you move this bug to MODIFIED once the xfs patch is backported. Assigning the bug to you for now.

Pranith

Comment 16 Pranith Kumar K 2014-01-16 08:58:26 UTC
Bug was in xfs. Since 813137 is fixed, marking this ON_QA.

Comment 17 Vivek Agarwal 2014-01-17 11:44:37 UTC
Based on comment 16, closing this bug as it has been fixed in current release of xfs