Description of problem: When one of the nodes crash on a distributed-replicate volume. Files go missing on the mountpoint. SETUP: [root@tex ~]# gluster volume info Volume Name: intu Type: Distributed-Replicate Volume ID: 0f281edf-05e3-455b-97d1-522d9fcda36b Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: tex.lab.eng.blr.redhat.com:/rhs/brick1/int Brick2: mater.lab.eng.blr.redhat.com:/rhs/brick1/int Brick3: van.lab.eng.blr.redhat.com:/rhs/brick1/int Brick4: wingo.lab.eng.blr.redhat.com:/rhs/brick1/int [root@tex ~]# [root@tex ~]# gluster volume status Status of volume: intu Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick tex.lab.eng.blr.redhat.com:/rhs/brick1/int 24011 Y 2695 Brick mater.lab.eng.blr.redhat.com:/rhs/brick1/int 24011 Y 32026 Brick van.lab.eng.blr.redhat.com:/rhs/brick1/int 24012 Y 1555 Brick wingo.lab.eng.blr.redhat.com:/rhs/brick1/int 24011 Y 18468 NFS Server on localhost 38467 Y 2701 Self-heal Daemon on localhost N/A Y 2706 NFS Server on van.lab.eng.blr.redhat.com 38467 Y 1561 Self-heal Daemon on van.lab.eng.blr.redhat.com N/A Y 1567 NFS Server on wingo.lab.eng.blr.redhat.com 38467 Y 18473 Self-heal Daemon on wingo.lab.eng.blr.redhat.com N/A Y 18478 NFS Server on mater.lab.eng.blr.redhat.com 38467 Y 32031 Self-heal Daemon on mater.lab.eng.blr.redhat.com N/A Y 32038 The node van crashes and comes up a few times. And the files on mount-point go missing. The files on tex and mater are not seen on the mount point. Version-Release number of selected component (if applicable): glusterfs 3.3.0.10rhs built on May 29 2013 05:38:09 How reproducible: Takes a long time, maybe not always. Steps to Reproduce: 1. Create huge IO on the client 2. Destroy one of the nodes and bring it back between some intervals
Created attachment 755016 [details] Client logs
After a while I can see them on the mount again. But meanwhile the IO is disrupted on the mount and the applications fail.
Some of application errors: tar: linux-3.9.4/arch/m32r/include/asm/spinlock_types.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/string.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/mmu.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/mutex.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/switch_to.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/page.h: Cannot open: No such file or directory tar: Skipping to next header xz: (stdin): Compressed data is corrupt tar: Child returned status 1 tar: Error is not recoverable: exiting now tar: Skipping to next header xz: (stdin): Compressed data is corrupt tar: Child returned status 1 tar: Error is not recoverable: exiting now
Created attachment 755033 [details] sosreports sosreport is for two servers. On the other two servers, sosreport is taking forever to complete. Will attach as soon as they are done.
Hi Sac, Do you have timestamp of the client node around which files were missed? That would be helpful to debug the problem. I notice these logs: 2013-05-30 21:48:22.369006] I [client.c:2098:client_rpc_notify] 0-intu-client-1: disconnected [2013-05-30 21:48:22.369034] E [afr-common.c:3650:afr_notify] 0-intu-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2013-05-30 21:48:22.568215] I [afr-common.c:3771:afr_local_init] 0-intu-replicate-0: no subvolumes up This seems to indicate that both intu-client-0 (tex) and intu-client-1 (mater) were down from replicate's perspective. This might explain why files were not being seen on the mount. However need to understand if this timestamp matches your observation.
Vijay, I am not sure around what time the files went missing. But I can say that I started seeing them again around 2013-05-30 00:15 or so (not exact though), however the server never rebooted nor the gluster daemons were restarted. So I'm quite clueless as to why the subvolumes went down intermittently... I'm looking around in the servers to see if I can find anything, will keep you posted.
According to the info gathered from Sac, one subvolume of dht went down because of which ls is giving partial listing of the directory entries. Untars of the file failed because file is located on the subvolume that went down. According to the logs subvolume of dht went down because of ping timer expiry. [2013-05-30 21:16:55.078146] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-intu-client-0: server 10.70.34.132:24011 has not responded in the last 42 seconds, disconnecting. [2013-05-30 21:48:21.665363] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-intu-client-1: server 10.70.34.103:24011 has not responded in the last 42 seconds, disconnecting. To identify Root cause we need to figure out why brick was not able to respond to pings from mount. We are going to simulate the tests with large IO and see if we can re-create such scenario. Will update you guys with my results if I could re-create it.
I tried re-creating this issue on my VMs. 3x2 configuration with 3 mounts all of them doing 10 parallel untars each in a while loop. This did not give any ping timeouts :-(.
As the issue is proving to be tricky to re-create I need help in recreating the issue from QE. Could you guys provide exact steps to re-create the issue. I am continuing my runs, with plain replicate with 40 untars in parallel on a single mount point with one brick down, to recreate the issue.
Pranith, can you take a look at the https://bugzilla.redhat.com/show_bug.cgi?id=969020 ? Seem related? Let me know your findings.
969020 seems to result in permanent data loss because of rebalance/renames, Where as this bug results in temporary data loss because of the non-availability of one of the dht-subvolumes. The subvolume is not available because the bricks got disconnected even when there were no explicit brick downs. We need to figure out why there were disconnects. Any steps to re-create such a scenario would help us.
Brian, Could you move this bug to MODIFIED once the xfs patch is backported. Assigning the bug to you for now. Pranith
Bug was in xfs. Since 813137 is fixed, marking this ON_QA.
Based on comment 16, closing this bug as it has been fixed in current release of xfs