Bug 969150
Summary: | Files go missing on mount point | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Sachidananda Urs <surs> | ||||||
Component: | glusterfs | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Sudhir D <sdharane> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 2.0 | CC: | aavati, pkarampu, rhinduja, rhs-bugs, sdharane, surs, vagarwal, vbellur | ||||||
Target Milestone: | --- | Keywords: | ZStream | ||||||
Target Release: | --- | Flags: | rhinduja:
needinfo+
|
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-01-17 11:44:37 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 813137, 970111 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Sachidananda Urs
2013-05-30 18:59:00 UTC
Created attachment 755016 [details]
Client logs
After a while I can see them on the mount again. But meanwhile the IO is disrupted on the mount and the applications fail. Some of application errors: tar: linux-3.9.4/arch/m32r/include/asm/spinlock_types.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/string.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/mmu.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/mutex.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/switch_to.h: Cannot open: No such file or directory tar: linux-3.9.4/arch/m32r/include/asm/page.h: Cannot open: No such file or directory tar: Skipping to next header xz: (stdin): Compressed data is corrupt tar: Child returned status 1 tar: Error is not recoverable: exiting now tar: Skipping to next header xz: (stdin): Compressed data is corrupt tar: Child returned status 1 tar: Error is not recoverable: exiting now Created attachment 755033 [details]
sosreports
sosreport is for two servers. On the other two servers, sosreport is taking forever to complete. Will attach as soon as they are done.
Hi Sac, Do you have timestamp of the client node around which files were missed? That would be helpful to debug the problem. I notice these logs: 2013-05-30 21:48:22.369006] I [client.c:2098:client_rpc_notify] 0-intu-client-1: disconnected [2013-05-30 21:48:22.369034] E [afr-common.c:3650:afr_notify] 0-intu-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2013-05-30 21:48:22.568215] I [afr-common.c:3771:afr_local_init] 0-intu-replicate-0: no subvolumes up This seems to indicate that both intu-client-0 (tex) and intu-client-1 (mater) were down from replicate's perspective. This might explain why files were not being seen on the mount. However need to understand if this timestamp matches your observation. Vijay, I am not sure around what time the files went missing. But I can say that I started seeing them again around 2013-05-30 00:15 or so (not exact though), however the server never rebooted nor the gluster daemons were restarted. So I'm quite clueless as to why the subvolumes went down intermittently... I'm looking around in the servers to see if I can find anything, will keep you posted. According to the info gathered from Sac, one subvolume of dht went down because of which ls is giving partial listing of the directory entries. Untars of the file failed because file is located on the subvolume that went down. According to the logs subvolume of dht went down because of ping timer expiry. [2013-05-30 21:16:55.078146] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-intu-client-0: server 10.70.34.132:24011 has not responded in the last 42 seconds, disconnecting. [2013-05-30 21:48:21.665363] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-intu-client-1: server 10.70.34.103:24011 has not responded in the last 42 seconds, disconnecting. To identify Root cause we need to figure out why brick was not able to respond to pings from mount. We are going to simulate the tests with large IO and see if we can re-create such scenario. Will update you guys with my results if I could re-create it. I tried re-creating this issue on my VMs. 3x2 configuration with 3 mounts all of them doing 10 parallel untars each in a while loop. This did not give any ping timeouts :-(. As the issue is proving to be tricky to re-create I need help in recreating the issue from QE. Could you guys provide exact steps to re-create the issue. I am continuing my runs, with plain replicate with 40 untars in parallel on a single mount point with one brick down, to recreate the issue. Pranith, can you take a look at the https://bugzilla.redhat.com/show_bug.cgi?id=969020 ? Seem related? Let me know your findings. 969020 seems to result in permanent data loss because of rebalance/renames, Where as this bug results in temporary data loss because of the non-availability of one of the dht-subvolumes. The subvolume is not available because the bricks got disconnected even when there were no explicit brick downs. We need to figure out why there were disconnects. Any steps to re-create such a scenario would help us. Brian, Could you move this bug to MODIFIED once the xfs patch is backported. Assigning the bug to you for now. Pranith Bug was in xfs. Since 813137 is fixed, marking this ON_QA. Based on comment 16, closing this bug as it has been fixed in current release of xfs |