| Summary: | Heal info shows pending heals and a split brain in .shard, even after deleting all files from fuse mount. | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Sweta Anandpara <sanandpa> |
| Component: | replicate | Assignee: | Krutika Dhananjay <kdhananj> |
| Status: | CLOSED NOTABUG | QA Contact: | Sweta Anandpara <sanandpa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.2 | CC: | pkarampu, rcyriac, rhs-bugs, sanandpa, storage-qa-internal |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-15 16:23:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Sweta Anandpara
2016-12-13 08:55:01 UTC
Sosreports at: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1404160/ [qe@rhsqe-repo 1404160]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1404160]$ [qe@rhsqe-repo 1404160]$ pwd /home/repo/sosreports/1404160 [qe@rhsqe-repo 1404160]$ [qe@rhsqe-repo 1404160]$ ll total 184688 -rwxr-xr-x. 1 qe qe 52716040 Dec 13 14:21 sosreport-dhcp35-101.lab.eng.blr.redhat.com-20161213114944.tar.xz -rwxr-xr-x. 1 qe qe 29882988 Dec 13 14:20 sosreport-dhcp35-104.lab.eng.blr.redhat.com-20161213114955.tar.xz -rwxr-xr-x. 1 qe qe 52331340 Dec 13 14:21 sosreport-sysreg-prod-20161213115006.tar.xz -rwxr-xr-x. 1 qe qe 54181576 Dec 13 14:21 sosreport-sysreg-prod-20161213115013.tar.xz [qe@rhsqe-repo 1404160]$ I do not think that a replica ending up in a state where it accuses itself is due to a bug in sharding. Changing the component to AFR. Krutika, Could you take a look at this bug. Assigning it to you. Pranith (In reply to Sweta Anandpara from comment #0) > Description of problem: > ======================= > Had a 4node cluster and 2*2 volume 'nash' with sharding enabled. Had created > a few files for testing interopability with eventing, and had left the > volume as is for more than a week. Revisited the volume this time to test > EVENT_SPLIT_BRAIN, and created a scenario for the same, by killing brick > process (of the replica bricks) one after the other and appending to one of > the files. > > While verifying the events, I saw the event SPLIT_BRAIN in the .shard > directory and the not the file path in question. 'gluster volume heal info' > showed the same output. I sensed something was not right and decided to > delete all the old files in that volume and start afresh. After doing so, > the volume heal info continued to show the stale split-brain entries. > > If there are files/dirs which require heal, or are in split brain, and if > they are deleted from the mountpoint, I am expecting the heal info to show a > clean output. > > Version-Release number of selected component (if applicable): > ========================================================= > 3.8.4-5 > > > How reproducible: > ================== > Have hit it once. Setup is still in same state, in case it is to be looked > at. > > > Steps to Reproduce: > =================== > 1. Have a 2*2 volume and enable sharding. > 2. Create a 100mb file and a few other files. > 3. Say brick1 and brick2 are replica pairs, kill brick1. > 4. Append some data to the file '100mb_file' > 5. Kill brick2 and get brick1 up > 6. Append some more data to the file '100mb_file' > 7. Get brick2 up and verify 'gluster volume heal info' > 8. Delete all the files from the mountpoint > > Actual results: > =============== > Step7 shows .shard directory in split brain > Step8 continues to show .shard directory in split brain > > > Expected results: > ================ > Step7 should show 100mb_file in split brain > Step8 should show 0 entries to be healed, as all files have been deleted. > Couple of observations: 1. The reason /.shard is displayed as being in split-brain in heal-info output is because the shard 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.1 has a gfid mismatch. [root@dhcp35-115 .shard]# getfattr -d -m . -e hex * # file: 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.1 security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.nash-client-1=0x000000000000000000000000 trusted.bit-rot.signature=0x0103000000000000009d3ea6b8ae2018edea7e8adb63fcc70b22c284c03d38a0b80e42b03c5c24ce90 trusted.bit-rot.version=0x0300000000000000582565f9000de59c trusted.gfid=0x38af4df0915d40df9e999544d9edb5e5 [root@dhcp35-100 .shard]# getfattr -d -m . -e hex 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.* # file: 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.1 security.selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000 trusted.afr.nash-client-0=0x000000010000000100000000 trusted.gfid=0xf67b9934a0d84452bdcd6dae1ffd181a The gfid mismatch is likely when the two appending writes executed while the "other" brick is down fall on the same shard and cause the shard to be created before it is written to with different gfids on different replicas. I still see couple of shards left in the first replica set, which means that the rm -rf itself was a failed command. Would rm fail on an entry that has a gfid mismatch? I'm not sure, I need to check the code to find out. As for the two replicas accusing themselves on .shard, again I need to check the code to see when this can happen. I'll update the bug once I've connected all the dots together. -Krutika > > Additional info: > ================ > > [root@dhcp35-115 nash0]# gluster peer status > Number of Peers: 3 > > Hostname: dhcp35-101.lab.eng.blr.redhat.com > Uuid: a3bd23b9-f70a-47f5-9c95-7a271f5f1e18 > State: Peer in Cluster (Connected) > > Hostname: 10.70.35.104 > Uuid: 10335359-1c70-42b2-bcce-6215a973678d > State: Peer in Cluster (Connected) > > Hostname: 10.70.35.100 > Uuid: fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5 > State: Peer in Cluster (Connected) > [root@dhcp35-115 nash0]# rpm -qa | grep gluster > glusterfs-client-xlators-3.8.4-5.el6rhs.x86_64 > glusterfs-cli-3.8.4-5.el6rhs.x86_64 > glusterfs-api-devel-3.8.4-5.el6rhs.x86_64 > gluster-nagios-addons-0.2.8-1.el6rhs.x86_64 > glusterfs-libs-3.8.4-5.el6rhs.x86_64 > glusterfs-fuse-3.8.4-5.el6rhs.x86_64 > glusterfs-devel-3.8.4-5.el6rhs.x86_64 > glusterfs-events-3.8.4-5.el6rhs.x86_64 > glusterfs-rdma-3.8.4-5.el6rhs.x86_64 > nfs-ganesha-gluster-2.3.1-8.el6rhs.x86_64 > glusterfs-debuginfo-3.8.4-2.el6rhs.x86_64 > glusterfs-api-3.8.4-5.el6rhs.x86_64 > glusterfs-geo-replication-3.8.4-5.el6rhs.x86_64 > glusterfs-ganesha-3.8.4-5.el6rhs.x86_64 > gluster-nagios-common-0.2.4-1.el6rhs.noarch > vdsm-gluster-4.16.30-1.5.el6rhs.noarch > glusterfs-server-3.8.4-5.el6rhs.x86_64 > glusterfs-3.8.4-5.el6rhs.x86_64 > python-gluster-3.8.4-5.el6rhs.noarch > [root@dhcp35-115 nash0]# > [root@dhcp35-115 nash0]# gluster v list > disp > nash > ozone > [root@dhcp35-115 nash0]# gluster v info nash > > Volume Name: nash > Type: Distributed-Replicate > Volume ID: d9c962de-5e4a-4fa9-a9c4-89b6803e543f > Status: Started > Snapshot Count: 0 > Number of Bricks: 2 x 2 = 4 > Transport-type: tcp > Bricks: > Brick1: 10.70.35.115:/bricks/brick1/nash0 > Brick2: 10.70.35.100:/bricks/brick1/nash1 > Brick3: 10.70.35.101:/bricks/brick1/nash2 > Brick4: 10.70.35.104:/bricks/brick1/nash3 > Options Reconfigured: > cluster.entry-self-heal: on > cluster.data-self-heal: on > cluster.metadata-self-heal: on > cluster.self-heal-daemon: enable > nfs.disable: on > performance.readdir-ahead: on > transport.address-family: inet > features.bitrot: off > features.scrub: Inactive > features.scrub-freq: hourly > features.shard: on > performance.stat-prefetch: off > auto-delete: disable > [root@dhcp35-115 nash0]# > [root@dhcp35-115 nash0]# > [root@dhcp35-115 nash0]# gluster v status nash > Status of volume: nash > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.70.35.115:/bricks/brick1/nash0 49152 0 Y > 29661 > Brick 10.70.35.100:/bricks/brick1/nash1 49152 0 Y > 24917 > Brick 10.70.35.101:/bricks/brick1/nash2 49154 0 Y > 2180 > Brick 10.70.35.104:/bricks/brick1/nash3 49153 0 Y > 7680 > Self-heal Daemon on localhost N/A N/A Y > 18473 > Self-heal Daemon on 10.70.35.100 N/A N/A Y > 4225 > Self-heal Daemon on dhcp35-101.lab.eng.blr. > redhat.com N/A N/A Y > 16937 > Self-heal Daemon on 10.70.35.104 N/A N/A Y > 18825 > > Task Status of Volume nash > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > [root@dhcp35-115 nash0]# > [root@dhcp35-115 nash0]# > [root@dhcp35-115 nash0]# gluster v heal nash info > Brick 10.70.35.115:/bricks/brick1/nash0 > /.shard - Is in split-brain > > Status: Connected > Number of entries: 1 > > Brick 10.70.35.100:/bricks/brick1/nash1 > /.shard - Is in split-brain > > /.shard/991e2e59-c8ea-4fef-9bd1-6c3a2b051156.1 > Status: Connected > Number of entries: 2 > > Brick 10.70.35.101:/bricks/brick1/nash2 > Status: Connected > Number of entries: 0 > > Brick 10.70.35.104:/bricks/brick1/nash3 > Status: Connected > Number of entries: 0 > > [root@dhcp35-115 nash0]# > [root@dhcp35-115 nash0]# getfattr -d -m . -e hex /bricks/brick1/nash0/.shard/ > getfattr: Removing leading '/' from absolute path names > # file: bricks/brick1/nash0/.shard/ > security. > selinux=0x73797374656d5f753a6f626a6563745f723a64656661756c745f743a733000 > trusted.afr.nash-client-0=0x000000000000000000000001 > trusted.afr.nash-client-1=0x000000000000000000000001 > trusted.gfid=0xbe318638e8a04c6d977d7a937aa84806 > trusted.glusterfs.dht=0x0000000100000000000000007ffffffe > > [root@dhcp35-115 nash0]# Hi, The problem is that an `rm -rf` on individual files won't remove .shard. .shard will remain through the lifetime of the volume once sharding is enabled. And because .shard is in split-brain and existent, heal-info will always display it in its output. This explains why .shard will continue to appear. As far as /.shard/991e2e59-c8ea-4fef-9bd1-6c3a2b051156.1 appearing in heal-info is concerned, the following log suggests that rm -f on 30m_file didn't succeed entirely: [2016-12-12 10:10:26.029055] W [MSGID: 108008] [afr-self-heal-name.c:369:afr_selfheal_name_gfid_mismatch_check] 0-nash-replicate-0: GFID mismatch for <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/991e2e59-c8ea-4fef-9bd1-6c3a2b051156.1 f67b9934-a0d8-4452-bdcd-6dae1ffd181a on nash-client-1 and 38af4df0-915d-40df-9e99-9544d9edb5e5 on nash-client-0 [2016-12-12 10:10:26.029938] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 0-nash-shard: Lookup on shard 1 failed. Base file gfid = 991e2e59-c8ea-4fef-9bd1-6c3a2b051156 [Input/output error] [2016-12-12 10:10:26.029968] W [fuse-bridge.c:1355:fuse_unlink_cbk] 0-glusterfs-fuse: 124: UNLINK() /30m_file => -1 (Input/output error) this is because shard sends a lookup on shards that are not in inode table (in this case shard 1 of 30m_file) and because this lookup was failed by AFR with EIO (because gfid mismatch!), the unlink itself of this shard was failed and in turn unlink of 30m_file as a whole. That explains why this shard also appears in heal-info despite rm -f. Does this answer your questions? -Krutika This also explains why there are 7 shards remaining in .shard. The way shard unlink works is to look up all shards of the file not in memory and then if these parallel lookups succeed, send parallel unlinks on all of them. Since the LOOKUP stage of the UNLINK fop itself failed, shard failed the fop early and didn't proceed to send an UNLINK on the individual shards. This is why we see all 7 shards (30M/4M ~= 8 shards and 7 if you exclude the base file or the zeroth shard) present in the backend. [root@dhcp35-100 .shard]# ls -l total 10424 -rw-r--r--. 2 root root 0 Nov 11 14:57 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.1 -rw-r--r--. 2 root root 4194304 Nov 11 12:49 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.3 -rw-r--r--. 2 root root 4194304 Nov 11 12:49 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.5 -rw-r--r--. 2 root root 2097152 Nov 11 12:49 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.7 on the first dht subvolume and ... [root@dhcp35-101 .shard]# ls -l total 12300 -rw-r--r--. 2 root root 4194304 Nov 11 12:49 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.2 -rw-r--r--. 2 root root 4194304 Nov 11 12:49 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.4 -rw-r--r--. 2 root root 4194304 Nov 11 12:49 991e2e59-c8ea-4fef-9bd1-6c3a2b051156.6 on the second DHT subvolume. So there is nothing that can be fixed here. After talking to Sweta, it seems like this is a good candidate for documentation for the case where unlink of a sharded file might fail for any reason (which will be logged by fuse) which will leave behind shards that need to be deleted from the backend. I don't know how to close this bug yet. In two way replica volume gfid-mismatch will happen. When that happens unlink of the file will fail partially leading to some blocks that need to be removed manually just like we fix gfid-mismatch of files manually. I guess there is nothing to be fixed here. I would close this as not a bug. But what shwetha said has a point. This behavior and how to delete the extra shards definitely needs to be documented as part of a separate bug which will be raised by Shweta. Krutika and I had a discussion with Nag and Shwetha Anandpara about this issue. This behavior is something we need to document. It would have been better if we could directly convert this bug itself for documentation. But I guess the process requires we open separate bug. So requesting Shwetha to raise the bug on documentation for the same. Closing this as NOTABUG. |