Description of problem: ------------------------ With 8X2 distributed replicate volume, initiated a remove-brick with data migration. After few minutes, all the application VMs with its disk image on that gluster volume went in to paused state. Noticed the split-brain error message in the fuse mount log Version -------- RHEL 6.7 as hypervisor RHGS 3.1 based on RHEL 7.1 How reproducible: ----------------- Tried only once Steps to Reproduce: ------------------- 1. Create a 2X2 distributed replicate volume 2. Use this gluster volume as the 'Data Domain' for RHEV 3. Create few App VMs and install OS 4. Remove the bricks where the disk image of App VMs are residing Actual results: ---------------- App VMs went in to **paused** state Expected results: ----------------- App VMs should be healthy
Following error messages are seen in the fuse mount logs: [2015-07-15 17:49:42.709088] E [MSGID: 114031] [client-rpc-fops.c:1673:client3_3_finodelk_cbk] 6-vol1-client-0: remote operation failed [Transport endpoint is not connected] [2015-07-15 17:49:42.710849] W [MSGID: 114031] [client-rpc-fops.c:1028:client3_3_fsync_cbk] 6-vol1-client-0: remote operation failed [Transport endpoint is not connected] [2015-07-15 17:49:42.710874] W [MSGID: 108035] [afr-transaction.c:1614:afr_changelog_fsync_cbk] 6-vol1-replicate-0: fsync(b7d21675-6fd8-472a-b7d9-71d7436c614d) failed on subvolume vol1-client-0. Transaction was WRITE [Transport endpoint is not connected] [2015-07-15 17:49:42.710897] W [MSGID: 108001] [afr-transaction.c:686:afr_handle_quorum] 6-vol1-replicate-0: b7d21675-6fd8-472a-b7d9-71d7436c614d: Failing WRITE as quorum is not met [2015-07-15 18:12:15.544061] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 12-vol1-replicate-5: Failing WRITE on gfid b7d21675-6fd8-472a-b7d9-71d7436c614d: split-brain observed. [Input/output error] [2015-07-15 18:12:15.737906] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293197: WRITE => -1 (Input/output error) [2015-07-15 18:12:17.022070] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-5: remote operation failed. Path: /c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b (d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory] [2015-07-15 18:12:17.022073] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-4: remote operation failed. Path: /c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b (d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory] [2015-07-15 18:12:22.952290] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293304: WRITE => -1 (Input/output error) [2015-07-15 18:12:22.952550] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293306: WRITE => -1 (Input/output error)
Created attachment 1052404 [details] Fuse mount log from the hypervisor This is the fuse mount log from the hypervisor
Created attachment 1052406 [details] sosreport from NODE1 - dhcp37-211 sosreport from RHGS node1 - dhcp37-211
Created attachment 1052407 [details] sosreport from NODE2 This is the sosreport from NODE2 - dhcp37-66
I retested this issue again and I am seeing this issue consistently. This was the new test taken up, 1. Created a 2X2 distributed-replicate volume and used it to back VM Image store in RHEV 2. While installing OS in the App VM, initiated 'remove-brick start' for the brick which originally contains the VM disk image. 3. Observed split-brain error messages in the fuse mount log 4. App VM went in to PAUSED state The issue looks like a regression and I could reproduce this issue consistently all the time. Marking this bug as a BLOCKER for RHGS 3.1 release
This issue is the regression and adding 'REGRESSION' keyword.
Doc text is edited. Please sign off to be included in Known Issues.
*** Bug 1235496 has been marked as a duplicate of this bug. ***
https://code.engineering.redhat.com/gerrit/#/c/53818/
Tested with glusterfs-3.7.1-13.el7rhgs with the following steps 1. Configured 2X2 gluster volume as a Data Domain for RHEV 2. Created few appvms and started OS installation on them 3. Initiated remove-brick on the brick which contained VM disk image There were no issues found and OS installation happened successfully. App VMs are healthy after removing the brick on which the VM's disk image was available. Marking this bug as VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1845.html