Bug 1243542
Summary: | [RHEV-RHGS] App VMs paused due to IO error caused by split-brain, after initiating remove-brick operation | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | SATHEESARAN <sasundar> | ||||||||
Component: | distribute | Assignee: | Ravishankar N <ravishankar> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | rhgs-3.1 | CC: | asriram, asrivast, divya, nbalacha, ravishankar, rcyriac, rgowdapp, ssampat, vagarwal | ||||||||
Target Milestone: | --- | Keywords: | Regression, ZStream | ||||||||
Target Release: | RHGS 3.1.1 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | glusterfs-3.7.1-12 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Previously, when a distribute leg of a dist-rep Gluster volume that hosts VM images was removed using the `remove-brick start` Gluster CLI, the VMs went into a paused state. With the fix, they do not go into a paused state.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1244165 (view as bug list) | Environment: |
RHEL 6.7 as Hypervisor
RHEVM 3.5.4
RHGS 3.1 Nighty build ( based on RHEL 7.1 )
|
||||||||
Last Closed: | 2015-10-05 07:20:21 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1216951, 1244165, 1245202, 1245934, 1251815 | ||||||||||
Attachments: |
|
Description
SATHEESARAN
2015-07-15 18:08:03 UTC
Following error messages are seen in the fuse mount logs: [2015-07-15 17:49:42.709088] E [MSGID: 114031] [client-rpc-fops.c:1673:client3_3_finodelk_cbk] 6-vol1-client-0: remote operation failed [Transport endpoint is not connected] [2015-07-15 17:49:42.710849] W [MSGID: 114031] [client-rpc-fops.c:1028:client3_3_fsync_cbk] 6-vol1-client-0: remote operation failed [Transport endpoint is not connected] [2015-07-15 17:49:42.710874] W [MSGID: 108035] [afr-transaction.c:1614:afr_changelog_fsync_cbk] 6-vol1-replicate-0: fsync(b7d21675-6fd8-472a-b7d9-71d7436c614d) failed on subvolume vol1-client-0. Transaction was WRITE [Transport endpoint is not connected] [2015-07-15 17:49:42.710897] W [MSGID: 108001] [afr-transaction.c:686:afr_handle_quorum] 6-vol1-replicate-0: b7d21675-6fd8-472a-b7d9-71d7436c614d: Failing WRITE as quorum is not met [2015-07-15 18:12:15.544061] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 12-vol1-replicate-5: Failing WRITE on gfid b7d21675-6fd8-472a-b7d9-71d7436c614d: split-brain observed. [Input/output error] [2015-07-15 18:12:15.737906] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293197: WRITE => -1 (Input/output error) [2015-07-15 18:12:17.022070] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-5: remote operation failed. Path: /c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b (d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory] [2015-07-15 18:12:17.022073] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 12-vol1-client-4: remote operation failed. Path: /c29ec775-c933-4109-87bf-0b7c4373d0a0/images/9ddffb02-b804-4f28-a8fb-df609eaa884a/c7637ade-9c78-4bd7-a9e4-a14913f9060b (d83a3f9a-7625-4872-b61f-0e4b63922a75) [No such file or directory] [2015-07-15 18:12:22.952290] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293304: WRITE => -1 (Input/output error) [2015-07-15 18:12:22.952550] W [fuse-bridge.c:2273:fuse_writev_cbk] 0-glusterfs-fuse: 293306: WRITE => -1 (Input/output error) Created attachment 1052404 [details]
Fuse mount log from the hypervisor
This is the fuse mount log from the hypervisor
Created attachment 1052406 [details]
sosreport from NODE1 - dhcp37-211
sosreport from RHGS node1 - dhcp37-211
Created attachment 1052407 [details]
sosreport from NODE2
This is the sosreport from NODE2 - dhcp37-66
I retested this issue again and I am seeing this issue consistently. This was the new test taken up, 1. Created a 2X2 distributed-replicate volume and used it to back VM Image store in RHEV 2. While installing OS in the App VM, initiated 'remove-brick start' for the brick which originally contains the VM disk image. 3. Observed split-brain error messages in the fuse mount log 4. App VM went in to PAUSED state The issue looks like a regression and I could reproduce this issue consistently all the time. Marking this bug as a BLOCKER for RHGS 3.1 release This issue is the regression and adding 'REGRESSION' keyword. Doc text is edited. Please sign off to be included in Known Issues. *** Bug 1235496 has been marked as a duplicate of this bug. *** Tested with glusterfs-3.7.1-13.el7rhgs with the following steps 1. Configured 2X2 gluster volume as a Data Domain for RHEV 2. Created few appvms and started OS installation on them 3. Initiated remove-brick on the brick which contained VM disk image There were no issues found and OS installation happened successfully. App VMs are healthy after removing the brick on which the VM's disk image was available. Marking this bug as VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1845.html |