Bug 1330044
Summary: | one of vm goes to paused state when network goes down and comes up back | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | RamaKasturi <knarra> | |
Component: | replicate | Assignee: | Ravishankar N <ravishankar> | |
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.1 | CC: | amureini, asrivast, bugs, knarra, mchangir, pkarampu, rcyriac, rhinduja, rhs-bugs, sabose, sankarshan, storage-qa-internal | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.1.3 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.7.9-6 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1336612 (view as bug list) | Environment: | ||
Last Closed: | 2016-06-23 05:19:16 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1258386, 1311817, 1336612, 1337822, 1337831 |
Description
RamaKasturi
2016-04-25 10:39:06 UTC
sos reports are present here. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1330044/ Can you provide info on which node was brought down and which node HE was running on when nic was brought down? Had three hosts, zod, sulphur and tettnang. zod - nic was brought down sulphur - primary server for gluster tettnang - HE was running. Regarding HE engine restarts - I think we need a separate bug to track this. I see periodic umount and mount in gluster logs - repeated every minute and errors in HE agent logs on accessing storage domain. No related errors in gluster logs however. Regarding VM pause error on zod server: There are a lot of messages related to EIO like this: var/log/vdsm/vdsm.log.1.xz:Thread-3503763::INFO::2016-04-25 14:56:36,103::clientIF::182::vds::(contEIOVms) Cont vm 74f7c2f0-150a-4076-9ade-9c467bcc922b in EIO However these VMs are then resumed, and only one continues to have IO error libvirtEventLoop::INFO::2016-04-25 14:56:37,532::vm::5084::virt.vm::(_logGuestCpuStatus) vmId=`1a6ad336-5317-40eb-a476-fbc70998e948`::CPU stopped: onIOError libvirtEventLoop::DEBUG::2016-04-25 14:56:37,533::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"1a6ad336-5317-40eb-a476-fbc70998e948": {"status": "Paused", "ioerror": {"alias": "virtio-disk0", "name": "vda", "path": "/rhev/data-center/00000001-0001-0001-0001-000000000128/297a9b9c-4396-4b30-8bfe-976a67d49a74/images/928c922c-65ab-453e-bf75-472cc41a1b31/c71e0e1e-7cfb-40a6-b85c-170705aa36e7"}, "pauseCode": "EIO"}, "notify_time": 5501919980}, "jsonrpc": "2.0", "method": "|virt|VM_status|1a6ad336-5317-40eb-a476-fbc70998e948"} and in gluster mount log for vmstore [2016-04-25 09:26:37.470109] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-vmstore-replicate-0: Failing WRITE on gfid a289ee5c-6ade-4c7d-954e-c7472cbcb284: split-brain observed. [Input/output error] [2016-04-25 09:26:37.470240] W [fuse-bridge.c:1287:fuse_err_cbk] 0-glusterfs-fuse: 105212361: FSYNC() ERR => -1 (Input/output error) [2016-04-25 09:26:37.470222] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-vmstore-replicate-0: Failing WRITE on gfid a289ee5c-6ade-4c7d-954e-c7472cbcb284: split-brain observed. [Input/output error] Moving to gluster team http://review.gluster.org/14368 and http://review.gluster.org/14369 posted for review in upstream. Moving this bug to POST state. Tested with RHGS 3.1.3 nightly build ( glusterfs-3.7.9-6.el7rhgs ) with the following tests 1. Brought down the network on one of the node and the VMs running on the node got paused, but other VMs were running healthy 2. The other test involves, creation of new shards when one of the replica 3 brick is down. And that brick was brought up later after sometime. The heal happened successfully and there are no VM pauses. Based on the above observation, marking this bug as VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240 |