1684350 – VM's fails to shutdown after re-balance

Bug 1684350 - VM's fails to shutdown after re-balance

Summary: VM's fails to shutdown after re-balance

Keywords:
Status:	CLOSED DUPLICATE of bug 1705953
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhhi
Sub Component:
Version:	rhhiv-1.6
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Sahina Bose
QA Contact:	bipin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-01 03:26 UTC by bipin
Modified:	2019-05-03 10:38 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-03 10:38:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description bipin 2019-03-01 03:26:03 UTC

Description of problem:
======================
In RHHI-1.6 environment, while testing re-balance once the re-balance was completed, the vm's running on the same brick failed to get powered off. The vm's still shows as powering down.


Version-Release number of selected component (if applicable):
============================================================
glusterfs-3.12.2-45.el7rhgs.x86_64
rhv-4.3.0.5-0
ovirt-engine-4.3.1.1-0.1.el7.noarch

How reproducible:
================
1/1

Steps to Reproduce:
==================
1.Create a replica 3 (1*3) volume
2.Host 2 or more vm's on it and pump IO to the additional disk
3.Add bricks to the volume mentioned above, now its 2*3 
4.Start the rebalance 
5.Once the rebalance is completed poweroff the vm's.

Actual results:
==============
The vm's never got powered off. Also tried with reboot,but never got a connection to ssh

Expected results:
================
Should poweroff the vm without any corruption

Additional info:
===============
Looking through the logs i dont see any clue if FUSE mount is the culprit here.
The logs looks clear but there are few mismatch messages. 
But the engine log suggests that the Destroy VDS command failed.

Additional, new creation of vm's on those bricks works fine.

Comment 2 bipin 2019-03-01 03:27:56 UTC

Engine log:
===========
2019-02-28 11:13:55,444+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Failed in 'DestroyVDS' method
2019-02-28 11:13:55,453+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] EVENT_ID: VDS_BROKER_
COMMAND_FAILURE(10,802), VDSM rhsqa-grafton9-nic2.lab.eng.blr.redhat.com command DestroyVDS failed: Virtual machine destroy error
2019-02-28 11:13:55,454+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'org.ovirt.engine.core
.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=42, message=Virtual machine destroy error]]'
2019-02-28 11:13:55,454+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] HostName = rhsqa-grafton9-nic2
.lab.eng.blr.redhat.com
2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'DestroyVDSCommand(Hos
tName = rhsqa-grafton9-nic2.lab.eng.blr.redhat.com, DestroyVmVDSCommandParameters:{hostId='a37bef1d-9214-4476-a22b-2e556f71949f', vmId='ceefada7-bda7-4c7e-b7d8-1b72bd131e61', secondsToWait='0', gracefully='false
', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42
2019-02-28 11:13:55,454+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] FINISH, DestroyVDSCommand, ret
urn: , log id: 6977bc42
2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Failed to destroy VM 'ceefada7-bda7-4c
7e-b7d8-1b72bd131e61' in VDS = 'a37bef1d-9214-4476-a22b-2e556f71949f' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error =
 Virtual machine destroy error, code = 42'
2019-02-28 11:13:55,454+05 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] FINISH, DestroyVmVDSCommand, return: ,
 log id: 58eabca
2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'org.ovirt.engine.core.bll.StopVmCommand' 
failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 (Failed with e
rror destroyErr and code 42)
2019-02-28 11:13:55,464+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM vm1_Vol_expansion (Host: rhsqa-grafton9-nic2.lab.eng.blr.redhat.com, User: admin@internal-authz)



FUSE mount log:
==============
[2019-02-27 11:28:04.401930] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388
[2019-02-27 11:28:04.402050] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c, gfid = 44063cb5-d123-423e-bef4-3b51bbe63ae6
[2019-02-27 11:28:04.402073] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-1; inode layout - 0 - 0 - 0; disk layout - 0 - 2147483646 - 3820200388
[2019-02-27 11:28:04.403391] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images, gfid = 4986b231-207d-42a2-8eae-6d4dd09d1482
[2019-02-27 11:28:04.404863] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images/58b4f4b2-f28a-4eae-90b5-7b576ad07874, gfid = 324c942a-6fcb-43e7-94f1-485ba0d9db3c
The message "I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388" repeated 3 times between [2019-02-27 11:28:04.401930] and [2019-02-27 11:28:10.331865]
[2019-02-27 11:28:10.331869] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/dom_md, gfid = c1d3d32a-80af-40b3-8775-c22e23b95cb1
[2019-02-27 11:28:10.331944] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/dom_md, gfid = c1d3d32a-80af-40b3-8775-c22e23b95cb1
[2019-02-27 11:28:22.392578] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-0: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory]
[2019-02-27 11:28:22.392760] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-2: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory]
[2019-02-27 11:28:22.392764] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-1: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory]
The message "I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-1; inode layout - 0 - 0 - 0; disk layout - 0 - 2147483646 - 3820200388" repeated 4 times between [2019-02-27 11:28:04.402073] and [2019-02-27 11:28:33.071945]
[2019-02-27 11:28:33.071948] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images/86d94115-6969-41a9-afc2-f8453820370f, gfid = 5384fdfe-38b9-4ebb-9b97-02ac02f73f81
[2019-02-27 11:28:33.072133] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388

Comment 4 Gobinda Das 2019-03-01 09:53:12 UTC

Hi Bipin,
 I was checking vdsm logs and could not find any error which explains the reason.
2019-02-28 13:24:45,811+0530 WARN  (jsonrpc/3) [virt.vm] (vmId='6998fc6d-79e3-46ca-a741-5d5f0322679e') Failed to destroy VM '6998fc6d-79e3-46ca-a741-5d5f0322679e' forcefully (error=38) (vm:5212)
2019-02-28 13:24:45,812+0530 INFO  (jsonrpc/3) [api.virt] FINISH destroy return={'status': {'message': 'Virtual machine destroy error', 'code': 42}} from=::ffff:10.70.36.247,57014, flow_id=2bfe5929-1906-44a2-bc3c-4d3520fd4070, vmId=6998fc6d-79e3-46ca-a741-5d5f0322679e (api:54)

Can you please confirm the engine log and vdsm log captured same time?

Comment 5 bipin 2019-03-01 10:37:05 UTC

Gobinda,

I updated with proper vdsm log now. Could you please check?

Comment 6 Gobinda Das 2019-03-08 05:49:56 UTC

Bipin,
 I don't find any thing related in given logs.Could you please attach libvirt log as well?

Comment 7 SATHEESARAN 2019-03-08 07:34:26 UTC

On the side note, I have also hit the similar problem with qemu/kvm environment.

1. Used the hypervisor as RHEL 7.6 server
2. Created a replica 3 volume with 3 VMs and fuse mounted the volume in this RHEL 7.6 host
3. Created few VMs and started installation OS on it.
4. While the OS installation is still  in progress, scaled the volume from 1x3 to 2x3
5. Performed rebalance
6. Expanded the volume once again from 2x3 to 3x3, followed by successful rebalance
7. Removed a replica set to convert the volume back to 2x3
8. At this point, powered down the VM.
and observed that the VM never powered down and stuck in the 'in shutdown' state.

Comment 8 bipin 2019-03-11 12:55:52 UTC

Gobinda,

I tried reproducing twice but couldn't see it. It worked fine, but since Sas also hit the issue lets not close the Bug. 
I am reducing the severity and will update the  bug if we see it later.

Comment 9 Yaniv Kaul 2019-04-22 06:55:55 UTC

(In reply to SATHEESARAN from comment #7)
> On the side note, I have also hit the similar problem with qemu/kvm
> environment.
> 
> 1. Used the hypervisor as RHEL 7.6 server
> 2. Created a replica 3 volume with 3 VMs and fuse mounted the volume in this
> RHEL 7.6 host
> 3. Created few VMs and started installation OS on it.

Can you try with a fully installed VM, with qemu-guest-agent, etc.?

Comment 10 SATHEESARAN 2019-05-03 10:19:55 UTC

This issue looks same as https://bugzilla.redhat.com/show_bug.cgi?id=1702686 and we
did see that QEMU process becoming a defunct process.

Comment 11 SATHEESARAN 2019-05-03 10:38:04 UTC


*** This bug has been marked as a duplicate of bug 1705953 ***

Note You need to log in before you can comment on or make changes to this bug.