Description of problem: ====================== In RHHI-1.6 environment, while testing re-balance once the re-balance was completed, the vm's running on the same brick failed to get powered off. The vm's still shows as powering down. Version-Release number of selected component (if applicable): ============================================================ glusterfs-3.12.2-45.el7rhgs.x86_64 rhv-4.3.0.5-0 ovirt-engine-4.3.1.1-0.1.el7.noarch How reproducible: ================ 1/1 Steps to Reproduce: ================== 1.Create a replica 3 (1*3) volume 2.Host 2 or more vm's on it and pump IO to the additional disk 3.Add bricks to the volume mentioned above, now its 2*3 4.Start the rebalance 5.Once the rebalance is completed poweroff the vm's. Actual results: ============== The vm's never got powered off. Also tried with reboot,but never got a connection to ssh Expected results: ================ Should poweroff the vm without any corruption Additional info: =============== Looking through the logs i dont see any clue if FUSE mount is the culprit here. The logs looks clear but there are few mismatch messages. But the engine log suggests that the Destroy VDS command failed. Additional, new creation of vm's on those bricks works fine.
Engine log: =========== 2019-02-28 11:13:55,444+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Failed in 'DestroyVDS' method 2019-02-28 11:13:55,453+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] EVENT_ID: VDS_BROKER_ COMMAND_FAILURE(10,802), VDSM rhsqa-grafton9-nic2.lab.eng.blr.redhat.com command DestroyVDS failed: Virtual machine destroy error 2019-02-28 11:13:55,454+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'org.ovirt.engine.core .vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=42, message=Virtual machine destroy error]]' 2019-02-28 11:13:55,454+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] HostName = rhsqa-grafton9-nic2 .lab.eng.blr.redhat.com 2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'DestroyVDSCommand(Hos tName = rhsqa-grafton9-nic2.lab.eng.blr.redhat.com, DestroyVmVDSCommandParameters:{hostId='a37bef1d-9214-4476-a22b-2e556f71949f', vmId='ceefada7-bda7-4c7e-b7d8-1b72bd131e61', secondsToWait='0', gracefully='false ', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 2019-02-28 11:13:55,454+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] FINISH, DestroyVDSCommand, ret urn: , log id: 6977bc42 2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Failed to destroy VM 'ceefada7-bda7-4c 7e-b7d8-1b72bd131e61' in VDS = 'a37bef1d-9214-4476-a22b-2e556f71949f' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42' 2019-02-28 11:13:55,454+05 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] FINISH, DestroyVmVDSCommand, return: , log id: 58eabca 2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 (Failed with e rror destroyErr and code 42) 2019-02-28 11:13:55,464+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM vm1_Vol_expansion (Host: rhsqa-grafton9-nic2.lab.eng.blr.redhat.com, User: admin@internal-authz) FUSE mount log: ============== [2019-02-27 11:28:04.401930] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388 [2019-02-27 11:28:04.402050] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c, gfid = 44063cb5-d123-423e-bef4-3b51bbe63ae6 [2019-02-27 11:28:04.402073] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-1; inode layout - 0 - 0 - 0; disk layout - 0 - 2147483646 - 3820200388 [2019-02-27 11:28:04.403391] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images, gfid = 4986b231-207d-42a2-8eae-6d4dd09d1482 [2019-02-27 11:28:04.404863] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images/58b4f4b2-f28a-4eae-90b5-7b576ad07874, gfid = 324c942a-6fcb-43e7-94f1-485ba0d9db3c The message "I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388" repeated 3 times between [2019-02-27 11:28:04.401930] and [2019-02-27 11:28:10.331865] [2019-02-27 11:28:10.331869] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/dom_md, gfid = c1d3d32a-80af-40b3-8775-c22e23b95cb1 [2019-02-27 11:28:10.331944] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/dom_md, gfid = c1d3d32a-80af-40b3-8775-c22e23b95cb1 [2019-02-27 11:28:22.392578] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-0: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory] [2019-02-27 11:28:22.392760] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-2: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory] [2019-02-27 11:28:22.392764] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-1: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory] The message "I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-1; inode layout - 0 - 0 - 0; disk layout - 0 - 2147483646 - 3820200388" repeated 4 times between [2019-02-27 11:28:04.402073] and [2019-02-27 11:28:33.071945] [2019-02-27 11:28:33.071948] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images/86d94115-6969-41a9-afc2-f8453820370f, gfid = 5384fdfe-38b9-4ebb-9b97-02ac02f73f81 [2019-02-27 11:28:33.072133] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388
Hi Bipin, I was checking vdsm logs and could not find any error which explains the reason. 2019-02-28 13:24:45,811+0530 WARN (jsonrpc/3) [virt.vm] (vmId='6998fc6d-79e3-46ca-a741-5d5f0322679e') Failed to destroy VM '6998fc6d-79e3-46ca-a741-5d5f0322679e' forcefully (error=38) (vm:5212) 2019-02-28 13:24:45,812+0530 INFO (jsonrpc/3) [api.virt] FINISH destroy return={'status': {'message': 'Virtual machine destroy error', 'code': 42}} from=::ffff:10.70.36.247,57014, flow_id=2bfe5929-1906-44a2-bc3c-4d3520fd4070, vmId=6998fc6d-79e3-46ca-a741-5d5f0322679e (api:54) Can you please confirm the engine log and vdsm log captured same time?
Gobinda, I updated with proper vdsm log now. Could you please check?
Bipin, I don't find any thing related in given logs.Could you please attach libvirt log as well?
On the side note, I have also hit the similar problem with qemu/kvm environment. 1. Used the hypervisor as RHEL 7.6 server 2. Created a replica 3 volume with 3 VMs and fuse mounted the volume in this RHEL 7.6 host 3. Created few VMs and started installation OS on it. 4. While the OS installation is still in progress, scaled the volume from 1x3 to 2x3 5. Performed rebalance 6. Expanded the volume once again from 2x3 to 3x3, followed by successful rebalance 7. Removed a replica set to convert the volume back to 2x3 8. At this point, powered down the VM. and observed that the VM never powered down and stuck in the 'in shutdown' state.
Gobinda, I tried reproducing twice but couldn't see it. It worked fine, but since Sas also hit the issue lets not close the Bug. I am reducing the severity and will update the bug if we see it later.
(In reply to SATHEESARAN from comment #7) > On the side note, I have also hit the similar problem with qemu/kvm > environment. > > 1. Used the hypervisor as RHEL 7.6 server > 2. Created a replica 3 volume with 3 VMs and fuse mounted the volume in this > RHEL 7.6 host > 3. Created few VMs and started installation OS on it. Can you try with a fully installed VM, with qemu-guest-agent, etc.?
This issue looks same as https://bugzilla.redhat.com/show_bug.cgi?id=1702686 and we did see that QEMU process becoming a defunct process.
*** This bug has been marked as a duplicate of bug 1705953 ***