Bug 2047203

Summary: segfault at 68 on disk live migrate
Product: [oVirt] ovirt-node Reporter: Tommaso <tommaso>
Component: Included packagesAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED CURRENTRELEASE QA Contact: sshmulev
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.10CC: aefrat, ahadas, bugs, cshao, eshames, mperina
Target Milestone: ovirt-4.5.0Keywords: TestOnly
Target Release: 4.5.0Flags: pm-rhel: ovirt-4.5?
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-6.2.0-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-20 06:33:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2002607    
Bug Blocks:    
Attachments:
Description Flags
core-dump none

Description Tommaso 2022-01-27 11:54:59 UTC
Created attachment 1857079 [details]
core-dump

Description of problem:


During live disk migration from two NFS storage, after some minutes the vm crash with error.
On /var/log/messages we see that errors:


Jan 27 11:22:28 host1.server.com kernel: qemu-kvm[154794]: segfault at 68 ip 000055f8bfe5b8e1 sp 00007f45a4229e90 error 6 in qemu-kvm[55f8bf836000+b4c000]
Jan 27 11:22:28 host1.server.com kernel: Code: 48 89 c6 48 8b 47 38 4c 01 c0 4c 01 c8 48 f7 f1 49 39 fc 74 d4 48 83 e8 01 49 39 c6 77 cb 48 39 de 77 c6 48 83 7f 68 00 75 bf <49> 89 7c 24 68 31 f6 48
 83 c7 50 e8 8f 82 0d 00 49 c7 44 24 68 00
Jan 27 11:22:31 host1.server.com abrt-hook-ccpp[163412]: Process 154794 (qemu-kvm) of user 107 killed by SIGSEGV - dumping core
Jan 27 11:22:41 host1.server.com vdsm[139234]: WARN executor state: count=5 workers={<Worker name=periodic/4 waiting task#=1844 at 0x7f32787cde48>, <Worker name=periodic/1 waiting task#=2071 at 0x7f
3290087eb8>, <Worker name=periodic/5 waiting task#=728 at 0x7f327863a0f0>, <Worker name=periodic/2 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f32900774a8> a
t 0x7f3290077630> timeout=7.5, duration=7.50 at 0x7f32900779e8> discarded task#=2109 at 0x7f3290087f60>, <Worker name=periodic/6 waiting task#=0 at 0x7f3291ad97b8>}
Jan 27 11:23:50 host1.server.com abrt-hook-ccpp[163543]: Can't generate core backtrace: dwfl_getthread_frames failed: No DWARF information found
Jan 27 11:23:50 host1.server.com abrt-hook-ccpp[163412]: Core backtrace generator exited with error 1
Jan 27 10:25:09 host1.server.com kernel: IO iothread1[142832]: segfault at 68 ip 000055f0949b98e1 sp 00007fa5b77c4e90 error 6 in qemu-kvm[55f094394000+b4c000]
Jan 27 10:25:09 host1.server.com kernel: Code: 48 89 c6 48 8b 47 38 4c 01 c0 4c 01 c8 48 f7 f1 49 39 fc 74 d4 48 83 e8 01 49 39 c6 77 cb 48 39 de 77 c6 48 83 7f 68 00 75 bf <49> 89 7c 24 68 31 f6 48
 83 c7 50 e8 8f 82 0d 00 49 c7 44 24 68 00
Jan 27 10:25:09 host1.server.com abrt-hook-ccpp[154174]: Process 142827 (qemu-kvm) of user 107 killed by SIGSEGV - dumping core
Jan 27 10:25:26 host1.server.com vdsm[139234]: WARN executor state: count=5 workers={<Worker name=periodic/4 waiting task#=1049 at 0x7f32787cde48>, <Worker name=periodic/1 waiting task#=1322 at 0x7f
3290087eb8>, <Worker name=periodic/5 waiting task#=0 at 0x7f327863a0f0>, <Worker name=periodic/3 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f32900774a8> at
0x7f3290077630> timeout=7.5, duration=7.50 at 0x7f3278706f98> discarded task#=1322 at 0x7f3290087748>, <Worker name=periodic/2 waiting task#=1321 at 0x7f3290087f60>}
Jan 27 10:26:08 host1.server.com abrt-hook-ccpp[154293]: Can't generate core backtrace: dwfl_getthread_frames failed: No DWARF information found
Jan 27 10:26:08 host1.server.com abrt-hook-ccpp[154174]: Core backtrace generator exited with error 1

This error and the conseguent VM reboot fal the disk migration.

This kind of isse occour often on large windows VM.



Additional info:

[root@host1 ~]# rpm -qa | grep vdsm
vdsm-http-4.40.100.2-1.el8.noarch
vdsm-api-4.40.100.2-1.el8.noarch
vdsm-network-4.40.100.2-1.el8.x86_64
vdsm-4.40.100.2-1.el8.x86_64
vdsm-python-4.40.100.2-1.el8.noarch
vdsm-yajsonrpc-4.40.100.2-1.el8.noarch
vdsm-client-4.40.100.2-1.el8.noarch
vdsm-jsonrpc-4.40.100.2-1.el8.noarch
vdsm-common-4.40.100.2-1.el8.noarch
[root@host1 ~]# rpm -qa | grep qemu-kvm
qemu-kvm-docs-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-block-curl-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-core-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-ui-spice-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-common-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-hw-usbredir-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-ui-opengl-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-block-rbd-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-block-ssh-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-block-gluster-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
qemu-kvm-block-iscsi-6.1.0-5.module_el8.6.0+1040+0ae94936.x86_64
[root@host1 ~]# uname -a
Linux host1.server.com 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Comment 1 RHEL Program Management 2022-01-27 13:25:03 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Tommaso 2022-01-27 15:41:15 UTC
The bug seems like this one on pve-qemu-kvm : https://forum.proxmox.com/threads/proxmox-7-0-14-1-crashes-vm-during-migrate-to-other-host.99678/
I it possible to have a patch like the one mentioned in that thread and aviable here: https://git.proxmox.com/?p=pve-qemu.git;a=commit;h=edbcc10a6914c115d9d148f498b3c6c7631820f6 ?

Comment 5 sshmulev 2022-04-18 11:54:17 UTC
Verified according to tier2 and tier3 automation runs TCs related to live merge.

Versions:
engine-4.5.0-0.237.el8ev
vdsm-4.50.0.10-1.el8ev
6.2.0 - 9.module+el8.6.0+14480+c0a3aa0f

Comment 6 Sandro Bonazzola 2022-04-20 06:33:59 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.