Bug 1969848
| Summary: | qemu-img convert hangs on aarch64 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Ondřej Budai <obudai> |
| Component: | qemu-kvm | Assignee: | Andrew Jones <drjones> |
| qemu-kvm sub component: | General | QA Contact: | Zhenyu Zhang <zhenyzha> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | ckellner, drjones, jinzhao, juzhang, lcapitulino, qzhang, timao, virt-maint, xuwei, zhenyzha |
| Version: | 8.4 | Keywords: | OtherQA, Triaged |
| Target Milestone: | beta | Flags: | pm-rhel:
mirror+
|
| Target Release: | 8.5 | ||
| Hardware: | aarch64 | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | qemu-kvm-4.2.0-57.module+el8.5.0+12118+4998563d | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-09 18:01:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1885765 | ||
|
Description
Ondřej Budai
2021-06-09 10:31:07 UTC
Updated test information:
I performed the convert operation 100 times on rhel.8.4 host, but this issue did not reproduce.
Host Distro: RHEL-8.4.0
Host Kernel: 4.18.0-305.el8.aarch64
Qemu-kvm: qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b
qemu-img version 4.2.0 (qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b)
# cat img_convert.sh
#!/bin/bash
count=$((${1:-1}-1))
i=0;while [ $i -lt $(($count+1)) ]; do qemu-img convert -O qcow2 ./rhel840-aarch64-virtio-scsi.raw ./rhel840-aarch64-virtio-scsi.qcow2 -p ;i=$(($i+1));done
# sh img_convert.sh 100 > img_convert_log.txt
# echo $?
0
# cat img_convert_log.txt (no errors found)
(100.00/100%)
(100.00/100%)
......
(100.00/100%)
(100.00/100%)
# lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: APM
BIOS Vendor ID: Ampere(TM)
Model: 2
Model name: X-Gene
BIOS Model name: eMAG
Stepping: 0x3
CPU max MHz: 3300.0000
CPU min MHz: 375.0000
BogoMIPS: 80.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
NUMA node0 CPU(s): 0-31
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
Ondrej, Can you give this a try by using the Virt stack from the Advanced-Virt RHEL module for 8.4? If this is really https://bugs.launchpad.net/qemu/+bug/1805256 it should be fixed there. Drew, can you take a look? If this is really https://bugs.launchpad.net/qemu/+bug/1805256, then it seems it's fixed by: commit 5710a3e09f9b85801e5ce70797a4a511e5fc9e2c Author: Paolo Bonzini <pbonzini> Date: Tue Apr 7 10:07:46 2020 -0400 async: use explicit memory barriers This should be present in AV, so a solution is to use AV. If we think the fix is important for other archs, then we could request it to be fixed in the z-stream. Hi, using Advanced-Virt isn't really an option for us because qemu-img is used by osbuild-composer that is shipped in AppStream. I took the Zhenyu's test script and ran it on AWS EC2 c6g.large machine with RHEL-8.4.0_HVM-20210504-arm64-2-Access2-GP2 image: $ rpm -q qemu-img qemu-img-4.2.0-48.module+el8.4.0+10368+630e803b.aarch64 $ uname -a Linux ip-172-31-23-50.ec2.internal 4.18.0-305.el8.aarch64 #1 SMP Thu Apr 29 08:58:53 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux $ cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.4 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.4" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.4 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.4" I used a raw Fedora image as the artifact to convert: https://download.fedoraproject.org/pub/fedora/linux/releases/34/Cloud/x86_64/images/Fedora-Cloud-Base-34-1.2.x86_64.raw.xz The first time, qemu-img hanged on the 24th attempt on (96.95/100%) Second time: 30th attempt, (95.78/100%) Third time: all 100 conversions succeeded Fourth time: 54th attempt hanged, (94.61/100%) ------- In this setup, the issue seems to be much rarer but it definitely still exists. WRT to the low probability, I think we don't need to ship a fix in z-stream but it would be great to see this fixed in 8.5. Thanks a lot for looking into this! Before posting 5710a3e09f9b to non-AV 8.5 (z-stream or y-stream) I'd like to get confirmation that it fixes the issue. If I prepare a non-AV build with that patch, can somebody test it? I'm more than happy to test it. Hello Andrew,
I performed the convert operation 800 times with qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b on aarch64, but this issue didn‘t reproduce.
Run 4 processes at the same time operation 200 times
# sh img_convert_1.sh 200
# echo $?
0
# sh img_convert_2.sh 200
# echo $?
0
# sh img_convert_3.sh 200
# echo $?
0
# sh img_convert_4.sh 200
# echo $?
0
And I performed the convert operation 800 times with qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b on x86_64,
Its conversion speed is very slow. It has been running for more than 12 hours and there are still two processes that have not been completed.
But processes are running this issue didn‘t reproduce too.
So does this bug have anything to do with the architecture? How can I test to reproduce it stably?
===== hardware info ========
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
NUMA node(s): 4
Vendor ID: AuthenticAMD
BIOS Vendor ID: Advanced Micro Devices, Inc.
CPU family: 23
Model: 1
Model name: AMD EPYC 7401P 24-Core Processor
BIOS Model name: AMD EPYC 7401P 24-Core Processor
Stepping: 2
CPU MHz: 2401.838
CPU max MHz: 2000.0000
CPU min MHz: 1200.0000
BogoMIPS: 3992.39
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-5,24-29
NUMA node1 CPU(s): 6-11,30-35
NUMA node2 CPU(s): 12-17,36-41
NUMA node3 CPU(s): 18-23,42-47
# free -h
total used free shared buff/cache available
Mem: 31Gi 569Mi 1.8Gi 19Mi 28Gi 30Gi
Swap: 15Gi 77Mi 15Gi
(In reply to Zhenyu Zhang from comment #6) > So does this bug have anything to do with the architecture? Probably, AArch64 has a weakly-ordered memory model, there's a good chance that the barriers added with commit 5710a3e09f9b could fix a hang that reproduces on AArch64 but not on x86. > How can I test to reproduce it stably? I don't know. Maybe the reporter can help with that. Hello Ondřej, I run the test script 4 processes 200 times on the same CPU but still didn't reproduce the issue. # taskset -c 4 ./img_convert_1.sh 200 -------- # echo $? 0 # taskset -c 4 ./img_convert_2.sh 200 -------- # echo $? 0 # taskset -c 4 ./img_convert_3.sh 200 -------- # echo $? 0 # taskset -c 4 ./img_convert_4.sh 200 -------- # echo $? 0 Could you share your environment? All my machines are borrowed from Beaker, or could you share with me a 'hostname' from Beaker that can reproduce the issue? Thanks to the env provided by Ondřej.
Hit this issue when with qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b runs to the 112th on Ondřej env.
And with qemu-kvm-4.2.0-52.module+el8.5.0+10875+d90dbc7e.drjones202106211655 runs the test script 200 times * 4 no hit this issue.
So I think this patch is working.
Test env:
[ec2-user@ip-10-30-18-44 ~]$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 1
Model name: Neoverse-N1
Stepping: r3p1
BogoMIPS: 243.75
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0,1
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
[ec2-user@ip-10-30-18-44 ~]$ free -h
total used free shared buff/cache available
Mem: 3.4Gi 474Mi 755Mi 30Mi 2.2Gi 2.4Gi
Swap: 0B 0B 0B
[ec2-user@ip-10-30-18-44 ~]$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1
node 0 size: 3454 MB
node 0 free: 756 MB
node distances:
node 0
0: 10
Thanks to the env provided by Ondřej. On the same env:ip-10-30-18-124.us-east-1.aws.redhat.com With qemu-kvm-4.2.0-57.module+el8.5.0+12118+4998563d runs the test script 200 times * 5 no hit this issue. So set Verified:Tested Set bug to VERIFIED according to Comment 34 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4191 |