RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1969848 - qemu-img convert hangs on aarch64
Summary: qemu-img convert hangs on aarch64
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: qemu-kvm
Version: 8.4
Hardware: aarch64
OS: Unspecified
medium
medium
Target Milestone: beta
: 8.5
Assignee: Andrew Jones
QA Contact: Zhenyu Zhang
URL:
Whiteboard:
Depends On:
Blocks: 1885765
TreeView+ depends on / blocked
 
Reported: 2021-06-09 10:31 UTC by Ondřej Budai
Modified: 2022-05-09 08:55 UTC (History)
10 users (show)

Fixed In Version: qemu-kvm-4.2.0-57.module+el8.5.0+12118+4998563d
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-09 18:01:39 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:4191 0 None None None 2021-11-09 18:02:15 UTC

Description Ondřej Budai 2021-06-09 10:31:07 UTC
Description of problem:
When running the following command on aarch64, it hangs indefinitely in about 30% of cases:

qemu-img convert -O qcow2 ./img.raw ./img.qcow2

Our team thinks that this is the same bug as Ubuntu had: https://bugs.launchpad.net/qemu/+bug/1805256

We applied the same workaround as in the ubuntu bug: Add -m 1 argument to the qemu-img process. From our experiments, this completely mitigated the bug.


Version-Release number of selected component (if applicable):
15:4.2.0-48.module+el8.4.0+10368+630e803b

How reproducible:
~ 30% (it depends on the number of cores of the system - we use 2 in our setup).


Steps to Reproduce:
1. Install qemu-img on RHEL 8.4 on aarch64
2. Run qemu-img convert -O qcow2 ./img.raw ./img.qcow2
3. Repeat step 2 until it hangs

Actual results:
The qemu-img convert process sometimes hangs indefinitely and need to be killed.

Expected results:
The qemu-img convert process always finishes.

Additional info:
In osbuild, we merged this PR to work around this issue: https://github.com/osbuild/osbuild/pull/657

Comment 1 Zhenyu Zhang 2021-06-11 09:13:37 UTC
Updated test information:

I performed the convert operation 100 times on rhel.8.4 host, but this issue did not reproduce.

Host Distro: RHEL-8.4.0
Host Kernel: 4.18.0-305.el8.aarch64
Qemu-kvm: qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b
qemu-img version 4.2.0 (qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b)

# cat img_convert.sh
#!/bin/bash
count=$((${1:-1}-1))
i=0;while [ $i -lt $(($count+1)) ]; do qemu-img convert -O qcow2 ./rhel840-aarch64-virtio-scsi.raw ./rhel840-aarch64-virtio-scsi.qcow2 -p ;i=$(($i+1));done

# sh img_convert.sh 100 > img_convert_log.txt
# echo $?
0

# cat img_convert_log.txt  (no errors found)
    (100.00/100%)
    (100.00/100%)
    ......
    (100.00/100%)
    (100.00/100%)

# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           1
NUMA node(s):        1
Vendor ID:           APM
BIOS Vendor ID:      Ampere(TM)
Model:               2
Model name:          X-Gene
BIOS Model name:     eMAG 
Stepping:            0x3
CPU max MHz:         3300.0000
CPU min MHz:         375.0000
BogoMIPS:            80.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
NUMA node0 CPU(s):   0-31
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

Comment 2 Luiz Capitulino 2021-06-11 13:59:43 UTC
Ondrej,

Can you give this a try by using the Virt stack from the Advanced-Virt RHEL module for 8.4? If this is really https://bugs.launchpad.net/qemu/+bug/1805256 it should be fixed there.

Drew, can you take a look?

If this is really https://bugs.launchpad.net/qemu/+bug/1805256, then it seems it's fixed by:

commit 5710a3e09f9b85801e5ce70797a4a511e5fc9e2c
Author: Paolo Bonzini <pbonzini>
Date:   Tue Apr 7 10:07:46 2020 -0400

    async: use explicit memory barriers

This should be present in AV, so a solution is to use AV. If we think the fix is important for other archs, then we could request it to be fixed in the z-stream.

Comment 3 Ondřej Budai 2021-06-14 07:52:31 UTC
Hi,

using Advanced-Virt isn't really an option for us because qemu-img is used by osbuild-composer that is shipped in AppStream.


I took the Zhenyu's test script and ran it on AWS EC2 c6g.large machine with RHEL-8.4.0_HVM-20210504-arm64-2-Access2-GP2 image:

$ rpm -q qemu-img
qemu-img-4.2.0-48.module+el8.4.0+10368+630e803b.aarch64

$ uname -a
Linux ip-172-31-23-50.ec2.internal 4.18.0-305.el8.aarch64 #1 SMP Thu Apr 29 08:58:53 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux

$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"

I used a raw Fedora image as the artifact to convert: https://download.fedoraproject.org/pub/fedora/linux/releases/34/Cloud/x86_64/images/Fedora-Cloud-Base-34-1.2.x86_64.raw.xz

The first time, qemu-img hanged on the 24th attempt on (96.95/100%)
Second time: 30th attempt, (95.78/100%)
Third time: all 100 conversions succeeded
Fourth time: 54th attempt hanged, (94.61/100%)

-------

In this setup, the issue seems to be much rarer but it definitely still exists. WRT to the low probability, I think we don't need to ship a fix in z-stream but it would be great to see this fixed in 8.5.

Thanks a lot for looking into this!

Comment 4 Andrew Jones 2021-06-14 09:56:32 UTC
Before posting 5710a3e09f9b to non-AV 8.5 (z-stream or y-stream) I'd like to get confirmation that it fixes the issue. If I prepare a non-AV build with that patch, can somebody test it?

Comment 5 Ondřej Budai 2021-06-14 10:15:12 UTC
I'm more than happy to test it.

Comment 6 Zhenyu Zhang 2021-06-16 07:00:42 UTC
Hello Andrew,

I performed the convert operation 800 times with qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b on aarch64, but this issue didn‘t reproduce.

Run 4 processes at the same time operation 200 times 
# sh img_convert_1.sh 200
# echo $?
0

# sh img_convert_2.sh 200
# echo $?
0

# sh img_convert_3.sh 200
# echo $?
0

# sh img_convert_4.sh 200
# echo $?
0

And I performed the convert operation 800 times with qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b on x86_64, 
Its conversion speed is very slow. It has been running for more than 12 hours and there are still two processes that have not been completed.
But processes are running this issue didn‘t reproduce too.
So does this bug have anything to do with the architecture? How can I test to reproduce it stably?



===== hardware info ========
# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        4
Vendor ID:           AuthenticAMD
BIOS Vendor ID:      Advanced Micro Devices, Inc.
CPU family:          23
Model:               1
Model name:          AMD EPYC 7401P 24-Core Processor
BIOS Model name:     AMD EPYC 7401P 24-Core Processor               
Stepping:            2
CPU MHz:             2401.838
CPU max MHz:         2000.0000
CPU min MHz:         1200.0000
BogoMIPS:            3992.39
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-5,24-29
NUMA node1 CPU(s):   6-11,30-35
NUMA node2 CPU(s):   12-17,36-41
NUMA node3 CPU(s):   18-23,42-47

# free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi       569Mi       1.8Gi        19Mi        28Gi        30Gi
Swap:          15Gi        77Mi        15Gi

Comment 8 Andrew Jones 2021-06-21 16:13:01 UTC
(In reply to Zhenyu Zhang from comment #6)
> So does this bug have anything to do with the architecture?

Probably, AArch64 has a weakly-ordered memory model, there's a good chance that the barriers added with commit 5710a3e09f9b could fix a hang that reproduces on AArch64 but not on x86. 

> How can I test to reproduce it stably?

I don't know. Maybe the reporter can help with that.

Comment 9 Zhenyu Zhang 2021-06-22 00:37:09 UTC
Hello Ondřej,

I run the test script 4 processes 200 times on the same CPU but still didn't reproduce the issue.
# taskset -c 4 ./img_convert_1.sh 200  -------- # echo $?  0
# taskset -c 4 ./img_convert_2.sh 200  -------- # echo $?  0
# taskset -c 4 ./img_convert_3.sh 200  -------- # echo $?  0
# taskset -c 4 ./img_convert_4.sh 200  -------- # echo $?  0


Could you share your environment? 
All my machines are borrowed from Beaker, or could you share with me a 'hostname' from Beaker that can reproduce the issue?

Comment 20 Zhenyu Zhang 2021-06-28 11:21:18 UTC
Thanks to the env provided by Ondřej. 
Hit this issue when with qemu-kvm-4.2.0-48.module+el8.4.0+10368+630e803b runs to the 112th on Ondřej env.
And with qemu-kvm-4.2.0-52.module+el8.5.0+10875+d90dbc7e.drjones202106211655 runs the test script 200 times * 4 no hit this issue.
So I think this patch is working.

Test env:
[ec2-user@ip-10-30-18-44 ~]$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Model name:          Neoverse-N1
Stepping:            r3p1
BogoMIPS:            243.75
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0,1
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

[ec2-user@ip-10-30-18-44 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:          3.4Gi       474Mi       755Mi        30Mi       2.2Gi       2.4Gi
Swap:            0B          0B          0B

[ec2-user@ip-10-30-18-44 ~]$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1
node 0 size: 3454 MB
node 0 free: 756 MB
node distances:
node   0 
  0:  10

Comment 34 Zhenyu Zhang 2021-08-05 15:41:27 UTC
Thanks to the env provided by Ondřej. 

On the same env:ip-10-30-18-124.us-east-1.aws.redhat.com
With qemu-kvm-4.2.0-57.module+el8.5.0+12118+4998563d runs the test script 200 times * 5 no hit this issue.
So set Verified:Tested

Comment 35 Zhenyu Zhang 2021-08-06 05:41:49 UTC
Set bug to VERIFIED according to Comment 34

Comment 38 errata-xmlrpc 2021-11-09 18:01:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4191


Note You need to log in before you can comment on or make changes to this bug.