Bug 2002009 - IPI Baremetal, qemu-convert takes to long to save image into drive on slow/large disks
Summary: IPI Baremetal, qemu-convert takes to long to save image into drive on slow/la...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Derek Higgins
QA Contact: Lubov
URL:
Whiteboard:
: 2005938 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-07 17:12 UTC by Manuel Rodriguez
Modified: 2022-08-11 02:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When writing the RHCOS image to some disks qemu-img was allocating space on the entire disk (including sparse areas) Consequence: On some hardware this could take hours Fix: Disable qemu-img sparse image creation Result: image writing no longer take hours on effected hardware
Clone Of:
Environment:
Last Closed: 2022-08-10 10:37:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console output ironic events (1.22 MB, image/png)
2021-09-07 17:12 UTC, Manuel Rodriguez
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 808993 0 None NEW Allow qemu-img to write out zeros to disk 2021-09-14 16:17:48 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:37:45 UTC

Description Manuel Rodriguez 2021-09-07 17:12:10 UTC
Created attachment 1821291 [details]
console output ironic events

Description of problem:

OCP deployments take too long to complete bootstrap up to 6 hours in servers with 4TB disk size because qemu-convert takes long time to save image into drive


Version-Release number of selected component (if applicable):

OCP 4.7 and 4.8 (probably 4.6 too, but I haven't tested)



How reproducible:


Steps to Reproduce:
1. Deploy OCP 4.7 or above in baremetal nodes with large drives 4TB+

2. Monitor the progress, you'll see in .openshift-install.log that the following step runs for hours

module.masters.ironic_deployment.openshift-master-deployment

3. When the deployment is completed you'll see in the summary, the time to prepare the infrastructure runs for 5+ hours.

level=debug msg="Time elapsed per stage:"
level=debug msg="    Infrastructure: 5h56m40s"
level=debug msg="Bootstrap Complete: 22m53s"
level=debug msg=" Bootstrap Destroy: 12s"
level=debug msg=" Cluster Operators: 19m53s"
level=info msg="Time elapsed: 6h39m41s"

4. When checking on the console of one of the servers, we can see the image is getting saved in the disk and then just constant ironic heart-beats are displayed. [see image attached: master-node-writing-disk-ironic.png]

5. Finally when logging to one of the servers during the bootstrap, we can see the process running is qemu-convert[image attached: master-node-writing-disk-ironic.png]

Upstream code:
https://github.com/openstack/ironic-python-agent/blob/7.0.1/ironic_python_agent/shell/write_image.sh#L52



Actual results:

Successful, but very long deployments (6+ hours)


Expected results:

Faster deployments, (1+ hour)


Additional info: These servers are not using RAID controller, they are disks directly connected to the motherboard. It seems servers behind a RAID card in RAID-1 get better deployment times.

Comment 2 Derek Higgins 2021-09-08 08:30:02 UTC
I've hit something similar on a Dell when provisioning the RHCOS image to a 1.1T HDD, 
the provisioning is being delayed when using qemu-img, specifically this command

qemu-img convert -O host_device -t directsync -W /tmp/cached-rhcos-49.84.202107010027-0-openstack.x86_64.qcow2 /dev/sda

Using a smaller test image and a smaller 8G block device (/dev/sdb3), I can see that qemu-img is spending its time
zero'ing out data on the entire disk in 2G chunks, each chunk is taking about 10 seconds (see the calls to fallocate)

[root@r640-u08 ~]# rpm -qf `which qemu-img `
qemu-img-4.2.0-34.module+el8.3.0+7976+077be4ec.x86_64

[root@r640-u08 ~]# time strace -T -f -e fallocate,pwrite64 /bin/qemu-img convert -O host_device -t directsync -W afile.qcow2 /dev/sdb3 |& grep  "("
[pid 13019] fallocate(10, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 512) = 0 <0.007062>
[pid 13019] fallocate(10, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 2147483136) = 0 <9.989639>
[pid 13019] fallocate(10, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2147483136, 2147483136) = 0 <9.894393>
[pid 13019] fallocate(10, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 4294966272, 2147483136) = 0 <9.966464>
[pid 13019] fallocate(10, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 6442449408, 2147483136) = 0 <9.899406>
[pid 13019] fallocate(10, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 8589932544, 2048) = 0 <0.016816>
[pid 13019] pwrite64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 41943040 <unfinished ...>
[pid 13022] pwrite64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 25165824strace: Process 13023 attached
[pid 13023] pwrite64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 16777216strace: Process 13024 attached
[pid 13024] pwrite64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 8650752 <unfinished ...>
[pid 13022] pwrite64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 8388608 <unfinished ...>
[pid 13019] pwrite64(10, "\2\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\f\0\2\2..\0\0\v\0\0\0\24\0\n\2"..., 16384, 1921024 <unfinished ...>
[pid 13025] pwrite64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 274432, 0strace: Process 13027 attached
[pid 13027] pwrite64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 278528) = 4096 <0.000364>

real    0m39.845s
user    0m0.014s
sys     0m0.020s



When I change to an older version of qemu-img(from RHEL 8.2) the fallocate system calls aren't present and the image write goes from
40 seconds to about .5 of a second.


[root@r640-u08 ~]# rpm -qf `which qemu-img `
qemu-img-2.12.0-99.module+el8.2.0+5827+8c39933c.x86_64
[root@r640-u08 ~]# time strace -T -f -e fallocate,pwrite64 /bin/qemu-img convert -O host_device -t directsync -W afile.qcow2 /dev/sdb3 |& grep  "("
[pid 12979] pwrite64(10, "\2\0\0\0\f\0\1\2.\0\0\0\2\0\0\0\f\0\2\2..\0\0\v\0\0\0\24\0\n\2"..., 15360, 1921024strace: Process 12980 attached
[pid 12975] pwrite64(10, "\0302\0\0\0\310\0\0\0\n\0\0\226\255\0\0\f2\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 272384, 1024 <unfinished ...>
[pid 12977] pwrite64(10, "\0\0\0\0\0\0\0\0\33w7a\33w7a\33w7a\0\0\0\0\0\0\0\0\0\0\0\0"..., 1536, 279552 <unfinished ...>
[pid 12979] pwrite64(10, "\0302\0\0\0\310\0\0\0\n\0\0\230\255\0\0\r2\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 1536, 8389632 <unfinished ...>
[pid 12973] pwrite64(10, "DEREKH DEADBEEF\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 8653824 <unfinished ...>
[pid 12975] pwrite64(10, "\300;9\230\0\0\0\4\0\0\0\0\0\0\4\0\0\0\20\0\0\0\0\1\0\0\0\5\0\0\0\0"..., 14848, 16778240 <unfinished ...>
[pid 12976] pwrite64(10, "\0302\0\0\0\310\0\0\0\n\0\0\230\255\0\0\r2\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 1536, 25166848 <unfinished ...>
[pid 12980] pwrite64(10, "\0302\0\0\0\310\0\0\0\n\0\0\230\255\0\0\r2\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 1536, 41944064 <unfinished ...>

real    0m0.530s
user    0m0.017s
sys     0m0.032s

Comment 4 Derek Higgins 2021-09-27 13:43:27 UTC
*** Bug 2005938 has been marked as a duplicate of this bug. ***

Comment 7 Derek Higgins 2022-06-01 10:21:03 UTC
This fix for has merged upstream in ironic and is part of the 4.11 and 4.10 releases.

Comment 15 errata-xmlrpc 2022-08-10 10:37:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.