Bug 1717728 - LiveOS installs use excessive CPU for /dev/loop1
Summary: LiveOS installs use excessive CPU for /dev/loop1
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: lorax
Version: 31
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Brian Lane
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-06 02:54 UTC by Chris Murphy
Modified: 2019-08-19 08:43 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-18 17:13:11 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Chris Murphy 2019-06-06 02:54:22 UTC
Description of problem:

Any Fedora LiveOS media installation (e.g. Fedora-Workstation-Live-x86_64-30-1.2.iso) results in a kernel process /dev/loop1 using excessively high CPU 50-100%, variable.



Version-Release number of selected component (if applicable):
lorax-30.18


How reproducible:
Always

Steps to Reproduce:
1. do an install of LiveOS (or otherwise stress the reads
2.
3.

Actual results:

/dev/loop1 uses quite a lot of CPU


Expected results:

It shouldn't be a #1 or #2 CPU resource consumer

Additional info:

I can't tell if this is mostly due to xz/lzma decompression, or if it's due to having an ext4 file system nested in a squashfs, both of which are being hammered during random reads.

# watch -n 2 cat /proc/759/stack
Every 2.0s: cat /proc/759/stack                                                                                            localhost: Wed Jun  5 21:03:37 2019

[<0>] squashfs_read_data+0x391/0x6c0 [squashfs]
[<0>] squashfs_cache_get+0x123/0x360 [squashfs]
[<0>] squashfs_readpage_block+0x2b/0x60 [squashfs]
[<0>] squashfs_readpage+0x559/0x8a0 [squashfs]
[<0>] read_pages+0x12a/0x1b0
[<0>] __do_page_cache_readahead+0x18d/0x210
[<0>] ondemand_readahead+0x223/0x560
[<0>] generic_file_read_iter+0x5c0/0xc50
[<0>] do_iter_readv_writev+0x1b2/0x1e0
[<0>] do_iter_read+0xe3/0x1b0
[<0>] loop_queue_work+0x38b/0xa5f [loop]
[<0>] kthread_worker_fn+0xc4/0x210
[<0>] kthread+0x108/0x140
[<0>] ret_from_fork+0x3a/0x50


# perf record -F 99 -p 759
# perf report

Samples: 1K of event 'cpu-clock:pppH', Event count (approx.): 19939393740
Overhead  Command  Shared Object      Symbol
  63.78%  loop1    [kernel.kallsyms]  [k] lzma_main
  11.20%  loop1    [kernel.kallsyms]  [k] dict_repeat
   9.93%  loop1    [kernel.kallsyms]  [k] lzma_len
   3.39%  loop1    [kernel.kallsyms]  [k] crc32_le
   2.68%  loop1    [kernel.kallsyms]  [k] bcj_apply
   0.96%  loop1    [kernel.kallsyms]  [k] squashfs_copy_data.part.0
   0.91%  loop1    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
   0.91%  loop1    [kernel.kallsyms]  [k] memcpy
   0.46%  loop1    [kernel.kallsyms]  [k] __softirqentry_text_start
   0.30%  loop1    [kernel.kallsyms]  [k] xz_dec_lzma2_run
   0.25%  loop1    [kernel.kallsyms]  [k] __add_to_page_cache_locked
   0.25%  loop1    [kernel.kallsyms]  [k] get_page_from_freelist
   0.25%  loop1    [kernel.kallsyms]  [k] xz_dec_bcj_run
   0.20%  loop1    [kernel.kallsyms]  [k] __blk_complete_request
   0.20%  loop1    [kernel.kallsyms]  [k] __find_get_block
   0.20%  loop1    [kernel.kallsyms]  [k] finish_task_switch
   0.20%  loop1    [kernel.kallsyms]  [k] loop_queue_work
   0.15%  loop1    [kernel.kallsyms]  [k] generic_file_read_iter
   0.15%  loop1    [kernel.kallsyms]  [k] xas_load
   0.10%  loop1    [kernel.kallsyms]  [k] __lru_cache_add
   0.10%  loop1    [kernel.kallsyms]  [k] bio_init
   0.10%  loop1    [kernel.kallsyms]  [k] blk_add_timer
   0.10%  loop1    [kernel.kallsyms]  [k] blk_mq_get_request
   0.10%  loop1    [kernel.kallsyms]  [k] ktime_get
   0.10%  loop1    [kernel.kallsyms]  [k] mark_page_accessed
   0.10%  loop1    [kernel.kallsyms]  [k] memcpy_to_page
   0.10%  loop1    [kernel.kallsyms]  [k] percpu_counter_add_batch
   0.10%  loop1    [kernel.kallsyms]  [k] pvclock_clocksource_read
   0.10%  loop1    [kernel.kallsyms]  [k] rb_insert_color
   0.10%  loop1    [kernel.kallsyms]  [k] squashfs_xz_uncompress
   0.10%  loop1    [kernel.kallsyms]  [k] xz_dec_run
   0.05%  loop1    [kernel.kallsyms]  [k] I_BDEV
   0.05%  loop1    [kernel.kallsyms]  [k] __alloc_pages_nodemask
Press '?' for help on key bindings


Possibly the easier test is to use zstd instead of xz, as a mksquashfs option. But also I wonder about the point of nesting an ext4 file system in the squashfs image. I've looked at a handful of other distros using squashfs for their LiveOS images, and files are directly on squashfs, not within a nested 2nd file system image. And they don't seem to have the high CPU problem I'm reporting here. But I also don't know if they're using xz.

See also
https://bugzilla.redhat.com/show_bug.cgi?id=1648490

Comment 1 Bruno Wolff III 2019-06-06 03:28:48 UTC
I think the ext4 system was used because the install from a live image is copied to the system and hence needs to be a file system. It's been a while, so it's possible I'm misremembering things.

Yes, my plan was to add a zstd option once was support was available in squashfs-tools.
My memory is that we need to default compression specific options for squashfs-tools and recommendations there would be nice to hear. Mostly we want to save space more than compression time.
It would probably be possible to have multiple options for the same compression method, but with different parameters. I think for this application we probably don't need to do that.

Comment 2 Chris Murphy 2019-06-06 04:40:03 UTC
LiveOS installations copy files over using rsync. A long time ago, if the user kept the default root as ext4 setting, the ext4 image was block copied over to the target LV, and then resized to fit. I'd guess every time there's an ext4 metadata request, it must go through squashfs and lzma.

task createLiveMedia (f31, Fedora-Workstation-Live-Rawhide-20190603.n.0, fedora-live-workstation.ks, x86_64)
https://kojipkgs.fedoraproject.org//work/tasks/6104/35246104/program.log

That suggests mksquashfs used 48 processors, and took 2 minutes.

For zstd, it needs --ultra option passed to enable compression levels 20 and higher; but memory use goes up for these levels, both when compressing and decompressing, so we'd have to see what the memory requirement ends up being. Also -T0 to detect physical CPUs and thread. And possibly the --long option.

Comment 3 Chris Murphy 2019-06-06 04:43:44 UTC
Does mksquashfs pass individual files for compression, or a stream? If it's individual files, then the dictionary option might apply. Question is how to leverage it, and if squashfs has a reserve area (uncompressed) for the dictionary since it will be needed for decompression. It is only beneficial with small files. If mksquash creates a data stream of files to pass to zstd, then a dictionary will have no benefit.

Comment 4 Bruno Wolff III 2019-06-06 05:23:23 UTC
Well if you are compressing an ext4 image it wouldn't be passing individual files.
If you are compressing a directory tree, I'm not sure. I thought it shared stuff. It will use multiple processes to do compression. It detects how many are available, though I think you can set a max. Some of these might depend on the compression method. The xz compressor allows filters that can help with compression, but that isn't turned on for livecd creator. I think there are ways to keep some meta info uncompressed, but I don't know what the trade offs look like if you do that.

If rsync is always used now then it would make sense to get rid of the ext4 layer. That could be done independently of adding zstd support. I remember there being an option to avoid compression and that will probably need to keep a file system which would relace squashfs.

Comment 5 Bruno Wolff III 2019-06-06 05:42:29 UTC
Another thing that might have affected the ext4 on squashfs decision is that back when that was done squashfs handled less special stuff. Devices for sure didn't used to be able to be properly created in squashfs. I think extended attribute support got added at some point.

Comment 6 Bruno Wolff III 2019-06-06 07:41:50 UTC
If we were to try to git rid of the intermediate ext4 file system, doing so should probably be controlled by an option initially, so that people can try this out, without much change of breaking things. It may not be obvious at first if some little used stuff is broken.
I did my tweaks to parts of livecd-creator. I never looked at lorax. I needed help with what I did, as I am not very good with Python.

Comment 7 John Reiser 2019-06-06 15:24:49 UTC
(In reply to Chris Murphy from comment #0)

> Actual results:
> /dev/loop1 uses quite a lot of CPU

> Expected results:
> It shouldn't be a #1 or #2 CPU resource consumer

That expectation has a low probability for any tight compression that uses arithmetic encoding as its major algorithm.

The LZMA* decoding that xz performs takes the same amount of time and memory as the LZMA* encoding that xz performed.  This is a property of arithmetic encoding: the position-dependent matrix of historic probabilities that is used to predict the next byte [high-likelyhood prediction enables compact encoding] must be the same each time that a choice is made, independent of whether encoding or decoding.  Thus the decoder must re-construct the position-dependent matrix that the encoder used.  xz compression takes longer than xz decompression because compression also requires match finding [deflation] and filtering, not just encoding.

For Fedora Live media, what is encoded is an ext4 filesystem with a blocksize of 4KiB.  ext4 has high sequentiality (for any file stored in ext4 the number of block extents tends to be not much more than the minimum), so byte intervals seen by the encoder tend to correspond to concatenations of long pieces of whole files.  squashfs supposedly caches decoded chunks (expanded chunk size is at least 128KiB [?]), so misses in (or mismanagement of) the squashfs cache might be responsible for some repeat decoding when ext4 "randomly" accesses its metadata.

Comment 8 Brian Lane 2019-06-06 16:00:55 UTC
The question about nesting ext4 inside the squashfs has come up before, and IIRC one of the original reasons for it was the overlay filesystem that we setup a boot time. It's probably time to re-evaluate how we do this and do some experiments. The Anaconda team should also be involved since this will also effect how boot.iso is created. There may also be some concerns about backwards compatability with tools mounting the images so changes should be well documented to make it easier for people to transition to a new format.

What's actually being compressed is a simple directory tree with 1 file :) /LiveOS/rootfs.img which is the ext4 image.

The code that does this in lorax is https://github.com/weldr/lorax/blob/master/src/pylorax/treebuilder.py#L216 here.

Comment 9 Chris Murphy 2019-06-06 16:56:41 UTC
A device mapper based overlay depends on a rw snapshot (live-rw) of the read-only rootfs.img (live-base), so this method requires an fs with random rw, which squashfs doesn't support. Directly using squashfs could be done with overlayfs, and the rw layer could maybe be tmpfs or zram. I think dracut has support for this already and some distros use it.

Another idea is make the rootfs.img Btrfs and not use squashfs, gist is it's overall simpler and faster for all of our use cases, but in my limited test results in a ~2% larger image. Details were discussed in this thread; there are a couple of downstreams interested in taking advantage of this.
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/BJF3SLPPIXJER7FHK3HXYBL7TLK5T2AF/

In both Fedora 30 and Rawhide there is a curious artifact I can't figure out:
[root@localhost-live ~]# losetup
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                               DIO LOG-SEC
/dev/loop1         0      0         0  1 /LiveOS/rootfs.img                        0     512
/dev/loop2         0      0         0  0 /overlay (deleted)                        0     512
/dev/loop0         0      0         0  1 /run/initramfs/live/LiveOS/squashfs.img   0     512
[root@localhost-live ~]# 

Even with systemd debug, rd debug, udev debug, I can't figure out what loop2 is or was, what created it or what deleted it. But then I'm also running into bug 1715699, causing the loss of ~15-20s of early boot logging, again both on Fedora 30 (as shipped) and Rawhide live media.

Comment 10 Brian Lane 2019-06-06 17:12:51 UTC
Ah, right, overlayfs may be a good solution, looks like current dmsquash-live has support for it - https://github.com/dracutdevs/dracut/blob/master/modules.d/90dmsquash-live/dmsquash-live-root.sh

There's zero chance that btrfs will be used as a solution for this so let's explore using overlayfs and a plain squashfs of the filesystem.

Comment 11 Frederick Grose 2019-06-07 00:01:58 UTC
(In reply to Chris Murphy from comment #9)

> In both Fedora 30 and Rawhide there is a curious artifact I can't figure out:
> [root@localhost-live ~]# losetup
> NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                               DIO LOG-SEC
> /dev/loop1         0      0         0  1 /LiveOS/rootfs.img                        0     512
> /dev/loop2         0      0         0  0 /overlay (deleted)                        0     512
> /dev/loop0         0      0         0  1 /run/initramfs/live/LiveOS/squashfs.img   0     512
> [root@localhost-live ~]# 
> 
> Even with systemd debug, rd debug, udev debug, I can't figure out what loop2
> is or was, what created it or what deleted it.

This comes from the temporary overlay file created here in the initramfs for the live-rw snapshot,
https://github.com/dracutdevs/dracut/blob/8365177644bee107dcb66a71c8a7509900c9ece4/modules.d/90dmsquash-live/dmsquash-live-root.sh#L217

Once the switch_root occurs, this file path disappears and is reported as (deleted) by losetup.

Comment 12 Brian Lane 2019-06-07 18:09:30 UTC
I had totally forgotten about the new --squashfs-only option to lorax - https://github.com/weldr/lorax/commit/27e611629f2c654939b0894086512e9d4f366113

it should be possible to use that with new boot cmdline args to experiment with it.

Comment 13 Brian Lane 2019-07-25 23:42:11 UTC
Here's a PR adding --squashfs-only to livemedia-creator

https://github.com/weldr/lorax/pull/822

A huge round of applause for fgrose for getting the overlayfs support into dracut -- it actually 'Just Works(TM)' with the current kernel cmdline setup.

To get it to install anaconda's liveinst needs to be changed to look for /run/rootfsbase and loop0, when I add that to /usr/sbin/liveinst it installs just fine.

Comment 14 Brian Lane 2019-08-01 16:04:11 UTC
This is now available in lorax-31.9-1

A related patch for liveinst in anaconda is here - https://github.com/rhinstaller/anaconda/pull/2070

Comment 15 Chris Murphy 2019-08-01 17:13:25 UTC
I don't see a way in livemedia-creator to pass zstd to mksquashfs. The performance profiling indicates the main reason for excessive CPU usage is xz decompression, which is a known behavior of xz and one of the main things zstd solves.

Comment 16 Brian Lane 2019-08-01 17:35:20 UTC
(In reply to Chris Murphy from comment #15)
> I don't see a way in livemedia-creator to pass zstd to mksquashfs. The
> performance profiling indicates the main reason for excessive CPU usage is
> xz decompression, which is a known behavior of xz and one of the main things
> zstd solves.

The way things work right now is you can pass --compression="zstd" and it should get passed down to mksquashfs as -comp zstd
I do notice there is no way to pass additional args, just a -comp value. I should probably fix that.

Comment 17 Chris Murphy 2019-08-01 17:51:39 UTC
Thanks. For testing purposes to unequivocally prove whether this bug is about xz, and if using zstd is a viable fix, I think it's adequate as is.

For actual production use, for sure additional arguments are needed because zstd default level 3 doesn't compare well to xz's default, in terms of ratio. For zstd, the highest levels of compression need --ultra flag to enable. Only asking for levels 20+ means you still get level 19, if --ultra is omitted.

Comment 18 Brian Lane 2019-08-01 21:08:46 UTC
I have a PR that should be able to handle that - https://github.com/weldr/lorax/pull/829

Comment 19 Kamil Páral 2019-08-02 07:27:29 UTC
Should this bug get re-opened until we can verify whether the CPU usage problem is gone (and all relevant places are patched)? So far, we've just added an option, that doesn't resolve this.

Comment 20 Chris Murphy 2019-08-02 15:49:17 UTC
@bcl: I now see in `man mksquashfs` only --Xcompression-level for zstd, with levels 1 .. 22, default 15. That's different than the zstd utility's requirements and default, so you can probably safely ignore what I said about --ultra and levels. Looks like all lmc needs to do is enable a way to pass a compression level.

@kparal: It sounds reasonable to keep tracking the problem here, until all components are tested as fixed with the features releng will need; and I'll open an issue with releng to track compose process change and testing.

Comment 21 Chris Murphy 2019-08-02 16:43:14 UTC
In some sense this is notabug because the xz decompression behavior is expected; it's really a distribution UX concern that impacts users and QA's automated tests.

I've opened this issue with releng
https://pagure.io/releng/issue/8581

Comment 22 Ben Cotton 2019-08-13 17:08:59 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 23 Ben Cotton 2019-08-13 18:35:45 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 24 Chris Murphy 2019-08-18 05:54:19 UTC
Update.
1.
Fedora-Workstation-Live-x86_64-Rawhide-20190816.n.0.iso
2.
extract the rootfs.img (6981419008 bytes) embedded in the squashfs.img (1927286784 bytes), and recompressed into a new squashfs.img (2138419200 bytes) and made a new ISO with just that replacement squashfs using these commands:

mksquashfs /mnt/0/ squashfs.img -comp zstd 

xorrisofs -o ~/bootzstdtest.iso -isohybrid-mbr /usr/share/syslinux/isohdpfx.bin -b isolinux/isolinux.bin -c isolinux/boot.cat -boot-load-size 4 -boot-info-table -no-emul-boot -eltorito-alt-boot -e images/efiboot.img -no-emul-boot -isohybrid-gpt-basdat -eltorito-alt-boot -e images/macboot.img -no-emul-boot -isohybrid-gpt-hfsplus -R -J -V Fedora-WS-Live-rawh-20190816-n-0 -graft-points isolinux=/mnt/0/isolinux images/pxeboot=/mnt/0/images/pxeboot LiveOS=/mnt/0/LiveOS EFI/BOOT=/mnt/0/EFI/BOOT images/efiboot.img=/mnt/0/images/efiboot.img images/macboot.img=/mnt/0/images/macboot.img

NOTE: the default mksquashfs uses zstd level 15, so there's probably some room to get rid of that 200MB increase in ISO size, needs testing

3.
Now when I do an installation, /dev/loop1 and it's using 10-30% CPU. So anywhere from 3 to 8 times lower than with xz. The live environment seems more responsive and the installation goes a bit faster.

4.
perf top -p PID for /dev/loop1 bears this out - 
I'm not certain, but it's possible some additional improvement happens when moving to plain squashfs

Samples: 126K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 6862018274 lost: 0/0 drop: 0/0
Overhead  Shared O  Symbol
  13.74%  [kernel]  [k] ZSTD_decompressSequences
  11.34%  [kernel]  [k] lock_acquire
   8.23%  [kernel]  [k] _raw_spin_unlock_irqrestore
   7.35%  [kernel]  [k] lock_release
   6.13%  [kernel]  [k] HUF_decompress4X2_usingDTable_internal.part.0
   5.92%  [kernel]  [k] lock_is_held_type
   5.60%  [kernel]  [k] ktime_get
   3.68%  [kernel]  [k] _raw_spin_unlock_irq
   2.60%  [kernel]  [k] __softirqentry_text_start
   2.17%  [kernel]  [k] memcpy
   1.43%  [kernel]  [k] get_page_from_freelist
   1.25%  [kernel]  [k] debug_lockdep_rcu_enabled.part.0
   1.01%  [kernel]  [k] bad_range
   0.94%  [kernel]  [k] squashfs_copy_data.part.0
   0.91%  [kernel]  [k] __find_get_block
   0.86%  [kernel]  [k] rcu_lockdep_current_cpu_online
   0.85%  [kernel]  [k] rcu_is_watching
   0.80%  [kernel]  [k] xas_descend
   0.78%  [kernel]  [k] __blk_complete_request
   0.71%  [kernel]  [k] kmem_cache_free
   0.51%  [kernel]  [k] pvclock_clocksource_read
   0.50%  [kernel]  [k] mod_delayed_work_on
   0.42%  [kernel]  [k] ___might_sleep
   0.42%  [kernel]  [k] lock_acquired
   0.39%  [kernel]  [k] kmem_cache_alloc
   0.38%  [kernel]  [k] find_get_entry
   0.37%  [kernel]  [k] blk_mq_get_request
   0.36%  [kernel]  [k] debug_lockdep_rcu_enabled
   0.35%  [kernel]  [k] ib_create_ah_from_wc
   0.34%  [kernel]  [k] mem_cgroup_commit_charge
   0.32%  [kernel]  [k] generic_file_read_iter
   0.31%  [kernel]  [k] __wake_up_bit
   0.30%  [kernel]  [k] kvm_clock_get_cycles
For a higher level overview, try: perf top --sort comm,dso

Comment 25 Chris Murphy 2019-08-18 17:13:11 UTC
OK I've created a plain squashfs image using level 22, created ISO containing it, and applied the changes in anaconda/pull/2070/files. There is no loop1 anymore, it's loop0 that needs to be profiled; and it uses at most 1/3 that of ext4 on squashfs+zstd. It never goes above 10% CPU. So I think it's fair to say two things were going on, the extra seeks for ext4 may have been perturbing xz decompression by requiring more blocks to be decompressed than necessary; on top of the already known expensive xz/lzma decompression hit to CPU use.

A further plus is the installation times are reduced by ~20% (8m vs 10m).

One gotcha is the image sizes, even with zstd level 22 and plain squashfs and 256K block size, are ~5% bigger than the present squashfs.img. Some optimization may still be possible, in particular if squashfs can unlock higher compression levels without additional options needed on the lorax or anaconda side.

Anyway, closing out this bug, the rest of the optimizations and implementation will get covered in

https://pagure.io/releng/issue/8581
https://pagure.io/releng/issue/8646

Comment 26 Kamil Páral 2019-08-19 08:37:29 UTC
Outstanding work, Chris, thanks a lot.

Comment 27 Zbigniew Jędrzejewski-Szmek 2019-08-19 08:43:32 UTC
+1


Note You need to log in before you can comment on or make changes to this bug.