Description of problem:
Any Fedora LiveOS media installation (e.g. Fedora-Workstation-Live-x86_64-30-1.2.iso) results in a kernel process /dev/loop1 using excessively high CPU 50-100%, variable.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. do an install of LiveOS (or otherwise stress the reads
/dev/loop1 uses quite a lot of CPU
It shouldn't be a #1 or #2 CPU resource consumer
I can't tell if this is mostly due to xz/lzma decompression, or if it's due to having an ext4 file system nested in a squashfs, both of which are being hammered during random reads.
# watch -n 2 cat /proc/759/stack
Every 2.0s: cat /proc/759/stack localhost: Wed Jun 5 21:03:37 2019
[<0>] squashfs_read_data+0x391/0x6c0 [squashfs]
[<0>] squashfs_cache_get+0x123/0x360 [squashfs]
[<0>] squashfs_readpage_block+0x2b/0x60 [squashfs]
[<0>] squashfs_readpage+0x559/0x8a0 [squashfs]
[<0>] loop_queue_work+0x38b/0xa5f [loop]
# perf record -F 99 -p 759
# perf report
Samples: 1K of event 'cpu-clock:pppH', Event count (approx.): 19939393740
Overhead Command Shared Object Symbol
63.78% loop1 [kernel.kallsyms] [k] lzma_main
11.20% loop1 [kernel.kallsyms] [k] dict_repeat
9.93% loop1 [kernel.kallsyms] [k] lzma_len
3.39% loop1 [kernel.kallsyms] [k] crc32_le
2.68% loop1 [kernel.kallsyms] [k] bcj_apply
0.96% loop1 [kernel.kallsyms] [k] squashfs_copy_data.part.0
0.91% loop1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.91% loop1 [kernel.kallsyms] [k] memcpy
0.46% loop1 [kernel.kallsyms] [k] __softirqentry_text_start
0.30% loop1 [kernel.kallsyms] [k] xz_dec_lzma2_run
0.25% loop1 [kernel.kallsyms] [k] __add_to_page_cache_locked
0.25% loop1 [kernel.kallsyms] [k] get_page_from_freelist
0.25% loop1 [kernel.kallsyms] [k] xz_dec_bcj_run
0.20% loop1 [kernel.kallsyms] [k] __blk_complete_request
0.20% loop1 [kernel.kallsyms] [k] __find_get_block
0.20% loop1 [kernel.kallsyms] [k] finish_task_switch
0.20% loop1 [kernel.kallsyms] [k] loop_queue_work
0.15% loop1 [kernel.kallsyms] [k] generic_file_read_iter
0.15% loop1 [kernel.kallsyms] [k] xas_load
0.10% loop1 [kernel.kallsyms] [k] __lru_cache_add
0.10% loop1 [kernel.kallsyms] [k] bio_init
0.10% loop1 [kernel.kallsyms] [k] blk_add_timer
0.10% loop1 [kernel.kallsyms] [k] blk_mq_get_request
0.10% loop1 [kernel.kallsyms] [k] ktime_get
0.10% loop1 [kernel.kallsyms] [k] mark_page_accessed
0.10% loop1 [kernel.kallsyms] [k] memcpy_to_page
0.10% loop1 [kernel.kallsyms] [k] percpu_counter_add_batch
0.10% loop1 [kernel.kallsyms] [k] pvclock_clocksource_read
0.10% loop1 [kernel.kallsyms] [k] rb_insert_color
0.10% loop1 [kernel.kallsyms] [k] squashfs_xz_uncompress
0.10% loop1 [kernel.kallsyms] [k] xz_dec_run
0.05% loop1 [kernel.kallsyms] [k] I_BDEV
0.05% loop1 [kernel.kallsyms] [k] __alloc_pages_nodemask
Press '?' for help on key bindings
Possibly the easier test is to use zstd instead of xz, as a mksquashfs option. But also I wonder about the point of nesting an ext4 file system in the squashfs image. I've looked at a handful of other distros using squashfs for their LiveOS images, and files are directly on squashfs, not within a nested 2nd file system image. And they don't seem to have the high CPU problem I'm reporting here. But I also don't know if they're using xz.
I think the ext4 system was used because the install from a live image is copied to the system and hence needs to be a file system. It's been a while, so it's possible I'm misremembering things.
Yes, my plan was to add a zstd option once was support was available in squashfs-tools.
My memory is that we need to default compression specific options for squashfs-tools and recommendations there would be nice to hear. Mostly we want to save space more than compression time.
It would probably be possible to have multiple options for the same compression method, but with different parameters. I think for this application we probably don't need to do that.
LiveOS installations copy files over using rsync. A long time ago, if the user kept the default root as ext4 setting, the ext4 image was block copied over to the target LV, and then resized to fit. I'd guess every time there's an ext4 metadata request, it must go through squashfs and lzma.
task createLiveMedia (f31, Fedora-Workstation-Live-Rawhide-20190603.n.0, fedora-live-workstation.ks, x86_64)
That suggests mksquashfs used 48 processors, and took 2 minutes.
For zstd, it needs --ultra option passed to enable compression levels 20 and higher; but memory use goes up for these levels, both when compressing and decompressing, so we'd have to see what the memory requirement ends up being. Also -T0 to detect physical CPUs and thread. And possibly the --long option.
Does mksquashfs pass individual files for compression, or a stream? If it's individual files, then the dictionary option might apply. Question is how to leverage it, and if squashfs has a reserve area (uncompressed) for the dictionary since it will be needed for decompression. It is only beneficial with small files. If mksquash creates a data stream of files to pass to zstd, then a dictionary will have no benefit.
Well if you are compressing an ext4 image it wouldn't be passing individual files.
If you are compressing a directory tree, I'm not sure. I thought it shared stuff. It will use multiple processes to do compression. It detects how many are available, though I think you can set a max. Some of these might depend on the compression method. The xz compressor allows filters that can help with compression, but that isn't turned on for livecd creator. I think there are ways to keep some meta info uncompressed, but I don't know what the trade offs look like if you do that.
If rsync is always used now then it would make sense to get rid of the ext4 layer. That could be done independently of adding zstd support. I remember there being an option to avoid compression and that will probably need to keep a file system which would relace squashfs.
Another thing that might have affected the ext4 on squashfs decision is that back when that was done squashfs handled less special stuff. Devices for sure didn't used to be able to be properly created in squashfs. I think extended attribute support got added at some point.
If we were to try to git rid of the intermediate ext4 file system, doing so should probably be controlled by an option initially, so that people can try this out, without much change of breaking things. It may not be obvious at first if some little used stuff is broken.
I did my tweaks to parts of livecd-creator. I never looked at lorax. I needed help with what I did, as I am not very good with Python.
(In reply to Chris Murphy from comment #0)
> Actual results:
> /dev/loop1 uses quite a lot of CPU
> Expected results:
> It shouldn't be a #1 or #2 CPU resource consumer
That expectation has a low probability for any tight compression that uses arithmetic encoding as its major algorithm.
The LZMA* decoding that xz performs takes the same amount of time and memory as the LZMA* encoding that xz performed. This is a property of arithmetic encoding: the position-dependent matrix of historic probabilities that is used to predict the next byte [high-likelyhood prediction enables compact encoding] must be the same each time that a choice is made, independent of whether encoding or decoding. Thus the decoder must re-construct the position-dependent matrix that the encoder used. xz compression takes longer than xz decompression because compression also requires match finding [deflation] and filtering, not just encoding.
For Fedora Live media, what is encoded is an ext4 filesystem with a blocksize of 4KiB. ext4 has high sequentiality (for any file stored in ext4 the number of block extents tends to be not much more than the minimum), so byte intervals seen by the encoder tend to correspond to concatenations of long pieces of whole files. squashfs supposedly caches decoded chunks (expanded chunk size is at least 128KiB [?]), so misses in (or mismanagement of) the squashfs cache might be responsible for some repeat decoding when ext4 "randomly" accesses its metadata.
The question about nesting ext4 inside the squashfs has come up before, and IIRC one of the original reasons for it was the overlay filesystem that we setup a boot time. It's probably time to re-evaluate how we do this and do some experiments. The Anaconda team should also be involved since this will also effect how boot.iso is created. There may also be some concerns about backwards compatability with tools mounting the images so changes should be well documented to make it easier for people to transition to a new format.
What's actually being compressed is a simple directory tree with 1 file :) /LiveOS/rootfs.img which is the ext4 image.
The code that does this in lorax is https://github.com/weldr/lorax/blob/master/src/pylorax/treebuilder.py#L216 here.
A device mapper based overlay depends on a rw snapshot (live-rw) of the read-only rootfs.img (live-base), so this method requires an fs with random rw, which squashfs doesn't support. Directly using squashfs could be done with overlayfs, and the rw layer could maybe be tmpfs or zram. I think dracut has support for this already and some distros use it.
Another idea is make the rootfs.img Btrfs and not use squashfs, gist is it's overall simpler and faster for all of our use cases, but in my limited test results in a ~2% larger image. Details were discussed in this thread; there are a couple of downstreams interested in taking advantage of this.
In both Fedora 30 and Rawhide there is a curious artifact I can't figure out:
[root@localhost-live ~]# losetup
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 0 1 /LiveOS/rootfs.img 0 512
/dev/loop2 0 0 0 0 /overlay (deleted) 0 512
/dev/loop0 0 0 0 1 /run/initramfs/live/LiveOS/squashfs.img 0 512
Even with systemd debug, rd debug, udev debug, I can't figure out what loop2 is or was, what created it or what deleted it. But then I'm also running into bug 1715699, causing the loss of ~15-20s of early boot logging, again both on Fedora 30 (as shipped) and Rawhide live media.
Ah, right, overlayfs may be a good solution, looks like current dmsquash-live has support for it - https://github.com/dracutdevs/dracut/blob/master/modules.d/90dmsquash-live/dmsquash-live-root.sh
There's zero chance that btrfs will be used as a solution for this so let's explore using overlayfs and a plain squashfs of the filesystem.
(In reply to Chris Murphy from comment #9)
> In both Fedora 30 and Rawhide there is a curious artifact I can't figure out:
> [root@localhost-live ~]# losetup
> NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
> /dev/loop1 0 0 0 1 /LiveOS/rootfs.img 0 512
> /dev/loop2 0 0 0 0 /overlay (deleted) 0 512
> /dev/loop0 0 0 0 1 /run/initramfs/live/LiveOS/squashfs.img 0 512
> [root@localhost-live ~]#
> Even with systemd debug, rd debug, udev debug, I can't figure out what loop2
> is or was, what created it or what deleted it.
This comes from the temporary overlay file created here in the initramfs for the live-rw snapshot,
Once the switch_root occurs, this file path disappears and is reported as (deleted) by losetup.
I had totally forgotten about the new --squashfs-only option to lorax - https://github.com/weldr/lorax/commit/27e611629f2c654939b0894086512e9d4f366113
it should be possible to use that with new boot cmdline args to experiment with it.
Here's a PR adding --squashfs-only to livemedia-creator
A huge round of applause for fgrose for getting the overlayfs support into dracut -- it actually 'Just Works(TM)' with the current kernel cmdline setup.
To get it to install anaconda's liveinst needs to be changed to look for /run/rootfsbase and loop0, when I add that to /usr/sbin/liveinst it installs just fine.
This is now available in lorax-31.9-1
A related patch for liveinst in anaconda is here - https://github.com/rhinstaller/anaconda/pull/2070
I don't see a way in livemedia-creator to pass zstd to mksquashfs. The performance profiling indicates the main reason for excessive CPU usage is xz decompression, which is a known behavior of xz and one of the main things zstd solves.
(In reply to Chris Murphy from comment #15)
> I don't see a way in livemedia-creator to pass zstd to mksquashfs. The
> performance profiling indicates the main reason for excessive CPU usage is
> xz decompression, which is a known behavior of xz and one of the main things
> zstd solves.
The way things work right now is you can pass --compression="zstd" and it should get passed down to mksquashfs as -comp zstd
I do notice there is no way to pass additional args, just a -comp value. I should probably fix that.
Thanks. For testing purposes to unequivocally prove whether this bug is about xz, and if using zstd is a viable fix, I think it's adequate as is.
For actual production use, for sure additional arguments are needed because zstd default level 3 doesn't compare well to xz's default, in terms of ratio. For zstd, the highest levels of compression need --ultra flag to enable. Only asking for levels 20+ means you still get level 19, if --ultra is omitted.
I have a PR that should be able to handle that - https://github.com/weldr/lorax/pull/829
Should this bug get re-opened until we can verify whether the CPU usage problem is gone (and all relevant places are patched)? So far, we've just added an option, that doesn't resolve this.
@bcl: I now see in `man mksquashfs` only --Xcompression-level for zstd, with levels 1 .. 22, default 15. That's different than the zstd utility's requirements and default, so you can probably safely ignore what I said about --ultra and levels. Looks like all lmc needs to do is enable a way to pass a compression level.
@kparal: It sounds reasonable to keep tracking the problem here, until all components are tested as fixed with the features releng will need; and I'll open an issue with releng to track compose process change and testing.
In some sense this is notabug because the xz decompression behavior is expected; it's really a distribution UX concern that impacts users and QA's automated tests.
I've opened this issue with releng
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.
extract the rootfs.img (6981419008 bytes) embedded in the squashfs.img (1927286784 bytes), and recompressed into a new squashfs.img (2138419200 bytes) and made a new ISO with just that replacement squashfs using these commands:
mksquashfs /mnt/0/ squashfs.img -comp zstd
xorrisofs -o ~/bootzstdtest.iso -isohybrid-mbr /usr/share/syslinux/isohdpfx.bin -b isolinux/isolinux.bin -c isolinux/boot.cat -boot-load-size 4 -boot-info-table -no-emul-boot -eltorito-alt-boot -e images/efiboot.img -no-emul-boot -isohybrid-gpt-basdat -eltorito-alt-boot -e images/macboot.img -no-emul-boot -isohybrid-gpt-hfsplus -R -J -V Fedora-WS-Live-rawh-20190816-n-0 -graft-points isolinux=/mnt/0/isolinux images/pxeboot=/mnt/0/images/pxeboot LiveOS=/mnt/0/LiveOS EFI/BOOT=/mnt/0/EFI/BOOT images/efiboot.img=/mnt/0/images/efiboot.img images/macboot.img=/mnt/0/images/macboot.img
NOTE: the default mksquashfs uses zstd level 15, so there's probably some room to get rid of that 200MB increase in ISO size, needs testing
Now when I do an installation, /dev/loop1 and it's using 10-30% CPU. So anywhere from 3 to 8 times lower than with xz. The live environment seems more responsive and the installation goes a bit faster.
perf top -p PID for /dev/loop1 bears this out -
I'm not certain, but it's possible some additional improvement happens when moving to plain squashfs
Samples: 126K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 6862018274 lost: 0/0 drop: 0/0
Overhead Shared O Symbol
13.74% [kernel] [k] ZSTD_decompressSequences
11.34% [kernel] [k] lock_acquire
8.23% [kernel] [k] _raw_spin_unlock_irqrestore
7.35% [kernel] [k] lock_release
6.13% [kernel] [k] HUF_decompress4X2_usingDTable_internal.part.0
5.92% [kernel] [k] lock_is_held_type
5.60% [kernel] [k] ktime_get
3.68% [kernel] [k] _raw_spin_unlock_irq
2.60% [kernel] [k] __softirqentry_text_start
2.17% [kernel] [k] memcpy
1.43% [kernel] [k] get_page_from_freelist
1.25% [kernel] [k] debug_lockdep_rcu_enabled.part.0
1.01% [kernel] [k] bad_range
0.94% [kernel] [k] squashfs_copy_data.part.0
0.91% [kernel] [k] __find_get_block
0.86% [kernel] [k] rcu_lockdep_current_cpu_online
0.85% [kernel] [k] rcu_is_watching
0.80% [kernel] [k] xas_descend
0.78% [kernel] [k] __blk_complete_request
0.71% [kernel] [k] kmem_cache_free
0.51% [kernel] [k] pvclock_clocksource_read
0.50% [kernel] [k] mod_delayed_work_on
0.42% [kernel] [k] ___might_sleep
0.42% [kernel] [k] lock_acquired
0.39% [kernel] [k] kmem_cache_alloc
0.38% [kernel] [k] find_get_entry
0.37% [kernel] [k] blk_mq_get_request
0.36% [kernel] [k] debug_lockdep_rcu_enabled
0.35% [kernel] [k] ib_create_ah_from_wc
0.34% [kernel] [k] mem_cgroup_commit_charge
0.32% [kernel] [k] generic_file_read_iter
0.31% [kernel] [k] __wake_up_bit
0.30% [kernel] [k] kvm_clock_get_cycles
For a higher level overview, try: perf top --sort comm,dso
OK I've created a plain squashfs image using level 22, created ISO containing it, and applied the changes in anaconda/pull/2070/files. There is no loop1 anymore, it's loop0 that needs to be profiled; and it uses at most 1/3 that of ext4 on squashfs+zstd. It never goes above 10% CPU. So I think it's fair to say two things were going on, the extra seeks for ext4 may have been perturbing xz decompression by requiring more blocks to be decompressed than necessary; on top of the already known expensive xz/lzma decompression hit to CPU use.
A further plus is the installation times are reduced by ~20% (8m vs 10m).
One gotcha is the image sizes, even with zstd level 22 and plain squashfs and 256K block size, are ~5% bigger than the present squashfs.img. Some optimization may still be possible, in particular if squashfs can unlock higher compression levels without additional options needed on the lorax or anaconda side.
Anyway, closing out this bug, the rest of the optimizations and implementation will get covered in
Outstanding work, Chris, thanks a lot.