openQA testing of the new squashfs-tools build in Rawhide - squashfs-tools-4.6-0.6.20230314git36abab0.fc39 - shows that mksquashfs crashes on a signal 11 during installer image build: https://openqa.fedoraproject.org/tests/1824107# program.log shows: 2023-03-14 20:39:04,729 INFO: Running... mksquashfs /var/tmp/lorax/lorax.3dei0r_i/installtree /var/tmp/lorax/lorax.3dei0r_i/installroot/images/install.img -comp xz -Xbcj x86 2023-03-14 20:39:17,051 INFO: Parallel mksquashfs: Using 2 processors 2023-03-14 20:39:17,053 INFO: Creating 4.0 filesystem on /var/tmp/lorax/lorax.3dei0r_i/installroot/images/install.img, block size 131072. 2023-03-14 20:39:17,053 INFO: 2023-03-14 20:39:17,053 INFO: [/ ] 0/25625 0% 2023-03-14 20:39:17,053 INFO: [| ] 100/33327 0% 2023-03-14 20:39:17,053 INFO: [=/ ] 1100/33327 3% 2023-03-14 20:39:17,053 DEBUG: Return code: -11 and the journal shows: Mar 14 17:39:16 fedora audit[18218]: ANOM_ABEND auid=0 uid=0 gid=0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=18218 comm="mksquashfs" exe="/usr/sbin/mksquashfs" sig=11 res=1 Mar 14 17:39:16 fedora kernel: mksquashfs[18218]: segfault at 3900000039 ip 000055cf9b0d91d0 sp 00007fff1ff78250 error 4 in mksquashfs[55cf9b0d1000+2e000] likely on CPU 1 (core 1, socket 0) Mar 14 17:39:16 fedora kernel: Code: b9 01 00 00 00 0f 1f 44 00 00 48 89 d3 45 31 e4 49 89 c6 45 31 c0 49 89 dd 85 c9 0f 8e 8c 01 00 00 66 0f 1f 84 00 00 00 00 00 <4d> 8b 6d 38 41 83 c0 01 4d 85 ed 0f 95 c0 41 39 c8 0f 9c c2 20 d0 Mar 14 17:39:16 fedora systemd[1]: Created slice system-systemd\x2dcoredump.slice - Slice /system/systemd-coredump. Mar 14 17:39:16 fedora audit: BPF prog-id=66 op=LOAD Mar 14 17:39:16 fedora audit: BPF prog-id=67 op=LOAD Mar 14 17:39:16 fedora audit: BPF prog-id=68 op=LOAD Mar 14 17:39:16 fedora systemd[1]: Started systemd-coredump - Process Core Dump (PID 18230/UID 0). Mar 14 17:39:16 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@0-18230-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Mar 14 17:39:17 fedora systemd-coredump[18231]: Resource limits disable core dumping for process 18218 (mksquashfs). Mar 14 17:39:17 fedora systemd-coredump[18231]: Process 18218 (mksquashfs) of user 0 dumped core. as you can see, we didn't get a coredump because of resource limits. I'll try and work around that and get one in the morning if nobody beats me to it (it's very late at night here now). Filing this bug so others can get started on investigation, and we can untag squashfs-tools to get a compose done if necessary. The test isn't that complicated, it's basically just installing lorax, doing `setenforce Permissive`, then running: lorax -p Fedora -v 39 -r 39 --repo=/etc/yum.repos.d/fedora-rawhide.repo --rootfs-size 3 --squashfs-only ./results (actually it adds a few more repos, but the above should work on Rawhide and will probably be enough to reproduce the issue).
I have been running tests of the updated packages locally and so far they have been clean. So it isn't something that always is happening. squashfs-tools-4.6-0.6.20230314git36abab0.fc39 has the memory leak fixes, so if it first broke last night, it might be related to that. As Phillip noted on github most of the recent changes otherwise have been related to documentation. I wasn't planning to build this for f38 until maybe after the f38 release, so it got more testing. I don't think there were any security fixes between 4.5.1 and 4.6. The reason I put it in f39 was to help Phillip get the code test more before he actually published a release. I figured Fedora owes him some help for the work he does on squashfs-tools, which is on his own time.
One other change I need to check is one that allows the build to be affected by environment variables when running make. I didn't look at it in depth. At the surface it looked like what we were doing in the spec file for CC. But it might be that something started getting set that was incompatible with libraries or something like that. I'm not sure if that commit got included in 4.6-0.5.20230312gitaaf011a.fc39 or if it didn't get in until the broken update. I'll try to figure this stuff out today.
In my comment about plans for the update I realized release was ambiguous. What I meant was I wasn't going to do an f38 update until after f38 final had been out for a while. Beta is already passed and screwing up the final release ISOs would cause a lot of pain. Squashfs-tools will also need a release as well for me to do that, but that is very likely to happen before f38 final.
I checked and aaf011a868c786b06e74cbdaf860d45793939f35 was the commit that added the customize makefile on the command line feature and that was in 4.6-0.5.20230312gitaaf011a.fc39 which I think was used to successfully build ISOs. So that commit probably isn't the problem.
Yes, the cause should be one of the six commits since then. The test did fail across four executions in openQA (we rerun failed tests once automatically, and we have both prod and staging instances; both tests on both instances failed the same way). It did also break last night's nightly compose, see e.g. https://koji.fedoraproject.org/koji/taskinfo?taskID=98710743 - you can see that ended in "2023-03-15 05:48:50,608: rootfs.img creation failed. See program.log" , indicating the same problem. I'll see if I can get a coredump one way or another (either reproduce manually, or tweak the openQA test to disable the coredump limit).
Created attachment 1951092 [details] backtrace of the crash Here's a backtrace I was able to get by reproducing this locally.
Think I've got this figured out, am doing a scratch build to confirm.
Thanks for fixing this. I'm glad this was caught before the 4.6 squashfs-tools release. I'll do another update for f39 after Phillip merges or incorporates your fix.
https://koji.fedoraproject.org/koji/taskinfo?taskID=98730435 has my fix for this backported.
While the fix is correct, I'm not sure the cause was. Since j was only initialized and not declared in the for statement, I think it is still in scope after the loop exits. However it would be equal to i at that point and the desired value was i-1, so the subscript after the loop exit should have been j-1, not j. I'm not 100% sure of this and using i-1 seems safer than using j-1, so I'm not advocating changing anything.
oh yeah, you might be right there. It probably doesn't matter a lot, though. :D