2178510 – mksquashfs crashes (signal 11) during installer image build

Bug 2178510 - mksquashfs crashes (signal 11) during installer image build

Summary: mksquashfs crashes (signal 11) during installer image build

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	squashfs-tools
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Bruno Wolff III
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	openqa
Depends On:
Blocks:	F39BetaBlocker
TreeView+	depends on / blocked

Reported:	2023-03-15 06:54 UTC by Adam Williamson
Modified:	2023-03-15 21:25 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2023-03-15 20:27:08 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
backtrace of the crash (7.84 KB, text/plain) 2023-03-15 18:56 UTC, Adam Williamson	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	plougher squashfs-tools issues 230	0	None	open	mksquashfs crashes during Fedora installer image build	2023-03-15 07:01:03 UTC

Description Adam Williamson 2023-03-15 06:54:30 UTC

openQA testing of the new squashfs-tools build in Rawhide - squashfs-tools-4.6-0.6.20230314git36abab0.fc39 - shows that mksquashfs crashes on a signal 11 during installer image build:

https://openqa.fedoraproject.org/tests/1824107#

program.log shows:

2023-03-14 20:39:04,729 INFO: Running... mksquashfs /var/tmp/lorax/lorax.3dei0r_i/installtree /var/tmp/lorax/lorax.3dei0r_i/installroot/images/install.img -comp xz -Xbcj x86
2023-03-14 20:39:17,051 INFO: Parallel mksquashfs: Using 2 processors
2023-03-14 20:39:17,053 INFO: Creating 4.0 filesystem on /var/tmp/lorax/lorax.3dei0r_i/installroot/images/install.img, block size 131072.
2023-03-14 20:39:17,053 INFO: 
2023-03-14 20:39:17,053 INFO: [/                                                           ]     0/25625   0%
2023-03-14 20:39:17,053 INFO: [|                                                           ]   100/33327   0%
2023-03-14 20:39:17,053 INFO: [=/                                                          ]  1100/33327   3%
2023-03-14 20:39:17,053 DEBUG: Return code: -11

and the journal shows:

Mar 14 17:39:16 fedora audit[18218]: ANOM_ABEND auid=0 uid=0 gid=0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=18218 comm="mksquashfs" exe="/usr/sbin/mksquashfs" sig=11 res=1
Mar 14 17:39:16 fedora kernel: mksquashfs[18218]: segfault at 3900000039 ip 000055cf9b0d91d0 sp 00007fff1ff78250 error 4 in mksquashfs[55cf9b0d1000+2e000] likely on CPU 1 (core 1, socket 0)
Mar 14 17:39:16 fedora kernel: Code: b9 01 00 00 00 0f 1f 44 00 00 48 89 d3 45 31 e4 49 89 c6 45 31 c0 49 89 dd 85 c9 0f 8e 8c 01 00 00 66 0f 1f 84 00 00 00 00 00 <4d> 8b 6d 38 41 83 c0 01 4d 85 ed 0f 95 c0 41 39 c8 0f 9c c2 20 d0
Mar 14 17:39:16 fedora systemd[1]: Created slice system-systemd\x2dcoredump.slice - Slice /system/systemd-coredump.
Mar 14 17:39:16 fedora audit: BPF prog-id=66 op=LOAD
Mar 14 17:39:16 fedora audit: BPF prog-id=67 op=LOAD
Mar 14 17:39:16 fedora audit: BPF prog-id=68 op=LOAD
Mar 14 17:39:16 fedora systemd[1]: Started systemd-coredump - Process Core Dump (PID 18230/UID 0).
Mar 14 17:39:16 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@0-18230-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Mar 14 17:39:17 fedora systemd-coredump[18231]: Resource limits disable core dumping for process 18218 (mksquashfs).
Mar 14 17:39:17 fedora systemd-coredump[18231]: Process 18218 (mksquashfs) of user 0 dumped core.

as you can see, we didn't get a coredump because of resource limits. I'll try and work around that and get one in the morning if nobody beats me to it (it's very late at night here now). Filing this bug so others can get started on investigation, and we can untag squashfs-tools to get a compose done if necessary.

The test isn't that complicated, it's basically just installing lorax, doing `setenforce Permissive`, then running:

lorax -p Fedora -v 39 -r 39 --repo=/etc/yum.repos.d/fedora-rawhide.repo --rootfs-size 3 --squashfs-only ./results

(actually it adds a few more repos, but the above should work on Rawhide and will probably be enough to reproduce the issue).

Comment 1 Bruno Wolff III 2023-03-15 11:13:28 UTC

I have been running tests of the updated packages locally and so far they have been clean. So it isn't something that always is happening.
squashfs-tools-4.6-0.6.20230314git36abab0.fc39 has the memory leak fixes, so if it first broke last night, it might be related to that. As Phillip noted on github most of the recent changes otherwise have been related to documentation.
I wasn't planning to build this for f38 until maybe after the f38 release, so it got more testing. I don't think there were any security fixes between 4.5.1 and 4.6.
The reason I put it in f39 was to help Phillip get the code test more before he actually published a release. I figured Fedora owes him some help for the work he does on squashfs-tools, which is on his own time.

Comment 2 Bruno Wolff III 2023-03-15 14:59:33 UTC

One other change I need to check is one that allows the build to be affected by environment variables when running make. I didn't look at it in depth. At the surface it looked like what we were doing in the spec file for CC. But it might be that something started getting set that was incompatible with libraries or something like that. I'm not sure if that commit got included in 4.6-0.5.20230312gitaaf011a.fc39 or if it didn't get in until the broken update. I'll try to figure this stuff out today.

Comment 3 Bruno Wolff III 2023-03-15 15:03:20 UTC

In my comment about plans for the update I realized release was ambiguous. What I meant was I wasn't going to do an f38 update until after f38 final had been out for a while. Beta is already passed and screwing up the final release ISOs would cause a lot of pain. Squashfs-tools will also need a release as well for me to do that, but that is very likely to happen before f38 final.

Comment 4 Bruno Wolff III 2023-03-15 15:08:03 UTC

I checked and aaf011a868c786b06e74cbdaf860d45793939f35 was the commit that added the customize makefile on the command line feature and that was in 4.6-0.5.20230312gitaaf011a.fc39 which I think was used to successfully build ISOs. So that commit probably isn't the problem.

Comment 5 Adam Williamson 2023-03-15 15:45:08 UTC

Yes, the cause should be one of the six commits since then.

The test did fail across four executions in openQA (we rerun failed tests once automatically, and we have both prod and staging instances; both tests on both instances failed the same way). It did also break last night's nightly compose, see e.g. https://koji.fedoraproject.org/koji/taskinfo?taskID=98710743 - you can see that ended in "2023-03-15 05:48:50,608: rootfs.img creation failed. See program.log" , indicating the same problem.

I'll see if I can get a coredump one way or another (either reproduce manually, or tweak the openQA test to disable the coredump limit).

Comment 6 Adam Williamson 2023-03-15 18:56:36 UTC

Created attachment 1951092 [details]
backtrace of the crash

Here's a backtrace I was able to get by reproducing this locally.

Comment 7 Adam Williamson 2023-03-15 19:40:06 UTC

Think I've got this figured out, am doing a scratch build to confirm.

Comment 8 Bruno Wolff III 2023-03-15 20:17:51 UTC

Thanks for fixing this. I'm glad this was caught before the 4.6 squashfs-tools release.
I'll do another update for f39 after Phillip merges or incorporates your fix.

Comment 9 Adam Williamson 2023-03-15 20:27:08 UTC

https://koji.fedoraproject.org/koji/taskinfo?taskID=98730435

has my fix for this backported.

Comment 10 Bruno Wolff III 2023-03-15 21:12:25 UTC

While the fix is correct, I'm not sure the cause was. Since j was only initialized and not declared in the for statement, I think it is still in scope after the loop exits. However it would be equal to i at that point and the desired value was i-1, so the subscript after the loop exit should have been j-1, not j. I'm not 100% sure of this and using i-1 seems safer than using j-1, so I'm not advocating changing anything.

Comment 11 Adam Williamson 2023-03-15 21:25:04 UTC

oh yeah, you might be right there. It probably doesn't matter a lot, though. :D

Note You need to log in before you can comment on or make changes to this bug.