Description of problem: It seems like since introduction of systemd-245, booting Fedora Installation media fails with timing out on mounting root system. Eg. /dev/disk/by-label/... does not exist /dev/root does not exist Compose without the issue (different problem, but got well beyond root mount): https://openqa.fedoraproject.org/tests/516585#step/_boot_to_anaconda/7 Compose with the issue: https://openqa.fedoraproject.org/tests/518614#step/_boot_to_anaconda/21 Version-Release number of selected component (if applicable): systemd-245 I am not 100% sure it's systemd failure, so please reassign if you think it's bug somewhere else. However, the issue didn't happen before systemd-245 landing in Rawhide as seen in OpenQA dates.
Proposed as a Blocker for 32-beta by Fedora user frantisekz using the blocker tracking app because: System must boot.
Discussed during the 2020-02-10 blocker review meeting: [1] The decision to classify this bug as an AcceptedBlocker was made: "All release-blocking images must boot in their supported configurations" [1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2020-02-10/f32-blocker-review.2020-02-10-17.01.log.txt
VM with Rawhide, initramfs has systemd-244, but systemd-245 is installed. This boots OK. Rebuild initramfs so that systemd-245 is in there, and it still boots.
So I think this *may* be related to the problem we had a few months back with util-linux, this one: https://bugzilla.redhat.com/show_bug.cgi?id=1783066 as Stuff Has Happened upstream since then. Specifically, this hit systemd: https://github.com/systemd/systemd/pull/14485 and then it seems some corresponding change was made in util-linux (reverting stuff): https://github.com/karelzak/util-linux/pull/913#issuecomment-572581058 So, I'm wondering if somewhere in there we managed to break booting hybrid ISOs again, or if Rawhide got one side of the changes but not the other side yet, or if we now need changes to dracut's handling of our hybrid ISOs, or something like that. So far I haven't got into the detailed dracut logs, but I did at least check that /dev/mapper has only control, and /dev/disk/by-id and /dev/disk/by-uuid are very empty, with only one entry in each. CCing Karel, Zbigniew and Harald for now.
OK, so in the end the problem here seems to be more or less what it was there: we get no ID_FS_* properties for /dev/sr0, so we get no /dev/disk/by-label entry (or by-id or by-uuid etc. entries) for it, and so we can't mount it. But I'm not sure yet exactly where the problem is that *causes* us to get no ID_FS_* properties in the first place. It seems that in current Rawhide we have systemd 245 (with the #14485 PR) and util-linux 2.35.1 (which has the two reversions discussed in #913), so those seem to be 'in sync'.
Created attachment 1662251 [details] rdsosreport
Created attachment 1662259 [details] rdsosreport w debug rd.debug rd.udev.debug systemd.log_level=debug Not sure how helpful this will be, the log starts at [ 345.341550], which seems like a separate bug.
Booting from an image written to a USB stick works. I'll try a real optical disc next.
Hi, what do you get from these commands ? udevadm info /dev/sr0 | grep ID_FS_USAGE udevadm info /dev/sr0 | grep ID_FS_LABEL_ENC for i in $(blkid -p /dev/sr0) ; do echo "$i" | grep USAGE ; done Reasoning: The rule which should make /dev/disk/by-label is at line 113 of https://github.com/systemd/systemd/blob/master/rules.d/60-persistent-storage.rules ENV{ID_FS_USAGE}=="filesystem|other|crypto", ENV{ID_FS_LABEL_ENC}=="?*", SYMLINK+="disk/by-label/$env{ID_FS_LABEL_ENC}" I understand ID_FS_USAGE gets filled in line 37 and 38 of https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-blkid.c which say } else if (streq(name, "USAGE")) { udev_builtin_add_property(dev, test, "ID_FS_USAGE", value); (at least https://codesearch.debian.net/search?q=package%3Asystemd+ID_FS_USAGE finds no other plausible suspects for setting this property.) Line 46 ff. of Chris Murphy's attachment of today https://bugzilla.redhat.com/attachment.cgi?id=1662251 says + blkid -o udev ... ID_FS_UUID_ENC=2020-02-10-09-43-22-00 ... ID_FS_LABEL_ENC=Fedora-WS-Live-rawh-20200210-n-0 These values are missing in /dev/disk/by-label and /dev/disk/by-uuid. by-uuid, too, is linked only if ID_FS_USAGE is "filesystem|other|crypto". blkid property "USAGE" (or "ID_FS_USAGE" with -o udev) is not listed with a blkid run without arguments. The internet proposes to use option -p (which i would not have guessed from the man blkid description "Switch to low-level superblock probing mode"). It works for me. Have a nice day :) Thomas
Thomas: I already said this: "OK, so in the end the problem here seems to be more or less what it was there: we get no ID_FS_* properties for /dev/sr0" i.e. you don't get ID_FS_[anything] for /dev/sr0.
The most obvious suspect here is https://github.com/systemd/systemd/pull/14485 , but it doesn't seem like it can be that, because the DEVTYPE for sr0 is reported as disk, not partition.
So comparing an image from 20191218 and today's, `udevadm info /dev/sr0` shows quite a lot more entries in the older image. It's not just ID_FS_*, but several other entries are 'missing' in the output from today's image - ID_CDROM_MEDIA_CD , ID_CDROM_MEDIA_SESSION_COUNT , ID_CDROM_MEDIA_TRACK_COUNT, ID_ATA, ID_TYPE , ID_BUS , ID_MODEL ... a whole bunch of stuff. Still not sure why.
ooh, okay, getting somewhere. The output of `/usr/lib/udev/cdrom_id /dev/sr0` is rather different on the old and the new image, the differences more or less map to the missing/different bits in the udevadm output (which makes sense). If I run `cdrom_id -d /dev/sr0` (for debug), on the old image I get: probing: '/dev/sr0' INQUIRY: [QEMU ][QEMU DVD-ROM ][2.0.] current profile 0x08 profile 0x08 media_cdrom [blah blah blah, more probe-y stuff] hardware reported media status: complete ID_CDROM=1 ... on the new image I get: probing: '/dev/sr0' INQUIRY failed ID_CDROM=1 ... so, it seems whatever this INQUIRY step is, it's not working. Looking into that some more next.
Aha. Hmm. So I realized another interesting thing about the compose where this started failing: it's actually the first compose with a 5.6 kernel build in it. The previous successful compose, where this bug didn't happen, was Fedora-Rawhide-20200204.n.0. That compose had kernel-5.5.0-0.rc6.git3.1.fc32 in it. The compose where this bug first showed up, Fedora-Rawhide-20200207.n.2, has kernel-5.6.0-0.rc0.git5.1.fc32 in it. Now poking around in the kernel sources, if I've got this right, I believe there are two fairly significant commits that weren't in 5.5.0-0.rc6.git3.1.fc32 but *are* in 5.6.0-0.rc0.git5.1.fc32 - these ones: https://github.com/torvalds/linux/commit/78ed001d9e7106171e0ee761cd854137dd731302 https://github.com/torvalds/linux/commit/f3ee6e63a9df460963bc51a40891cd9b067404f3 those touch the 'cdrom_generic_command' struct that udev uses for doing SCSI commands, which is how this 'inquiry' stuff is done. See `scsi_cmd`, `scsi_cmd_init`, and `scsi_cmd_set` in systemd/src/udev/cdrom_id/cdrom_id.c . None of that code changed between systemd 244 and 245, so we likely aren't looking at a systemd change here, but the *kernel* did change. So, tentatively re-assigning to the kernel. I'll see if I can prove this a bit harder...
Adding Arnd Bergman (author of the commit that touches cdrom_generic_command) in case he has any insight here.
so yeah, I abused openQA a lil' bit to test this out. https://openqa.stg.fedoraproject.org/tests/overview?build=Kojitask-41390770-NOREPORT&version=31&groupid=2&distri=fedora is what happens when you tell openQA to test kernel-5.6.0-0.rc0.git5.1.fc32 on F31: that means it'll build an F31 installer image, but with kernel-5.6.0-0.rc0.git5.1.fc32 in it, so everything else is Very Old in Rawhide terms, but we have the new kernel. And what happens is...it fails exactly like in this bug: https://openqa.stg.fedoraproject.org/tests/719428 https://openqa.stg.fedoraproject.org/tests/719427 so, I'm relatively confident blaming the kernel at this point. Next I'll do an *actual* scratch build of the kernel but with those two patches reverted, to see if that works again.
OK, well, those two commits don't revert quite cleanly on master, and I didn't have time to rebase the reverts tonight, so if this isn't figured out by tomorrow I'll do it then.
Agreed, my SG_IO changes are likely the cause, sorry for the troubles you are having. Can you confirm that this is with native 64-bit user space, not 32-bit compat mode on a 64-bit kernel?
Hi, > you don't get ID_FS_[anything] for /dev/sr0. Sorry for my lack of understanding. Chris Murphy's attachments show that he got some ID_FS_* but no /dev/disk/by-label link for sr0. I only halfways realized that his ID_FS_ stem only from blkid but not from udev. Duh. > two fairly significant commits that weren't in 5.5.0-0.rc6.git3.1.fc32 > [...] > those touch the 'cdrom_generic_command' struct Actually they touch struct compat_cdrom_generic_command, an internal struct of the kernel. It does not look like src/udev/cdrom_id/cdrom_id.c would make use of that struct definition but only of the public copy cdrom_generic_command which is unchanged since 2017: https://github.com/torvalds/linux/commits/master/include/uapi/linux/cdrom.h And from this copy it only uses an array member of a locally allocated instance as data storage for the SCSI CDB which it sends to the drive by struct sg_io_hdr and ioctl(SG_IO). So i would consider https://github.com/torvalds/linux/commit/f3ee6e63a9df460963bc51a40891cd9b067404f3 as not suspicious. I am riddling whether the change about getting pack_id https://github.com/torvalds/linux/commit/78ed001d9e7106171e0ee761cd854137dd731302 can have introduced inappropriate error occasions. Reasoning: (linux) include/linux/cdrom.h includes include/uapi/linux/cdrom.h which has struct cdrom_generic_command with the same member sizes as struct compat_cdrom_generic_command, but with more simple data types. https://github.com/systemd/systemd/blob/master/src/udev/cdrom_id/cdrom_id.c seems to use cdrom_generic_command only for lending struct sg_io_hdr its buffer for the SCSI Command Descriptor Block. (linux) include/uapi/linux/cdrom.h: struct cdrom_generic_command { unsigned char cmd[CDROM_PACKET_SIZE]; (linux) scsi/sg.h: typedef struct sg_io_hdr { ... unsigned char * cmdp; /* [i], [*i] points to command to perform */ (systemd) src/udev/cdrom_id/cdrom_id.c struct scsi_cmd { struct cdrom_generic_command cgc; ... struct sg_io_hdr sg_io; }; ... static void scsi_cmd_init(struct scsi_cmd *cmd) { ... cmd->sg_io.cmdp = cmd->cgc.cmd; cdrom_generic_command.quiet gets set to 1 but seems never used. cdrom_generic_command.sense gets set to the sense buffer of struct scsi_cmd but seems never been read. (This sense buffer is used by sg_io_hdr, too. So i would expect to see the error indicating sense codes in the error messages if INQUIRY would return some.) The SCSI command transaction is done by giving sg_io_hdr to ioctl(SG_IO). static int scsi_cmd_run(struct scsi_cmd *cmd, int fd, unsigned char *buf, size_t bufsize) { ... if (ioctl(fd, SG_IO, &cmd->sg_io)) So cdrom_generic_command is used only indirectly by its .cmd buffer. The failed INQUIRY transaction touches the cdrom_generic_command struct by putting bytes into its .cmd array to get CDB {0x12, 0, 0, 0, 36, 0} which complies to the description of INQUIRY in SCSI volume SPC-3. static int cd_inquiry(int fd) { ... scsi_cmd_init(&sc); scsi_cmd_set(&sc, 0, 0x12); scsi_cmd_set(&sc, 4, 36); scsi_cmd_set(&sc, 5, 0); So if the changes about compat_cdrom_generic_command changed the processing of udev's cd_inquiry(), then it would have something to do with a change of suitability of cdrom_generic_command.cmd as storage of sg_io_hdr.cmdp. But that would be implausible, because udev uses an instance of struct cdrom_generic_command which is allocated as local variable. Any change of the internal compat_cdrom_generic_command can hardly influence this. Have a nice day :) Thomas
If anyone is able to run the failing command with 'strace', that would show which error code is returned by the ioctl(fd, SG_IO, ...) command.
(In reply to Arnd Bergmann from comment #18) > Can you confirm that this is with native 64-bit user space, not 32-bit > compat mode on a 64-bit kernel? I can confirm the attached logs are from a 64-bit qemu-kvm. I'm not yet sure what the failing command is, Adam might know by now. Maybe possible to modify the relevant systemd unit to add strace, and bake it into an ISO image to test.
*** Bug 1801767 has been marked as a duplicate of this bug. ***
Hi, Arnd Bergmann wrote: > If anyone is able to run the failing command with 'strace', > that would show which error code is returned by the > ioctl(fd, SG_IO, ...) command. Sorry, even a fresh upgrade of my Debian Sid VM cannot reproduce the problem with ISO 9660 /dev/disk/by-label or cdrom_id. No problem to see with SG_IO. $ uname -a Linux ts-sid 5.4.0-3-amd64 #1 SMP Debian 5.4.13-1 (2020-01-19) x86_64 GNU/Linux /dev/sr0 is an .iso submitted to qemu-system-x86_64 by -cdrom. Chris Murphy wrote: > Maybe possible to modify the relevant systemd unit to add strace, > and bake it into an ISO image to test. I assume Arnd wants the strace of a run which shows the error message "INQUIRY failed" as in comment 13. E.g. strace /usr/lib/udev/cdrom_id -d /dev/sr0 2>&1 | less My successful SG_IO transactions look like ioctl(3, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_FROM_DEV, cmd_len=6, cmdp="\x12\x00\x00\x00\x24\x00", mx_sb_len=64, iovec_count=0, dxfer_len=36, timeout=0, flags=SG_FLAG_DIRECT_IO|SG_FLAG_UNUSED_LUN_INHIBIT, dxferp="\x05\x80\x05\x32\x1f\x00\x00\x00\x51\x45\x4d\x55\x20\x20\x20\x20\x51\x45\x4d\x55\x20\x44\x56\x44\x2d\x52\x4f\x4d\x20\x20\x20\x20"..., status=0, masked_status=0, msg_status=0, sb_len_wr=0, sbp="", host_status=0, driver_status=0, resid=0, duration=0, info=0}) = 0 This appears as first occurence of string "SG_IO" in the output from my Debian Sid VM running strace /lib/udev/cdrom_id -d /dev/sr0 cmdp="\x12\x00\x00\x00\x24\x00" is the INQUIRY command sent to the drive. dxferp="..." shows the reply data from the successful call. "\x51\x45...\x4f\x4d\" means vendor and model "QEMU QEMU DVD-ROM". Arnd is probably interested in an error code which would come after the last "=" character and number -1. Like ENOENT from stat() if i try non-existing /dev/sr7: stat("/dev/sr7", 0x7ffeeaa4a530) = -1 ENOENT (No such file or directory) Have a nice day :) Thomas
Created attachment 1662512 [details] strace of the failed `/usr/lib/udev/cdrom_id -d /dev/sr0` command OK, so I tried the same command on an installed system with latest Rawhide kernel package, and indeed it fails the same way. Here's the strace output.
Thanks for tracking that down! The failing command is clearly: ioctl(3, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_FROM_DEV, cmd_len=6, cmdp="\x12\x00\x00\x00\x24\x00", mx_sb_len=64, iovec_count=0, dxfer_len=36, timeout=0, flags=SG_FLAG_DIRECT_IO|SG_FLAG_UNUSED_LUN_INHIBIT, dxferp="\x00\x00\x00\x00\x00\x00\x00\x00\xbf\xf7\x1c\x42\xff\x7f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xb5\x96\x5d\x45\x68\x7f\x00\x00"..., status=0, masked_status=0, msg_status=0, sb_len_wr=0, sbp="", host_status=0, driver_status=0, resid=0, duration=0, info=0}) = -1 EFAULT (Bad address) Which seems to indicate that something went wrong in 78ed001d9e71 ("compat: scsi: sg: fix v3 compat read/write interface") that leads to an incorrect pointer being passed down, or the return code getting set wrong. I spent some more time looking at each code path, but could not actually find anything in the long SG_IO call chain that is either changed between v5.5 and v5.6-rc1, or that is clearly wrong in a way that leads to EFAULT with the given input from cdrom_id. My f3ee6e63a9df ("compat_ioctl: move CDROM_SEND_PACKET handling into scsi") only changes the compat handling for an ioctl command that is not called here, and while 78ed001d9e71 ("compat: scsi: sg: fix v3 compat read/write interface") does change the implementation of the SG_IO in the native case, it only does so for /dev/sg*, not /dev/sr*, which has a different implementation. I give up here, hopefully someone else can make more sense of it or can try something that gets us further. If anyone is able to run the failing process in qemu with a gdb attached to the kernel, I'd suggest attaching a breakpoint to sr_block_ioctl() or sg_io(), and stepping through these to see where the -EFAULT is returned.
I can try and get that later, but for now I'm running a scratch build of a kernel with 78ed001d9e71 reverted, and we can see if that 'fixes' the bug...
Hmm, so with that scratch build - https://koji.fedoraproject.org/koji/taskinfo?taskID=41461671 - we still get the bug, suggesting 78ed001d9e71 isn't the problem, or at least 78ed001d9e71 *alone* isn't the problem.
Arnd, I'm having trouble trying to come up with a sane-yet-minimal set of reversions from all your associated changes that includes reverting f3ee6e63a9df460963bc51a40891cd9b067404f3 . Any suggestions? Just trying to pin down the problematic commit...
Is there an even remote possibility this is GCC 10 related? A kernel bisect can be a huge PITA if things don't cleanly revert. Maybe fire off a 5.6.0-rc1 compile with gcc-9.2.1?
No, because the scratch build from #c27 is an F31 scratch build. Which means it was done with GCC 9. And it still hits the bug.
Created attachment 1662607 [details] git bisect log git bisect says d320a9551e394cb2d842fd32d28e9805c2a18fbb is the first bad commit commit d320a9551e394cb2d842fd32d28e9805c2a18fbb Author: Arnd Bergmann <arnd> Date: Fri Mar 15 17:39:44 2019 +0100 compat_ioctl: scsi: move ioctl handling into drivers
The good bad was discrete up until the last bad, which had a different error than the other bads. $ git status HEAD detached at d320a9551e39 $ sudo /usr/lib/udev/cdrom_id -d /dev/sr0 probing: '/dev/sr0' CDROM_GET_CAPABILITY failed
The first commit that produces the error we've been seeing all along (i.e. ignoring c32) seems a more likely suspect. But when I try to revert, I get merge conflicts. Anyway, maybe it'll help figure out which ballpark we're even in. commit 64cbfa96551a1511b9babef06afa06a1c6c4e8f5 Author: Arnd Bergmann <arnd> Date: Thu Nov 28 15:55:17 2019 +0100 compat_ioctl: move cdrom commands into cdrom.c
Ok, found the typo in my 64cbfa96551a15 commit, here is the fix: diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c index 0fbb8fe6e521..e4240e4ae8bb 100644 --- a/drivers/scsi/sr.c +++ b/drivers/scsi/sr.c @@ -688,7 +688,7 @@ static const struct block_device_operations sr_bdops = .release = sr_block_release, .ioctl = sr_block_ioctl, #ifdef CONFIG_COMPAT - .ioctl = sr_block_compat_ioctl, + .compat_ioctl = sr_block_compat_ioctl, #endif .check_events = sr_block_check_events, .revalidate_disk = sr_block_revalidate_disk, I had ignored that patch while looking through the changes because it was meant to only change the compat behavior, but it accidentally changed the wrong callback pointer. I'll send it upstream in a bit, but just to be sure, can you confirm that this fully addresses the issue?
I see the same bug was introduced in multiple files in the same patch, I'll obviously fix them all at once.
Thanks a lot for running the bisect, Chris, and for the fix, Arnd! I'll run a test with the patch.
Arnd, the specific mistake you list in #c34 seems to be from d320a9551e394 , not 64cbfa96551a1511b9babef06afa06a1c6c4e8f5 . But I guess what we need to do is s/.ioctl/.compat_ioctl/ in all appropriate places from both commits?
OK, so this scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=41469287 has this patch: https://www.happyassassin.net/temp/0001-Replace-.ioctl-with-.compat_ioctl-in-three-appropria.patch and looks like it fixes the problem: [root@localhost ~]# /usr/lib/udev/cdrom_id -d /dev/sr0 probing: '/dev/sr0' INQUIRY: [QEMU ][QEMU DVD-ROM ][2.4.] current profile 0x08 profile 0x08 media_cd_rom ... from my test VM. I'm running openQA test on it too, just to verify that the fix for cdrom_id really does solve the installer environment boot problem. No, I don't know why the scratch build fails on s390x and armv7hl, that's -ENOTMYPROBLEM :)
Yup: confirmed. https://openqa.stg.fedoraproject.org/tests/720533#step/_boot_to_anaconda/3
Pardon my intrusion into this discussion. I was attempting to find the error number for the command in c38 but alas there is no strace or gdb commands in the initramfs of the iso image containing the problem. It would be nice to have them there but I don't know how tight real-estate in in the initramfs. Again, over awed. George...
George: we've already isolated the problem at this point. There was no need to do any of the diagnosis in the initramfs once we identified cdrom_id as the culprit - you can just as easily run it in a regular install with an affected kernel, and it fails the same way. That made things easier.
This should be fixed in kernel-5.6.0-0.rc1.git2.1.fc32 and kernel-5.6.0-0.rc1.git2.1.fc33, per jcline.
Fix confirmed with Fedora-32-20200214.n.1 - openQA tests have got past this bug now.
I submitted the mainline patch now, sorry for the delay. I also found a fourth instance of the same bug in my patches, on top of the three that Adam's patch addressed. I kept authorship and merged in the additional fix: https://lore.kernel.org/lkml/20200219165139.3467320-1-arnd@arndb.de/