Bug 1801353 - Mounting root from installation media fails since kernel-5.6.0-0.rc0.git5.1.fc32
Summary: Mounting root from installation media fails since kernel-5.6.0-0.rc0.git5.1.fc32
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker
: 1801767 (view as bug list)
Depends On:
Blocks: F32BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2020-02-10 17:36 UTC by František Zatloukal
Modified: 2020-02-19 19:07 UTC (History)
37 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-15 00:57:41 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
rdsosreport (1.43 MB, text/plain)
2020-02-10 19:36 UTC, Chris Murphy
no flags Details
rdsosreport w debug (18.53 MB, text/plain)
2020-02-10 19:46 UTC, Chris Murphy
no flags Details
strace of the failed `/usr/lib/udev/cdrom_id -d /dev/sr0` command (41.26 KB, text/plain)
2020-02-11 18:17 UTC, Adam Williamson
no flags Details
git bisect log (2.57 KB, text/plain)
2020-02-12 07:12 UTC, Chris Murphy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1783066 0 unspecified CLOSED Fedora-Rawhide-20191212.n.1 images do not boot (possibly only as real or virtual optical media?) 2021-05-25 17:08:52 UTC

Description František Zatloukal 2020-02-10 17:36:24 UTC
Description of problem:
It seems like since introduction of systemd-245, booting Fedora Installation media fails with timing out on mounting root system.

Eg. 
/dev/disk/by-label/... does not exist
/dev/root does not exist

Compose without the issue (different problem, but got well beyond root mount): https://openqa.fedoraproject.org/tests/516585#step/_boot_to_anaconda/7
Compose with the issue: https://openqa.fedoraproject.org/tests/518614#step/_boot_to_anaconda/21

Version-Release number of selected component (if applicable):
systemd-245

I am not 100% sure it's systemd failure, so please reassign if you think it's bug somewhere else. However, the issue didn't happen before systemd-245 landing in Rawhide as seen in OpenQA dates.

Comment 1 Fedora Blocker Bugs Application 2020-02-10 17:38:02 UTC
Proposed as a Blocker for 32-beta by Fedora user frantisekz using the blocker tracking app because:

 System must boot.

Comment 2 František Zatloukal 2020-02-10 18:07:50 UTC
Discussed during the 2020-02-10 blocker review meeting: [1]

The decision to classify this bug as an AcceptedBlocker was made:

"All release-blocking images must boot in their supported configurations"

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2020-02-10/f32-blocker-review.2020-02-10-17.01.log.txt

Comment 3 Chris Murphy 2020-02-10 18:17:33 UTC
VM with Rawhide, initramfs has systemd-244, but systemd-245 is installed. This boots OK. Rebuild initramfs so that systemd-245 is in there, and it still boots.

Comment 4 Adam Williamson 2020-02-10 19:03:13 UTC
So I think this *may* be related to the problem we had a few months back with util-linux, this one:

https://bugzilla.redhat.com/show_bug.cgi?id=1783066

as Stuff Has Happened upstream since then. Specifically, this hit systemd:

https://github.com/systemd/systemd/pull/14485

and then it seems some corresponding change was made in util-linux (reverting stuff):

https://github.com/karelzak/util-linux/pull/913#issuecomment-572581058

So, I'm wondering if somewhere in there we managed to break booting hybrid ISOs again, or if Rawhide got one side of the changes but not the other side yet, or if we now need changes to dracut's handling of our hybrid ISOs, or something like that.

So far I haven't got into the detailed dracut logs, but I did at least check that /dev/mapper has only control, and /dev/disk/by-id and /dev/disk/by-uuid are very empty, with only one entry in each. CCing Karel, Zbigniew and Harald for now.

Comment 5 Adam Williamson 2020-02-10 19:24:24 UTC
OK, so in the end the problem here seems to be more or less what it was there: we get no ID_FS_* properties for /dev/sr0, so we get no /dev/disk/by-label entry (or by-id or by-uuid etc. entries) for it, and so we can't mount it. But I'm not sure yet exactly where the problem is that *causes* us to get no ID_FS_* properties in the first place. It seems that in current Rawhide we have systemd 245 (with the #14485 PR) and util-linux 2.35.1 (which has the two reversions discussed in #913), so those seem to be 'in sync'.

Comment 6 Chris Murphy 2020-02-10 19:36:43 UTC
Created attachment 1662251 [details]
rdsosreport

Comment 7 Chris Murphy 2020-02-10 19:46:23 UTC
Created attachment 1662259 [details]
rdsosreport w debug

rd.debug rd.udev.debug systemd.log_level=debug

Not sure how helpful this will be, the log starts at [  345.341550], which seems like a separate bug.

Comment 8 Adam Williamson 2020-02-10 20:32:11 UTC
Booting from an image written to a USB stick works. I'll try a real optical disc next.

Comment 9 Thomas Schmitt 2020-02-10 22:34:11 UTC
Hi,

what do you get from these commands ?

  udevadm info /dev/sr0 | grep ID_FS_USAGE

  udevadm info /dev/sr0 | grep ID_FS_LABEL_ENC

  for i in $(blkid -p /dev/sr0) ; do echo "$i" | grep USAGE ; done

Reasoning:

The rule which should make /dev/disk/by-label is at line 113 of
  https://github.com/systemd/systemd/blob/master/rules.d/60-persistent-storage.rules

  ENV{ID_FS_USAGE}=="filesystem|other|crypto", ENV{ID_FS_LABEL_ENC}=="?*", SYMLINK+="disk/by-label/$env{ID_FS_LABEL_ENC}"

I understand ID_FS_USAGE gets filled in line 37 and 38 of
  https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-blkid.c
which say

   } else if (streq(name, "USAGE")) {
           udev_builtin_add_property(dev, test, "ID_FS_USAGE", value);

(at least
  https://codesearch.debian.net/search?q=package%3Asystemd+ID_FS_USAGE
finds no other plausible suspects for setting this property.)

Line 46 ff. of Chris Murphy's attachment of today
  https://bugzilla.redhat.com/attachment.cgi?id=1662251
says

  + blkid -o udev
  ...
  ID_FS_UUID_ENC=2020-02-10-09-43-22-00
  ...
  ID_FS_LABEL_ENC=Fedora-WS-Live-rawh-20200210-n-0

These values are missing in /dev/disk/by-label and /dev/disk/by-uuid.
by-uuid, too, is linked only if ID_FS_USAGE is "filesystem|other|crypto".

blkid property "USAGE" (or "ID_FS_USAGE" with -o udev) is not listed with
a blkid run without arguments.
The internet proposes to use option -p (which i would not have guessed
from the man blkid description "Switch to low-level superblock probing
mode"). It works for me.

Have a nice day :)

Thomas

Comment 10 Adam Williamson 2020-02-10 23:37:36 UTC
Thomas: I already said this: "OK, so in the end the problem here seems to be more or less what it was there: we get no ID_FS_* properties for /dev/sr0"

i.e. you don't get ID_FS_[anything] for /dev/sr0.

Comment 11 Adam Williamson 2020-02-11 00:06:26 UTC
The most obvious suspect here is https://github.com/systemd/systemd/pull/14485 , but it doesn't seem like it can be that, because the DEVTYPE for sr0 is reported as disk, not partition.

Comment 12 Adam Williamson 2020-02-11 00:31:42 UTC
So comparing an image from 20191218 and today's, `udevadm info /dev/sr0` shows quite a lot more entries in the older image. It's not just ID_FS_*, but several other entries are 'missing' in the output from today's image - ID_CDROM_MEDIA_CD , ID_CDROM_MEDIA_SESSION_COUNT , ID_CDROM_MEDIA_TRACK_COUNT, ID_ATA, ID_TYPE , ID_BUS , ID_MODEL ... a whole bunch of stuff. Still not sure why.

Comment 13 Adam Williamson 2020-02-11 00:50:44 UTC
ooh, okay, getting somewhere. The output of `/usr/lib/udev/cdrom_id /dev/sr0` is rather different on the old and the new image, the differences more or less map to the missing/different bits in the udevadm output (which makes sense). If I run `cdrom_id -d /dev/sr0` (for debug), on the old image I get:

probing: '/dev/sr0'
INQUIRY: [QEMU     ][QEMU DVD-ROM    ][2.0.]
current profile 0x08
profile 0x08 media_cdrom
[blah blah blah, more probe-y stuff]
hardware reported media status: complete
ID_CDROM=1
...

on the new image I get:

probing: '/dev/sr0'
INQUIRY failed
ID_CDROM=1
...

so, it seems whatever this INQUIRY step is, it's not working. Looking into that some more next.

Comment 14 Adam Williamson 2020-02-11 01:21:38 UTC
Aha. Hmm. So I realized another interesting thing about the compose where this started failing: it's actually the first compose with a 5.6 kernel build in it. The previous successful compose, where this bug didn't happen, was Fedora-Rawhide-20200204.n.0. That compose had kernel-5.5.0-0.rc6.git3.1.fc32 in it. The compose where this bug first showed up, Fedora-Rawhide-20200207.n.2, has kernel-5.6.0-0.rc0.git5.1.fc32 in it.

Now poking around in the kernel sources, if I've got this right, I believe there are two fairly significant commits that weren't in 5.5.0-0.rc6.git3.1.fc32 but *are* in 5.6.0-0.rc0.git5.1.fc32 - these ones:

https://github.com/torvalds/linux/commit/78ed001d9e7106171e0ee761cd854137dd731302
https://github.com/torvalds/linux/commit/f3ee6e63a9df460963bc51a40891cd9b067404f3

those touch the 'cdrom_generic_command' struct that udev uses for doing SCSI commands, which is how this 'inquiry' stuff is done. See `scsi_cmd`, `scsi_cmd_init`, and `scsi_cmd_set` in systemd/src/udev/cdrom_id/cdrom_id.c . None of that code changed between systemd 244 and 245, so we likely aren't looking at a systemd change here, but the *kernel* did change.

So, tentatively re-assigning to the kernel. I'll see if I can prove this a bit harder...

Comment 15 Adam Williamson 2020-02-11 01:23:17 UTC
Adding Arnd Bergman (author of the commit that touches cdrom_generic_command) in case he has any insight here.

Comment 16 Adam Williamson 2020-02-11 02:32:16 UTC
so yeah, I abused openQA a lil' bit to test this out. https://openqa.stg.fedoraproject.org/tests/overview?build=Kojitask-41390770-NOREPORT&version=31&groupid=2&distri=fedora is what happens when you tell openQA to test kernel-5.6.0-0.rc0.git5.1.fc32 on F31: that means it'll build an F31 installer image, but with kernel-5.6.0-0.rc0.git5.1.fc32 in it, so everything else is Very Old in Rawhide terms, but we have the new kernel. And what happens is...it fails exactly like in this bug:

https://openqa.stg.fedoraproject.org/tests/719428
https://openqa.stg.fedoraproject.org/tests/719427

so, I'm relatively confident blaming the kernel at this point. Next I'll do an *actual* scratch build of the kernel but with those two patches reverted, to see if that works again.

Comment 17 Adam Williamson 2020-02-11 07:25:48 UTC
OK, well, those two commits don't revert quite cleanly on master, and I didn't have time to rebase the reverts tonight, so if this isn't figured out by tomorrow I'll do it then.

Comment 18 Arnd Bergmann 2020-02-11 08:53:24 UTC
Agreed, my SG_IO changes are likely the cause, sorry for the troubles you are having.

Can you confirm that this is with native 64-bit user space, not 32-bit compat mode on a 64-bit kernel?

Comment 19 Thomas Schmitt 2020-02-11 09:55:33 UTC
Hi,

> you don't get ID_FS_[anything] for /dev/sr0.

Sorry for my lack of understanding. Chris Murphy's attachments show that
he got some ID_FS_* but no /dev/disk/by-label link for sr0.
I only halfways realized that his ID_FS_ stem only from blkid but not
from udev. Duh.

> two fairly significant commits that weren't in 5.5.0-0.rc6.git3.1.fc32
> [...]
> those touch the 'cdrom_generic_command' struct

Actually they touch struct compat_cdrom_generic_command, an internal
struct of the kernel.

It does not look like src/udev/cdrom_id/cdrom_id.c would make use of
that struct definition but only of the public copy cdrom_generic_command
which is unchanged since 2017:
  https://github.com/torvalds/linux/commits/master/include/uapi/linux/cdrom.h

And from this copy it only uses an array member of a locally allocated
instance as data storage for the SCSI CDB which it sends to the drive by
struct sg_io_hdr and ioctl(SG_IO).

So i would consider
  https://github.com/torvalds/linux/commit/f3ee6e63a9df460963bc51a40891cd9b067404f3
as not suspicious.

I am riddling whether the change about getting pack_id
  https://github.com/torvalds/linux/commit/78ed001d9e7106171e0ee761cd854137dd731302
can have introduced inappropriate error occasions.

Reasoning:

(linux) include/linux/cdrom.h includes include/uapi/linux/cdrom.h which
has struct cdrom_generic_command with the same member sizes as struct
compat_cdrom_generic_command, but with more simple data types.

  https://github.com/systemd/systemd/blob/master/src/udev/cdrom_id/cdrom_id.c
seems to use cdrom_generic_command only for lending struct sg_io_hdr its
buffer for the SCSI Command Descriptor Block.

(linux) include/uapi/linux/cdrom.h:

  struct cdrom_generic_command
  {
        unsigned char cmd[CDROM_PACKET_SIZE];

(linux) scsi/sg.h:

  typedef struct sg_io_hdr
  {
     ...
     unsigned char * cmdp;       /* [i], [*i] points to command to perform */

(systemd) src/udev/cdrom_id/cdrom_id.c

  struct scsi_cmd {
          struct cdrom_generic_command cgc;
          ...
          struct sg_io_hdr sg_io;
  };
  ...
  static void scsi_cmd_init(struct scsi_cmd *cmd) {
          ...  
          cmd->sg_io.cmdp = cmd->cgc.cmd;

cdrom_generic_command.quiet gets set to 1 but seems never used.
cdrom_generic_command.sense gets set to the sense buffer of struct scsi_cmd
but seems never been read. (This sense buffer is used by sg_io_hdr, too.
So i would expect to see the error indicating sense codes in the error
messages if INQUIRY would return some.)

The SCSI command transaction is done by giving sg_io_hdr to ioctl(SG_IO).

  static int scsi_cmd_run(struct scsi_cmd *cmd, int fd, unsigned char *buf, size_t bufsize) {
          ...
          if (ioctl(fd, SG_IO, &cmd->sg_io))

So cdrom_generic_command is used only indirectly by its .cmd buffer.

The failed INQUIRY transaction touches the cdrom_generic_command struct
by putting bytes into its .cmd array to get CDB {0x12, 0, 0, 0, 36, 0}
which complies to the description of INQUIRY in SCSI volume SPC-3.

  static int cd_inquiry(int fd) {
          ...
          scsi_cmd_init(&sc);
          scsi_cmd_set(&sc, 0, 0x12);
          scsi_cmd_set(&sc, 4, 36);
          scsi_cmd_set(&sc, 5, 0);

So if the changes about compat_cdrom_generic_command changed the processing
of udev's cd_inquiry(), then it would have something to do with a change of
suitability of cdrom_generic_command.cmd as storage of sg_io_hdr.cmdp.

But that would be implausible, because udev uses an instance of struct
cdrom_generic_command which is allocated as local variable. Any change
of the internal compat_cdrom_generic_command can hardly influence this.

Have a nice day :)

Thomas

Comment 20 Arnd Bergmann 2020-02-11 11:32:11 UTC
If anyone is able to run the failing command with 'strace', that would show which error code is returned by the ioctl(fd, SG_IO, ...) command.

Comment 21 Chris Murphy 2020-02-11 14:11:18 UTC
(In reply to Arnd Bergmann from comment #18)
> Can you confirm that this is with native 64-bit user space, not 32-bit
> compat mode on a 64-bit kernel?

I can confirm the attached logs are from a 64-bit qemu-kvm. I'm not yet sure what the failing command is, Adam might know by now. Maybe possible to modify the relevant systemd unit to add strace, and bake it into an ISO image to test.

Comment 22 Radek Vykydal 2020-02-11 15:40:44 UTC
*** Bug 1801767 has been marked as a duplicate of this bug. ***

Comment 23 Thomas Schmitt 2020-02-11 17:14:24 UTC
Hi,

Arnd Bergmann wrote:
> If anyone is able to run the failing command with 'strace',
> that would show which error code is returned by the
> ioctl(fd, SG_IO, ...) command.

Sorry, even a fresh upgrade of my Debian Sid VM cannot reproduce the
problem with ISO 9660 /dev/disk/by-label or cdrom_id.
No problem to see with SG_IO.
  $ uname -a
  Linux ts-sid 5.4.0-3-amd64 #1 SMP Debian 5.4.13-1 (2020-01-19) x86_64 GNU/Linux
/dev/sr0 is an .iso submitted to qemu-system-x86_64 by -cdrom.

Chris Murphy wrote:
> Maybe possible to modify the relevant systemd unit to add strace,
> and bake it into an ISO image to test.

I assume Arnd wants the strace of a run which shows the error message
"INQUIRY failed" as in comment 13. E.g.

   strace /usr/lib/udev/cdrom_id -d /dev/sr0 2>&1 | less

My successful SG_IO transactions look like

  ioctl(3, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_FROM_DEV, cmd_len=6, cmdp="\x12\x00\x00\x00\x24\x00", mx_sb_len=64, iovec_count=0, dxfer_len=36, timeout=0, flags=SG_FLAG_DIRECT_IO|SG_FLAG_UNUSED_LUN_INHIBIT, dxferp="\x05\x80\x05\x32\x1f\x00\x00\x00\x51\x45\x4d\x55\x20\x20\x20\x20\x51\x45\x4d\x55\x20\x44\x56\x44\x2d\x52\x4f\x4d\x20\x20\x20\x20"..., status=0, masked_status=0, msg_status=0, sb_len_wr=0, sbp="", host_status=0, driver_status=0, resid=0, duration=0, info=0}) = 0

This appears as first occurence of string "SG_IO" in the output from my
Debian Sid VM running
  strace /lib/udev/cdrom_id -d /dev/sr0

cmdp="\x12\x00\x00\x00\x24\x00" is the INQUIRY command sent to the drive.
dxferp="..." shows the reply data from the successful call.
"\x51\x45...\x4f\x4d\" means vendor and model "QEMU    QEMU DVD-ROM".

Arnd is probably interested in an error code which would come after the
last "=" character and number -1.
Like ENOENT from stat() if i try non-existing /dev/sr7:

  stat("/dev/sr7", 0x7ffeeaa4a530) = -1 ENOENT (No such file or directory)

Have a nice day :)

Thomas

Comment 24 Adam Williamson 2020-02-11 18:17:40 UTC
Created attachment 1662512 [details]
strace of the failed `/usr/lib/udev/cdrom_id -d /dev/sr0` command

OK, so I tried the same command on an installed system with latest Rawhide kernel package, and indeed it fails the same way. Here's the strace output.

Comment 25 Arnd Bergmann 2020-02-11 21:15:29 UTC
Thanks for tracking that down! The failing command is clearly:

ioctl(3, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_FROM_DEV, cmd_len=6, cmdp="\x12\x00\x00\x00\x24\x00", mx_sb_len=64, iovec_count=0, dxfer_len=36, timeout=0, flags=SG_FLAG_DIRECT_IO|SG_FLAG_UNUSED_LUN_INHIBIT, dxferp="\x00\x00\x00\x00\x00\x00\x00\x00\xbf\xf7\x1c\x42\xff\x7f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xb5\x96\x5d\x45\x68\x7f\x00\x00"..., status=0, masked_status=0, msg_status=0, sb_len_wr=0, sbp="", host_status=0, driver_status=0, resid=0, duration=0, info=0}) = -1 EFAULT (Bad address)

Which seems to indicate that something went wrong in 78ed001d9e71 ("compat: scsi: sg: fix v3 compat read/write interface") that leads to an incorrect pointer being passed down, or the return code getting set wrong.

I spent some more time looking at each code path, but could not actually find anything in the long SG_IO call chain that is either changed between v5.5 and v5.6-rc1, or that is clearly wrong in a way that leads to EFAULT with the given input from cdrom_id. My f3ee6e63a9df ("compat_ioctl: move CDROM_SEND_PACKET handling into scsi") only changes the compat handling for an ioctl command that is not called here, and while 78ed001d9e71 ("compat: scsi: sg: fix v3 compat read/write interface") does change the implementation of the SG_IO in the native case, it only does so for /dev/sg*, not /dev/sr*, which has a different implementation.

I give up here, hopefully someone else can make more sense of it or can try something that gets us further. If anyone is able to run the failing process in qemu with a gdb attached to the kernel, I'd suggest attaching a breakpoint to sr_block_ioctl() or sg_io(), and stepping through these to see where the -EFAULT is returned.

Comment 26 Adam Williamson 2020-02-12 00:20:59 UTC
I can try and get that later, but for now I'm running a scratch build of a kernel with 78ed001d9e71 reverted, and we can see if that 'fixes' the bug...

Comment 27 Adam Williamson 2020-02-12 01:37:57 UTC
Hmm, so with that scratch build - https://koji.fedoraproject.org/koji/taskinfo?taskID=41461671 - we still get the bug, suggesting 78ed001d9e71 isn't the problem, or at least 78ed001d9e71 *alone* isn't the problem.

Comment 28 Adam Williamson 2020-02-12 01:46:53 UTC
Arnd, I'm having trouble trying to come up with a sane-yet-minimal set of reversions from all your associated changes that includes reverting f3ee6e63a9df460963bc51a40891cd9b067404f3 . Any suggestions? Just trying to pin down the problematic commit...

Comment 29 Chris Murphy 2020-02-12 01:51:58 UTC
Is there an even remote possibility this is GCC 10 related? A kernel bisect can be a huge PITA if things don't cleanly revert. Maybe fire off a 5.6.0-rc1 compile with gcc-9.2.1?

Comment 30 Adam Williamson 2020-02-12 01:54:37 UTC
No, because the scratch build from #c27 is an F31 scratch build. Which means it was done with GCC 9. And it still hits the bug.

Comment 31 Chris Murphy 2020-02-12 07:12:11 UTC
Created attachment 1662607 [details]
git bisect log

git bisect says
                                       
d320a9551e394cb2d842fd32d28e9805c2a18fbb is the first bad commit
commit d320a9551e394cb2d842fd32d28e9805c2a18fbb
Author: Arnd Bergmann <arnd>
Date:   Fri Mar 15 17:39:44 2019 +0100

    compat_ioctl: scsi: move ioctl handling into drivers

Comment 32 Chris Murphy 2020-02-12 07:14:48 UTC
The good bad was discrete up  until the last bad, which had a different error than the other bads.

$ git status
HEAD detached at d320a9551e39

$ sudo /usr/lib/udev/cdrom_id -d /dev/sr0
probing: '/dev/sr0'
CDROM_GET_CAPABILITY failed

Comment 33 Chris Murphy 2020-02-12 08:00:54 UTC
The first commit that produces the error we've been seeing all along (i.e. ignoring c32) seems a more likely suspect. But when I try to revert, I get merge conflicts. Anyway, maybe it'll help figure out which ballpark we're even in.

commit 64cbfa96551a1511b9babef06afa06a1c6c4e8f5
Author: Arnd Bergmann <arnd>
Date:   Thu Nov 28 15:55:17 2019 +0100

    compat_ioctl: move cdrom commands into cdrom.c

Comment 34 Arnd Bergmann 2020-02-12 08:34:37 UTC
Ok, found the typo in my 64cbfa96551a15 commit, here is the fix:

diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 0fbb8fe6e521..e4240e4ae8bb 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -688,7 +688,7 @@ static const struct block_device_operations sr_bdops =
        .release        = sr_block_release,
        .ioctl          = sr_block_ioctl,
 #ifdef CONFIG_COMPAT
-       .ioctl          = sr_block_compat_ioctl,
+       .compat_ioctl   = sr_block_compat_ioctl,
 #endif
        .check_events   = sr_block_check_events,
        .revalidate_disk = sr_block_revalidate_disk,


I had ignored that patch while looking through the changes because it was
meant to only change the compat behavior, but it accidentally changed
the wrong callback pointer. I'll send it upstream in a bit, but just to be
sure, can you confirm that this fully addresses the issue?

Comment 35 Arnd Bergmann 2020-02-12 08:39:57 UTC
I see the same bug was introduced in multiple files in the same patch, I'll obviously fix them all at once.

Comment 36 Adam Williamson 2020-02-12 16:00:28 UTC
Thanks a lot for running the bisect, Chris, and for the fix, Arnd! I'll run a test with the patch.

Comment 37 Adam Williamson 2020-02-12 16:07:20 UTC
Arnd, the specific mistake you list in #c34 seems to be from d320a9551e394 , not 64cbfa96551a1511b9babef06afa06a1c6c4e8f5 . But I guess what we need to do is s/.ioctl/.compat_ioctl/ in all appropriate places from both commits?

Comment 38 Adam Williamson 2020-02-12 17:26:02 UTC
OK, so this scratch build:

https://koji.fedoraproject.org/koji/taskinfo?taskID=41469287

has this patch:

https://www.happyassassin.net/temp/0001-Replace-.ioctl-with-.compat_ioctl-in-three-appropria.patch

and looks like it fixes the problem:

[root@localhost ~]# /usr/lib/udev/cdrom_id -d /dev/sr0
probing: '/dev/sr0'
INQUIRY: [QEMU    ][QEMU DVD-ROM    ][2.4.]
current profile 0x08
profile 0x08 media_cd_rom
...

from my test VM. I'm running openQA test on it too, just to verify that the fix for cdrom_id really does solve the installer environment boot problem.

No, I don't know why the scratch build fails on s390x and armv7hl, that's -ENOTMYPROBLEM :)

Comment 39 Adam Williamson 2020-02-12 17:54:48 UTC
Yup: confirmed. https://openqa.stg.fedoraproject.org/tests/720533#step/_boot_to_anaconda/3

Comment 40 George R. Goffe 2020-02-13 00:23:16 UTC
Pardon my intrusion into this discussion.

I was attempting to find the error number for the command in c38 but alas there is no strace or gdb commands in the initramfs of the iso image containing the problem. It would be nice to have them there but I don't know how tight real-estate in in the initramfs.

Again, over awed.

George...

Comment 41 Adam Williamson 2020-02-13 01:06:33 UTC
George: we've already isolated the problem at this point. There was no need to do any of the diagnosis in the initramfs once we identified cdrom_id as the culprit - you can just as easily run it in a regular install with an affected kernel, and it fails the same way. That made things easier.

Comment 42 Adam Williamson 2020-02-13 18:53:19 UTC
This should be fixed in kernel-5.6.0-0.rc1.git2.1.fc32 and kernel-5.6.0-0.rc1.git2.1.fc33, per jcline.

Comment 43 Adam Williamson 2020-02-15 00:57:41 UTC
Fix confirmed with Fedora-32-20200214.n.1 - openQA tests have got past this bug now.

Comment 44 Arnd Bergmann 2020-02-19 19:07:00 UTC
I submitted the mainline patch now, sorry for the delay.

I also found a fourth instance of the same bug in my patches, on top of the three that Adam's patch addressed. I kept authorship and merged in the additional fix:

https://lore.kernel.org/lkml/20200219165139.3467320-1-arnd@arndb.de/


Note You need to log in before you can comment on or make changes to this bug.