Bug 1520982 - 4.15.0-0.rc2.git0.1.fc28.x86_64 fails to boot on some hardware
Summary: 4.15.0-0.rc2.git0.1.fc28.x86_64 fails to boot on some hardware
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-05 15:04 UTC by Bruno Wolff III
Modified: 2018-01-02 08:01 UTC (History)
21 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-01-02 08:01:52 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
journalctl output (156.27 KB, text/plain)
2017-12-05 15:05 UTC, Bruno Wolff III
no flags Details
Patch to acquire more logs to help diagnose the problem (3.04 KB, patch)
2017-12-14 15:31 UTC, Bruno Wolff III
no flags Details | Diff
Kernel messages from boot with debugging patch applied. (96.20 KB, text/plain)
2017-12-14 15:41 UTC, Bruno Wolff III
no flags Details
dmesg output from successful Fedroa rc4 boot (76.47 KB, text/plain)
2017-12-21 17:33 UTC, Bruno Wolff III
no flags Details
Log from failed boot of Fedora rpm based rc4 (93.75 KB, text/plain)
2017-12-21 18:41 UTC, Bruno Wolff III
no flags Details
Tested with this patch (820 bytes, patch)
2017-12-21 21:43 UTC, Bruno Wolff III
no flags Details | Diff
Boot log from failed rc4 boot (94.03 KB, text/x-vhdl)
2017-12-21 21:47 UTC, Bruno Wolff III
no flags Details
boot log output from i686 machine (79.44 KB, text/plain)
2017-12-22 05:03 UTC, Bruno Wolff III
no flags Details

Description Bruno Wolff III 2017-12-05 15:04:07 UTC
User-Agent:       
Build Identifier: 

I am able to use 4.15.0-0.rc2.git0.1.fc28.x86_64 on an old Mac laptop without a problem, but it doesn't boot on my dell workstation. 4.15.0-0.rc1.git2.2.fc28.x86_64 does boot.
I'm going to attach boot.log with journalctl info from a bad boot. There is an error about a null pointer that is probably related to the problem.
I'll keep trying updated kernels as things change fast early in the development cycle. If you need me to track this down further I can probably find time to do a bisect.

Reproducible: Always

Steps to Reproduce:
1. Boot
2.
3.
Actual Results:  
The boot fails.

Expected Results:  
The boot succeeds.

Comment 1 Bruno Wolff III 2017-12-05 15:05:21 UTC
Created attachment 1363233 [details]
journalctl output

Comment 2 Bruno Wolff III 2017-12-11 19:46:07 UTC
rc3 is also failing, so it looks like it will be worth trying to do a bisect to try to find a problem commit.

Comment 3 Bruno Wolff III 2017-12-12 19:29:53 UTC
The bisect versus Linus' kernel seems to be working. I'll probably finish tomorror if there aren't any complications. So far:
git bisect start
# bad: [ae64f9bd1d3621b5e60d7363bc20afb46aede215] Linux 4.15-rc2
git bisect bad ae64f9bd1d3621b5e60d7363bc20afb46aede215
# good: [9e0600f5cf6cecfcab5046d1453a9538c054d8a7] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
git bisect good 9e0600f5cf6cecfcab5046d1453a9538c054d8a7
# good: [503505bfea19b7d69e2572297e6defa0f9c2404e] Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
git bisect good 503505bfea19b7d69e2572297e6defa0f9c2404e
# bad: [ae753ee2771a1bacade56411bb98037b2545c929] Merge tag 'afs-fixes-20171201' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect bad ae753ee2771a1bacade56411bb98037b2545c929

Comment 4 Bruno Wolff III 2017-12-13 19:08:37 UTC
The bad commit appears to be a0747a859ef6d3cc5b6cd50eb694499b78dd0025. I'm going to make sure the latest Linus tree is still bad and see if reverting the above commit fixes it. If so I'll report it upstream.
The full bisect log is:
git bisect start
# bad: [ae64f9bd1d3621b5e60d7363bc20afb46aede215] Linux 4.15-rc2
git bisect bad ae64f9bd1d3621b5e60d7363bc20afb46aede215
# good: [9e0600f5cf6cecfcab5046d1453a9538c054d8a7] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
git bisect good 9e0600f5cf6cecfcab5046d1453a9538c054d8a7
# good: [503505bfea19b7d69e2572297e6defa0f9c2404e] Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
git bisect good 503505bfea19b7d69e2572297e6defa0f9c2404e
# bad: [ae753ee2771a1bacade56411bb98037b2545c929] Merge tag 'afs-fixes-20171201' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect bad ae753ee2771a1bacade56411bb98037b2545c929
# bad: [eb1bd249ba016284ed762d87c1989dd822500773] nvme-rdma: fix memory leak during queue allocation
git bisect bad eb1bd249ba016284ed762d87c1989dd822500773
# bad: [8c97eeccf0ad8783c057830119467b877bdfced7] nvme-pci: add quirk for delay before CHK RDY for WDC SN200
git bisect bad 8c97eeccf0ad8783c057830119467b877bdfced7
# bad: [9e0ed16ab9a9aaf670b81c9cd05b5e50defed654] nvme-fc: check if queue is ready in queue_rq
git bisect bad 9e0ed16ab9a9aaf670b81c9cd05b5e50defed654
# bad: [3a92168bc8a113098b44a95d499557a88bb387da] block: add WARN_ON if bdi register fail
git bisect bad 3a92168bc8a113098b44a95d499557a88bb387da
# bad: [a0747a859ef6d3cc5b6cd50eb694499b78dd0025] bdi: add error handle for bdi_debug_register
git bisect bad a0747a859ef6d3cc5b6cd50eb694499b78dd0025
# good: [97f07697932e6faf2a708f7aef7cba8e6b0cab96] bdi: convert bdi_debug_register to int
git bisect good 97f07697932e6faf2a708f7aef7cba8e6b0cab96
# first bad commit: [a0747a859ef6d3cc5b6cd50eb694499b78dd0025] bdi: add error handle for bdi_debug_register

Comment 5 Bruno Wolff III 2017-12-13 22:23:22 UTC
I confirmed the issue was still a problem as of d39a01eff9af1045f6e30ff9db40310517c4b45f. However when I reverted a0747a859ef6d3cc5b6cd50eb694499b78dd0025 from d39a01eff9af1045f6e30ff9db40310517c4b45f my system booted normally.

Comment 6 Laura Abbott 2017-12-14 00:54:34 UTC
Thanks for the bisect! I sent an e-mail off to the maintainers.

Comment 7 Bruno Wolff III 2017-12-14 15:31:56 UTC
Created attachment 1368056 [details]
Patch to acquire more logs to help diagnose the problem

Comment 8 Bruno Wolff III 2017-12-14 15:41:14 UTC
Created attachment 1368073 [details]
Kernel messages from boot with debugging patch applied.

I applied the debugging patch from weiping zhang to d39a01eff9af1045f6e30ff9db40310517c4b45f. There were some new debugging messages in the dmesg output. Hopefully they tell him what he needs.

Comment 9 Bruno Wolff III 2017-12-19 16:38:37 UTC
This is still happenning with rc4. Reverting the commit still fixes the problem in Linus' tree.

Comment 10 Weiping 2017-12-21 15:58:38 UTC
(In reply to Bruno Wolff III from comment #9)
> This is still happenning with rc4. Reverting the commit still fixes the
> problem in Linus' tree.

"still happen" can you see WARNING at device_add_disk?
can you paste the boot fail log and boot normal log(after revert).

Comment 11 Bruno Wolff III 2017-12-21 16:40:46 UTC
Note that this report is not really about the warning, it just happens to be there. The report is about commit a0747a859ef6d3cc5b6cd50eb694499b78dd0025 breaking the boot process. When that commit is present in the build the machine will not complete booting, when it isn't the boot proceeds normally.
I'll get you boot logs from Fedora's rc4, Linus' rc4 and Linus' rc4 with commit a0747a859ef6d3cc5b6cd50eb694499b78dd0025 reverted. I should be able to collect those before I leave.

Comment 12 Weiping 2017-12-21 16:48:25 UTC
(In reply to Bruno Wolff III from comment #11)
> Note that this report is not really about the warning, it just happens to be
> there. The report is about commit a0747a859ef6d3cc5b6cd50eb694499b78dd0025
> breaking the boot process. When that commit is present in the build the
> machine will not complete booting, when it isn't the boot proceeds normally.
> I'll get you boot logs from Fedora's rc4, Linus' rc4 and Linus' rc4 with
> commit a0747a859ef6d3cc5b6cd50eb694499b78dd0025 reverted. I should be able
> to collect those before I leave.

Hi, I see boot1.log test for shaohua's debug patch, which catch the warning
in device_add_disk, can add my debug patch, test again?

Comment 13 Bruno Wolff III 2017-12-21 17:33:57 UTC
Created attachment 1370994 [details]
dmesg output from successful Fedroa rc4 boot

I just a successful boot of the rpm version of rc4. I'm attaching dmesg output in case that is useful. I used the loglevel=7 boot parameter, but that might not have anything to do with why this boot succeeded. I had just tried one a little bit before that, that failed, but the available dmesg output didn't include the failure from the failed boot.

Comment 14 Bruno Wolff III 2017-12-21 18:12:47 UTC
It looks like if I boot in permissive mode I don't get the problem. I changed the boot mode before getting the successful boot because I saw some AVCs during boot. I also currently have my home machines booting in permissive mode, so I might be able to reproduce the problem at home. setenforce 1 after booting does seem to immediately cause any problems. Still this might provide a useful hint to what is going on.

Comment 15 Bruno Wolff III 2017-12-21 18:41:39 UTC
Created attachment 1371019 [details]
Log from failed boot of Fedora rpm based rc4

I don't see the CPU hang and traceback messages in this log. The very last line looks a little suspicious. But it might be useful to compare to the successful boot in permissive mode.
I'll see if I can still get a source build done today.

Comment 16 Bruno Wolff III 2017-12-21 21:43:32 UTC
Created attachment 1371071 [details]
Tested with this patch

I really didn't mean to do this test with this patch, but wanted to note the log I'm going to attach next wasn't on a completely vanilla rc4. I should have done a git diff as well as a git log to make sure the source tree was in the state I wanted.

Comment 17 Bruno Wolff III 2017-12-21 21:47:57 UTC
Created attachment 1371073 [details]
Boot log from failed rc4 boot

Again the CPU hang messages didn't get copied over to permanent storage before things were too messed up to continue.
Late tonight I should get a chance to verify that the problem happens on more hardware (or not) when I reboot some machines and come up in enforcing mode.

Comment 18 Bruno Wolff III 2017-12-21 23:15:20 UTC
I tested one home machine and enforcing mode isn't enough to trigger the problem. This machine has one ssd instead of 2 hdd and isn't using mdraid, though is using luks. Later tonight I'll test an i686 machine using md raid with hdds.

Comment 19 Bruno Wolff III 2017-12-22 05:03:04 UTC
Created attachment 1371172 [details]
boot log output from i686 machine

I got this to happen on one of my home machines when booting in enforcing mode. That machine was i686 machine using md raid to provide raid 1 as well as using luks.
Another machine did not exhibit the problem. it is x86_64 with a single ssd using luks. I don't know if md raid is really needed to trigger the problem. It seems pretty likely selinux is involved, at least as one path to trigger the problem.

Comment 20 Weiping 2017-12-22 05:34:59 UTC
(In reply to Bruno Wolff III from comment #19)
> Created attachment 1371172 [details]
> boot log output from i686 machine
> 
> I got this to happen on one of my home machines when booting in enforcing
> mode. That machine was i686 machine using md raid to provide raid 1 as well
> as using luks.
> Another machine did not exhibit the problem. it is x86_64 with a single ssd
> using luks. I don't know if md raid is really needed to trigger the problem.
> It seems pretty likely selinux is involved, at least as one path to trigger
> the problem.
Thanks your effort,
could you please help paste all of your raid configuration:
1. cat /proc/mdstat
2. mdadm -D  mdN (N all md devices).
3. /etc/mdadm.conf

Comment 21 Bruno Wolff III 2017-12-22 15:43:43 UTC
[root@wolff bruno]# cat /proc/mdstat
Personalities : [raid1] 
md14 : active raid1 sda4[1] sdb4[2]
      217148624 blocks super 1.2 [2/2] [UU]
      
md15 : active raid1 sdd1[0] sdc1[1]
      35564360 blocks super 1.2 [2/2] [UU]
      
md13 : active raid1 sdb3[2] sda3[1]
      83884984 blocks super 1.2 [2/2] [UU]
      
md11 : active raid1 sdb1[2] sda1[1]
      1048564 blocks super 1.0 [2/2] [UU]
      
md12 : active raid1 sdb2[2] sda2[1]
      10484664 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

[bruno@wolff bruno]# foreach name (/dev/md*)
foreach? mdadm -D $name
foreach? end
/dev/md11:
        Version : 1.0
  Creation Time : Sun Sep 18 10:16:27 2011
     Raid Level : raid1
     Array Size : 1048564 (1023.99 MiB 1073.73 MB)
  Used Dev Size : 1048564 (1023.99 MiB 1073.73 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Dec 22 04:09:12 2017
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : wolff.to:11  (local to host wolff.to)
           UUID : 9cfe4aa1:47d6f086:abfada32:0a3c771b
         Events : 6275

    Number   Major   Minor   RaidDevice State
       1       8        1        0      active sync   /dev/sda1
       2       8       17        1      active sync   /dev/sdb1
/dev/md12:
        Version : 1.2
  Creation Time : Sun Sep 18 10:16:47 2011
     Raid Level : raid1
     Array Size : 10484664 (10.00 GiB 10.74 GB)
  Used Dev Size : 10484664 (10.00 GiB 10.74 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Dec 22 02:04:58 2017
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : wolff.to:12  (local to host wolff.to)
           UUID : 681e6d55:77f9f49e:3fb1f683:51e54c15
         Events : 6584

    Number   Major   Minor   RaidDevice State
       2       8       18        0      active sync   /dev/sdb2
       1       8        2        1      active sync   /dev/sda2
/dev/md13:
        Version : 1.2
  Creation Time : Sun Sep 18 10:17:01 2011
     Raid Level : raid1
     Array Size : 83884984 (80.00 GiB 85.90 GB)
  Used Dev Size : 83884984 (80.00 GiB 85.90 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Dec 22 09:39:17 2017
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : wolff.to:13  (local to host wolff.to)
           UUID : 99cfb35d:e485d41a:a4c3c75b:46747cac
         Events : 244134

    Number   Major   Minor   RaidDevice State
       1       8        3        0      active sync   /dev/sda3
       2       8       19        1      active sync   /dev/sdb3
/dev/md14:
        Version : 1.2
  Creation Time : Sun Sep 18 10:17:14 2011
     Raid Level : raid1
     Array Size : 217148624 (207.09 GiB 222.36 GB)
  Used Dev Size : 217148624 (207.09 GiB 222.36 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Dec 22 09:28:45 2017
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : wolff.to:14  (local to host wolff.to)
           UUID : fabfa860:723f5925:2e965e89:8f7ed574
         Events : 86516

    Number   Major   Minor   RaidDevice State
       2       8       20        0      active sync   /dev/sdb4
       1       8        4        1      active sync   /dev/sda4
/dev/md15:
        Version : 1.2
  Creation Time : Sun Sep 18 10:17:30 2011
     Raid Level : raid1
     Array Size : 35564360 (33.92 GiB 36.42 GB)
  Used Dev Size : 35564360 (33.92 GiB 36.42 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Dec 22 09:39:15 2017
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : wolff.to:15  (local to host wolff.to)
           UUID : 6a954169:eede6356:5b6ea629:45b68abd
         Events : 70667

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       33        1      active sync   /dev/sdc1

[bruno@wolff bruno]# cat /etc/mdadm.conf
DEVICE partitions
MAILADDR root
ARRAY /dev/md11 level=raid1 num-devices=2 metadata=1.0 uuid=9cfe4aa1-47d6-f086-abfa-da320a3c771b
ARRAY /dev/md12 level=raid1 num-devices=2 metadata=1.2 uuid=681e6d55-77f9-f49e-3fb1-f68351e54c15
ARRAY /dev/md13 level=raid1 num-devices=2 metadata=1.2 uuid=99cfb35d-e485-d41a-a4c3-c75b46747cac
ARRAY /dev/md14 level=raid1 num-devices=2 metadata=1.2 uuid=fabfa860-723f-5925-2e96-5e898f7ed574
ARRAY /dev/md15 level=raid1 num-devices=2 metadata=1.2 uuid=6a954169-eede-6356-5b6e-a62945b68abd

Comment 22 Bruno Wolff III 2017-12-22 15:46:10 UTC
Since you mentioned being able to reproduce this (on lkml), I'll just include the info for the one machine. The other is very similar and if you need it, I can get you that as well.

Comment 23 Bruno Wolff III 2017-12-24 15:48:37 UTC
The problem commit is reverted in rc5. I'm not sure if a new version is going to be tried this development cycle. I'll test the Fedora rc5 kernel when it's ready and we should be able to close this after that.

Comment 24 Bruno Wolff III 2018-01-02 08:01:52 UTC
This appears fixed in the rc6.git0 kernel build (rc5 never got built). Given the commit was reverted, this is expected. The commit will probably be back later with changes, but the author can reproduce the problem, so the version that gets posted is unlikely to cause this problem again.


Note You need to log in before you can comment on or make changes to this bug.