1614815 – Dracut hang in boot at probing the /dev/mmcblk0 partitions

Bug 1614815 - Dracut hang in boot at probing the /dev/mmcblk0 partitions

Summary: Dracut hang in boot at probing the /dev/mmcblk0 partitions

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	27
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-10 13:30 UTC by Steve Heynemand
Modified:	2020-03-02 14:16 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-11-30 23:02:34 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Journal output at time of hang of the system (633.83 KB, text/x-vhdl) 2018-08-10 13:30 UTC, Steve Heynemand	no flags	Details
Output of DMSEG at time of system hang (162.22 KB, text/plain) 2018-08-10 13:31 UTC, Steve Heynemand	no flags	Details
List of /dev at time of hang of the system (8.78 KB, text/plain) 2018-08-10 13:31 UTC, Steve Heynemand	no flags	Details
List of modules loaded at time of system hang (689 bytes, text/plain) 2018-08-10 13:32 UTC, Steve Heynemand	no flags	Details
Journal output at time of hang of the system - with MMC logs added (934.22 KB, text/x-vhdl) 2018-08-20 14:56 UTC, Steve Heynemand	no flags	Details
View All

Description Steve Heynemand 2018-08-10 13:30:05 UTC

Created attachment 1475016 [details]
Journal output at time of hang of the system

Description of the problem:
We are using Fedora 27 on a system that contains an eMMC controller and eMMC flash device (it is actually an Intel N2807 CPU). We install a version of Fedora Workstation on the system and boot from the eMMC. A boot with dracut attempt to detect the /dev/mmcblk0 device and partitions to mount root file system. During the detection of the /dev/mmcblk0 device content, the system may hangs indefinitely stating ‘A start job is running for dev-mmcblk0p2.device’. The error message is visible on the display console output of the system.

Version-Release number of selected component (if applicable): 
kernel-4.16.7-200.fc27.x86_64 
dracut-046-8.git20180105.fc27.x86_64

How reproducible:
The problem is visible in a sequence of tests over warm reboot loop. The system may hang once in a while, approximately each 30 to 80 cycles. The problem has been tested against two different hardware platforms and three different eMMC flash device.

Steps to reproduce:
1- Install Fedora Workstation on eMMC flash device.
2- Create GNOME user startup script that execute a delay of 15 seconds and then request a reboot.
3- Reboot to start cycle.
This sequence of tests has been validated to work properly on Fedora 23 Workstation on the same test systems and eMMC flash devices.

Actual results: (time of failure)

Attached snapshot show the complete picture. We were able to externally connect USB disk, mount the USB with mount /dev/sda /myusb, and redirect the journalctl output to a file.

The interesting lines of journal content in the problematic situation

  Aug 09 17:03:45 gtr321a95 kernel: mmcblk0: mmc0:0001 eMMC   28.8 GiB 
  Aug 09 17:03:45 gtr321a95 kernel: mmcblk0boot0: mmc0:0001 eMMC   partition 1 4.00 MiB
  ...
  Aug 09 17:03:45 gtr321a95 kernel: mmcblk0boot1: mmc0:0001 eMMC   partition 2 4.00 MiB
  Aug 09 17:03:45 gtr321a95 kernel: mmcblk0rpmb: mmc0:0001 eMMC   partition 3 4.00 MiB, chardev (241:0)
  ...
  ...
  Aug 09 17:03:45 gtr321a95 kernel:  mmcblk0: p1 p2
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[269]: handling device node '/dev/mmcblk0rpmb', devnum=c241:0, mode=0600, uid=0, gid=0
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[269]: preserve permissions /dev/mmcblk0rpmb, 020600, uid=0, gid=0
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[269]: creating symlink '/dev/char/241:0' to '../mmcblk0rpmb'
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[269]: created empty file '/run/udev/data/c241:0' for '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/mmcblk0rpmb'
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: probe /dev/mmcblk0 raid offset=0
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: handling device node '/dev/mmcblk0', devnum=b179:0, mode=0660, uid=0, gid=6
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: set permissions /dev/mmcblk0, 060660, uid=0, gid=6
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: creating symlink '/dev/block/179:0' to '../mmcblk0'
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: creating link '/dev/disk/by-id/mmc-eMMC_0x150500d2' to '/dev/mmcblk0'
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: creating symlink '/dev/disk/by-id/mmc-eMMC_0x150500d2' to '../../mmcblk0'
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: created db file '/run/udev/data/b179:0' for '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0'
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: adding watch on '/dev/mmcblk0'
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: created db file '/run/udev/data/b179:0' for '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0'
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[259]: probe /dev/mmcblk0p1 raid offset=0
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[269]: probe /dev/mmcblk0boot1 raid offset=0
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[267]: probe /dev/mmcblk0boot0 raid offset=0
  ...
  Aug 09 17:03:45 gtr321a95 systemd-udevd[265]: probe /dev/mmcblk0p2 raid offset=0
  ...
  Aug 09 17:04:46 gtr321a95 systemd-udevd[232]: seq 1560 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p2' is taking a long time
  Aug 09 17:04:46 gtr321a95 systemd-udevd[232]: seq 1562 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot1' is taking a long time
  Aug 09 17:04:46 gtr321a95 systemd-udevd[232]: seq 1559 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time
  Aug 09 17:04:46 gtr321a95 systemd-udevd[232]: seq 1564 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot0' is taking a long time
  ...
  Aug 09 17:06:46 gtr321a95 systemd-udevd[232]: seq 1559 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' killed
  Aug 09 17:06:46 gtr321a95 systemd-udevd[232]: seq 1564 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot0' killed
  Aug 09 17:06:46 gtr321a95 systemd-udevd[232]: seq 1560 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p2' killed
  Aug 09 17:06:46 gtr321a95 systemd-udevd[232]: seq 1562 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot1' killed
  ...

Additional information:
Attached list of all OS components and version - componentversion.txt
Attached capture of the journalctl at time when the system hangs – fullerrorjornal2.log
Attached capture of the dmesg at time when the system hangs – dmesgerror2.log
Attached capture of the dev at time when the system hangs – deverror2.log
Attached capture of the lsmod at time when the system hangs – lsmoderror2.log
We added the rd.debug rd.udev.debug rd.timeout=120 kernel parameters to assist in debug.
We added ForwardToConsole=yes in journald.conf to assist in debug.

The fullerrorjournal2.txt also contains the SysRq Backtrace and SysRq State output when hanged.

Comparing the journal content between a situation of error and one that works let use believe that the ‘probe /dev/mmcblk0p2’ does not complete.
As a reference, a bootokjournal.log file is attached to see the working condition

We are working in order to let the mmc_core loadable module to print debug statements on transaction to see if the hang is produced by the mmc driver stack.

Comment 1 Steve Heynemand 2018-08-10 13:31:03 UTC

Created attachment 1475017 [details]
Output of DMSEG at time of system hang

Comment 2 Steve Heynemand 2018-08-10 13:31:39 UTC

Created attachment 1475018 [details]
List of /dev at time of hang of the system

Comment 3 Steve Heynemand 2018-08-10 13:32:13 UTC

Created attachment 1475019 [details]
List of modules loaded at time of system hang

Comment 4 Harald Hoyer 2018-08-13 07:53:49 UTC

looks like a kernel bug to me

Comment 5 Steve Heynemand 2018-08-20 14:55:16 UTC

One common thing from fullerrorjournal2.txt

Udev seq number       1560              1562            1559            1564
Udev process number   265               269             259             267
Process name          system-udevd      system-udevd    system-udevd    system-udevd
Priority              120               120             120             120
Stack hang point      blkdev_close      blkdev_close    blkdev_close    blkdev_close
Log file statement    A                 B               C               D

A-> match to - Aug 09 17:04:46 gtr321a95 systemd-udevd[232]: seq 1560 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p2' is taking a long time
B-> match to - Aug 09 17:04:46 gtr321a95 systemd-udevd[232]: seq 1562 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot1' is taking a long time
C-> match to - Aug 09 17:04:46 gtr321a95 systemd-udevd[232]: seq 1559 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time
D-> match to - Aug 09 17:04:46 gtr321a95 systemd-udevd[232]: seq 1564 '/devices/pci0000:00/0000:00:17.0/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0boot0' is taking a long time


I am attaching a new capture of the issue after having changed some dynamic debug statement in static printk in the mmc_core.c file. I have annotated the mmc_journalerror2.log with the MMC commands and all of the commands appear to be ok up to the moment that the system-udevd declares 4 operations that are taking a long time. Some 45 seconds after error declaration, we can see the probe that are progressing but it appears to be too late for the boot process to happen. The addition of the MMC printk statements change a little bit the behavior as originally, I could not see the probe happenning at all.

Comment 6 Steve Heynemand 2018-08-20 14:56:55 UTC

Created attachment 1477253 [details]
Journal output at time of hang of the system - with MMC logs added

Comment 7 Steve Heynemand 2018-08-20 14:59:05 UTC

I have been able to validate a work around for this error situation. I have forced udevd to operate with only one child process (rd.udev.children_max=1). This has been tested to work correctly for a complete 4 days long run on 3 same systems that showed the issue. Before work around, each system showed the issue more or less 5 times a day.

Comment 8 Laura Abbott 2018-10-01 21:22:57 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.
 
Fedora 27 has now been rebased to 4.18.10-100.fc27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 28 or Fedora 29, and are still experiencing this issue, please change the version to Fedora 28 or 29.
 
If you experience different issues, please open a new bug report for those.

Comment 9 Ben Cotton 2018-11-27 14:12:39 UTC

This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 10 Ben Cotton 2018-11-30 23:02:34 UTC

Fedora 27 changed to end-of-life (EOL) status on 2018-11-30. Fedora 27 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 11 Steve Heynemand 2020-03-02 14:16:47 UTC

Got it. In attempt to shutdown the regular request for info by Bugzilla. I added this comment.

Note You need to log in before you can comment on or make changes to this bug.

airlied
bskeggs
dracut-maint-list
ewk
harald
hdegoede
ichavero
itamar
jarodwilson
jglisse
john.j5live
jonathan
josef
kernel-maint
linville
mchehab
mjg59
steved
steve.heynemand
zbyszek