Bug 1413291 - the systemd post-boot fix-up service prevents all kernel images from booting if it fails
Summary: the systemd post-boot fix-up service prevents all kernel images from booting ...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 25
Hardware: i686
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-14 15:34 UTC by Martin Gregorie
Modified: 2017-12-12 10:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-12 10:04:28 UTC
Type: Bug


Attachments (Terms of Use)
Image of the final lines of boot log (rhgb and quiet removed from command line) (4.79 MB, image/jpeg)
2017-01-15 13:43 UTC, Martin Gregorie
no flags Details

Description Martin Gregorie 2017-01-14 15:34:47 UTC
Description of problem:
During the reboot following a dnf update that installed Fedora 25 4.8.16-300 PAE
on a Compaq Vision (AMD E1 based) the screen had filled the icon on the initial splash screen and then cleared the screen and showed the messages
- CPU0 not found
- Job scheduler started
- Job spooling tools started
- Starting Network Manager script dispatcher

and then hung with no furthers messages shown. 

This was run last night, 13 Jan 17 and installed the third kernel image. The first was installed from the F25 32 bit XFCE live spin - this replaces (successively) Win 10 and the Gnome-based F25 workstation version. The second kernel image was the result of running a full dnf update immediately after the install. 

Having tried a second boot, which was unsuccessful, I next tried to boot earlier kernel images: both failed as the same place as the 4.8.16-300 and so did the recovery image that the XFCE Live spin had installed.

There seem to be three problems here:
1) I have been unable to determine exactly where the problem lies
   because nothing is shown on the screen and I can't see the logs (see
   point 3) 

2) I have been unable to find anything in systemd documentation that
   explains how to force a boot when the post-boot fix-up task fails.

3) This failure prevents older kernel images, including the recovery image, 
   from booting. THIS SHOULD NEVER HAPPEN. The fact that it does happen 
   shows that some systemd flag or status indicator is global to all kernel
   images when it should be local to each kernel image. Making it local would
   allow previously installed images to boot. 
 
Version-Release number of selected component (if applicable):
   The systemd version on the 32 bit (PAE) XFCE workstation live DVD iso.

How reproducible:
   Happens every time I try to boot any currently installed boot image.

Steps to Reproduce:
1. Boot the computer using any boot image
2. Wait for the system to show the splash screen, clear it and hang after
   displaying "Starting Network Manager script dispatcher".

Actual results:
  System fails to boot and hangs up

Expected results:
  System should boot normally.

Additional info:
  I had the exact same problem on Friday, 30th December 2016 but with 
  Fedora 24, which had been running satisfactorily on an Lenovo R61i for 
  over a month. I was unable to reinstall on it from a live F25 DVD because
  its DVD drive is failing and refused to boot off and DVD except for old
  Debian Parted and Fedora 20 DVDs.   

  I'm about to reinstall from the 32 bit F15 XFCE live DVD and will add 
  the outcome to this bug report.

Comment 1 Martin Gregorie 2017-01-14 21:46:47 UTC
I reinstalled from the 32 bit F15 XFCE live DVD as I said and went through the following stages:

- The base install was 4.8.6-300.fc25.i686+PAE, which booted immediately when
  the live image was terminated.

- Added the extra 586 packages I needed from Fedora and RPMFusion repos.
  Rebooted OK.

- ran 'dnf update', which installed the 4.8.16-300.fc25.i686+PAE kernel image
  and upgraded 418 packages including the kernel. Rebooted OK.

- At this point the system was accessing the outside world but not finding
  systems on my LAN, so I dropped my various custom config files back into
  place, including resolv.conf, and configured the ethernet connection to 
  use my local DNS (bind in non-recursive mode on another Fedora system) 
  and gave it a static IP address that matches the bind configuration.
  The laptop was successfully now pinging a selection of both external 
  systems and those on my LAN.

- rebooted once again to check that this setup was stable. At this point the
  boot process failed to complete and I haven't been able to make it boot
  successfully since then. Here's what was displayed during the boot:

  4.8.16-300.fc25.i686+PAE
    Displays: failed to find cpu0 device node
    Shows the splash screen and prompts for the password of an encrypted
       partition.
    Displays: Started job spooler
              Starting Network Manager Script Dispatcher
    Stalls at this point

I removed 'rhgb' and 'quiet' from the boot command line and rebooted again. This detail level shows that its hanging up because IPV6 is not configured, This isn't surprising because my LAN and its connection to my ISP world are both using IPV4, so of course, when I set them up I simply ignored the IPV6 parameters. This has worked correctly on all previous Fedora versions [I first used RedHat 6.x and 7.2 before moving to Fedora 1 when it appeared].
  
Questions:
==========
Is it possible to unsnarl this without yet another install from the live DVD?

What do I need to do to stop the Network Manager from hanging due to the lack of IPV6 parameters?

Comment 2 Martin Gregorie 2017-01-15 13:43:34 UTC
Created attachment 1240940 [details]
Image of the final lines of boot log (rhgb and quiet removed from command line)

Comment 3 Martin Gregorie 2017-01-15 13:54:50 UTC
I left the system sitting and stewing after making the screen image. It repeats the message:

IPV6: ADDRCONF(NETDEV_UP): wlo1: link is not ready

every 5 minutes and, when I last looked, had repeated it 32 times. 

This must be due to a change made since the 4.8.6-300 image was released because I have never configured the wlo1 link at any time and that 4.8.6-300 booted successfully several times. Surely this sort of error should be only retried once at boot time and thereafter ignored after marking the device as failing.

How can I force this issue toi be ignored without repeating the F25 XFCE install for a third time?

Comment 4 Martin Gregorie 2017-01-17 23:06:31 UTC
The endless wait for is preceded by an oddity. The last point during a boot where I can break into boot in emergency mode is immediately before a prompt is output for the password of a luks5 encrypted partition At that point the file
/etc/resolv.conf has been replaced by an invalid link to a nonexistant file (i.e. the link is shown in red) to /var/run/NetworkManager/resolv.conf. 

However, it turns out that the missing item is the directory 
/var/run/NetworkManager rather than the resolv.conf copy

If the boot is allowed to continue I get an immediate prompt for the encrypted partition's password. The next two debug lines:
- report that the encrypted partition was mounted
- a line saying 
    IPV6: ADDRCONF(NETDEV_UP): eno1: link is not ready
  followed by lines reporting that eno1 has come up and that 
  wlo1 (which is disabled in the network configuration) won't come up and
  that leads to the boot hanging forever while it waits for wlo1 to be ready.
  See the attached JPG image for the detail of this part of the trace.

At this point the hang becomes total and I can't get back into emergency mode to
see whether the resolve.conf copy is still missing.

Comment 5 Martin Gregorie 2017-02-04 20:11:20 UTC
I have now managed to overcome the two problems involved in ths bug and have the system up and running correctly under Fedora 25, which showed exactly the same problem as I originally reported under Fedora 24.

The base problem was that a configuration file: 

/etc/X11/Xwrapper.config

contained the parameter line 

allowed-users=anybody

that the old Xorg Xserver wanted but that Wayland evidently doesn't understand or like. The result was that Wayland crashed. So:

BUG 1: Wayland doesn't validate parameters correctly, so failed to reject the
       'allowed-users' parameter and crashed instead.

The immediate effect of this crash was that systemd waited forever for Wayland to come up without providing any way that I could find to break in, take control and edit the /etc content without rebooting and running the live DVD in order to login as root to delete the file.

Eventually I worked out that starting Fedora 25 XFCE Live and logging in as root instead of 'liveuser' would allow me to remove /etc/X11/Xwrapper.conf.
Once that was done F25 vbooted successfully. 

F24 and F25 are my first exposures to using a live DVD to run Anaconda (all my upgrades since F21 have been direct in-situ version upgrades). Since I didn't see anything on the website to indicate that the Live DVD could be used this way, I wasted quite some time working this out. So:

RQST1: Alter the Recovery Image so that it runs a shell immediately after all
       disks have been mounted, i.e. this is equivalent to running at InitV
       level 1. This would allow the installer to fix service configuration
       problems before exiting the shell to let the boot process continue. 
       This would save a lot of hassle for those who, like me, have been doing
       in-place version upgrades and so who do not have a bootable DVD for the
       new version.

RQST2: Amend the Fedora installation pages to make it more obvious that the
       various DVD spins are live images that, as well as running the
       installer, can also be used (via a root login or sudo) to inspect
       and modify disk content before exiting to re-try the failed boot.

Comment 6 Fedora End Of Life 2017-11-16 19:13:04 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 7 Fedora End Of Life 2017-12-12 10:04:28 UTC
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.