786567 – systemd update to systemd-37-11.fc16 makes unbootable previously bootable system

Bug 786567 - systemd update to systemd-37-11.fc16 makes unbootable previously bootable system

Summary: systemd update to systemd-37-11.fc16 makes unbootable previously bootable system

Keywords:
Status:	CLOSED DUPLICATE of bug 786182
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	systemd-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-02-01 19:25 UTC by Michal Jaegermann
Modified:	2012-02-06 11:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-02-02 12:35:10 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Michal Jaegermann 2012-02-01 19:25:14 UTC

Description of problem:

The summary is not precise. The system in question still boots to level 1 but this does not look like much of a help as nothing seems to be amiss. That happened after a bunch of updates which included replacing of systemd-37-3.fc16 to systemd-37-11.fc16. Now when trying to boot to level 3 after dropping
'rhgb quiet' from boot assorted things are printed on a screen ending with
lines:

....
Started Avahi mDNS/DNS-SD Stack [OK]
Started Login Service [OK]

with a cursor going to the next line and sits there and that is it.
No login on a console, no login over a network.

Logs from such attempt end up with
....
... systemd-logind[960]: New seat seat0.
... avahi-daemon[952]: Successfully called chroot().
......
... avahi-daemon[952]: Service "linux" (/services/ssh.service) successfully established.

and nothing more until one will hit Ctr-Alt-Del which shuts everything down.

Stuff printed on a screen does not seem to be logged anywhere.

To make this more frustrating I have other F16 systems which so far are booting
just fine after recent updates. OTOH jumpy framebuffer text console is spectacularly useless as it is not readable (at least to me) while scrolling, does not seem to have any means to stop it, and it is gone the moment when it stops scrolling and a login prompt appears so attempts to compare with something still working go nowhere. Systemd is totally opaque about what may be expected next and why it is possibly unhappy.

Version-Release number of selected component (if applicable):
systemd-37-11.fc16

How reproducible:
No way so far to get a shell prompt other that booting to level 1

Expected results:
A system does not fall apart due to updates

Additional info:
On the same hardware I have another F16 test installation, also updated, and that one so far boots. Nothing was suggesting any possible troubles before updates were applied.

Comment 1 Michal Jaegermann 2012-02-01 19:35:37 UTC

Oh, booting with 'enforcing=0' does not change anything.

Comment 2 Michal Schmidt 2012-02-02 01:27:43 UTC

Have you tried waiting a few minutes when it looks stuck? Just in case a timeout kicks in and lets it finish.

Comment 3 Michal Jaegermann 2012-02-02 03:25:43 UTC

(In reply to comment #2)
> Have you tried waiting a few minutes when it looks stuck?

Some four minutes so far without any visible effects.  Maybe not long enough?

> Just in case a timeout kicks in and lets it finish.

If you think that this may help I may leave it for half an hour, or so, and see if something will time out.

This is not my "work" system, so I am just wasting time, but that the other one on the same hardware (which seem to boot so far) has some "unusual" layout.
At least with an old init, or upstart, I was able to start things one by one and investigate when something got stuck.  This does not look like feasible in systemd maze and apparently I am SOL.

Comment 4 Michal Jaegermann 2012-02-02 05:32:31 UTC

(In reply to comment #2)
> Just in case a timeout kicks in and lets it finish.

Indeed four minutes was not enough.  Eventually I found that one has to wait about 5 minutes after getting stuck before boot continues.  I _think_ that this was  mcstrans.service because I eventually found in log files (with timestamps
consistent with that wait):

.... systemd[1]: mcstrans.service operation timed out. Terminating.
.... systemd[1]: Unit mcstrans.service entered failed state.

and turning that service off restores sanity.  I notice once something printing "Failed" on my screen after that timeout but it was beyond my capabilities to read what that could possbily be in that jumpy junk before the message was erased.

Still it was not me who turned that service on (and also iscsi.service and  iscsid.service which were failing too but quickly). Also 'mcstrans' package was installed on "Fri 25 Nov 2011" and did not change from that time.  I was not paying much attention to that thing as before recent systemd changes I was getting in logs:

 mcstransd: mcstransd starting
 mcstransd: Failed to initialize color translations
 mcstransd: No color information will be available
 mcstrans[1087]: Starting mcstransd: [  OK  ]
 systemd[1]: Failed to read PID file /var/run/mcstransd.pid after start. The service might be broken. 
 mcstransd: mcstransd initialized

and the whole boot continued without long timeouts.

The other installation with boot not affected does not have that package installed.  Why 'mcstrans' was installed at all now I could not tell.  It appears that this was anaconda with some default sets.

Attempts to

/bin/systemctl start mcstrans.service

clearly go into a funk.  Why, I have no idea as executing /sbin/mcstransd starts it right away even if complaints about "color translation" are still recorded.

Comment 5 Michal Schmidt 2012-02-02 12:35:10 UTC

(In reply to comment #0)
> assorted things are printed on a screen ending with lines:
> 
> ....
> Started Avahi mDNS/DNS-SD Stack [OK]
> Started Login Service           [OK]

When looking for a stuck service, it's more interesting to look for an earlier "Starting" message without a corresponding "Started" message later.

> Stuff printed on a screen does not seem to be logged anywhere.

The status messages printed to the console are captured by plymouth and written out to /var/log/boot.log. I usually find the stuff that goes to /var/log/messages more interesting though, especially when I use systemd.log_level=debug.
And "systemctl --failed" after boot will tell you what services failed.

> OTOH jumpy framebuffer text console is spectacularly useless as it is not
> readable (at least to me) while scrolling, does not seem to have any means
> to stop it, and it is gone the moment when it stops scrolling and a login
> prompt appears so attempts to compare with something still working go
> nowhere.

I don't know what you mean by "jumpy".
The console is cleared by the getty unit. If you don't want the scrollback buffer cleared, override getty@.service and set TTYVTDisallocate=no.

> Systemd is totally opaque about what may
> be expected next and why it is possibly unhappy.

systemd accepts several command line options to increase the verbosity etc.
Another option is to use Alt+UpArrow which start kbrequest.target. The target unit is not defined by default, but one useful thing for debugging it to make it print what jobs are pending.

/etc/systemd/system/kbrequest.target:
[Unit]
Description=kbrequest

/etc/systemd/system/list-jobs.service:
[Unit]
Description=list pending systemd jobs
DefaultDependencies=no
[Service]
Type=oneshot
StandardOutput=tty
ExecStart=/bin/systemctl list-jobs
[Install]
WantedBy=kbrequest.target

(and "systemctl enable list-jobs.service")

In your case it would list several "waiting" jobs and a job "running" for mcstrans.service.


(In reply to comment #4)
> Indeed four minutes was not enough.  Eventually I found that one has to wait
> about 5 minutes after getting stuck before boot continues.

Oh, yes, 5 minutes is the timeout for starting of a SysV service.

> .... systemd[1]: mcstrans.service operation timed out. Terminating.
> .... systemd[1]: Unit mcstrans.service entered failed state.

mcstrans failed to write the PID file it declared in its initscript. There's an update in Bodhi.

*** This bug has been marked as a duplicate of bug 786182 ***

Comment 6 Michal Jaegermann 2012-02-02 16:29:07 UTC

(In reply to comment #5)
> (In reply to comment #0)

> > Stuff printed on a screen does not seem to be logged anywhere.
> 
> The status messages printed to the console are captured by plymouth and written
> out to /var/log/boot.log.

I just looked at three different F16 installations and for whatever reasons on each of these /var/log/boot.log is of a size 0.
 
> I don't know what you mean by "jumpy".

I mean that a console scrolls in such jerks and jumps that it is, at least to me, mostly unreadable.  That is even worse when after a long lull a bunch of lines is dumped in a quick succession to be immediately erased.

> The console is cleared by the getty unit. If you don't want the scrollback
> buffer cleared, override getty@.service and set TTYVTDisallocate=no.

I wonder where this is documented.  Somewhere on LP blog or maybe not?  Thanks for an information though.  Of course this assumes that some timeout will eventually happen and we are not stuck for good.

> 
> > Systemd is totally opaque about what may
> > be expected next and why it is possibly unhappy.
> 
> systemd accepts several command line options to increase the verbosity etc.

So how you are using these in a boot seqence when you are stuck and you cannot get to a shell prompt?  After booting "rescue" you are basically seeing "this cannot be done" responses for mostly anything.

> Another option is to use Alt+UpArrow which start kbrequest.target

You mean that I should write a new file with such name as nothing of that
sort exists so far?  Neither 'man systemd' nor 'man systemctl' mention anything about Alt+UpArrow.

> In your case it would list several "waiting" jobs and a job "running" for
> mcstrans.service.

Yes, I eventually found that after scratching my head for a very long time.

Comment 7 Michal Schmidt 2012-02-02 19:15:29 UTC

(In reply to comment #6)
> I just looked at three different F16 installations and for whatever reasons on
> each of these /var/log/boot.log is of a size 0.

Hm, I don't know what this behaviour depends on and how exactly plymouth works. On my box boot.log is not empty. Maybe plymouth only captures the messages if 'rhgb' is used. As I said, I don't pay much attention to boot.log.

> > The console is cleared by the getty unit. If you don't want the scrollback
> > buffer cleared, override getty@.service and set TTYVTDisallocate=no.
> 
> I wonder where this is documented.  Somewhere on LP blog or maybe not?

TTYVTDisallocate= describes a part of the execution environment of the unit and is documented in 'man systemd.exec'.

> > systemd accepts several command line options to increase the verbosity etc.
> 
> So how you are using these in a boot seqence when you are stuck and you cannot
> get to a shell prompt?

I meant the kernel command line. Most useful are systemd.log_level and systemd.log_target, especially if serial console is available.

If the problem appears during a later phase of the boot process when rsyslog is already running, then serial console is not needed, because all systemd debug messages can be saved to disk (assuming rsyslog is configured to do so).

> After booting "rescue" you are basically seeing "this
> cannot be done" responses for mostly anything.

Like what for example?

> > Another option is to use Alt+UpArrow which start kbrequest.target
> 
> You mean that I should write a new file with such name as nothing of that
> sort exists so far?

Well, only if you want to try this sort of thing.

> Neither 'man systemd' nor 'man systemctl' mention
> anything about Alt+UpArrow.

The kernel sends a signal to PID 1 when Alt+Up is pressed. The classical SysVinit could catch the signal and run a configured 'kbrequest' action.
systemd has a similar feature. When it gets the signal, it runs kbrequest.target.
The target is documented in 'man systemd.special'. It is not hooked to do anything by default. Whatever you configure it to do, is up to you. You can symlink it to another target, or do what I did, or run anything else from it.

Comment 8 Michal Jaegermann 2012-02-04 18:54:18 UTC

(In reply to comment #7)

> 
> TTYVTDisallocate= describes a part of the execution environment of the unit and
> is documented in 'man systemd.exec'.

You are right that 'rpm -qd systemd | grep -c /man/' responds with '74'. Only if you run into troubles while booting the only installation you have around all this information is rather hard to access.  Even when you booted figuring out what you may need to read is not that obvious.  "Everything and memorize that too" does not work that well in practice.

> > After booting "rescue" you are basically seeing "this
> > cannot be done" responses for mostly anything.
> 
> Like what for example?

Like when I was attempting to start another tty and getting nowhere.
Of course 'systemctl start gettty.target' would work but that was really the same I was trying to achieve and so this bumps into the same problem I was trying to avoid.  Some other attempts to use systemctl (and I am afraid that I already forgot precisely what) were coming back that this or that does not respond so nothing doing.  If I have everything up then there is no issue in the first place.

Oh, well, with some experience points scored I will likely have an easier time in this new variant of nethack.

Comment 9 Michal Schmidt 2012-02-06 11:27:36 UTC

(In reply to comment #8)
> Like when I was attempting to start another tty and getting nowhere.

When starting services in emergency mode, you probably want to avoid having their requirement dependencies pulled in. This should be safe:
systemctl start --ignore-dependencies getty

In fact, I don't know why we add Requires=basic.target to all the services. It would make sense to add only the After dependency and depend on the fact that basic.target is pulled in by multi-user.target.

> Oh, well, with some experience points scored I will likely have an easier time
> in this new variant of nethack.

Or, as in the case of this specific bug, with a bit more patience before declaring a system unbootable.

Note You need to log in before you can comment on or make changes to this bug.