Bug 495326

Summary:

Exiting runlevel 5 does not exit X

Product:

[Fedora] Fedora

Reporter:

Allen Kistler <ackistler>

Component:

initscripts

Assignee:

Bill Nottingham <notting>

Status:

CLOSED RAWHIDE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

low

Version:

rawhide

CC:

benjavalero, bgamari, bnocera, cdahlin, chris, dakingun, davidsen, farrellj, hez, jfrieben, jlayton, jmccann, josep.puigdemont, kyleki, mclasen, me, michael+konten, michal, mtasaka, notting, rpjday, russell, rvokal, wtogami

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-05-05 21:44:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

446452

Attachments:

Description	Flags
rc6 kills itself (twice) - short form	none
rc6 kills itself (twice) - long form	none
rc0 kills itself (twice)	none
yum.log from 10 Apr 14:09 to 14:28	none
gdm dump, changing from runlevel 5 to runlevel 3	none
/var/log/messages on failed restart with 04-17-09 rawhide	none
/var/log/messages from bootup until sysrq at failed shutdown	none
/var/log/messages with Magic SysRq info (from possible VT-switch bug?)	none
Screenshot of X with runlevel reporting 3 in gnome-terminal	none
/var/log/messages from failed X exit	none
"ps -ef" output after failed X exit	none

Description Allen Kistler 2009-04-12 00:24:47 UTC

Description of problem:
When shutting down or rebooting from runlevel 5 via Main-menu/Shut-Down..., some processes are killed, but the desktop remains up (though blank except for the background), including the mouse (which moves, though there's nothing left to click on).

Version-Release number of selected component (if applicable):
initscripts-8.94-1.i586

How reproducible:
Sometimes

Steps to Reproduce:
1. Boot to runlevel 5
2. Shutdown or reboot from the main menu
  
Actual results:
Shutdown/reboot stops on the way down, still displaying a blank desktop

Expected results:
Shutdown/reboot should go all the way down

Additional info:
I'll attach excerpts from /var/log/messages that I think show why.  Basically it looks like the rc0 and rc6 scripts are *sometimes* killing themselves off before they can completely take the machine down.  (Race?)

The machine still remains responsive to the network and sshd, so it is possible to login over ssh and issue a command line "telinit 0" or "telinit 6."

The machine still remains responsive to the keyboard and power button, so if one of them is assigned the Shutdown function in gnome-power-manager, it is possible to shut down that way, too.

I installed initscripts-8.94-1.i586 on 8 Apr.  I didn't see this behavior before 10 Apr.  Previously I had initscripts-8.93-1.i586, installed on 3 Apr.  I'll attach my yum.log from 10 Apr, too, in case it's relevant.

Comment 1 Allen Kistler 2009-04-12 00:39:14 UTC

Created attachment 339200 [details]
rc6 kills itself (twice) - short form

Apr 10 14:43:49 localhost init: rc6 main process (7176) killed by TERM signal
Apr 10 14:43:49 localhost init: rc6 main process (7190) killed by TERM signal

Comment 2 Allen Kistler 2009-04-12 00:40:57 UTC

Created attachment 339201 [details]
rc6 kills itself (twice) - long form

Apr 11 15:57:26 localhost init: rc6 main process (3648) killed by TERM signal
Apr 11 15:57:26 localhost init: rc6 main process (3656) killed by TERM signal

Comment 3 Allen Kistler 2009-04-12 00:42:12 UTC

Created attachment 339202 [details]
rc0 kills itself (twice)

Apr 10 14:52:03 localhost init: rc0 main process (3287) killed by TERM signal
Apr 10 14:52:03 localhost init: rc0 main process (3298) killed by TERM signal

Comment 4 Allen Kistler 2009-04-12 00:45:15 UTC

Created attachment 339203 [details]
yum.log from 10 Apr 14:09 to 14:28

Comment 5 Robert P. J. Day 2009-04-12 10:19:43 UTC

I've noticed something that might be related.  Running "init 3" from a GNOME desktop "xterm" might leave the entire desktop running happily, even though the system actually thinks it's in run level 3:

  # runlevel
  5 3
  #

Even going to run level one might leave GNOME running as well.  On other occasions, doing "init 3" locks up the system, requiring a power down and reboot.  If I can track this down more explicitly, I'll add that here.

Comment 6 Allen Kistler 2009-04-13 06:03:57 UTC

The bug just happened to me again shutting down. So I logged in over ssh and ...

$ runlevel
5 0

... which I kind of expected. It *is* supposed to be shutting down, after all.

Then I tried seeing if the bug happened going from runlevel 5 to runlevel 3.

[First attempt]
Without nomodeset, the screen filled up with graphics garbage (not text garbage) and any time I touched the keyboard, I got ...

Apr 13 00:29:59 localhost /sbin/mingetty[3411]: tty1: invalid character 0x1c in login name
Apr 13 00:30:04 localhost init: tty1 main process (3411) terminated with status 1
Apr 13 00:30:04 localhost init: tty1 main process ended, respawning

... in /var/log/messages.

[Second attempt]
With nomodeset (I'm not sure it's related though ... read on), nothing appeared to happen. /var/log/messages had a dump from gdm in it, though. (I'll attach it.) "ps -ef | grep gdm" reveals that gdm is *not* running. The main menu still works, and I could launch gedit. So I tried logging out. Most of the desktop went away, and I was left with just the background and a mouse, but nothing to click on. /var/log/messages had a few lines:

Apr 13 00:42:36 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie
Apr 13 00:43:28 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie
Apr 13 00:43:28 localhost /sbin/mingetty[3275]: tty1: invalid character 0xb6 in login name
Apr 13 00:43:33 localhost init: tty1 main process (3275) terminated with status 1
Apr 13 00:43:33 localhost init: tty1 main process ended, respawning
Apr 13 00:44:00 localhost /sbin/mingetty[3361]: tty1: invalid character 0x12 in login name
Apr 13 00:44:05 localhost init: tty1 main process (3361) terminated with status 1
Apr 13 00:44:05 localhost init: tty1 main process ended, respawning
Apr 13 00:44:08 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie
Apr 13 00:44:08 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie
Apr 13 00:44:09 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie

Last ditch, at this point I tried alt-ctrl-bksp to kill X. I got the same screen full of garbage as in {First attempt} above. /var/log/messages had:

Apr 13 00:52:25 localhost kernel: [drm] Resetting GPU
Apr 13 00:52:29 localhost init: tty1 main process (3365) terminated with status 1
Apr 13 00:52:29 localhost init: tty1 main process ended, respawning

Any time I touch the keyboard, I get the same 3 lines I have above for "any time I touched the keyboard" in [First attempt]. So at this point, it's the same place.

Repeating "telinit 3" from within the ssh session has no effect, either on what's visible or in /var/log/messages.

The bottom line is that I'm now wondering if this is a Gnome or x11-server problem.

Comment 7 Allen Kistler 2009-04-13 06:05:10 UTC

Created attachment 339280 [details]
gdm dump, changing from runlevel 5 to runlevel 3

Comment 8 Allen Kistler 2009-04-13 07:53:33 UTC

I backed out xorg-x11-server and plymouth to their previous versions.  The problem went away AFAICT.  (It's an intermittent problem, so it could have just hidden.)

Update xorg-x11-server.  Still no problem.  Back it out again.
Update plymouth. ==> PROBLEM <==
Update xorg-x11-server, back-out plymouth.  Problem goes away again.

I'm not certain what role plymouth plays in shutting down or otherwise *exiting* runlevel 5, but it looks like the culprit to me.

Changing this report's component to plymouth.

Comment 9 Allen Kistler 2009-04-13 08:04:02 UTC

The apparently broken version of plymouth is plymouth-0.7.0-0.2009.03.10.3.

The previous version was plymouth-0.7.0-0.2009.03.10.2, which seems okay.

Comment 10 Allen Kistler 2009-04-13 10:26:12 UTC

I just had the bug again, despite plymouth being backed out, so it's probably not plymouth.

Apr 13 05:11:07 localhost console-kit-daemon[2295]: GLib-GObject-WARNING: IA__g_object_get_valist: value location for `gchararray' passed as NULL
Apr 13 05:11:07 localhost console-kit-daemon[2295]: GLib-GObject-WARNING: IA__g_object_get_valist: value location for `gchararray' passed as NULL
Apr 13 05:11:07 localhost console-kit-daemon[2295]: GLib-GObject-WARNING: IA__g_object_get_valist: value location for `gchararray' passed as NULL
Apr 13 05:11:07 localhost init: tty4 main process (2524) killed by TERM signal
Apr 13 05:11:07 localhost init: tty5 main process (2525) killed by TERM signal
Apr 13 05:11:07 localhost init: tty6 main process (2530) killed by TERM signal
Apr 13 05:11:07 localhost gdm-simple-slave[2583]: GLib-GObject-CRITICAL: g_object_ref: assertion `G_IS_OBJECT (object)' failed
Apr 13 05:11:07 localhost gdm-simple-slave[2583]: GLib-GObject-WARNING: invalid unclassed pointer in cast to `GdmSession'
Apr 13 05:11:07 localhost gdm-simple-slave[2583]: CRITICAL: _gdm_session_conversation_stopped: assertion `GDM_IS_SESSION (session)' failed
Apr 13 05:11:07 localhost gdm-simple-slave[2583]: GLib-GObject-CRITICAL: g_object_unref: assertion `G_IS_OBJECT (object)' failed
Apr 13 05:11:07 localhost restorecond: Unable to watch (/root/.ssh/*) No such file or directory
Apr 13 05:11:07 localhost init: tty2 main process (2527) killed by TERM signal
Apr 13 05:11:07 localhost init: tty3 main process (2529) killed by TERM signal
Apr 13 05:11:07 localhost init: rc0 main process (3375) killed by TERM signal
Apr 13 05:11:07 localhost kernel: [drm] Resetting GPU

I'm officially too tired to try another candidate right now.  Maybe later.

Comment 11 Allen Kistler 2009-04-14 09:24:24 UTC

Still trying to figure out what component is causing the freaky behavior, I ...

1. wiped and reinstalled from the Beta DVD
2. updated everything to rawhide except
   ConsoleKit*
   libdrm (because I saw some other complaints about it elsewhere)
   mesa-dri-drivers (because it needs the newer libdrm)
   mesa-LibGL (because it needs the newer libdrm)

The bug still occurs in this setup.  So basically the problem is also *not* in ConsoleKit or libdrm.

If anybody has a good idea where it really is, I'll happily assign it to the right component.  In the meantime, I'll try a few components per day.

Comment 12 Bill Nottingham 2009-04-14 16:20:42 UTC

*** Bug 495355 has been marked as a duplicate of this bug. ***

Comment 13 Allen Kistler 2009-04-15 22:48:36 UTC

After much beating on my test install, here's my latest finding.

If I back out gnome-session and gnome-session-xsession from 2.26.0.90-1.fc11 to 2.26.0-2.fc11, the bug disappears.  At least, I can't make it happen.

Mysteriously, if I re-update to 2.26.0.90-1.fc11, the bug stays gone.

So:

1. I will reinstall again from the beta DVD and see if I can recreate the bug
   using a local copy I've got of 2.26.0.90-1.fc11.  (I reinstalled twice before.
   Why not a third time?)
2. I noticed there's a newer version of gnome-session in rawhide today.  I'll see
   if that version fixes anything, too.

In the meantime, I'm changing the reported component to gnome-session.

Comment 14 Allen Kistler 2009-04-16 10:06:09 UTC

1. Clean install from the Beta DVD
2. Update to all the current rawhide, except for gnome-session*
3. Update gnome-session* to 2.26.0-2.fc11

-> No bug

4. Update gnome-session* to 2.26.0.90-1.fc11

-> Bug

5. Update gnome-session* to 2.26.1-1.fc11 (latest in rawhide right now)

-> Bug is still there

6. Back-out gnome-session* to 2.26.0-2.fc11 (rpm --oldpackage)

-> Bug goes away

7. Update gnome-session* to 2.26.0.90-1.fc11 again

-> The bug comes back again
   (seems harder to exercise, but that just might be me losing patience)

Exercising the bug seems to be a matter of just doing enough stuff, e.g., login, open windows, close windows, browse the net, run a few other apps (gedit, gcalc, whatever).  *Then* shut down or reboot to see the bug.

Comment 15 Chris Schumann 2009-04-16 22:13:23 UTC

This also happens to me. How can I be helpful?

Comment 16 Allen Kistler 2009-04-16 23:23:03 UTC

Chris:

If you downgrade to the versions of gnome-session and gnome-session-xsession from http://koji.fedoraproject.org/koji/buildinfo?buildID=96666, does the problem go away?

Comment 17 Matthias Clasen 2009-04-17 02:14:43 UTC

I have failed to reproduce this so far.

_gdm_session_conversation_stopped has been introduced with the pam-multistack work, so maybe something is going wrong there.

Comment 18 Allen Kistler 2009-04-17 03:51:01 UTC

The thing I find most challenging with this bug is that it doesn't *always* happen.  Sometimes I can exit runlevel 5 and have everything work smoothly.  Of course, many times not as smoothly.

I've been concentrating on the components listed on the attached yum log, since after I updated April 10 is when I noticed the bug.  Would the pam-multistack work affect some of those components more after April 10 than before?

Comment 19 Allen Kistler 2009-04-17 07:54:40 UTC

It turns out my strategy for finding the responsible component is invalid.  The bug wasn't introduced in the 10 Apr updates to rawhide.  I just made the bug happen with a fresh install from the Beta DVD.

Comment 20 Allen Kistler 2009-04-18 07:02:04 UTC

Created attachment 340128 [details]
/var/log/messages on failed restart with 04-17-09 rawhide

Updated everything to rawhide current.  I noticed gdm has a debug switch, so I turned it on.  Attaching the latest /var/log/messages for when "Restart" didn't even exit the GUI.  FWIW, there's no "gdm_session_conversation_stopped" in this log.  There is rc6 getting killed, though, *before* the runlevel can be set.

# runlevel
N 5  <-- not 5 6 !!

Comment 21 Allen Kistler 2009-04-28 18:13:15 UTC

Changing component to upstart.  When the rcX init scripts are getting killed before they can complete when dropping levels (or going to 6) and when it doesn't seem like it's a crash in any component, upstart is probably a better place to start looking.

Comment 22 Jason Farrell 2009-04-28 19:27:42 UTC

I've been seeing this same incomplete-shutdown behavior in about 1 in 5 reboots. I notice 'upstart' was updated today, but nothing in the changelog is related.

Comment 23 Adam Pribyl 2009-04-29 10:42:10 UTC

*** Bug 495444 has been marked as a duplicate of this bug. ***

Comment 24 Bill Davidsen 2009-04-30 13:36:47 UTC

I saw similar, as noted in the "Can't stop log" thread on the testers list.
Behaved thus:
- GNOME system->shutdown->shutdown_button fails
- "shutdown -h now" using sudo or similar works
- logout and using [SHUTDOWN] button works

Messages in /var/log/messages indicate permissions error:
  gnome-session[1652]: WARNING: Unable to stop system: Not privileged for action:   org.freedesktop.consolekit.system.stop no
which looks like SElinux.

the audit log:
  type=SYSCALL msg=audit(1241012799.315:30): arch=40000003 syscall=102 success=no exit=-13 a0=1 a1=bf9fb880 a2=2faff4 a3=3 items=0 ppid=5313 pid=5336 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="shutdown" exe="/sbin/shutdown"

This may or may not be the same issue

Comment 25 Allen Kistler 2009-04-30 18:13:00 UTC

I put SELinux in permissive mode long ago for the F11 pre-release versions so that any denials wouldn't cause problems.

The bug still exists in permissive mode.  I haven't tried to track anything down for several days, including anything with the Preview DVD.  If there's now an SELinux denial, I'd say it's probably a separate issue that occurs concurrently, but I'll see what I see later today or maybe tomorrow.

I still find it suggestive that rc0/rc3/rc6 is getting killed by a TERM signal before it can complete changing levels.  I've even seen it get killed before it can complete setting the runlevel, which in upstart is now little more than some bookkeeping variable.

Comment 26 Bill Nottingham 2009-04-30 23:27:45 UTC

So, if people are seeing this, can they 1) enable sysrq 2) do sysrq-t and sysrq-p 3) post their resulting /var/log/messages?

In the two cases I've seen so far, there's been a hang in the kernel in the VT switch code.

Comment 27 Allen Kistler 2009-05-01 00:17:44 UTC

Re: Comment #26
For those who've never used SysRq before, see
http://fedoraproject.org/wiki/QA/Sysrq

Comment 28 Jason D. Clinton 2009-05-01 01:19:49 UTC

Re: Comment #26

AFAICT from the replies and original bug description, this bug is about the shutdown process aborting, not the machine locking up. If you press the power button--for example--ACPI initiates the shutdown sequence in spite of the first try starting and then failing.

Comment 29 Jason Farrell 2009-05-01 03:20:35 UTC

OK - so I just spent the last hour rebooting my netbook at taking notes. Hopefully it helps.

Summary: combinations of nomodeset/rhgb don't seem to matter. shutdown only randomly fails to complete if initiated after logging in, and never from gdm itself even after a login/logout.

And here's my notes dump with boot failure tally and other observations:

*** tests with KMS; "rhgb quiet" default:
10/10 successful reboots from initial gdm
5/5 successful reboots from gdm after one login/logout cycle
* for the 15 successful reboots above: plymouth shutdown animation always stops animating after a second or two (progress bar stuck at ~3%), but still shuts down fine.

2/14 successful reboots after gnome login
* for the 12 failed shutdowns the behavior differed slightly:
6/14 kicked to text console still on VT1 with shutdown broadcast message visible. keyboard input sometimes still echoed, and if so ctrl-alt-delete continues reboot; if not, then sysrq-b or coldboot required.
4/14 plymouth shutdown progress animates forever, but never shuts down.
if you manually switch from VT1 to VT2, plymouth animation still overlayed over console.
1/14 never see text VT or plymouth; stuck looking at leonidas backround;
can still manually VT switch and c-a-d.
1/14 black background; can VT switch and c-a-d.

*** tests without KMS; "rhgb quiet nomodeset"
6/8 successful reboots after gnome login
1/8 stuck at pretty text VT until c-a-d
1/8 stuck at leonidas background until c-a-d
*** tests without KMS or rhgb/plymouth; just "nomodeset"
4/5 successful reboots after gnome login
1/5 stuck at standard 80x25 text VT until c-a-d

Comment 30 Jason Farrell 2009-05-01 03:33:04 UTC

Created attachment 342035 [details]
/var/log/messages from bootup until sysrq at failed shutdown

this is for the last boot, where the "rhgb quiet" default was replaced with just "nomodeset"

Comment 31 Bill Nottingham 2009-05-01 20:29:25 UTC

*** Bug 498514 has been marked as a duplicate of this bug. ***

Comment 32 Bill Nottingham 2009-05-01 20:32:20 UTC

http://git.fedorahosted.org/git/?p=initscripts.git;a=commitdiff;h=64e4a7e85366875a1f985f011d58273bc8aba875

Wow, that was irritating. initscripts-8.95-1 coming tomorrow.

Comment 33 Michal Jaegermann 2009-05-01 22:19:31 UTC

After I applied a patch from comment #32 my shutdown problems detailed in bug 498514 so far indeed look like gone; and I see only one "The system is going down for reboot NOW!" broadcast message instead of three or four.  Thanks!

Comment 34 Michael von der Heide 2009-05-02 07:40:53 UTC

Patching /etc/event.d/rc? as shown in comment #32 solves it for me, too. Thanks.

Comment 35 Allen Kistler 2009-05-04 20:56:17 UTC

I didn't see the rev show up in rawhide, so I snagged it out of Koji.  As before, the challenging thing is that it only happens sometimes.  So if the bug seems to go away, you can't prove it's gone.  But if it happens, you can prove it's not gone.

It's not gone.  It seemed to work for a while, then ...

First bug occurrence:
I tried "telinit 3" from a gnome-terminal window.
The screen blanked.  The machine was unresponsive.
This may be the VT-switch bug that Bill mentioned in Comment #26.

Second bug occurrence:
I tried "telinit 3" from a gnome-terminal window.  X did not exit.
"Shut down" was no longer an option on the main menu.
I tried "Log out" instead.  I was left with a screen filled with my background and a working mouse, but nothing to click on.  I could not switch away with Alt-Ctrl-Fn.

I'll attach some files I collected and change Status to ASSIGNED again.
(MODIFIED -> CLOSED -> ASSIGNED, 'cuz that's the only way, argh)

Comment 36 Allen Kistler 2009-05-04 21:00:12 UTC

Created attachment 342385 [details]
/var/log/messages with Magic SysRq info (from possible VT-switch bug?)

Comment 37 Allen Kistler 2009-05-04 21:02:11 UTC

Created attachment 342386 [details]
Screenshot of X with runlevel reporting 3 in gnome-terminal

Comment 38 Allen Kistler 2009-05-04 21:24:08 UTC

Created attachment 342389 [details]
/var/log/messages from failed X exit

Significant times in the attached messages file

15:18 I try to go from runlevel 5 to runlevel 3
      gdm dumps; tty1 starts dying/respawning continuously
      note: the rcX script and other tty's are *not* TERMed (improvement!)
      Argh.  After taking a screenshot, Magic SysRq just keeps wanting to take
         more screenshots instead of doing magic.  I snag "ps -ef" instead.

15:33 I insert a flash drive to start getting some files to upload here

15:35 I unplug the flash drive and log out of X to a featureless background

15:36 Magic SysRq still refuses to work, although I don't find that out until
         I read the log (It *is* enabled in sysctl.conf.)
      I hit the power button and let ACPI trigger a clean shutdown

Comment 39 Allen Kistler 2009-05-04 21:27:42 UTC

Created attachment 342390 [details]
"ps -ef" output after failed X exit

Comment 40 Bill Nottingham 2009-05-05 21:29:59 UTC

The reason that tty1 is dying/respawning is that it's ending up running on the same VT as the still-running X. That's not going to work at all.

Comment 41 Bill Nottingham 2009-05-05 21:44:31 UTC

I'm closing this bug; while I have reproduced what you're seeing, it's a completely different issue unrelated to the first - gdm crashes before it can kill X and the session. Please file a new bug against gdm; if I get a stack trace, I'll file and mention it here.

Comment 42 Bill Nottingham 2009-05-05 21:58:44 UTC

Bug 499272 filed for the gdm issue.

Comment 43 Bill Nottingham 2009-05-05 22:03:52 UTC

*** Bug 499224 has been marked as a duplicate of this bug. ***