Bug 495326
Summary: | Exiting runlevel 5 does not exit X | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Allen Kistler <ackistler> |
Component: | initscripts | Assignee: | Bill Nottingham <notting> |
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | rawhide | CC: | benjavalero, bgamari, bnocera, cdahlin, chris, dakingun, davidsen, farrellj, hez, jfrieben, jlayton, jmccann, josep.puigdemont, kyleki, mclasen, me, michael+konten, michal, mtasaka, notting, rpjday, russell, rvokal, wtogami |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-05-05 21:44:31 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 446452 | ||
Attachments: |
Description
Allen Kistler
2009-04-12 00:24:47 UTC
Created attachment 339200 [details]
rc6 kills itself (twice) - short form
Apr 10 14:43:49 localhost init: rc6 main process (7176) killed by TERM signal
Apr 10 14:43:49 localhost init: rc6 main process (7190) killed by TERM signal
Created attachment 339201 [details]
rc6 kills itself (twice) - long form
Apr 11 15:57:26 localhost init: rc6 main process (3648) killed by TERM signal
Apr 11 15:57:26 localhost init: rc6 main process (3656) killed by TERM signal
Created attachment 339202 [details]
rc0 kills itself (twice)
Apr 10 14:52:03 localhost init: rc0 main process (3287) killed by TERM signal
Apr 10 14:52:03 localhost init: rc0 main process (3298) killed by TERM signal
Created attachment 339203 [details]
yum.log from 10 Apr 14:09 to 14:28
I've noticed something that might be related. Running "init 3" from a GNOME desktop "xterm" might leave the entire desktop running happily, even though the system actually thinks it's in run level 3: # runlevel 5 3 # Even going to run level one might leave GNOME running as well. On other occasions, doing "init 3" locks up the system, requiring a power down and reboot. If I can track this down more explicitly, I'll add that here. The bug just happened to me again shutting down. So I logged in over ssh and ... $ runlevel 5 0 ... which I kind of expected. It *is* supposed to be shutting down, after all. Then I tried seeing if the bug happened going from runlevel 5 to runlevel 3. [First attempt] Without nomodeset, the screen filled up with graphics garbage (not text garbage) and any time I touched the keyboard, I got ... Apr 13 00:29:59 localhost /sbin/mingetty[3411]: tty1: invalid character 0x1c in login name Apr 13 00:30:04 localhost init: tty1 main process (3411) terminated with status 1 Apr 13 00:30:04 localhost init: tty1 main process ended, respawning ... in /var/log/messages. [Second attempt] With nomodeset (I'm not sure it's related though ... read on), nothing appeared to happen. /var/log/messages had a dump from gdm in it, though. (I'll attach it.) "ps -ef | grep gdm" reveals that gdm is *not* running. The main menu still works, and I could launch gedit. So I tried logging out. Most of the desktop went away, and I was left with just the background and a mouse, but nothing to click on. /var/log/messages had a few lines: Apr 13 00:42:36 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie Apr 13 00:43:28 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie Apr 13 00:43:28 localhost /sbin/mingetty[3275]: tty1: invalid character 0xb6 in login name Apr 13 00:43:33 localhost init: tty1 main process (3275) terminated with status 1 Apr 13 00:43:33 localhost init: tty1 main process ended, respawning Apr 13 00:44:00 localhost /sbin/mingetty[3361]: tty1: invalid character 0x12 in login name Apr 13 00:44:05 localhost init: tty1 main process (3361) terminated with status 1 Apr 13 00:44:05 localhost init: tty1 main process ended, respawning Apr 13 00:44:08 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie Apr 13 00:44:08 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie Apr 13 00:44:09 localhost gnome-session[2694]: WARNING: Unable to determine session: Unable to find session for cookie Last ditch, at this point I tried alt-ctrl-bksp to kill X. I got the same screen full of garbage as in {First attempt} above. /var/log/messages had: Apr 13 00:52:25 localhost kernel: [drm] Resetting GPU Apr 13 00:52:29 localhost init: tty1 main process (3365) terminated with status 1 Apr 13 00:52:29 localhost init: tty1 main process ended, respawning Any time I touch the keyboard, I get the same 3 lines I have above for "any time I touched the keyboard" in [First attempt]. So at this point, it's the same place. Repeating "telinit 3" from within the ssh session has no effect, either on what's visible or in /var/log/messages. The bottom line is that I'm now wondering if this is a Gnome or x11-server problem. Created attachment 339280 [details]
gdm dump, changing from runlevel 5 to runlevel 3
I backed out xorg-x11-server and plymouth to their previous versions. The problem went away AFAICT. (It's an intermittent problem, so it could have just hidden.) Update xorg-x11-server. Still no problem. Back it out again. Update plymouth. ==> PROBLEM <== Update xorg-x11-server, back-out plymouth. Problem goes away again. I'm not certain what role plymouth plays in shutting down or otherwise *exiting* runlevel 5, but it looks like the culprit to me. Changing this report's component to plymouth. The apparently broken version of plymouth is plymouth-0.7.0-0.2009.03.10.3. The previous version was plymouth-0.7.0-0.2009.03.10.2, which seems okay. I just had the bug again, despite plymouth being backed out, so it's probably not plymouth. Apr 13 05:11:07 localhost console-kit-daemon[2295]: GLib-GObject-WARNING: IA__g_object_get_valist: value location for `gchararray' passed as NULL Apr 13 05:11:07 localhost console-kit-daemon[2295]: GLib-GObject-WARNING: IA__g_object_get_valist: value location for `gchararray' passed as NULL Apr 13 05:11:07 localhost console-kit-daemon[2295]: GLib-GObject-WARNING: IA__g_object_get_valist: value location for `gchararray' passed as NULL Apr 13 05:11:07 localhost init: tty4 main process (2524) killed by TERM signal Apr 13 05:11:07 localhost init: tty5 main process (2525) killed by TERM signal Apr 13 05:11:07 localhost init: tty6 main process (2530) killed by TERM signal Apr 13 05:11:07 localhost gdm-simple-slave[2583]: GLib-GObject-CRITICAL: g_object_ref: assertion `G_IS_OBJECT (object)' failed Apr 13 05:11:07 localhost gdm-simple-slave[2583]: GLib-GObject-WARNING: invalid unclassed pointer in cast to `GdmSession' Apr 13 05:11:07 localhost gdm-simple-slave[2583]: CRITICAL: _gdm_session_conversation_stopped: assertion `GDM_IS_SESSION (session)' failed Apr 13 05:11:07 localhost gdm-simple-slave[2583]: GLib-GObject-CRITICAL: g_object_unref: assertion `G_IS_OBJECT (object)' failed Apr 13 05:11:07 localhost restorecond: Unable to watch (/root/.ssh/*) No such file or directory Apr 13 05:11:07 localhost init: tty2 main process (2527) killed by TERM signal Apr 13 05:11:07 localhost init: tty3 main process (2529) killed by TERM signal Apr 13 05:11:07 localhost init: rc0 main process (3375) killed by TERM signal Apr 13 05:11:07 localhost kernel: [drm] Resetting GPU I'm officially too tired to try another candidate right now. Maybe later. Still trying to figure out what component is causing the freaky behavior, I ... 1. wiped and reinstalled from the Beta DVD 2. updated everything to rawhide except ConsoleKit* libdrm (because I saw some other complaints about it elsewhere) mesa-dri-drivers (because it needs the newer libdrm) mesa-LibGL (because it needs the newer libdrm) The bug still occurs in this setup. So basically the problem is also *not* in ConsoleKit or libdrm. If anybody has a good idea where it really is, I'll happily assign it to the right component. In the meantime, I'll try a few components per day. *** Bug 495355 has been marked as a duplicate of this bug. *** After much beating on my test install, here's my latest finding. If I back out gnome-session and gnome-session-xsession from 2.26.0.90-1.fc11 to 2.26.0-2.fc11, the bug disappears. At least, I can't make it happen. Mysteriously, if I re-update to 2.26.0.90-1.fc11, the bug stays gone. So: 1. I will reinstall again from the beta DVD and see if I can recreate the bug using a local copy I've got of 2.26.0.90-1.fc11. (I reinstalled twice before. Why not a third time?) 2. I noticed there's a newer version of gnome-session in rawhide today. I'll see if that version fixes anything, too. In the meantime, I'm changing the reported component to gnome-session. 1. Clean install from the Beta DVD 2. Update to all the current rawhide, except for gnome-session* 3. Update gnome-session* to 2.26.0-2.fc11 -> No bug 4. Update gnome-session* to 2.26.0.90-1.fc11 -> Bug 5. Update gnome-session* to 2.26.1-1.fc11 (latest in rawhide right now) -> Bug is still there 6. Back-out gnome-session* to 2.26.0-2.fc11 (rpm --oldpackage) -> Bug goes away 7. Update gnome-session* to 2.26.0.90-1.fc11 again -> The bug comes back again (seems harder to exercise, but that just might be me losing patience) Exercising the bug seems to be a matter of just doing enough stuff, e.g., login, open windows, close windows, browse the net, run a few other apps (gedit, gcalc, whatever). *Then* shut down or reboot to see the bug. This also happens to me. How can I be helpful? Chris: If you downgrade to the versions of gnome-session and gnome-session-xsession from http://koji.fedoraproject.org/koji/buildinfo?buildID=96666, does the problem go away? I have failed to reproduce this so far. _gdm_session_conversation_stopped has been introduced with the pam-multistack work, so maybe something is going wrong there. The thing I find most challenging with this bug is that it doesn't *always* happen. Sometimes I can exit runlevel 5 and have everything work smoothly. Of course, many times not as smoothly. I've been concentrating on the components listed on the attached yum log, since after I updated April 10 is when I noticed the bug. Would the pam-multistack work affect some of those components more after April 10 than before? It turns out my strategy for finding the responsible component is invalid. The bug wasn't introduced in the 10 Apr updates to rawhide. I just made the bug happen with a fresh install from the Beta DVD. Created attachment 340128 [details]
/var/log/messages on failed restart with 04-17-09 rawhide
Updated everything to rawhide current. I noticed gdm has a debug switch, so I turned it on. Attaching the latest /var/log/messages for when "Restart" didn't even exit the GUI. FWIW, there's no "gdm_session_conversation_stopped" in this log. There is rc6 getting killed, though, *before* the runlevel can be set.
# runlevel
N 5 <-- not 5 6 !!
Changing component to upstart. When the rcX init scripts are getting killed before they can complete when dropping levels (or going to 6) and when it doesn't seem like it's a crash in any component, upstart is probably a better place to start looking. I've been seeing this same incomplete-shutdown behavior in about 1 in 5 reboots. I notice 'upstart' was updated today, but nothing in the changelog is related. *** Bug 495444 has been marked as a duplicate of this bug. *** I saw similar, as noted in the "Can't stop log" thread on the testers list. Behaved thus: - GNOME system->shutdown->shutdown_button fails - "shutdown -h now" using sudo or similar works - logout and using [SHUTDOWN] button works Messages in /var/log/messages indicate permissions error: gnome-session[1652]: WARNING: Unable to stop system: Not privileged for action: org.freedesktop.consolekit.system.stop no which looks like SElinux. the audit log: type=SYSCALL msg=audit(1241012799.315:30): arch=40000003 syscall=102 success=no exit=-13 a0=1 a1=bf9fb880 a2=2faff4 a3=3 items=0 ppid=5313 pid=5336 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="shutdown" exe="/sbin/shutdown" This may or may not be the same issue I put SELinux in permissive mode long ago for the F11 pre-release versions so that any denials wouldn't cause problems. The bug still exists in permissive mode. I haven't tried to track anything down for several days, including anything with the Preview DVD. If there's now an SELinux denial, I'd say it's probably a separate issue that occurs concurrently, but I'll see what I see later today or maybe tomorrow. I still find it suggestive that rc0/rc3/rc6 is getting killed by a TERM signal before it can complete changing levels. I've even seen it get killed before it can complete setting the runlevel, which in upstart is now little more than some bookkeeping variable. So, if people are seeing this, can they 1) enable sysrq 2) do sysrq-t and sysrq-p 3) post their resulting /var/log/messages? In the two cases I've seen so far, there's been a hang in the kernel in the VT switch code. Re: Comment #26 For those who've never used SysRq before, see http://fedoraproject.org/wiki/QA/Sysrq Re: Comment #26 AFAICT from the replies and original bug description, this bug is about the shutdown process aborting, not the machine locking up. If you press the power button--for example--ACPI initiates the shutdown sequence in spite of the first try starting and then failing. OK - so I just spent the last hour rebooting my netbook at taking notes. Hopefully it helps. Summary: combinations of nomodeset/rhgb don't seem to matter. shutdown only randomly fails to complete if initiated after logging in, and never from gdm itself even after a login/logout. And here's my notes dump with boot failure tally and other observations: *** tests with KMS; "rhgb quiet" default: 10/10 successful reboots from initial gdm 5/5 successful reboots from gdm after one login/logout cycle * for the 15 successful reboots above: plymouth shutdown animation always stops animating after a second or two (progress bar stuck at ~3%), but still shuts down fine. 2/14 successful reboots after gnome login * for the 12 failed shutdowns the behavior differed slightly: 6/14 kicked to text console still on VT1 with shutdown broadcast message visible. keyboard input sometimes still echoed, and if so ctrl-alt-delete continues reboot; if not, then sysrq-b or coldboot required. 4/14 plymouth shutdown progress animates forever, but never shuts down. if you manually switch from VT1 to VT2, plymouth animation still overlayed over console. 1/14 never see text VT or plymouth; stuck looking at leonidas backround; can still manually VT switch and c-a-d. 1/14 black background; can VT switch and c-a-d. *** tests without KMS; "rhgb quiet nomodeset" 6/8 successful reboots after gnome login 1/8 stuck at pretty text VT until c-a-d 1/8 stuck at leonidas background until c-a-d *** tests without KMS or rhgb/plymouth; just "nomodeset" 4/5 successful reboots after gnome login 1/5 stuck at standard 80x25 text VT until c-a-d Created attachment 342035 [details]
/var/log/messages from bootup until sysrq at failed shutdown
this is for the last boot, where the "rhgb quiet" default was replaced with just "nomodeset"
*** Bug 498514 has been marked as a duplicate of this bug. *** http://git.fedorahosted.org/git/?p=initscripts.git;a=commitdiff;h=64e4a7e85366875a1f985f011d58273bc8aba875 Wow, that was irritating. initscripts-8.95-1 coming tomorrow. After I applied a patch from comment #32 my shutdown problems detailed in bug 498514 so far indeed look like gone; and I see only one "The system is going down for reboot NOW!" broadcast message instead of three or four. Thanks! Patching /etc/event.d/rc? as shown in comment #32 solves it for me, too. Thanks. I didn't see the rev show up in rawhide, so I snagged it out of Koji. As before, the challenging thing is that it only happens sometimes. So if the bug seems to go away, you can't prove it's gone. But if it happens, you can prove it's not gone. It's not gone. It seemed to work for a while, then ... First bug occurrence: I tried "telinit 3" from a gnome-terminal window. The screen blanked. The machine was unresponsive. This may be the VT-switch bug that Bill mentioned in Comment #26. Second bug occurrence: I tried "telinit 3" from a gnome-terminal window. X did not exit. "Shut down" was no longer an option on the main menu. I tried "Log out" instead. I was left with a screen filled with my background and a working mouse, but nothing to click on. I could not switch away with Alt-Ctrl-Fn. I'll attach some files I collected and change Status to ASSIGNED again. (MODIFIED -> CLOSED -> ASSIGNED, 'cuz that's the only way, argh) Created attachment 342385 [details]
/var/log/messages with Magic SysRq info (from possible VT-switch bug?)
Created attachment 342386 [details]
Screenshot of X with runlevel reporting 3 in gnome-terminal
Created attachment 342389 [details]
/var/log/messages from failed X exit
Significant times in the attached messages file
15:18 I try to go from runlevel 5 to runlevel 3
gdm dumps; tty1 starts dying/respawning continuously
note: the rcX script and other tty's are *not* TERMed (improvement!)
Argh. After taking a screenshot, Magic SysRq just keeps wanting to take
more screenshots instead of doing magic. I snag "ps -ef" instead.
15:33 I insert a flash drive to start getting some files to upload here
15:35 I unplug the flash drive and log out of X to a featureless background
15:36 Magic SysRq still refuses to work, although I don't find that out until
I read the log (It *is* enabled in sysctl.conf.)
I hit the power button and let ACPI trigger a clean shutdown
Created attachment 342390 [details]
"ps -ef" output after failed X exit
The reason that tty1 is dying/respawning is that it's ending up running on the same VT as the still-running X. That's not going to work at all. I'm closing this bug; while I have reproduced what you're seeing, it's a completely different issue unrelated to the first - gdm crashes before it can kill X and the session. Please file a new bug against gdm; if I get a stack trace, I'll file and mention it here. Bug 499272 filed for the gdm issue. *** Bug 499224 has been marked as a duplicate of this bug. *** |