Description of problem: One one of my machines that I'm testing Fedora 8 on, the startup of KDM sometimes fails spectacularly, with X startup getting stuck, and eventually exiting with Fatal server error: xf86OpenConsole: VT_WAITACTIVE failed: Interrupted system call and this seems to happen randomly, but very consistently with just the right kernels (ie it seems to be very timing-dependent). I have chased it down, and what happens is that the new X server is trying to switch to its newly allocated virtual console, but the old X server that was started by rhgb is exiting at the same time, and apparently depending on pure luck (or lack of it), the old X server will switch back to the original VT (VT#0) JUST AFTER the new X server has tried to switch to its new VT! The end result is that when the new X server waits for that switch to take effect, it will never happen (since we did actually switch to it, but then the old X server switched back!), and it will wait forever - or rather, it will wait until a timeout signal comes in, and at that point it will exit with the above error! Version-Release number of selected component (if applicable): This is the current testin gFedora 8 as of yesterday or something like that: rhgb-0.17.7-1.fc8 xorg-x11-server-Xorg-1.3.0.0-29.fc8 How reproducible: Not very reproducible: I can apparently make it happen about 50% of the time on one particular machine (UP 1.5GHz Pentium-M laptop) with one particular kernel configuration. I think it has to hit just the right timing window. Steps to Reproduce: 1. Get right timing 2. ? 3. Profit! Actual results: KDM startup fails, we never get a graphical environment, I need to manually re-do the "init 5" to start up KDM. Expected results: KDM and X should just start up. Additional info: I literally have a kernel trace (added printk's to my kernel) showing very clearly that the old X server switches back to VT 0 *after* the new X server has requested a console switch to its new VT. A working trace looks like this (pid 1124 is the old X server, while pid 6 is the kernel eventd that does the switch, and pid 1994 is the new one): set_console (1124): want_console=0 console_callback 1 (6): want_console=0 6: change_console 6 complete_change_console console_callback 2 (6): want_console=-1 1994 reset_vc set_console (1994): want_console=6 console_callback 1 (6): want_console=6 6: change_console 6 complete_change_console console_callback 2 (6): want_console=-1 while a broken trace looks like this (now 1993 is the new X server, and 1126 is the old one): 1993 reset_vc set_console (1993): want_console=6 console_callback 1 (6): want_console=6 6: change_console console_callback 2 (6): want_console=-1 set_console (1126): want_console=0 console_callback 1 (6): want_console=0 6: change_console 6 complete_change_console console_callback 2 (6): want_console=-1 ie it clearly shows how the old X server (1126) switched back to console 0 *after* the new X server had done its "want_console=6". This may be an unusual error, but when it happens, it effectively makes the machine unusable for anybody who doesn't know how to restart KDM. Which is why I'm marking it high-severity.
We've hit this bug before and I thought we fixed it. RHGB is supposed to wait until the X server completely exits before replying to the rhgb-client --quit request. KDM start won't start until after the rhgb-client --quit call finishes. The X server is supposed to be blocking until the VT_WAITACTIVE completes before it exits. So it *should* be all synchronized, not sure why it's falling over. I wonder if the VT_WAITACTIVE call (the one in the rhgb shutdown path) is failing in the X server (or returned early), and so the X server is exiting too soon.
Orthogonal to this, we could really use a better VT switch API. Please, please, pretty please. But. The X server will do a full VT_ACTIVATE/VT_WAITACTIVE pair at startup, to request a new tty and then wait until it's been switched to. It will also do the same thing on server exit, to switch back to the VT it started from. rhgb will wait until the X server it launched has completely exited before itself returning. So this should never happen unless kdm is trying to start an X server before rhgb has actually exited. The race your seeing shows the X server for kdm trying to switch to tty7, and then the X server for rhgb trying to release back to tty1. This implies something funky in kdm startup, since we definitely fixed this for the gdm case, and it should be doing the same thing for kdm; are you starting your display manager in some weird way? Alternatively, rhgb-client --quit might be returning before rhgb has actually exited; I'll check. The rhgb X server would have voluntarily switched away though, so in principle it could just exit without trying to VT switch if it's already switched away. I suppose by that point in the exit path the driver has already done all the cleanup it's going to do, so it should be safe, and it shouldn't have any cleanup to do anyway since it's already off the hardware. This would be a bug for fast-user-switching too: A logs in, B logs in, A's server exits and kicks B off the display.
Ajax: I can definitely not argue with you about the horridness of the VT_ACTIVATE/WAITACTIVE interfaces. They are inherited from SVR4, basically from back in the days when X originally got ported to Linux, and being compatible with other systems was a primary goal, rather than trying to do anything sane. But we're kind of stuck with it. And no, I'm not doing anything strange at all with kdm. This is a bog-standard Fedora 8 box, with pretty much everything updated with "yum -y upgrade" every once in a while. The exception, of course, is the kernel (and git), which I always end up recompiling from sources for obvious reasons. The original install was from the Fedora 8 test2 KDE LiveCD, but while I'm not at all familiar with the init scripts, it looks like kdm gets started directly by init calling out to /etc/X11/prefdm, which then exec's /usr/bin/kdm. That same script seems to do the "/usr/bin/rhgb-client --quit" too, and I'm now rebooting the machine to see if adding a "sleep 1" after that thing makes any difference. .. time passes .. Hmm. I rebooted twice, and it worked both times. My kernel debug messages were still active, so I could see the timestamps in /var/log/messages, and now there did seem to be a seconds pause in between the old X exit and the new X startup. But since this was never 100% reliable to start with, I cannot guarantee that it makes a difference, but it sure looks suspicious. That added "sleep 1" is obviously a total hack, and I'm in no way advocating it as a solution. It was just meant to be a "does this change behavior" test.
Side note: are the rhgb sources available in some sane format (and no, I don't think a source RPM is a sane format)? I'll happily try things out myself, but google didn't show anything really obvious.
export CVSROOT=:pserver:anonymous.redhat.com:/cvs/pkgs cvs login [press enter] cvs co rhgb cd rhgb/devel make local That should set up a source tree and build it
Ok, I just strace'd rhgb, and the thing is totally buggy. Here's the snipped from the strace when it gets the "quit" command from rhgb-client: read(13, "quit\0", 1024) = 5 read(13, 0x864401d, 1024) = -1 EAGAIN (Resource temporarily unavailable) write(13, "done\0", 5) = 5 close(13) = 0 close(13) = -1 EBADF (Bad file descriptor) in other words, it does NOT wait at all for the X server to shut down. Not in any way, shape, or form. What happens afterwards is: select(9, [8], [8], NULL, NULL) = 1 (out [8]) writev(8, [{"+\7\1\0", 4}], 1) = 4 select(9, [8], [], NULL, NULL) = 1 (in [8]) read(8, "\1\0...", 4096) = 32 read(8, 0x8073990, 4096) = -1 EAGAIN kill(10649, SIGTERM) = 0 waitpid(10649, NULL, 0) = 10649 --- SIGCHLD (Child exited) @ 0 (0) --- sigreturn() = ? (mask now []) ie it does end up sending the SIGTERM, but only long after it has already told the client that it is "done", and apparently after it has also done some final X operations in the main loop. No wonder this thing doesn't work. I'm attaching a totally untested patch that is probably not the right thing to do either, but it is an example of how this thing might be fixable. Depending on that "g_at_exit()" thing to reap things is useless, that is way way too late, and I think the thing needs to wait for the child synchronously and explicitly, instead of playing any clever games.
Created attachment 221661 [details] Proposed example fix NOTE! Untested! I didn't even bother compiling it! Caveat emptor! This is meant as a "this approach may work" kind of thing, not necessarily as a working fix. The code and the RH build system is so strange that I'm not going to really try to fight it any further than importing it into git and generating a patch that looks like it might just work.
Yeah, that looks plausible. And I could easily believe that gdm is just slow enough to launch that we're just getting lucky. I'll build a test RPM with a variation of that patch.
So with the RPMs here: http://koji.fedoraproject.org/scratch/ajax/task_188803/ and a bootchart, I can very clearly see rhgb exiting after X exits, and the next X server starting after rhgb exits. The fix is just your change but with more places converted to use got_exit_command() since we're repeating that stanza verbatim in several places. Please test, but I think your analysis is right.
and it gets even more funny. spot the other error (this time rhgb-client.c): if (strcmp (buffer, "done")) g_main_loop_quit (loop); ...
Ajax said: "I can very clearly see rhgb exiting after X exits, and the next X server starting after rhgb exits." I hope/assume you mean that you can see rhgb-*client* exiting after X exits: the problem was never rhgb itself, it was the client that exited too early and allowed subsequent commands to start executing before the old X had actually exited. So the exit time of rhgb itself is totally immaterial (and yes, that one has always exited after X exits - since it had the "wait for child" code in its atexit() routine). However, I can confirm that your RPM seems to have fixed rhgb so that it returns the "done" string only after the child (X) has exited, so that rhgb-client is correctly serialized. And just to double-check, I also booted the machine a few times without my hacky "sleep 1" thing to see that yes, it does fix the symptoms of the problem that caused me to chase this down in the first place. So it does seem fixed in your RPM's, please push it out to mainstream... I can well imagine that gdm is so bloated that you'd never see it in practice, but I cannot be the only one who wants to use KDE and thus kdm.
Building 0.17.7-2 now. Thanks!