323501 – rhgb causes "xf86OpenConsole: VT_WAITACTIVE failed: Interrupted system call" on real X startup

Bug 323501 - rhgb causes "xf86OpenConsole: VT_WAITACTIVE failed: Interrupted system call" on real X startup

Summary: rhgb causes "xf86OpenConsole: VT_WAITACTIVE failed: Interrupted system call" ...

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rhgb
Sub Component:
Version:	8
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Adam Jackson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	F8Blocker
TreeView+	depends on / blocked

Reported:	2007-10-08 18:14 UTC by Linus Torvalds
Modified:	2007-11-30 22:12 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-10-09 21:06:15 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Proposed example fix (738 bytes, patch) 2007-10-09 19:02 UTC, Linus Torvalds	no flags	Details \| Diff
View All

Description Linus Torvalds 2007-10-08 18:14:37 UTC

Description of problem:

One one of my machines that I'm testing Fedora 8 on, the startup of KDM
sometimes fails spectacularly, with X startup getting stuck, and eventually
exiting with 

    Fatal server error:
    xf86OpenConsole: VT_WAITACTIVE failed: Interrupted system call

and this seems to happen randomly, but very consistently with just the
right kernels (ie it seems to be very timing-dependent).

I have chased it down, and what happens is that the new X server is trying
to switch to its newly allocated virtual console, but the old X server that
was started by rhgb is exiting at the same time, and apparently depending
on pure luck (or lack of it), the old X server will switch back to the
original VT (VT#0) JUST AFTER the new X server has tried to switch to its
new VT!

The end result is that when the new X server waits for that switch to take
effect, it will never happen (since we did actually switch to it, but then
the old X server switched back!), and it will wait forever - or rather, it
will wait until a timeout signal comes in, and at that point it will exit
with the above error!

Version-Release number of selected component (if applicable):

This is the current testin gFedora 8 as of yesterday or something like that:

    rhgb-0.17.7-1.fc8
    xorg-x11-server-Xorg-1.3.0.0-29.fc8

How reproducible:

Not very reproducible: I can apparently make it happen about 50% of the time
on one particular machine (UP 1.5GHz Pentium-M laptop) with one particular
kernel configuration. I think it has to hit just the right timing window.

Steps to Reproduce:
1. Get right timing
2. ?
3. Profit!
  
Actual results:

KDM startup fails, we never get a graphical environment, I need to
manually re-do the "init 5" to start up KDM.

Expected results:

KDM and X should just start up.

Additional info:

I literally have a kernel trace (added printk's to my kernel) showing
very clearly that the old X server switches back to VT 0 *after* the new
X server has requested a console switch to its new VT.

A working trace looks like this (pid 1124 is the old X server, while
pid 6 is the kernel eventd that does the switch, and pid 1994 is the
new one):

        set_console (1124): want_console=0
        console_callback 1 (6): want_console=0
        6: change_console
        6 complete_change_console
        console_callback 2 (6): want_console=-1
        1994 reset_vc
        set_console (1994): want_console=6
        console_callback 1 (6): want_console=6
        6: change_console
        6 complete_change_console
        console_callback 2 (6): want_console=-1

while a broken trace looks like this (now 1993 is the new X server, and
1126 is the old one):

        1993 reset_vc
        set_console (1993): want_console=6
        console_callback 1 (6): want_console=6
        6: change_console
        console_callback 2 (6): want_console=-1
        set_console (1126): want_console=0
        console_callback 1 (6): want_console=0
        6: change_console
        6 complete_change_console
        console_callback 2 (6): want_console=-1

ie it clearly shows how the old X server (1126) switched back to console 0
*after* the new X server had done its "want_console=6".

This may be an unusual error, but when it happens, it effectively makes the
machine unusable for anybody who doesn't know how to restart KDM. Which is
why I'm marking it high-severity.

Comment 1 Ray Strode [halfline] 2007-10-09 15:16:44 UTC

We've hit this bug before and I thought we fixed it.

RHGB is supposed to wait until the X server completely exits before replying to
the rhgb-client --quit request.  KDM start won't start until after the
rhgb-client --quit call finishes.  The X server is supposed to be blocking until
the VT_WAITACTIVE completes before it exits.  So it *should* be all
synchronized, not sure why it's falling over.

I wonder if the VT_WAITACTIVE call (the one in the rhgb shutdown path) is
failing in the X server (or returned early), and so the X server is exiting too
soon.

Comment 2 Adam Jackson 2007-10-09 15:22:33 UTC

Orthogonal to this, we could really use a better VT switch API.  Please, please,
pretty please.  But.

The X server will do a full VT_ACTIVATE/VT_WAITACTIVE pair at startup, to
request a new tty and then wait until it's been switched to.  It will also do
the same thing on server exit, to switch back to the VT it started from.  rhgb
will wait until the X server it launched has completely exited before itself
returning.

So this should never happen unless kdm is trying to start an X server before
rhgb has actually exited.  The race your seeing shows the X server for kdm
trying to switch to tty7, and then the X server for rhgb trying to release back
to tty1.  This implies something funky in kdm startup, since we definitely fixed
this for the gdm case, and it should be doing the same thing for kdm; are you
starting your display manager in some weird way?

Alternatively, rhgb-client --quit might be returning before rhgb has actually
exited; I'll check.

The rhgb X server would have voluntarily switched away though, so in principle
it could just exit without trying to VT switch if it's already switched away.  I
suppose by that point in the exit path the driver has already done all the
cleanup it's going to do, so it should be safe, and it shouldn't have any
cleanup to do anyway since it's already off the hardware.  This would be a bug
for fast-user-switching too: A logs in, B logs in, A's server exits and kicks B
off the display.

Comment 3 Linus Torvalds 2007-10-09 15:53:23 UTC

Ajax: I can definitely not argue with you about the horridness of the
VT_ACTIVATE/WAITACTIVE interfaces. They are inherited from SVR4, basically from
back in the days when X originally got ported to Linux, and being compatible
with other systems was a primary goal, rather than trying to do anything sane.

But we're kind of stuck with it. 

And no, I'm not doing anything strange at all with kdm. This is a bog-standard
Fedora 8 box, with pretty much everything updated with "yum -y upgrade" every
once in a while. The exception, of course, is the kernel (and git), which I
always end up recompiling from sources for obvious reasons.

The original install was from the Fedora 8 test2 KDE LiveCD, but while I'm not
at all familiar with the init scripts, it looks like kdm gets started directly
by init calling out to /etc/X11/prefdm, which then exec's /usr/bin/kdm.

That same script seems to do the "/usr/bin/rhgb-client --quit" too, and I'm
now rebooting the machine to see if adding a "sleep 1" after that thing makes
any difference.

.. time passes ..

Hmm. I rebooted twice, and it worked both times. My kernel debug messages were
still active, so I could see the timestamps in /var/log/messages, and now there
did seem to be a seconds pause in between the old X exit and the new X startup.
But since this was never 100% reliable to start with, I cannot guarantee that it
makes a difference, but it sure looks suspicious.

That added "sleep 1" is obviously a total hack, and I'm in no way advocating it
as a solution. It was just meant to be a "does this change behavior" test.

Comment 4 Linus Torvalds 2007-10-09 16:05:19 UTC

Side note: are the rhgb sources available in some sane format (and no, I don't
think a source RPM is a sane format)? I'll happily try things out myself, but
google didn't show anything really obvious.

Comment 5 Ray Strode [halfline] 2007-10-09 17:28:58 UTC

export CVSROOT=:pserver:anonymous.redhat.com:/cvs/pkgs
cvs login
[press enter]
cvs co rhgb
cd rhgb/devel
make local

That should set up a source tree and build it

Comment 6 Linus Torvalds 2007-10-09 19:00:13 UTC

Ok, I just strace'd rhgb, and the thing is totally buggy.

Here's the snipped from the strace when it gets the "quit" command from rhgb-client:

   read(13, "quit\0", 1024)  = 5
   read(13, 0x864401d, 1024) = -1 EAGAIN (Resource temporarily unavailable)
   write(13, "done\0", 5)    = 5
   close(13)                 = 0
   close(13)                 = -1 EBADF (Bad file descriptor)

in other words, it does NOT wait at all for the X server to shut down. Not in
any way, shape, or form.

What happens afterwards is:

   select(9, [8], [8], NULL, NULL) = 1 (out [8])
   writev(8, [{"+\7\1\0", 4}], 1)  = 4
   select(9, [8], [], NULL, NULL)  = 1 (in [8])
   read(8, "\1\0...", 4096)        = 32
   read(8, 0x8073990, 4096)        = -1 EAGAIN
   kill(10649, SIGTERM)            = 0
   waitpid(10649, NULL, 0)         = 10649
   --- SIGCHLD (Child exited) @ 0 (0) ---
   sigreturn()                     = ? (mask now [])

ie it does end up sending the SIGTERM, but only long after it has already told
the client that it is "done", and apparently after it has also done some final X
operations in the main loop. No wonder this thing doesn't work.

I'm attaching a totally untested patch that is probably not the right thing to
do either, but it is an example of how this thing might be fixable. Depending on
that "g_at_exit()" thing to reap things is useless, that is way way too late,
and I think the thing needs to wait for the child synchronously and explicitly,
instead of playing any clever games.

Comment 7 Linus Torvalds 2007-10-09 19:02:36 UTC

Created attachment 221661 [details]
Proposed example fix

NOTE! Untested! I didn't even bother compiling it! Caveat emptor!

This is meant as a "this approach may work" kind of thing, not necessarily as a
working fix. The code and the RH build system is so strange that I'm not going
to really try to fight it any further than importing it into git and generating
a patch that looks like it might just work.

Comment 8 Adam Jackson 2007-10-09 19:29:38 UTC

Yeah, that looks plausible.  And I could easily believe that gdm is just slow
enough to launch that we're just getting lucky.

I'll build a test RPM with a variation of that patch.

Comment 9 Adam Jackson 2007-10-09 19:53:01 UTC

So with the RPMs here:

http://koji.fedoraproject.org/scratch/ajax/task_188803/

and a bootchart, I can very clearly see rhgb exiting after X exits, and the next
X server starting after rhgb exits.  The fix is just your change but with more
places converted to use got_exit_command() since we're repeating that stanza
verbatim in several places.

Please test, but I think your analysis is right.

Comment 10 Ray Strode [halfline] 2007-10-09 20:26:56 UTC

and it gets even more funny.  spot the other error (this time rhgb-client.c):

      if (strcmp (buffer, "done"))
        g_main_loop_quit (loop);

...

Comment 11 Linus Torvalds 2007-10-09 20:31:14 UTC

Ajax said: "I can very clearly see rhgb exiting after X exits, and the next
X server starting after rhgb exits." 

I hope/assume you mean that you can see rhgb-*client* exiting after X exits: the
problem was never rhgb itself, it was the client that exited too early and
allowed subsequent commands to start executing before the old X had actually exited.

So the exit time of rhgb itself is totally immaterial (and yes, that one has
always exited after X exits - since it had the "wait for child" code in its
atexit() routine).

However, I can confirm that your RPM seems to have fixed rhgb so that it returns
the "done" string only after the child (X) has exited, so that rhgb-client is
correctly serialized. And just to double-check, I also booted the machine a few
times without my hacky "sleep 1" thing to see that yes, it does fix the symptoms
of the problem that caused me to chase this down in the first place.

So it does seem fixed in your RPM's, please push it out to mainstream... I can
well imagine that gdm is so bloated that you'd never see it in practice, but I
cannot be the only one who wants to use KDE and thus kdm.

Comment 12 Adam Jackson 2007-10-09 21:06:15 UTC

Building 0.17.7-2 now.  Thanks!

Note You need to log in before you can comment on or make changes to this bug.