Bug 323501
Summary: | rhgb causes "xf86OpenConsole: VT_WAITACTIVE failed: Interrupted system call" on real X startup | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Linus Torvalds <torvalds> | ||||
Component: | rhgb | Assignee: | Adam Jackson <ajax> | ||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 8 | CC: | rstrode | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2007-10-09 21:06:15 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 235703 | ||||||
Attachments: |
|
Description
Linus Torvalds
2007-10-08 18:14:37 UTC
We've hit this bug before and I thought we fixed it. RHGB is supposed to wait until the X server completely exits before replying to the rhgb-client --quit request. KDM start won't start until after the rhgb-client --quit call finishes. The X server is supposed to be blocking until the VT_WAITACTIVE completes before it exits. So it *should* be all synchronized, not sure why it's falling over. I wonder if the VT_WAITACTIVE call (the one in the rhgb shutdown path) is failing in the X server (or returned early), and so the X server is exiting too soon. Orthogonal to this, we could really use a better VT switch API. Please, please, pretty please. But. The X server will do a full VT_ACTIVATE/VT_WAITACTIVE pair at startup, to request a new tty and then wait until it's been switched to. It will also do the same thing on server exit, to switch back to the VT it started from. rhgb will wait until the X server it launched has completely exited before itself returning. So this should never happen unless kdm is trying to start an X server before rhgb has actually exited. The race your seeing shows the X server for kdm trying to switch to tty7, and then the X server for rhgb trying to release back to tty1. This implies something funky in kdm startup, since we definitely fixed this for the gdm case, and it should be doing the same thing for kdm; are you starting your display manager in some weird way? Alternatively, rhgb-client --quit might be returning before rhgb has actually exited; I'll check. The rhgb X server would have voluntarily switched away though, so in principle it could just exit without trying to VT switch if it's already switched away. I suppose by that point in the exit path the driver has already done all the cleanup it's going to do, so it should be safe, and it shouldn't have any cleanup to do anyway since it's already off the hardware. This would be a bug for fast-user-switching too: A logs in, B logs in, A's server exits and kicks B off the display. Ajax: I can definitely not argue with you about the horridness of the VT_ACTIVATE/WAITACTIVE interfaces. They are inherited from SVR4, basically from back in the days when X originally got ported to Linux, and being compatible with other systems was a primary goal, rather than trying to do anything sane. But we're kind of stuck with it. And no, I'm not doing anything strange at all with kdm. This is a bog-standard Fedora 8 box, with pretty much everything updated with "yum -y upgrade" every once in a while. The exception, of course, is the kernel (and git), which I always end up recompiling from sources for obvious reasons. The original install was from the Fedora 8 test2 KDE LiveCD, but while I'm not at all familiar with the init scripts, it looks like kdm gets started directly by init calling out to /etc/X11/prefdm, which then exec's /usr/bin/kdm. That same script seems to do the "/usr/bin/rhgb-client --quit" too, and I'm now rebooting the machine to see if adding a "sleep 1" after that thing makes any difference. .. time passes .. Hmm. I rebooted twice, and it worked both times. My kernel debug messages were still active, so I could see the timestamps in /var/log/messages, and now there did seem to be a seconds pause in between the old X exit and the new X startup. But since this was never 100% reliable to start with, I cannot guarantee that it makes a difference, but it sure looks suspicious. That added "sleep 1" is obviously a total hack, and I'm in no way advocating it as a solution. It was just meant to be a "does this change behavior" test. Side note: are the rhgb sources available in some sane format (and no, I don't think a source RPM is a sane format)? I'll happily try things out myself, but google didn't show anything really obvious. export CVSROOT=:pserver:anonymous.redhat.com:/cvs/pkgs cvs login [press enter] cvs co rhgb cd rhgb/devel make local That should set up a source tree and build it Ok, I just strace'd rhgb, and the thing is totally buggy. Here's the snipped from the strace when it gets the "quit" command from rhgb-client: read(13, "quit\0", 1024) = 5 read(13, 0x864401d, 1024) = -1 EAGAIN (Resource temporarily unavailable) write(13, "done\0", 5) = 5 close(13) = 0 close(13) = -1 EBADF (Bad file descriptor) in other words, it does NOT wait at all for the X server to shut down. Not in any way, shape, or form. What happens afterwards is: select(9, [8], [8], NULL, NULL) = 1 (out [8]) writev(8, [{"+\7\1\0", 4}], 1) = 4 select(9, [8], [], NULL, NULL) = 1 (in [8]) read(8, "\1\0...", 4096) = 32 read(8, 0x8073990, 4096) = -1 EAGAIN kill(10649, SIGTERM) = 0 waitpid(10649, NULL, 0) = 10649 --- SIGCHLD (Child exited) @ 0 (0) --- sigreturn() = ? (mask now []) ie it does end up sending the SIGTERM, but only long after it has already told the client that it is "done", and apparently after it has also done some final X operations in the main loop. No wonder this thing doesn't work. I'm attaching a totally untested patch that is probably not the right thing to do either, but it is an example of how this thing might be fixable. Depending on that "g_at_exit()" thing to reap things is useless, that is way way too late, and I think the thing needs to wait for the child synchronously and explicitly, instead of playing any clever games. Created attachment 221661 [details]
Proposed example fix
NOTE! Untested! I didn't even bother compiling it! Caveat emptor!
This is meant as a "this approach may work" kind of thing, not necessarily as a
working fix. The code and the RH build system is so strange that I'm not going
to really try to fight it any further than importing it into git and generating
a patch that looks like it might just work.
Yeah, that looks plausible. And I could easily believe that gdm is just slow enough to launch that we're just getting lucky. I'll build a test RPM with a variation of that patch. So with the RPMs here: http://koji.fedoraproject.org/scratch/ajax/task_188803/ and a bootchart, I can very clearly see rhgb exiting after X exits, and the next X server starting after rhgb exits. The fix is just your change but with more places converted to use got_exit_command() since we're repeating that stanza verbatim in several places. Please test, but I think your analysis is right. and it gets even more funny. spot the other error (this time rhgb-client.c): if (strcmp (buffer, "done")) g_main_loop_quit (loop); ... Ajax said: "I can very clearly see rhgb exiting after X exits, and the next X server starting after rhgb exits." I hope/assume you mean that you can see rhgb-*client* exiting after X exits: the problem was never rhgb itself, it was the client that exited too early and allowed subsequent commands to start executing before the old X had actually exited. So the exit time of rhgb itself is totally immaterial (and yes, that one has always exited after X exits - since it had the "wait for child" code in its atexit() routine). However, I can confirm that your RPM seems to have fixed rhgb so that it returns the "done" string only after the child (X) has exited, so that rhgb-client is correctly serialized. And just to double-check, I also booted the machine a few times without my hacky "sleep 1" thing to see that yes, it does fix the symptoms of the problem that caused me to chase this down in the first place. So it does seem fixed in your RPM's, please push it out to mainstream... I can well imagine that gdm is so bloated that you'd never see it in practice, but I cannot be the only one who wants to use KDE and thus kdm. Building 0.17.7-2 now. Thanks! |