Bug 72295
Summary: | gdm kills network connectivity on X restart | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Mark Cooke <segfault> |
Component: | gdm | Assignee: | Owen Taylor <otaylor> |
Status: | CLOSED RAWHIDE | QA Contact: | Mike McLean <mikem> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 8.0 | CC: | jirka, notting |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-02-04 21:29:14 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 67218, 79579 |
Description
Mark Cooke
2002-08-22 19:47:03 UTC
Damnit. This looks like gdm is killing something quite random or some such. I need to check what's happening there. There could be a race in the xioerror handler stuff. As a datapoint, does not happen for me on my test machine. Does not happen here either though I may have fixed two minor problems that could perhaps avoid some races (I couldn't find races, but the fact that gdm_slave_xioerror_handler got called twice and once within a signal since the message has been proxied through the protocol tells me that something went horribly wrong somewhere). Can you (The reporter) turn on debugging (debug/Enable=true in the config file) and then give me what it dumps into the syslog. Damnit I keep finding flaws and subtle races. Though none that could explain the above. I don't think. A very fun race can happen at times though forcing gdm to do kill (0, whatever) which would explain the above if not run with setsid, and this happens currently when run in -nodaemon. I can avoid these races and I've now fixed all such kills in the code by pushing a block to SIGCHLD. I'm testing currently and will commit the cleanup to CVS soon. If redhat wants a quick fix which doesn't fix but should minimize effects of this, they should add setsid in gdm.c, in the branch of the if statement before we cann gdm_daemonize. Then gdm will run in it's own session and a kill(0,...) will just kill gdm itself rather then random other stuff that may be in the process group. No problem, will enable debugging and report back the output. Debugging information for gdm taken from /var/log/messages, This is taken straight after using CTRL | ALT | Backspace ----------------------------------------------------------- Aug 23 19:31:35 stimpy gdm[871]: (child 976) gdm_slave_child_handler Aug 23 19:31:35 stimpy gdm[871]: (child 976) gdm_slave_child_handler: 1170 died Aug 23 19:31:35 stimpy gdm[871]: (child 976) gdm_slave_child_handler: 1170 returned 1 Aug 23 19:31:35 stimpy gdm[871]: (child 976) gdm_slave_xioerror_handler: I/O error for display :0 Aug 23 19:31:35 stimpy gdm[871]: (child 976) gdm_slave_xioerror_handler: Fatal X error - Restarting :0 Aug 23 19:31:35 stimpy gdm[871]: (child 976) gdm_server_stop: Server for :0 going down! Aug 23 19:31:35 stimpy gdm[871]: (child 976) gdm_server_stop: Killing server pid 977 Aug 23 19:31:35 stimpy gdm[871]: mainloop_sig_callback: Got signal 17 Aug 23 19:31:35 stimpy gdm[871]: gdm_cleanup_children: child 976 returned 2 Aug 23 19:31:35 stimpy gdm[871]: gdm_child_action: Slave process returned 2 Aug 23 19:31:35 stimpy gdm[871]: gdm_display_manage: Managing :0 Aug 23 19:31:35 stimpy gdm[871]: Resetting counts for loop of death detection, 90 seconds elapsed. Aug 23 19:31:35 stimpy gdm[871]: gdm_display_manage: Forked slave: 1866 Aug 23 19:31:35 stimpy gdm[871]: (child 976) gdm_server_stop: Server pid 977 dead Aug 23 19:31:35 stimpy gdm[871]: main: Exited main loop Aug 23 19:31:35 stimpy gdm[1866]: gdm_slave_start: Starting slave process for :0 Aug 23 19:31:35 stimpy gdm[1866]: gdm_slave_start: Loop Thingie Aug 23 19:31:35 stimpy gdm[1866]: Sending VT_NUM == -1 for slave 1866 Aug 23 19:31:35 stimpy gdm[1866]: Sending VT_NUM 1866 -1 Aug 23 19:31:35 stimpy gdm[871]: Handling message: 'VT_NUM 1866 -1' Aug 23 19:31:35 stimpy gdm[871]: Got VT_NUM == -1 Aug 23 19:31:35 stimpy gdm[871]: (child 1866) gdm_slave_usr2_handler: :0 got USR2 signal Aug 23 19:31:35 stimpy gdm[1866]: gdm_server_start: :0 Aug 23 19:31:35 stimpy gdm[1866]: gdm_auth_secure_display: Setting up access for :0 Aug 23 19:31:35 stimpy gdm[1866]: gdm_auth_secure_display: Setting up socket access Aug 23 19:31:35 stimpy gdm[1866]: gdm_auth_secure_display: Setting up network access Aug 23 19:31:35 stimpy gdm[1866]: gdm_auth_secure_display: Setting up access for :0 - 5 entries Aug 23 19:31:35 stimpy gdm[1866]: Sending COOKIE == <secret> for slave 1866 Aug 23 19:31:35 stimpy gdm[1866]: Sending COOKIE 1866 0c6841d7f4e1baf5b7d164dc234d6e8d Aug 23 19:31:35 stimpy gdm[871]: Handling message: 'COOKIE 1866 0c...' Aug 23 19:31:35 stimpy gdm[871]: Got COOKIE == <secret> Aug 23 19:31:35 stimpy gdm[871]: (child 1866) gdm_slave_usr2_handler: :0 got USR2 signal Aug 23 19:31:35 stimpy gdm[1866]: gdm_server_spawn: Forked server on pid 1867 Aug 23 19:31:35 stimpy gdm[1867]: gdm_server_spawn: '/usr/X11R6/bin/X :0 -auth /var/gdm/:0.Xauth' Aug 23 19:31:35 stimpy gdm[871]: (child 1866) gdm_server_usr1_handler: Got SIGUSR1, server running Aug 23 19:31:35 stimpy gdm[1866]: gdm_server_start: Before mainloop waiting for server Aug 23 19:31:35 stimpy gdm[1866]: gdm_server_start: After mainloop waiting for server Aug 23 19:31:35 stimpy gdm[1866]: gdm_server_start: Completed :0! Aug 23 19:31:35 stimpy gdm[1866]: Sending XPID == 1867 for slave 1866 Aug 23 19:31:35 stimpy gdm[1866]: Sending XPID 1866 1867 Aug 23 19:31:35 stimpy gdm[871]: Handling message: 'XPID 1866 1867' Aug 23 19:31:35 stimpy gdm[871]: Got XPID == 1867 Aug 23 19:31:35 stimpy gdm[1866]: gdm_slave_run: Opening display :0 Aug 23 19:31:35 stimpy gdm[871]: (child 1866) gdm_slave_usr2_handler: :0 got USR2 signal Aug 23 19:31:36 stimpy netfs: Unmounting NFS filesystems: succeeded Aug 23 19:31:38 stimpy network: Shutting down interface eth0: succeeded Aug 23 19:31:38 stimpy network: Shutting down loopback interface: succeeded Aug 23 19:31:38 stimpy /etc/hotplug/net.agent: NET unregister event not supported Aug 23 19:31:39 stimpy apmd[599]: User Suspend Aug 23 19:31:39 stimpy kernel: apm: suspend was vetoed. Aug 23 19:31:40 stimpy gdm[1866]: Sending START_NEXT_LOCAL Aug 23 19:31:40 stimpy gdm[871]: Handling message: 'START_NEXT_LOCAL' Aug 23 19:31:40 stimpy gdm[1866]: gdm_slave_greeter: Running greeter on :0 Aug 23 19:31:40 stimpy gdm[1866]: gdm_slave_greeter: Greeter on pid 2060 Aug 23 19:31:40 stimpy gdm[1866]: Sending GREETPID == 2060 for slave 1866 Aug 23 19:31:40 stimpy gdm[1866]: Sending GREETPID 1866 2060 Aug 23 19:31:40 stimpy gdm[871]: (child 1866) gdm_slave_child_handler Aug 23 19:31:40 stimpy gdm[871]: Handling message: 'GREETPID 1866 2060' Aug 23 19:31:40 stimpy gdm[871]: Got GREETPID == 2060 Aug 23 19:31:40 stimpy gdm[871]: (child 1866) gdm_slave_usr2_handler: :0 got USR2 signal Aug 23 19:31:44 stimpy gdm[1866]: gdm_slave_wait_for_login: In loop Wow I've been staring at the code and fixing problems but I can't see the precise problem. I think it just may be one of the races I've recently fixed in CVS. Can you try out of CVS? If not can you try a tarball? I'll make .9 version soon (maybe today or tommorrow). For RedHat: the setsid is wrong where I put it before. If in -nodaemon it shoul dnot do setsid (I'm a dumbass). But the slave should do it when it forks. Note that the slave fork is in display.c and not slave.c. This should also remove some minor races when you change init levels in case you start gdm from init. The main daemon doesn't really try to kill so many things so I doubt the error was there. If you do a setsid in the slave, at worst the slave kills itself, so I'd recommend this one-liner-fix in case redhat won't want to update to .9 now. I put in george's fix in 2.4.0.7-5 segfault: as we haven't reproduced the problem would appreciate you checking whether this fixes it when the new gdm shows up in rawhide. please close/reopen the bug accordingly. Will check out rawhide, but I can reporduce it on at least 6 test machines now. In case this affects anything, they are all using NFS mounts (/pub/....) and NIS for passwds (for users). Will report back when 2.4.0.7-5 comes available in rawhide. Hmm, let's speed this up in fact. ;-) http://people.redhat.com/~hp/gdm-2.4.0.7-5.i386.rpm (assuming x86) Yup.. (got the rpm from Havoc's directory) fixed the network problem, except now for some reason, it switches off the monitor at times (not everytime), then you just have to press Return to get gdm back. This is livable with, but would be nice not to have to do this. Well done everyone :-) Closing out as resolved. This is still happening on the official 8.0 release, I can reproduce it on 4 machines and other members have reported the same still happening. The rpm that I downloaded from the link Havoc mentioned above fixed it in Null, but its come back in Psyche Maybe it was never really fixed - the final version still seems to have the patch that we thought fixed this. gdm 2.4.0.12 to appear in rawhide shortly has george's full upstream fix. Maybe check in next beta to come out what the status is. Taking gdm target/blockers. Could people test with 2.4.1.0/2.4.1.1 from Raw Hide and see if it still occurs? After a recompile a several rawhide packages, this now seems to be fixed on my system anyway. Mark Thanks for the testing. |