NOTE: Since I'm not sure if this will interest Redhat or XFree more, I've sent it to both. Thus I've included references to the redhat packages involved. VERSION: Redhat package: XFree86-SVGA-3.3.6-33 R6.3, public-patch-3 The problem was initially discovered in the XF86_SVGA server contained in the above Redhat package. It is also present in XF86_SVGA compiled from sources X336-src-x.tgz with and without the following fixes... fix-01-r128, fix-02-svr4, fix-03-mmap, fix-04-s3trio3d2x, fix-05-s3trio3d, fix-06-s3trio3d2x, fix-07-s3trio64v2gx+netfinity, fix-08-s3savage_ix+mx. CLIENT MACHINE and OPERATING SYSTEM: i386/Redhat Linux 7.0 Dell Cpt Laptop, Intel Celeron 333Mhz Processer, Kernel 2.2.17. Kernel is unpatched and compiled from source, not a redhat rpm. Also PCMCIA version 3.1.21 compiled from source. DISPLAY TYPE: Neomagic NM2360 driving internal LCD panel (Chipset forced to NM2200 in XF86Config) WINDOW MANAGER: None -- see later. COMPILER: gcc version 2.96 20000731 (Red Hat Linux 7.0) AREA: xc/lib/font/fc SYNOPSIS: fs_handle_unexpected() [in xc/lib/font/fc/fserve.c] can call _fs_eat_rest_of_error() [in .../fsio.c] with an FSFpeRec "conn" structure in which the field trans_conn is NULL. This ultimately leads to TRANS(Read)() [_FontTransRead()] [in xc/lib/xtrans/Xtrans.c] attempting to dereference a NULL pointer followed by "Caught signal 11" and a server crash. DESCRIPTION: A brief description of my setup... I'm running XFree86 4.0.1 as shipped in Redhat 7.0. Redhat supplies both the new 4.0.1 XFree86 server and the older individual servers from XFree86 3.3.6. Due to some (very minor) glitches with the newer server on my system I'm still running the XF86_SVGA server from 3.3.6. I'm also running vnc (version: 3.3.3r2, with some modifications, installed from source) and kdm (from kde 1.1.2, redhat package kdebase-1.1.2-48). I log in using kdm and my .xsession script runs the vncviewer in fullscreen mode in order to simulate an X session which I can share to other displays as I move around. Fonts are provided by xfs (from 4.0.1, redhat package XFree86-xfs-4.0.1-1). This setup worked correctly for several weeks until I install the microsoft web truetype fonts (http://www.microsoft.com/typography/fontpack/) to be served by xfs. Now if the laptop is suspended for more than about 5-10 minutes the X server will probably crash shortly after the system is resumed. The specifics of the problem... I've traced the execution path by using a combination of the call trace and judicious insertion of printf(). I can get as far as fs_wakeup() [in xc/lib/font/fc/fserve.c] although I'm unsure where this is getting called from (as a callback func)? At somepoint after the resume, fs_wakeup() is called, and not finding a matching block record for the data it reads from the connection, it calls fs_handle_unexpected() [in the same file]. ----->8 Snip 8<----- static void fs_handle_unexpected(conn, rep) FSFpePtr conn; fsGenericReply *rep; { if (rep->type == FS_Event && rep->data1 == KeepAlive) { fsNoopReq req; /* ping it back */ req.reqType = FS_Noop; req.length = SIZEOF(fsNoopReq) >> 2; _fs_add_req_log(conn, FS_Noop); _fs_write(conn, (char *) &req, SIZEOF(fsNoopReq)); <-- NO ERROR CHECK HERE. } /* this should suck up unexpected replies and events */ _fs_eat_rest_of_error(conn, (fsError *) rep); } ----->8 Snip 8<----- fs_handle_unexpected() finds that this is a "KeepAlive" and attempts to send a Noop back to the font server. This involves calling _fs_write() in [.../fsio.c]. When _fs_write() calls _FontTransWrite() [in xc/lib/xtrans/Xtrans.c] it fails setting errno to EPIPE. _fs_write() then calls _fs_connection_died() [in .../fserve.c] which (amongst other things) sets conn->trans_conn to NULL. _fs_write() then sets errno to EPIPE and returns -1 to signal the error. Crucially fs_handle_unexpected() doesn't check the return value of _fs_write() and goes on to call _fs_eat_rest_of_error() [in .../fsio.c] with "conn" containing the NULL pointer in the field trans_conn. ----->8 Snip 8<----- void _fs_connection_died(conn) FSFpePtr conn; { if (!conn->attemptReconnect) return; conn->attemptReconnect = FALSE; fs_close_conn(conn); conn->time_to_try = time((Time_t *) 0) + FS_RECONNECT_WAIT; conn->reconnect_delay = FS_RECONNECT_WAIT; conn->fs_fd = -1; conn->trans_conn = NULL; <--- HERE. conn->next_reconnect = awaiting_reconnect; awaiting_reconnect = conn; } ----->8 Snip 8<----- _fs_eat_rest_of_error() just does a call to _fs_drain_bytes() [in the same file] passing on "conn" containing the NULL pointer. _fs_drain_bytes() calls _fs_read() [in the same file] to read the data from the connection, once again passing on "conn". _fs_read() calls TRANS(Read) [_FontTransRead()] in xc/lib/xtrans/Xtrans.c passing conn->trans_conn as the first paramater (Ie. NULL). ----->8 Snip 8<----- [From _fs_read()] while ((bytes_read = _FontTransRead(conn->trans_conn, data, (int) size)) != size) { ----->8 Snip 8<----- _FontTransRead() tries to dereference this NULL pointer and we catch sig11. ----->8 Snip 8<----- int TRANS(Read) (ciptr, buf, size) <-- ciptr is NULL. ... { return ciptr->transptr->Read (ciptr, buf, size); } ----->8 Snip 8<----- REPEAT BY: This sequence seems to cause the crash everytime on my system (but it can also happen even if the exact sequence is not followed). Initially: We've logged in using kdm and are running a normal X session under Xvnc with vncviewer running full screen so we can see it. [Windowmaker 0.62.1, redhat package WindowMaker-0.62.1-14, is running under Xvnc; no window manager is running under XF86_SVGA (just vncviewer), nothing else.] 1, Run xscreensaver to lock the X display with the "xscreensaver-command -lock" command (XScreeSaver 3.25, redhat package xscreensaver-3.25-4). Note xscreensaver is locking Xvnc NOT XF86_SVGA. 2, Press a key so that xscreensaver prompts for a password. 3, While the password dialog is displayed hit Fn+Suspend to place the laptop into suspend mode. 4, Go and have a cup of coffee. Note this step is important! You must wait at least 10 minutes (say 15 for good measure). If you try to resume immediately it won't crash. 5, After a 15 minute delay hit the power button. The password dialog will pop back up (unless the panel is set to blank after 10 minutes in which case press Ctrl or something to wake it up). 6, Type your password, hit Enter and XF86_SVGA will crash with the "Caught signal 11." message. Xvnc will be fine, and you can restart the X server and reconnect to it with vncviewer and your Xvnc session will be uneffected. SAMPLE FIX: This fix adds an error check such that fs_handle_unexpected() checks the return value of _fs_write() and exits immediately if it is -1 (thus skipping the call to _fs_eat_rest_of_error()). It also adds a few warning messages so you know that the problem is still there, although now better handled. ----->8 Snip 8<-- (diff -c xc/lib/font/fc/fserve.c xc.new/lib/font/fc/fserve.c) *** xc/lib/font/fc/fserve.c Wed Jun 11 13:08:41 1997 --- xc.new/lib/font/fc/fserve.c Sun Nov 26 14:49:27 2000 *************** *** 1,4 **** --- 1,5 ---- /* $TOG: fserve.c /main/49 1997/06/10 11:23:56 barstow $ */ + /* Modified to prevent a seg.fault crash -- R.Kay (26-11-00) */ /* Copyright (c) 1990 X Consortium *************** *** 92,97 **** --- 93,99 ---- (pci)->descent || \ (pci)->characterWidth) + #include <stdio.h> /* So we can print some warnings -- RKAY */ extern FontPtr find_old_font(); *************** *** 1214,1220 **** req.reqType = FS_Noop; req.length = SIZEOF(fsNoopReq) >> 2; _fs_add_req_log(conn, FS_Noop); ! _fs_write(conn, (char *) &req, SIZEOF(fsNoopReq)); } /* this should suck up unexpected replies and events */ _fs_eat_rest_of_error(conn, (fsError *) rep); --- 1216,1229 ---- req.reqType = FS_Noop; req.length = SIZEOF(fsNoopReq) >> 2; _fs_add_req_log(conn, FS_Noop); ! /* If _fs_write fails, conn->tran_conn will be NULL and calling ! * _fs_eat_rest_of_error will eventually cause a segfault in ! * _FontTransRead() -- RKAY */ ! if (_fs_write(conn, (char *) &req, SIZEOF(fsNoopReq)) == -1) { ! fprintf(stderr, "Warning: _fs_write failed in " ! "fs_handle_unexpected.\n"); ! return; ! } } /* this should suck up unexpected replies and events */ _fs_eat_rest_of_error(conn, (fsError *) rep); ----->8 Snip 8<------------------------ ----->8 Snip 8<-- (diff -c xc/lib/font/fc/fsio.c xc.new/lib/font/fc/fsio.c) *** xc/lib/font/fc/fsio.c Fri Jul 23 14:42:00 1999 --- xc.new/lib/font/fc/fsio.c Sun Nov 26 14:49:38 2000 *************** *** 1,5 **** --- 1,6 ---- /* $XConsortium: fsio.c,v 1.37 95/04/05 19:58:13 kaleb Exp $ */ /* $XFree86: xc/lib/font/fc/fsio.c,v 3.5.2.2 1999/07/23 13:22:20 hohndel Exp $ */ + /* Modified to prevent a seg.fault crash -- R.Kay (26-11-00) */ /* * Copyright 1990 Network Computing Devices * *************** *** 457,462 **** --- 458,467 ---- } else if (ECHECK(EINTR)) { continue; } else { /* something bad happened */ + /* RKAY */ + if (ECHECK(EPIPE)) + fprintf(stderr, "Warning: EPIPE while writing to font " + "server.\n"); _fs_connection_died(conn); ESET(EPIPE); return -1; ----->8 Snip 8<------------------ Of course there remains the question of why writing to the font server fails after a resume. I've not had chance to investigate that aspect of the problem. However, whatever the answer may be, a problem with the font server shouldn't result in the X server seg faulting. One side effect of the above patch is that subsequent to the averted crash XF86_SVGA start consuming ~90% of the CPU. This appears to be because the mechanism for calculating the timeouts for the Select() calls in WaitForSomething() [xc/programs/Xserver/os/WaitFor.c] now decides on a timeout of 0. And so XF86_SVGA runs in a busy loop. Aside from that it works fine (if it wasn't for the fan switching in I wouldn't have noticed). There appears to be a function _fs_try_reconnect() [xc/lib/font/fc/fserve.c] that looks like it should re-establish the font server connection. The only place I can find where this is called is fs_wakeup(). However, if the connection has died and fs_connection_died() has been called, as in this case, then conn->fs_fd == -1 and fs_wakeup() exits immediately and the call to fs_try_reconnect() is never reached. ----->8 Snip 8<----- if (conn->fs_fd == -1) return FALSE; ----->8 Snip 8<----- I tried sleeping for 10 seconds and then calling fs_try_reconnect in fs_handle_unexpected() but conn->trans_conn was still NULL. I guess an examination of xfs is would be a good idea. Hopefully this is of some use, R.Kay
Thanks very much for doing the debugging session and analasis, and also for the patch as well. I believe we have a fix for 4.x based servers now and will be testing it out soon. i will also try out your patch soon too, if by chance you've come across any more info or patches please feel free to submit them, and I will try to get a fix out ASAP. Sorry for the delay of response. I'm playing catchup with an inherited bug report pile from XFree86, and hope to get caught up sometime this year.. ;o) This bug is a duplicate of Bug #17991 and countless others I wont mention, however I'm not marking it duplicate, as it is the most detailed report of the bunch. Thanks again.
Fixed in latest errata