Bug 705652

Summary: Heap corruption causing X deadlock with xorg-x11-drv-intel-2.15.0-3.fc15
Product: [Fedora] Fedora Reporter: Ian Pilcher <ipilcher>
Component: xorg-x11-drv-intelAssignee: Adam Jackson <ajax>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: ajax, xgl-maint
Target Milestone: ---Keywords: Patch, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: [cat:crash]
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-07 04:43:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Full backtrace of X server deadlock
none
Backtrace of deadlock with glibc-2.13.90-13.x86_64 none

Description Ian Pilcher 2011-05-18 00:36:44 UTC
Description of problem:
After updating to xorg-x11-drv-intel-2.15.0-3.fc15, I get a hard X hang
every time I log out.  The screen is black, with the mouse pointer visible
but immobile, and the system doesn't respond to any keyboard combination;
the only way to recover the system is to ssh in, kill -9 the X server, and
reboot.

Attaching to the X server with gdb shows that it has deadlocked trying to
report a "corrupted double-linked list":

#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:100
#1  0x00007fb07f8c7c11 in _L_lock_10461 () at malloc.c:6486
#2  0x00007fb07f8c59d7 in __libc_malloc (bytes=140396034138592) at malloc.c:3657
#3  0x00007fb07f8bb35d in __libc_message (do_abort=2, 
    fmt=0x7fb07f9a6fb8 "*** glibc detected *** %s: %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:137
#4  0x00007fb07f8c196a in malloc_printerr (action=3, 
    str=0x7fb07f9a3f92 "corrupted double-linked list", ptr=<optimized out>) at malloc.c:6283
#5  0x00007fb07f8c1d48 in malloc_consolidate (av=0x7fb07fbe21e0) at malloc.c:5161
#6  0x00007fb07f8c2669 in malloc_consolidate (av=0x7fb07fbe21e0) at malloc.c:5115
#7  _int_free (av=0x7fb07fbe21e0, p=<optimized out>, have_lock=0) at malloc.c:5034
#8  0x000000351360c01f in FontFileFreeDir (dir=0x1fef3d0) at fontdir.c:166
#9  0x000000351360ce18 in FontFileFreeFPE (fpe=0x1fef360) at fontfile.c:139
#10 0x000000351360f89e in CatalogueUnrefFPEs (fpe=<optimized out>) at catalogue.c:116
#11 0x000000351360fe41 in CatalogueFreeFPE (fpe=0x1fb8f00) at catalogue.c:272
#12 0x000000000042f09d in FreeFPE (fpe=0x1fb8f00) at dixfonts.c:218
#13 FreeFPE (fpe=0x1fb8f00) at dixfonts.c:214
#14 0x000000000042f107 in FreeFontPath (list=0x1fb54b0, n=2, force=1) at dixfonts.c:1628
#15 0x0000000000432257 in FreeFonts () at dixfonts.c:1998
#16 0x0000000000422f1e in main (argc=<optimized out>, argv=0x7fff89fd3fb8, envp=<optimized out>)
    at main.c:329

Downgrading back to xorg-x11-drv-intel-2.14.0-6.fc15.x86_64 makes the
problem go away.

Version-Release number of selected component (if applicable):
xorg-x11-drv-intel-2.15.0-3.fc15.x86_64

How reproducible:
100%

Comment 1 Ian Pilcher 2011-05-18 00:37:54 UTC
Created attachment 499495 [details]
Full backtrace of X server deadlock

Comment 2 Ian Pilcher 2011-05-18 18:46:22 UTC
It looks like the problematic commit is one of:

e1ff5182304e00c0d392092069422cae7626cf8d  Handle drawable/client
    destruction in pending swaps/flips

86f23f21ab57fcbc031bcd2b8f432a08ff4cc320  Skip client and drawable
   resource delete calls when deleting frame event

I wasn't able to test with only the first commit, because KDE gets stuck
on its "splash screen".

Comment 3 Ian Pilcher 2011-05-21 15:55:23 UTC
Copied from https://bugs.freedesktop.org/show_bug.cgi?id=37420:

Ian Pilcher 2011-05-20 14:41:59 PDT

One other data point.  I'm using the following script to reproduce the
problem:

#!/bin/bash

export DISPLAY=:0

firefox http://www.cnn.com &>/dev/null &
kwrite &>/dev/null &
glxgears &>/dev/null &
sleep 15
qdbus org.kde.screensaver /ScreenSaver org.freedesktop.ScreenSaver.SetActive
true
sleep 30
qdbus org.kde.screensaver /ScreenSaver org.freedesktop.ScreenSaver.SetActive
false
sleep 15
killall kwrite
sleep 2
killall firefox
sleep 2
killall glxgears
sleep 2
qdbus org.kde.ksmserver /KSMServer logout 0 0 0

The interesting thing is that the problem does not occur without the
"killall ..." commands.  There's something about closing the windows (or
the way that KWin does it) that triggers the issue.

[reply] [-] Comment 3 Ian Pilcher 2011-05-20 17:39:24 PDT

I am unable to reproduce this problem when booting with maxcpus=1 or
setting MALLOC_CHECK_ to any value.

Comment 4 Ian Pilcher 2011-05-22 20:29:50 UTC
With the latest glibc update, I'm not getting an abort, rather than a dead-
lock:

#0  0x0000003e06a36275 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003e06a37b8b in abort () at abort.c:93
#2  0x0000003e06a7232e in __libc_message (do_abort=2, 
    fmt=0x3e06b5e060 "*** glibc detected *** %s: %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x0000003e06a7896a in malloc_printerr (action=3, 
    str=0x3e06b5b012 "corrupted double-linked list", ptr=<optimized out>) at malloc.c:6283
#4  0x0000003e06a78d80 in malloc_consolidate (av=0x3e06d991e0) at malloc.c:5169
#5  0x0000003e06a79669 in malloc_consolidate (av=0x3e06d991e0) at malloc.c:5115
#6  _int_free (av=0x3e06d991e0, p=<optimized out>, have_lock=0) at malloc.c:5034
#7  0x0000000000461094 in FreeOsBuffers (oc=0x21459e0) at io.c:1101
#8  0x000000000045f283 in CloseDownConnection (client=0x2145a20) at connection.c:1068
#9  0x000000000042e1c6 in CloseDownClient (client=0x2145a20) at dispatch.c:3432
#10 0x000000000042ec3a in Dispatch () at dispatch.c:441
#11 0x0000000000422e1a in main (argc=<optimized out>, argv=0x7fffdb2af6c8, envp=<optimized out>)
    at main.c:287

Comment 5 Ian Pilcher 2011-05-22 20:32:52 UTC
(In reply to comment #4)
> With the latest glibc update, I'm not getting an abort, rather than a dead-
> lock:

s/not/now/

<sigh/>

Also, kdm is now able to restart X post-abort, so the problem is less severe
from a system usability point of view.

Comment 6 Ian Pilcher 2011-05-22 22:18:38 UTC
Created attachment 500311 [details]
Backtrace of deadlock with glibc-2.13.90-13.x86_64

... or not.  :-(

#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:100
#1  0x0000003e06a7ec11 in _L_lock_10461 () at malloc.c:6486
#2  0x0000003e06a7c9d7 in __libc_malloc (bytes=266402894304) at malloc.c:3657
#3  0x00000000004ea3ce in XIChangeDeviceProperty (dev=0x17db4c0, property=<optimized out>, 
    type=19, format=8, mode=<optimized out>, len=<optimized out>, value=0x7fff435c4bcf, 
    sendevent=1) at xiproperty.c:749
#4  0x0000000000426fa4 in DisableDevice (dev=0x17db4c0, sendevent=1 '\001') at devices.c:499
#5  0x0000000000427298 in RemoveDevice (dev=0x17db4c0, sendevent=1 '\001') at devices.c:1059
#6  0x000000000047da32 in DeleteInputDeviceRequest (pDev=0x17db4c0) at xf86Xinput.c:957
#7  0x0000000000424560 in CloseDeviceList (listHead=0x7e4b08) at devices.c:968
#8  0x0000000000424ac4 in CloseDownDevices () at devices.c:996
#9  0x00000000004612f8 in AbortServer () at log.c:409
#10 0x00000000004614e7 in FatalError (f=0x578e50 "Caught signal %d (%s). Server aborting\n")
    at log.c:536
#11 0x000000000046231e in OsSigHandler (sip=<optimized out>, signo=11, unused=<optimized out>)
    at osinit.c:153
#12 OsSigHandler (signo=11, sip=<optimized out>, unused=<optimized out>) at osinit.c:115
#13 <signal handler called>
#14 0x0000003e06a78bf5 in malloc_consolidate (av=0x3e06d991e0) at malloc.c:5169
#15 0x0000003e06a79669 in malloc_consolidate (av=0x3e06d991e0) at malloc.c:5115
#16 _int_free (av=0x3e06d991e0, p=<optimized out>, have_lock=0) at malloc.c:5034
#17 0x000000000044c64f in FreeClientResources (client=0x17d8140) at resource.c:858
#18 0x000000000042e0ce in CloseDownClient (client=0x17d8140) at dispatch.c:3461
#19 0x000000000042ec3a in Dispatch () at dispatch.c:441
#20 0x0000000000422e1a in main (argc=<optimized out>, argv=0x7fff435c5828, envp=<optimized out>)
    at main.c:287

Comment 7 Ian Pilcher 2011-06-07 04:43:29 UTC
I haven't seen this for a couple of weeks now.  Given the "raciness" of
the symptoms, it's hard to say whether the problem is really fixed or
still lurking (or even where the problem is/was).

Closing for now.