Bug 754674 - F16 X hang due to frequent evdev mouse scrollwheel events (F14 ok)
Summary: F16 X hang due to frequent evdev mouse scrollwheel events (F14 ok)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-evdev
Version: 16
Hardware: i686
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Peter Hutterer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 807493
TreeView+ depends on / blocked
 
Reported: 2011-11-17 10:10 UTC by Trevor Cordes
Modified: 2012-05-03 07:29 UTC (History)
3 users (show)

Fixed In Version: xorg-x11-server-1.11.4-3.fc16
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 807493 (view as bug list)
Environment:
Last Closed: 2012-05-03 07:29:37 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
cheesy patch that seems to fix the problems (1.09 KB, patch)
2011-12-16 13:03 UTC, Trevor Cordes
no flags Details | Diff

Description Trevor Cordes 2011-11-17 10:10:35 UTC
Description of problem:
Just upgraded to Fedora 16 (from 14 to 15 to 16).  My X session now hangs 2-3 times a day.  The mouse freezes.  Keyboard unresponsive (including numlock).  Can't switch to virtual terminal.

I *can* ssh into the box from another.  The core system is running fine, it's just X or the video or something that's freezing.

I also *can* do a ALT-SYSRQ-<foo>.  For instance, doing an emergency sync works.  Killing the X ps's causes the screen to go black and the kb/mouse are still frozen.  Only fix is to issue reboot, which works fine.

dmesg and /var/log/messages show NOTHING about the hang.  No backtraces, no oopses, no debug info, nothing.  Xorg.0.log shows nothing unusual either.  I know that really sucks to not get any debug output.

I haven't changed video drivers: I was using nouveau in 14 without any problems.  I'm still using nouveau.

I only used F15 for a day and it didn't crash, but I may not have used it long enough or intensely enough, so all I can say is the bug started between latest F14 packages and F16.

Video card is:
01:00.0 VGA compatible controller: nVidia Corporation G73 [GeForce 7600 GS] (rev a1)
in dual-head (2 DVI) mode.


Version-Release number of selected component (if applicable):
xorg-x11-drv-nouveau-0.0.16-27.20110720gitb806e3f.fc16.i686
xorg-x11-server-Xorg-1.11.2-3.fc16.i686
kernel-PAE-3.1.1-1.fc16.i686

How reproducible:
Happens randomly 2-3 times a day so far.


Steps to Reproduce:
1. start up X
2. use system, opening more windows and doing more things
  
Actual results:
Occasional freeze

Expected results:
No freeze, just like Fedora 14

Additional info:

So far the freezes always seem to happen while manipulating firefox.  The pages aren't overly complex, and I don't think there was any flash in them at the time.  It may be a red herring as I spend at least 20% working time in firefox.

Also, I use synergy to sync the linux mouse/kb to my windows box and the most recent freeze happened shortly after I had started synergy up (it had run for at least 6 hours without synergy without freezing).  May be a coincidence but I want to document it here as a possibility.  I will try to isolate synergy from the equation and report back.

Lastly, my Xorg does have one weird error:
[   298.093] evdev: HID 04b3:3103: dropping event due to full queue!

But that has always been there (in F14, etc).  It's caused by my IBM ScrollPoint mouse and it's odd pencil-eraser type scroll wheel.  I don't think it's related to the scroll wheel, as I use it all the time, but I will try to discern any correlation between use and freezing.

I will do anything to solve this issue as it's my main workstation and I can't bear freezes for long.

Comment 1 Trevor Cordes 2011-11-17 10:17:47 UTC
Note to self: try a strace on the X ps's next time freeze happens.

Comment 2 Trevor Cordes 2011-11-18 05:32:35 UTC
Just froze again.  Learned more.

It is *not* caused by synergy, as synergy wasn't running at all since I last rebooted.

I got an interesting strace on the X ps:

An endless stream of about 10 of these a second:

--- {si_signo=SIGALRM, si_code=SI_KERNEL, si_value={int=1158153608, ptr=0x45080588}} (Alarm clock) ---
sigreturn()                             = ? (mask now [IO])
futex(0x450c73e0, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted)
--- {si_signo=SIGALRM, si_code=SI_KERNEL, si_value={int=1158153608, ptr=0x45080588}} (Alarm clock) ---
sigreturn()                             = ? (mask now [IO])
futex(0x450c73e0, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted)
--- {si_signo=SIGALRM, si_code=SI_KERNEL, si_value={int=1158153608, ptr=0x45080588}} (Alarm clock) ---

X appears to be stuck in some futex deadlock?

I was using the mouse scrollwheel heavily at the time (so far I was scrolling during most crashes I think).

I just tried opening a few big scrollbar-able pages and scrolled with the wheel for several minutes like mad, but I couldn't get it to freeze on demand.

Comment 3 Trevor Cordes 2011-11-18 10:34:10 UTC
Very good news.  I have a stack trace during the freeze.  Using gdb and -debuginfo rpms I get:

gdb /usr/bin/X 2107
#0  0x0038d424 in __kernel_vsyscall ()
#1  0x450209f2 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x44f9bcff in _L_lock_11099 () from /lib/libc.so.6
#3  0x44f99e6c in malloc () from /lib/libc.so.6
#4  0x4502a631 in __vasprintf_chk () from /lib/libc.so.6
#5  0x4502a607 in __asprintf_chk () from /lib/libc.so.6
#6  0x080d83f9 in asprintf (__fmt=0x81d5481 "%s: %s: %s", __ptr=0xbffdb2bc) at /usr/include/bits/stdio2.h:158
#7  xf86VIDrvMsgVerb (dev=0x88b47e0, type=X_NONE, verb=1,
    format=0xfa5d50 "dropping event due to full queue!\n", args=0xbffdb31c "") at xf86Helper.c:1092
#8  0x080d84eb in xf86IDrvMsg (dev=0x88b47e0, type=X_NONE,
    format=0xfa5d50 "dropping event due to full queue!\n") at xf86Helper.c:1120
#9  0x00f9fcb2 in EvdevNextInQueue (pInfo=<optimized out>) at evdev.c:282
#10 EvdevNextInQueue (pInfo=<optimized out>) at evdev.c:276
#11 0x00f9ff5d in EvdevQueueButtonEvent (pInfo=0x88b47e0, button=4, value=1) at evdev.c:314
#12 0x00fa0037 in EvdevQueueButtonClicks (pInfo=0x88b47e0, button=4, count=21) at evdev.c:350
#13 0x00fa16fb in EvdevProcessRelativeMotionEvent (ev=0xbffdb3c0, pInfo=0x88b47e0) at evdev.c:595
#14 EvdevProcessEvent (ev=0xbffdb3c0, pInfo=0x88b47e0) at evdev.c:834
#15 EvdevReadInput (pInfo=0x88b47e0) at evdev.c:891
#16 0x080c6510 in xf86SigioReadInput (fd=17, closure=0x88b47e0) at xf86Events.c:298
#17 0x080ecf8b in xf86SIGIO (sig=29) at ../shared/sigio.c:109
#18 <signal handler called>
#19 0x44f96bc4 in _int_free () from /lib/libc.so.6
#20 0x45e88139 in pixman_region_fini () from /usr/lib/libpixman-1.so.0
#21 0x08093eb3 in RegionDestroy (pReg=0x8d2a238) at region.c:254
#22 0x081b03e3 in miDestroyClip (pGC=0x89215e0) at migc.c:71
#23 0x081b0440 in miChangeClip (pGC=0x89215e0, type=1, pvalue=0x894d640, nrects=0) at migc.c:80
#24 0x008c5508 in exaChangeClip (pGC=0x89215e0, type=1, pvalue=0x894d640, nrects=0) at exa.c:601
#25 0x08154565 in damageChangeClip (pGC=0x89215e0, type=1, pvalue=0x894d640, nrects=0) at damage.c:480
#26 0x08089f19 in ChangeGC (client=0x89087f8, pGC=0x89215e0, mask=0, pUnion=<optimized out>) at gc.c:341
#27 0x0808a156 in ChangeGCXIDs (client=0x89087f8, pGC=0x89215e0, mask=524288, pC32=0x8963d64) at gc.c:464
#28 0x080716af in ProcChangeGC (client=0x89087f8) at dispatch.c:1507
#29 0x080760f5 in Dispatch () at dispatch.c:447
#30 0x0806433a in main (argc=4, argv=0xbffdbfe4, envp=0xbffdbff8) at main.c:287

I'm pretty sure now the bug is due to using the scrollwheel on my IBM ScrollPoint mouse.  Possibly also related to the copious (20-100 per sec) debug messages I see in Xorg.log when scrolling, as outlined earlier:

[ 18591.986] evdev: HID 04b3:3103: dropping event due to full queue!

As you can see above, that message is included in the stack trace.  Some sort of race or thread interrupt problem?

My mouse is a bit rare, I'm sure, but this X code should probably handle rapid-fire events in any case.

I am going to try as hard as possible to not scroll-wheel for the next few days to see if the freezes stop.

I am also going to hack around in the X and evdev code when I get a chance to see if I can make a patch.  However, I know nothing of the inner workings of X so someone familiar with it can probably fix it in a fraction of the time.

Comment 4 Trevor Cordes 2011-11-18 10:40:07 UTC
Further rpm version info:

xorg-x11-drv-evdev-2.6.99-3.20110601giteaf202531.fc16.i686
xorg-x11-server-common-1.11.2-3.fc16.i686
xorg-x11-server-Xorg-1.11.2-3.fc16.i686
xorg-x11-drv-nouveau-0.0.16-27.20110720gitb806e3f.fc16.i686

Using nouveau driver.  No xorg.conf file.


Also discovered a method to recover:

on other machine ssh'd in to frozen box:
kill -9 the /usr/bin/X ps
/usr/bin/X
^C

on frozen box:
startx

X then starts up ok and works fine.  So bug appears to be mid-level, not in the low-level nouveau driver, so that is good news.

Comment 5 Trevor Cordes 2011-11-18 10:48:29 UTC
See related ubuntu bug regarding the odd linux handling of this mouse's events.  However, this ubuntu bug isn't about X hanging/freezing, it's just about making the mouse more usable.  The linux support for this mouse seems to get worse with each new release.  That's too bad because this mouse series is universally loved by anyone who ever uses it (many blogs about it).

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-input-mouse/+bug/126897

Comment 6 Trevor Cordes 2011-11-24 02:51:30 UTC
Bug is definitely in the scrollwheel code path.  I have tried not to scrollwheel for nearly 1 week and I have zero new system hangs.  I have accidentally hit it from time to time but I've probably cut the number of events down by 99%.

Comment 7 Trevor Cordes 2011-12-03 00:55:49 UTC
I take back the last comment.  This may not be specific to the scrollwheel.  I've been very good at not hitting the scrollwheel for over a week and I just had a crash where I'm sure I wasn't using the scrollwheel (but I was mousing around).  However, it is so sensitive that sometimes it triggers on its own (rarely), often by a nearby right-click.

I can't tell by the times in the Xorg.log whether the scrollwheel triggered or not.  I'm guessing it didn't.

Anyhow, same deadlock just happened now, with pretty much the same stack trace.  If this bug isn't unique to my mouse, it may hit other people!

#0  0x00a9c424 in __kernel_vsyscall ()
#1  0x00210a32 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x0018bd2f in _L_lock_11172 () from /lib/libc.so.6
#3  0x00189e8c in malloc () from /lib/libc.so.6
#4  0x0021a671 in __vasprintf_chk () from /lib/libc.so.6
#5  0x0021a647 in __asprintf_chk () from /lib/libc.so.6
#6  0x080d83f9 in asprintf (__fmt=0x81d5481 "%s: %s: %s", __ptr=0xbf91a2ec)
    at /usr/include/bits/stdio2.h:158
#7  xf86VIDrvMsgVerb (dev=0xb403378, type=X_NONE, verb=1,
    format=0x743d50 "dropping event due to full queue!\n", args=0xbf91a34c "") at xf86Helper.c:1092
#8  0x080d84eb in xf86IDrvMsg (dev=0xb403378, type=X_NONE,
    format=0x743d50 "dropping event due to full queue!\n") at xf86Helper.c:1120
#9  0x0073dcb2 in EvdevNextInQueue (pInfo=<optimized out>) at evdev.c:282
#10 EvdevNextInQueue (pInfo=<optimized out>) at evdev.c:276
#11 0x0073df5d in EvdevQueueButtonEvent (pInfo=0xb403378, button=5, value=1) at evdev.c:314
#12 0x0073e037 in EvdevQueueButtonClicks (pInfo=0xb403378, button=5, count=63) at evdev.c:350
#13 0x0073f604 in EvdevProcessRelativeMotionEvent (ev=0xbf91a3f0, pInfo=0xb403378) at evdev.c:597
#14 EvdevProcessEvent (ev=0xbf91a3f0, pInfo=0xb403378) at evdev.c:834
#15 EvdevReadInput (pInfo=0xb403378) at evdev.c:891
#16 0x080c6510 in xf86SigioReadInput (fd=17, closure=0xb403378) at xf86Events.c:298
#17 0x080ecf8b in xf86SIGIO (sig=29) at ../shared/sigio.c:109
#18 <signal handler called>
#19 0x00187468 in _int_malloc () from /lib/libc.so.6
#20 0x00189e95 in malloc () from /lib/libc.so.6
#21 0x0808fee8 in AllocatePixmap (pScreen=0x9cccff0, pixDataSize=0) at pixmap.c:123
#22 0x0068f4bc in fbCreatePixmapBpp (pScreen=0x9cccff0, width=0, height=0, depth=8, bpp=8,
    usage_hint=1) at fbpixmap.c:53
#23 0x0068f5fe in fbCreatePixmap (pScreen=0x9cccff0, width=0, height=0, depth=8, usage_hint=1)
    at fbpixmap.c:94
#24 0x00e0e6f4 in exaCreatePixmap_mixed (pScreen=0x9cccff0, w=9, h=2, depth=8, usage_hint=1)
    at exa_mixed.c:62
#25 0x00e13486 in exaGlyphs (op=3 '\003', pSrc=0x9f26190, pDst=0xb5589a0, maskFormat=0x9cd1d18,
    xSrc=879, ySrc=20, nlist=1, list=0xbf91c8ac, glyphs=0xbf91c4ac) at exa_glyphs.c:737
#26 0x08156c5f in damageGlyphs (op=3 '\003', pSrc=0x9f26190, pDst=0xb5589a0, maskFormat=0x9cd1d18,
    xSrc=879, ySrc=20, nlist=1, list=0xbf91c8ac, glyphs=0xbf91c4ac) at damage.c:647
#27 0x08143711 in CompositeGlyphs (op=3 '\003', pSrc=0x9f26190, pDst=0xb5589a0, maskFormat=0x9cd1d18,
    xSrc=879, ySrc=20, nlist=1, lists=0xbf91c8ac, glyphs=0xbf91c4ac) at glyph.c:604
#28 0x081505c8 in ProcRenderCompositeGlyphs (client=0x9f04280) at render.c:1440
#29 0x08149ed4 in ProcRenderDispatch (client=0x9f04280) at render.c:2063
#30 0x080760f5 in Dispatch () at dispatch.c:447
#31 0x0806433a in main (argc=4, argv=0xbf91cd44, envp=0xbf91cd58) at main.c:287

Comment 8 Trevor Cordes 2011-12-03 01:00:42 UTC
Comparing the 2 stacks, there is a difference starting at #19.  The old one was calling _int_free, whereas this new one was calling _int_malloc.  The Pixmap/etc ones below it were also different.  But I guess this makes sense as it's a sig alarm that's interrupting whatever was happening at the time, and the hang is in the signal handler.

Comment 9 Trevor Cordes 2011-12-03 01:24:37 UTC
OK, I take back the tack back.  This seems to still be the scrollwheel specific bug.  The stack tells us (upon closer inspection): the new hang is on button 5, which is wheel_down_button.  The old hang was on button 4, which is wheel_up_button.  So even though I wasn't scrollwheeling, my mouse triggered it anyhow.  Like I said, this happens sometimes as the right mouse button sometimes causes slight flex in the mouse which triggers a tiny bit of scrollwheel action.  It's the same as if I just nanosecond-tapped the scrollwheel.

These IBM scrollwheels are crazy sensitive.  I think I could use it as a seismograph.

Comment 10 Trevor Cordes 2011-12-16 13:02:11 UTC
I hacked around in evdev.c and seem to have fixed the problem.  No crashes for 2 weeks while using the scrollpoint "wheel" normally (i.e. heavily).  I follow previous posts (elsewhere) regarding clamping the magnitude of events the scrollpoint generates at 1/-1.  That makes the scrollpoint usable (ie: slows it down to a reasonable level).  Also, I commented out the debug error message that I'm pretty sure is causing the race and hang.

The patch is probably not good for inclusion but should allow others with the problem some sort of workaround.

If anyone wants to help with tips for how to make a patch like this acceptable for inclusion, I'd love to hear it.  It would probably have to detect the precise mouse models and only apply the fix to them.  I have no idea what the "proper" way to do that is.

The commenting of the debug print may be a good compromise.  Letting the user know this is probably useless, and when it does happen it's almost guaranteed to spit out a zillion lines each time.

Of course, the real solution to the hang is to solve the race condition, but that's beyond my capabilities.

Comment 11 Trevor Cordes 2011-12-16 13:03:18 UTC
Created attachment 547789 [details]
cheesy patch that seems to fix the problems

not ready for prime time, but a useful workaround

Comment 12 Trevor Cordes 2012-01-01 20:33:27 UTC
My patch does mitigate the bug for sure.  It's been 3 weeks without a single crash.

Comment 13 Adam Jackson 2012-02-09 18:22:01 UTC
That looks like it might be plausible.  Reassigning to the evdev maintainer.

Comment 14 Trevor Cordes 2012-03-27 05:22:56 UTC
Thanks for that.  Nearly 4 months with zero crashes due to this bug using my patch.  My patch definitely mitigates it, if in an ugly way.  Obviously someone who knows the inner workings could craft the idea into something more acceptable.

Comment 15 Fedora Update System 2012-03-27 05:24:15 UTC
xorg-x11-server-1.11.4-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/xorg-x11-server-1.11.4-3.fc16

Comment 16 Trevor Cordes 2012-03-27 06:00:53 UTC
Forgive my ignorance on procedure, but it looks like the 1.11.4-3.fc16 update is claiming to fix this bug yet I don't see the package evdev being touched at all.  The source file I patched is in evdev.  Or was there some better fix higher up (or lower down?) that someone did to mitigate this bug?  I'd love to see the patch if I knew where to find it.  Thanks!

Comment 17 Peter Hutterer 2012-03-27 06:25:18 UTC
It's mostly a server bug. The hang was caused by a malloc during the signal handler, triggered by xf86IDrvMsg(). Your stacktrace in both cases is (read bottom-up for better understanding)

#0  0x00a9c424 in __kernel_vsyscall ()
#1  0x00210a32 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x0018bd2f in _L_lock_11172 () from /lib/libc.so.6

oops, malloc during signal handler, I'm going to hang myself now.

#3  0x00189e8c in malloc () from /lib/libc.so.6
#4  0x0021a671 in __vasprintf_chk () from /lib/libc.so.6
#5  0x0021a647 in __asprintf_chk () from /lib/libc.so.6
#6  0x080d83f9 in asprintf (__fmt=0x81d5481 "%s: %s: %s", __ptr=0xbf91a2ec)
    at /usr/include/bits/stdio2.h:158

scrolling triggered this warning

#7  xf86VIDrvMsgVerb (dev=0xb403378, type=X_NONE, verb=1,
    format=0x743d50 "dropping event due to full queue!\n", args=0xbf91a34c "")
at xf86Helper.c:1092

[...]

#16 0x080c6510 in xf86SigioReadInput (fd=17, closure=0xb403378) at
xf86Events.c:298
#17 0x080ecf8b in xf86SIGIO (sig=29) at ../shared/sigio.c:109
#18 <signal handler called>

malloc interrupted by signal

#19 0x00187468 in _int_malloc () from /lib/libc.so.6



The patches removed the asprintf part from the logging, so this should restore the previous behaviour (F14 didn't use xf86IDrvMsg, it's a new API).

Comment 18 Trevor Cordes 2012-03-27 06:34:40 UTC
Great, I thought the ultimate problems was the message printing routine.

Now, is there any way to get the other part of my bug fixed in evdev?  This IBM scrollpoint mouse generates thousands of events on just a tiny little press causing the scrollwheel to be too touchy and jump 20 pages on a slight tap.  The ignoring of value (instead use "1") fixes it and makes it work perfectly.

Should I start a new bug or find & revive an old one (I think it's in there somewhere from forever ago but no one seems to care).  There's quite a lot of discussion elsewhere on the net (other distros/bugzillas) that discuss this problem at length.  It would be great to get a fix in the stock evdev.

Thanks!

Comment 19 Peter Hutterer 2012-03-27 23:29:23 UTC
Bug 807493 has been filed for the high scrolling values, let's move the discussion over there.

This bug is now only for the hang caused by malloc during signal handlers.

Comment 20 Fedora Update System 2012-03-31 03:12:41 UTC
Package xorg-x11-server-1.11.4-3.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing xorg-x11-server-1.11.4-3.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-5022/xorg-x11-server-1.11.4-3.fc16
then log in and leave karma (feedback).

Comment 21 Fedora Update System 2012-05-03 07:29:37 UTC
xorg-x11-server-1.11.4-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.