494245 – X server deadlock

Bug 494245 - X server deadlock

Summary: X server deadlock

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xorg-x11-drv-evdev
Sub Component:
Version:	11
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Peter Hutterer
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-04-05 23:38 UTC by David Woodhouse
Modified:	2018-04-11 08:19 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-11-05 22:45:20 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
X stuck in an infinite loop (186.97 KB, text/plain) 2009-04-15 23:13 UTC, Bojan Smojver	no flags	Details
View All

Description David Woodhouse 2009-04-05 23:38:00 UTC

When I turn off a Bluetooth mouse, X starts to get -ENODEV from the open filedescriptor before the HAL DeviceRemoved notification arrives.

Sometimes, X prints an error message:
         (EE) Bluetooth Mouse: Read error: No such device

Other times, X just locks up -- this is being done from a signal handler, and seems to need to allocate memory. But if we were already holding locks in glibc's allocator code when the signal arrived, we deadlock.

The race window is actually quite small and would almost never trigger if it happened only once per remove -- but instead of closing the non-functional device, X just keeps trying to read from the dead file descriptor and printing the same error message.... over and over and over again until it hits the race and locks up. (That should perhaps be considered a second bug).

(gdb) bt
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1  0x0000003b2067c875 in _L_lock_9304 () at hooks.c:126
#2  0x0000003b2067a477 in *__GI___libc_malloc (bytes=25) at malloc.c:3618
#3  0x00000000004f382a in LogVMessageVerb ()
#4  0x0000000000487327 in xf86Msg ()
#5  0x00007f62adb6a867 in ?? () from /usr/lib64/xorg/modules/input//evdev_drv.so
#6  0x000000000047de94 in ?? ()
#7  0x000000000046f3b4 in ?? ()
#8  <signal handler called>
#9  _int_free (av=0x3b20968e40, p=0x2f13510, have_lock=<value optimized out>) at malloc.c:4887
#10 0x000000000042d878 in dixFreePrivates ()
#11 0x00007f62afb89faa in fbDestroyPixmap () from /usr/lib64/xorg/modules//libfb.so
#12 0x000000000052f87c in ?? ()
#13 0x00000000005083ea in ?? ()
#14 0x00000000005236d8 in FreePicture ()
#15 0x00000000005222e8 in miTrapezoids ()
#16 0x000000000052b5bb in ?? ()


10:51:41.554369 read(11, 0x7fffb4dbaf40, 384) = -1 ENODEV (No such device)
10:51:41.554484 write(2, "(EE) Bluetooth Mouse: Read error: No such device\n"..., 49) = 49
10:51:41.554671 write(0, "(EE) Bluetooth Mouse: Read error: No such device\n"..., 49) = 49
10:51:41.554806 rt_sigreturn(0x1)       = 0
10:51:41.559194 read(31, "OUZ[\20\20\20\20\20\20\21\3\24\3URRW\3URRW\3\31\27\3\25\23\23\34\20\23\27\20\23\30\3\24\23\35\30\24\3\223\27\n\0\3\20\20\20\342\4@\1L\5@\1*\0\0\0\326\4@\1\t\1\276\0\2\23\34\20\t\1\276\0\27\24\24\23\223\27\n\0\3\27\n\0\345\4@\1L\5@\1*\0\0\0\202\0@\1\33\1\276\0\3\1\312\0\33\1\276\0\20!\3\0\223\27\r\0\3\27\n\0\331\4@\1L\5@\1*\0\0\0\326\4@\1-\1\276\0\17\1\312\0-\1\276\0VRFNHW\35>\24\25\23\25\31\31@\0\223\27\24\0\3\5@\1\345\4@\1L\5@\1*\0\0\0\202\0@\1\1\0\312\0,RFN\1\0\312\0OUZ[\20\20\20\20\20\20\21\3\24\3URRW\3URRW\3\31\27\3\25\23\23\34\20\23\27\20\23\30\3\24\23\35\30\24\3\223"..., 4468) = 1948
10:51:41.561956 --- SIGIO (I/O possible) @ 0 (0) ---
10:51:41.562016 select(19, [10 11 12 13 14 15 16 17 18], NULL, NULL, {0, 0}) = 2 (in [10 11], left {0, 0})
10:51:41.562167 rt_sigprocmask(SIG_BLOCK, [IO], [IO], 8) = 0
10:51:41.562290 select(1024, [10], NULL, NULL, {0, 0}) = 1 (in [10], left {0, 0})
10:51:41.562526 read(10, "\255\357\330I\0\0\0\0\371\222\10\0\0\0\0\0\3\0\30\0|\0\0\0\255\357\330I\0\0\0\0\3\223\10\0\0\0\0\0\3\0\0\0\355\2\0\0"..., 64) = 48
10:51:41.562679 select(1024, [10], NULL, NULL, {0, 0}) = 1 (in [10], left {0, 0})
10:51:41.562931 read(10, "\255\357\330I\0\0\0\0\6\223\10\0\0\0\0\0\3\0\1\0\365\1\0\0\255\357\330I\0\0\0\0\n\223\10\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 64) = 48
10:51:41.563090 select(1024, [10], NULL, NULL, {0, 0}) = 0 (Timeout)
10:51:41.563264 rt_sigprocmask(SIG_BLOCK, [IO], [IO], 8) = 0
10:51:41.563383 read(11, 0x7fffb4dbb040, 384) = -1 ENODEV (No such device)
10:51:41.563504 futex(0x3b20968e40, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted)
10:51:41.571523 --- SIGALRM (Alarm clock) @ 0 (0) ---
10:51:41.571736 rt_sigreturn(0xe)       = -1 EINTR (Interrupted system call)
10:51:41.571952 futex(0x3b20968e40, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted)
10:51:41.591526 --- SIGALRM (Alarm clock) @ 0 (0) ---
10:51:41.591736 rt_sigreturn(0xe)       = -1 EINTR (Interrupted system call)
10:51:41.591876 futex(0x3b20968e40, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted)
10:51:41.611518 --- SIGALRM (Alarm clock) @ 0 (0) ---
10:51:41.611724 rt_sigreturn(0xe)       = -1 EINTR (Interrupted system call)
10:51:41.611862 futex(0x3b20968e40, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted)

Comment 1 David Woodhouse 2009-04-06 00:12:05 UTC

Turns out the evdev driver _does_ check for errno == ENODEV.

But only if the read() call returns zero. Not if it returns -1.

Doh.

There are still plenty of opportunities for deadlock even if we fix that, though.

Comment 2 Peter Hutterer 2009-04-06 00:58:02 UTC

comment #2 fixed in xorg-x11-drv-evdev-2.2.1-2.

http://koji.fedoraproject.org/koji/taskinfo?taskID=1278824

For the rest - if I read this right this would mean we must not print anything to the logs during the signal handler since xf86Msg will always allocate for the matching verbosity string. I need to check this further.

Comment 3 David Woodhouse 2009-04-09 21:09:29 UTC

(In reply to comment #2)
> For the rest - if I read this right this would mean we must not print anything
> to the logs during the signal handler since xf86Msg will always allocate for
> the matching verbosity string.

That seems to be the case.

Comment 4 Carlos Romero 2009-04-10 22:17:10 UTC

I had to downgrade evdev to 2.2.1-1 because X was locking up for 3-5 minutes at 100% load every time my bluetooth kb or mouse woke up from sleep.

Comment 5 Peter Hutterer 2009-04-14 03:49:28 UTC

wait, you're saying -2 locks up and -1 doesn't?

Comment 6 Carlos Romero 2009-04-14 06:11:41 UTC

-1 still locks up, just not for as long. not sure what the change was that started it. changed kernels, dongles and evdev so far but have not had time to look further.

Comment 7 David Woodhouse 2009-04-14 10:53:18 UTC

I think Carlos' version of "locks up" is different to mine. Mine is a deadlock in glibc which I don't think it's ever going to recover from. It sounds like it might just be that in Carlos' case it takes a long time to re-open the Bluetooth input devices when they reconnect. I've seen similar behaviour.

Carlos, is the X server _completely_ dead while this is happening? Do you have any non-Bluetooth input devices? Does the on-screen clock or anything else update while it's "locked up"? 

There are various other bugs with removal and addition of Bluetooth devices, including the one described in bug #494246, which means that removal events from the kernel have the wrong path and thus HAL never realises the device went away. So you get new devices being added which share /dev/inputX device nodes with other devices that HAL thinks exist.

Carlos, can you show /var/log/Xorg.0.log from after this has happened?

Comment 8 Bojan Smojver 2009-04-15 23:12:53 UTC

I'm seeing this on F-10 folks. Same SIGALRM/Interrupted system call loop in strace. It happens completely intermittently, X locks up and the screen is not released even if I kill it hard. Switching to RL 3 also doesn't help. Please note: I do not remove _any_ devices - it just decides to do this on its own.

I'll attach the Xorg.0.log.old file.

Comment 9 Bojan Smojver 2009-04-15 23:13:58 UTC

Created attachment 339762 [details]
X stuck in an infinite loop

Comment 10 Bojan Smojver 2009-04-15 23:18:14 UTC

One other detail, if it matters. I hibernate/thaw at least once a day and as you can see from the log, the X server was started 2 days ago, when this last happened.

Comment 11 Peter Hutterer 2009-04-16 01:40:35 UTC

Bojan: please open a separate bug for this issue. It's not related to evdev, the server is stuck in drmCommandWrite.

Comment 12 Bojan Smojver 2009-04-16 02:12:33 UTC

Thanks. Bug #496019 submitted.

Comment 13 Peter Hutterer 2009-04-16 04:10:54 UTC

Should be fixed with evdev 2.2.1-3.

http://koji.fedoraproject.org/koji/taskinfo?taskID=1301448

Comment 14 Bug Zapper 2009-06-09 13:18:56 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 15 Matěj Cepl 2009-11-05 18:25:22 UTC

Since this bugzilla report was filed, there have been several major updates in various components of the Xorg system, which may have resolved this issue. Users who have experienced this problem are encouraged to upgrade their system to the latest version of their packages. For packages from updates-testing repository you can use command

yum upgrade --enablerepo='*-updates-testing'

Alternatively, you can also try to test whether this bug is reproducible with the upcoming Fedora 12 distribution by downloading LiveMedia of F12 Beta available at http://alt.fedoraproject.org/pub/alt/nightly-composes/ . By using that you get all the latest packages without need to install anything on your computer. For more information on using LiveMedia take a look at https://fedoraproject.org/wiki/FedoraLiveCD .

Please, if you experience this problem on the up-to-date system, let us now in the comment for this bug, or whether the upgraded system works for you.

If you won't be able to reply in one month, I will have to close this bug as INSUFFICIENT_DATA. Thank you.

[This is a bulk message for all open Fedora Rawhide Xorg-related bugs. I'm adding myself to the CC list for each bug, so I'll see any comments you make after this and do my best to make sure every issue gets proper attention.]

Comment 16 Peter Hutterer 2009-11-05 22:45:20 UTC

Closing per comment 13. Please reopen if the issue still persists.

Note You need to log in before you can comment on or make changes to this bug.