Bug 436659

Summary: oops in evdev device close
Product: [Fedora] Fedora Reporter: Adam Jackson <ajax>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: dwmw2, wtogami, wwoods
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-08 15:28:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 235705    
Attachments:
Description Flags
Hacked-up evtest with a grab and an exit
none
Full dmesg with crash none

Description Adam Jackson 2008-03-08 20:54:41 UTC
Description of problem:

general protection fault: 0000 [1] SMP DEBUG_PAGEALLOC
CPU 0 
Modules linked in: rfcomm l2cap autofs4 fuse sunrpc nf_conntrack_ipv4 ipt_REJECT
iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp
ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6
cpufreq_ondemand acpi_cpufreq freq_table loop dm_multipath snd_hda_intel
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device arc4
snd_pcm_oss i2c_i801 snd_mixer_oss dcdbas firewire_ohci pcspkr joydev ecb
i2c_core snd_pcm firewire_core crypto_blkcipher sg crc_itu_t iTCO_wdt snd_timer
iTCO_vendor_support hci_usb iwl3945 snd_page_alloc mac80211 option tg3 bluetooth
wmi snd_hwdep cfg80211 sr_mod ac usbserial button battery cdrom snd soundcore
dm_snapshot dm_zero dm_mirror dm_mod ata_generic ata_piix pata_acpi libata
sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded:
microcode]
Pid: 2824, comm: X Not tainted 2.6.25-0.95.rc4.fc9 #1
RIP: 0010:[<ffffffff8113f68c>]  [<ffffffff8113f68c>] __list_add+0x2e/0x5b
RSP: 0018:ffff81007319bd08  EFLAGS: 00010046
RAX: 6b6b6b6b6b6b6b6b RBX: ffffffffffffffff RCX: ffff81007319bd08
RDX: ffff81007f8568e8 RSI: ffff81007f8568e8 RDI: ffff81007319bd38
RBP: ffff81007319bd08 R08: 6b6b6b6b6b6b6b6b R09: 0000000000000001
R10: ffffffff811dfdc2 R11: 0000000000000046 R12: ffff81007f8568a8
R13: 0000000000000246 R14: ffff81007f8568b0 R15: ffff81007f856910
FS:  00007f8cb690b780(0000) GS:ffffffff81415000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000038fd203088 CR3: 0000000074d8c000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process X (pid: 2824, threadinfo ffff81007319a000, task ffff810072954000)
Stack:  ffff81007319bd98 ffffffff8129f8cb ffffffff811dfdc2 000000007d8ad3c0
 ffff810072954000 ffff81007f8568e8 ffff81007319bd38 ffff81007319bd38
 1111111111111111 ffff81007f8568a8 ffff81007319bd38 dead4ead00000303
Call Trace:
 [<ffffffff8129f8cb>] mutex_lock_nested+0x127/0x295
 [<ffffffff811dfdc2>] ? input_release_device+0x1f/0x34
 [<ffffffff811dfdc2>] input_release_device+0x1f/0x34
 [<ffffffff811e3d1b>] evdev_release+0x50/0xc9
 [<ffffffff810abc57>] __fput+0xca/0x18a
 [<ffffffff810abd2b>] fput+0x14/0x16
 [<ffffffff810a8f42>] filp_close+0x66/0x71
 [<ffffffff810364b6>] put_files_struct+0x74/0xc8
 [<ffffffff81036551>] __exit_files+0x47/0x50
 [<ffffffff81037dae>] do_exit+0x295/0x774
 [<ffffffff8103830f>] do_group_exit+0x82/0xa0
 [<ffffffff8103833f>] sys_exit_group+0x12/0x14
 [<ffffffff8100c1c7>] tracesys+0xdc/0xe1


Code: 42 08 49 89 f0 48 89 d6 48 89 e5 4c 39 c0 74 1b 48 89 d1 4c 89 c6 48 89 c2
48 c7 c7 11 99 38 81 31 c0 e8 b4 86 16 00 0f 0b eb fe <48> 8b 10 48 39 f2 74 15
48 89 c1 48 c7 c7 61 99 38 81 31 c0 e8 
RIP  [<ffffffff8113f68c>] __list_add+0x2e/0x5b
 RSP <ffff81007319bd08>
---[ end trace 42def19e8d3bbb29 ]---
Fixing recursive fault but reboot is needed!

---

Triggered this by playing with the evdev driver in X.  ALPS touchpad, PS/2
pointer, USB mouse, keyboard.  Unfortunately I don't know which of the device
closures triggered it, it's not clear from the X log.

Comment 1 Chuck Ebbert 2008-03-10 22:37:21 UTC
Did it say "kernel bug at list_debug.h line NNN" above that?

Comment 2 Pete Zaitcev 2008-03-12 01:38:49 UTC
I poked around with evtest a little, no luck. I modified the Vojtech's
evtest to perform grabbing (the backtrace could only happen if the
device was grabbed). Tried to open the same device twice, close
while getting events, disconnect. This does not seem trivially
reproducible... Sorry.

Comment 3 Adam Jackson 2008-03-12 20:33:09 UTC
Chuck: no, that was the entirety of the traceback.

This is trivial to use reproduce with X.  Remove a plugged device while the
server is running, then exit X.

Comment 4 Will Woods 2008-03-12 20:48:11 UTC
This is making a lot of my F9Beta tests "crash" when I switch machines on the
KVM. Makes it reeeally hard to test things. Strangely it only seems to happen
while in anaconda.

Comment 5 Pete Zaitcev 2008-03-12 22:47:05 UTC
Created attachment 297861 [details]
Hacked-up evtest with a grab and an exit

Comment 6 Pete Zaitcev 2008-03-12 22:47:56 UTC
Created attachment 297862 [details]
Full dmesg with crash

Comment 7 Pete Zaitcev 2008-03-12 22:51:31 UTC
Looks like I'm reproducing something that is similar to Ajax' crash.
The recipy is:
 - run 2 evtests (I used the same device, one grabs one fails)
   in a while true loop.
 - Disconnect several times, as seen in the dmesg.

Comment 8 Chuck Ebbert 2008-03-13 22:56:12 UTC
Oopses here in lib/list_debug.c::__list_add():

        if (unlikely(prev->next != next)) {

Corrupted linked list in the mutex code...

Comment 9 Chuck Ebbert 2008-03-14 03:58:18 UTC
Just before it oopsed it printed:
WARNING: at kernel/mutex.c:134 mutex_lock_nested+0xca/0x295() (Not tainted)

which is:
        spin_lock_mutex(&lock->wait_lock, flags);

#define spin_lock_mutex(lock, flags)                    \ 
        do {                                            \ 
                struct mutex *l = container_of(lock, struct mutex, wait_lock); \
                                                        \ 
                DEBUG_LOCKS_WARN_ON(in_interrupt());    \ 
                local_irq_save(flags);                  \ 
                __raw_spin_lock(&(lock)->raw_lock);     \ 
                DEBUG_LOCKS_WARN_ON(l->magic != l);     \ 
        } while (0)

We aren't in an interrupt, so it must be the lock magic that is bad.


Comment 10 Pete Zaitcev 2008-03-14 04:50:48 UTC
Right, I too think it's some kind of use-after-free.

Comment 11 Pete Zaitcev 2008-03-18 06:42:48 UTC
Oh god, drivers/input/input.c, I cannot unsee it. This should do it:

diff --git a/drivers/input/evdev.c b/drivers/input/evdev.c
index 0727b0a..c0874a3 100644
--- a/drivers/input/evdev.c
+++ b/drivers/input/evdev.c
@@ -155,7 +155,8 @@ static int evdev_ungrab(struct evdev *evdev, struct
evdev_client *client)
 
 	rcu_assign_pointer(evdev->grab, NULL);
 	synchronize_rcu();
-	input_release_device(&evdev->handle);
+	if (evdev->exist)
+		input_release_device(&evdev->handle);
 
 	return 0;
 }


Comment 12 Jesse Keating 2008-03-19 18:12:16 UTC
We no longer use evdev for keyboards, are you still seeing this crash?  Should
we move this off the beta blocker list?

Comment 13 Will Woods 2008-03-19 18:15:18 UTC
xserver-1.5.0-no-evdev-keyboards-kthnx.patch (in xorg-x11-server
1.4.99.901-10.20080314.fc9) works around the crash.

Fix would still be nice to have, so I'm moving this to F9Target.

Comment 14 Pete Zaitcev 2008-03-19 19:03:52 UTC
I'm comminucating with upstream about this. Dmitry prefers evdev to
pin down input_dev with its refcount instead of checking for existence.

Comment 15 Pete Zaitcev 2008-03-21 17:42:47 UTC
Regarding Jesse's comment #12, perhaps it would be better to throw my
patch into the kernel instead of saddling X11 with workarounds.
It's clear that the bug is in kernel. Dmitry said that he "does not
oppose" my patch, but wants "to understand it better". If X11 people
think that keyboard is ok with old API, then sure, no problem. I still
have my reproducer.

Comment 16 Will Woods 2008-03-21 21:08:08 UTC
Jesse's comment #12 was only to justify moving this bug off the Beta blocker list. I won't consider the bug 
closed until we have this fix (or a similar one) in the kernel.

IIRC we're not going to be using evdev keyboards for F9 final but we still hope to move to evdev-managed 
keyboards in F10 rawhide - that is to say, a few weeks from now. There's also a few circumstances where 
evdev will still claim keyboard devices in F9b. Avoiding oopses in those situations would be A Very Good 
Thing. But I don't believe it's *urgent* anymore; if you need more time to write a better patch that's fine.


Comment 17 Adam Jackson 2008-03-24 13:49:34 UTC
We'll also still be using evdev for (most) mice.  According to comment #2, this
only happens when the device is grabbed, but the evdev driver will _always_
grab.  Given that my reproducer in comment #3 was tested by removing a mouse,
this still definitely needs a fix for F9.

I'd be happy with carrying Pete's fix for F9 until an upstream fix happens.

Comment 18 Adam Jackson 2008-04-08 15:28:33 UTC
I am unable to reproduce this with 2.6.25-0.204.rc8.git4.fc9.  I don't see any
evdev-related patches in our patchset, so I'll assume this was resolved upstream.

Thanks all!

Comment 19 Pete Zaitcev 2008-04-08 17:44:55 UTC
Dmitry used a different approach: he prefers pinning the parent device.
But whatever works. Commit a7097ff89c3204737a07eecbc83f9ae6002cc534.