From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7 Description of problem: When calling the function clock_gettime() with parameter CLOCK_MONOTONIC, the kernel sometimes crashes, producing the following error in /var/log/messages. Nov 10 14:32:50 obsidian kernel: divide error: 0000 [#1] Nov 10 14:32:50 obsidian kernel: Modules linked in: tun ipv6 parport_pc lp parport autofs4 rfcomm l2cap bluetooth sunrpc pcmcia video ibm_acpi button battery ac yenta_socket rsrc_nonstatic pcmcia_core uhci_hcd ehci_hcd shpchp hw_random i2c_i801 i2c_core snd_intel8x0m snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ipw2200 ieee80211 ieee80211_crypt e1000 floppy joydev dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod Nov 10 14:32:50 obsidian kernel: CPU: 0 Nov 10 14:32:50 obsidian kernel: EIP: 0060:[<c012df2c>] Not tainted VLI Nov 10 14:32:50 obsidian kernel: EFLAGS: 00210246 (2.6.14-1.1637_FC4) Nov 10 14:32:50 obsidian kernel: EIP is at sample_to_timespec+0x31/0x3d Nov 10 14:32:50 obsidian kernel: eax: f80dac27 ebx: 3b9aca00 ecx: ffffffec edx: ffffffec Nov 10 14:32:50 obsidian kernel: esi: e6610fa8 edi: e6610fa8 ebp: e6610000 esp: e6610f78 Nov 10 14:32:50 obsidian kernel: ds: 007b es: 007b ss: 0068 Nov 10 14:32:50 obsidian kernel: Process python (pid: 6738, threadinfo=e6610000 task=dff5c570) Nov 10 14:32:50 obsidian kernel: Stack: 00000000 fffffffe e6610fa8 e6610000 c012e313 e6610fa8 f80dac27 ffffffec Nov 10 14:32:50 obsidian kernel: fffffffe 08fdc818 00000003 c012cfa6 fffffffe 08fdc818 fffffffe 08fdc818 Nov 10 14:32:50 obsidian kernel: c0102edd fffffffe bf9a0ec4 00a61ff4 08fdc818 00000003 bf9a0e98 ffffffda Nov 10 14:32:50 obsidian kernel: Call Trace: Nov 10 14:32:50 obsidian kernel: [<c012e313>] posix_cpu_clock_get+0x48/0xf9 Nov 10 14:32:50 obsidian kernel: [<c012cfa6>] sys_clock_gettime+0x16/0x82 Nov 10 14:32:50 obsidian kernel: [<c0102edd>] syscall_call+0x7/0xb Nov 10 14:32:50 obsidian kernel: Code: 24 14 83 e0 03 83 f8 02 74 18 bb 00 ca 9a 3b b8 fa 09 3d 00 f7 e2 f7 f3 89 56 04 89 06 5b 5e 5f 5d c3 bb 00 ca 9a 3b 89 d0 89 ca <f7> f3 89 56 04 89 06 5b 5e 5f 5d c3 55 57 56 53 83 ec 1c 89 c5 Nov 10 14:32:50 obsidian kernel: <0>Fatal exception: panic in 5 seconds The same problem occurs with kernel version 2.6.13-1.1532_FC4. Version-Release number of selected component (if applicable): kernel-2.6.14-1.1637_FC4 How reproducible: Sometimes Steps to Reproduce: Program illustrating this problem to follow shortly... Additional info: Disassembling the code from the error message in gdb gives the following. (gdb) print /x str $1 = {0x24, 0x14, 0x83, 0xe0, 0x3, 0x83, 0xf8, 0x2, 0x74, 0x18, 0xbb, 0x0, 0xca, 0x9a, 0x3b, 0xb8, 0xfa, 0x9, 0x3d, 0x0, 0xf7, 0xe2, 0xf7, 0xf3, 0x89, 0x56, 0x4, 0x89, 0x6, 0x5b, 0x5e, 0x5f, 0x5d, 0xc3, 0xbb, 0x0, 0xca, 0x9a, 0x3b, 0x89, 0xd0, 0x89, 0xca, 0xf7, 0xf3, 0x89, 0x56, 0x4, 0x89, 0x6, 0x5b, 0x5e, 0x5f, 0x5d, 0xc3, 0x55, 0x57, 0x56, 0x53, 0x83, 0xec, 0x1c, 0x89, 0xc5, 0x0} (gdb) disassemble str Dump of assembler code for function str: 0x08049540 <str+0>: and $0x14,%al 0x08049542 <str+2>: and $0x3,%eax 0x08049545 <str+5>: cmp $0x2,%eax 0x08049548 <str+8>: je 0x8049562 <str+34> 0x0804954a <str+10>: mov $0x3b9aca00,%ebx 0x0804954f <str+15>: mov $0x3d09fa,%eax 0x08049554 <str+20>: mul %edx 0x08049556 <str+22>: div %ebx 0x08049558 <str+24>: mov %edx,0x4(%esi) 0x0804955b <str+27>: mov %eax,(%esi) 0x0804955d <str+29>: pop %ebx 0x0804955e <str+30>: pop %esi 0x0804955f <str+31>: pop %edi 0x08049560 <str+32>: pop %ebp 0x08049561 <str+33>: ret 0x08049562 <str+34>: mov $0x3b9aca00,%ebx 0x08049567 <str+39>: mov %edx,%eax 0x08049569 <str+41>: mov %ecx,%edx [Crash is on this instruction:] 0x0804956b <str+43>: div %ebx 0x0804956d <str+45>: mov %edx,0x4(%esi) 0x08049570 <str+48>: mov %eax,(%esi) 0x08049572 <str+50>: pop %ebx 0x08049573 <str+51>: pop %esi 0x08049574 <str+52>: pop %edi 0x08049575 <str+53>: pop %ebp 0x08049576 <str+54>: ret 0x08049577 <str+55>: push %ebp 0x08049578 <str+56>: push %edi 0x08049579 <str+57>: push %esi 0x0804957a <str+58>: push %ebx 0x0804957b <str+59>: sub $0x1c,%esp 0x0804957e <str+62>: mov %eax,%ebp 0x08049580 <str+64>: add %al,(%eax) 0x08049582 <str+66>: add %al,(%eax) End of assembler dump. This corresponds to the first branch of the 'if' test in sample_to_timespec (kernel/posix-cpu-timers.c, line 51). I believe the divide error is due to an integer overflow. The value of EDX, which I believe is the high word of cpu.sched, looks very suspicious.
I have used the attached program to reproduce this bug. Compile as: gcc -lrt bug.c -o bug and then run as: ./bug <id-list> for example: ./bug 0 2 ./bug 0 0 ./bug 2 ./bug 0 1 2 3 -2 -3 -4 -6 -7 -8 I can reproduce this bug fairly reliably with many different combinations of command-line arguments. In most cases, the crash occurs after a few (2-10) seconds of running the program. It seems to be more prone to occur when there are multiple arguments (and hence multiple threads calling clock_gettime), but I have also reproduced it with only a single thread.
Created attachment 120919 [details] Test program
Similiar thing happens sometimes on my two machines: Jan 11 18:12:27 lompa kernel: divide error: 0000 [#1] Jan 11 18:12:27 lompa kernel: SMP Jan 11 18:12:27 lompa kernel: Modules linked in: loop i915 drm parport_pc lp parport autofs4 rfcomm l2cap bluetooth sunrpc dm_mod video button battery ac ipv6 uhci_hcd ehci_hcd i2c_i801 i2c_core snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc tg3 floppy ext3 jbd ata_piix libata sd_mod scsi_mod Jan 11 18:12:27 lompa kernel: CPU: 0 Jan 11 18:12:27 lompa kernel: EIP: 0060:[<c013420c>] Not tainted VLI Jan 11 18:12:27 lompa kernel: EFLAGS: 00010246 (2.6.14-1.1653_FC4smp) Jan 11 18:12:27 lompa kernel: EIP is at sample_to_timespec+0x31/0x3d Jan 11 18:12:27 lompa kernel: eax: ffc2f700 ebx: 3b9aca00 ecx: ffffffff edx: ffffffff Jan 11 18:12:27 lompa kernel: esi: dba19fa8 edi: dba19fa8 ebp: dba19000 esp: dba19f78 Jan 11 18:12:27 lompa kernel: ds: 007b es: 007b ss: 0068 Jan 11 18:12:27 lompa kernel: Process memcheck (pid: 27631, threadinfo=dba19000 task=db707030) Jan 11 18:12:27 lompa kernel: Stack: 00000000 fffca086 dba19fa8 dba19000 c01345f3 dba19fa8 ffc2f700 ffffffff Jan 11 18:12:27 lompa kernel: fffca086 00000000 fffca086 c013326a fffca086 00000000 fffca086 00000000 Jan 11 18:12:27 lompa kernel: c0103995 fffca086 09e9b36c 00de0ff4 00000000 fffca086 0eecdf18 ffffffda Jan 11 18:12:27 lompa kernel: Call Trace: Jan 11 18:12:27 lompa kernel: [<c01345f3>] posix_cpu_clock_get+0x48/0xf9 Jan 11 18:12:27 lompa kernel: [<c013326a>] sys_clock_gettime+0x16/0x82 Jan 11 18:12:27 lompa kernel: [<c0103995>] syscall_call+0x7/0xb Jan 11 18:12:27 lompa kernel: Code: 24 14 83 e0 03 83 f8 02 74 18 bb 00 ca 9a 3b b8 fa 09 3d 00 f7 e2 f7 f3 89 56 04 89 06 5b 5e 5f 5d c3 bb 00 ca 9a 3b 89 d0 89 ca <f7> f3 89 56 04 89 06 5b 5e 5f 5d c3 55 57 56 53 83 ec 1c 89 c5 System is still up and running - no kernel panic after it. It happens in a context of big multithreaded app that calls pthread_getcpuclockid/clock_gettime to get per-thread CPU usage statistics and clock_gettime(CLOCK_MONOTONIC, ...) to get 'uptime'-like value. In the log above you can see 'Process memcheck' because I tried to run my app under Valgrind. It happens in both cases - under Valgrind and natively. But I CAN'T reproduce this bug using test program attached by James. Kernel version: 2.6.14-1.1653_FC4smp CPU: Pentium 4, hyper-threading
This is a mass-update to all currently open kernel bugs. A new kernel update has been released (Version: 2.6.15-1.1830_FC4) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. Thank you.
Problem occured on FC4 with kernel 2.6.16.rc3. Pentium 4 HT, context of big multithreaded application. Happens quite often (more than 100 times in 24h) <0>divide error: 0000 [#254] PREEMPT SMP Modules linked in: iptable_nat parport_pc lp parport autofs4 nfsd exportfs lockd nfs_acl sunrpc xt_limit xt_tcpudp iptable_mangle ipt_MASQUERADE ip_nat ipt_TOS ipt_REJECT ip_conntrack_irc ip_conntrack_ftp xt_state ip_conntrack nfnetlink iptable_filter ip_tables x_tables vfat fat dm_mod video button battery ac ipv6 uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc b44 mii floppy CPU: 1 EIP: 0060:[<c012f20f>] Not tainted VLI EFLAGS: 00010246 (2.6.16-rc3 #2) EIP is at sample_to_timespec+0x31/0x3d eax: ffc2f700 ebx: 3b9aca00 ecx: ffffffff edx: ffffffff esi: c8bbffa8 edi: c8bbffa8 ebp: c8bbe000 esp: c8bbff78 ds: 007b es: 007b ss: 0068 Process my_app (pid: 5034, threadinfo=c8bbe000 task=f12db030) Stack: <0>00000000 ffff62ae c8bbffa8 c8bbe000 c0130fa1 c8bbffa8 ffc2f700 ffffffff ffff62ae 00000000 ffff62ae c012e470 ffff62ae 00000000 ffff62ae 00000000 c0102b7d ffff62ae 09391e4c 0015bff4 00000000 ffff62ae 09941128 ffffffda Call Trace: [<c0130fa1>] posix_cpu_clock_get+0x48/0xf9 [<c012e470>] sys_clock_gettime+0x16/0x84 [<c0102b7d>] syscall_call+0x7/0xb Code: 24 14 83 e0 03 83 f8 02 74 18 bb 00 ca 9a 3b b8 fa 09 3d 00 f7 e2 f7 f3 89 56 04 89 06 5b 5e 5f 5d c3 bb 00 ca 9a 3b 89 d0 89 ca <f7> f3 89 56 04 89 06 5b 5e 5f 5d c3 57 56 53 89 c6 8b 40 10 89
As you're obviously not adverse to building your own kernels, could you post this upstream to linux-kernel.org please ?
I can't reproduce this on the 2.6.17 based kernels on either i386 or x86-64. I'm going to assume this got fixed. If it didn't, and you can still reproduce it, see comment #6 Thanks.
(In reply to comment #7) > I can't reproduce this on the 2.6.17 based kernels on either i386 or x86-64. > I'm going to assume this got fixed. If it didn't, and you can still reproduce > it, see comment #6 This looks very similar to a problem discussed upstream in 2.6.21-rc2: https://lkml.org/lkml/2007/3/2/20 I've just had a customer report similar symptoms with 2.6.18-164.el5 (your employer might be interested in that one). This looks like the fix: http://marc.info/?l=linux-kernel&m=117210430112543&w=2
(In reply to comment #8) > This looks like the fix: > > http://marc.info/?l=linux-kernel&m=117210430112543&w=2 I've verified that the RHEL5 kernel does not contain this (simple) fix.
(In reply to comment #9) > I've verified that the RHEL5 kernel does not contain this (simple) fix. And I've opened bug 729133 against RHEL5 kernel.