Bug 172907
Summary: | Kernel panic (divide error) when calling clock_gettime | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | James Lingard <redhat.com> | ||||
Component: | kernel | Assignee: | Dave Jones <davej> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4 | CC: | a.salwa, charlieb-fedora-bugzilla, dziekon, pfrields, wtogami | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 729133 (view as bug list) | Environment: | |||||
Last Closed: | 2006-07-29 05:44:35 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
James Lingard
2005-11-11 01:28:49 UTC
I have used the attached program to reproduce this bug. Compile as: gcc -lrt bug.c -o bug and then run as: ./bug <id-list> for example: ./bug 0 2 ./bug 0 0 ./bug 2 ./bug 0 1 2 3 -2 -3 -4 -6 -7 -8 I can reproduce this bug fairly reliably with many different combinations of command-line arguments. In most cases, the crash occurs after a few (2-10) seconds of running the program. It seems to be more prone to occur when there are multiple arguments (and hence multiple threads calling clock_gettime), but I have also reproduced it with only a single thread. Created attachment 120919 [details]
Test program
Similiar thing happens sometimes on my two machines: Jan 11 18:12:27 lompa kernel: divide error: 0000 [#1] Jan 11 18:12:27 lompa kernel: SMP Jan 11 18:12:27 lompa kernel: Modules linked in: loop i915 drm parport_pc lp parport autofs4 rfcomm l2cap bluetooth sunrpc dm_mod video button battery ac ipv6 uhci_hcd ehci_hcd i2c_i801 i2c_core snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc tg3 floppy ext3 jbd ata_piix libata sd_mod scsi_mod Jan 11 18:12:27 lompa kernel: CPU: 0 Jan 11 18:12:27 lompa kernel: EIP: 0060:[<c013420c>] Not tainted VLI Jan 11 18:12:27 lompa kernel: EFLAGS: 00010246 (2.6.14-1.1653_FC4smp) Jan 11 18:12:27 lompa kernel: EIP is at sample_to_timespec+0x31/0x3d Jan 11 18:12:27 lompa kernel: eax: ffc2f700 ebx: 3b9aca00 ecx: ffffffff edx: ffffffff Jan 11 18:12:27 lompa kernel: esi: dba19fa8 edi: dba19fa8 ebp: dba19000 esp: dba19f78 Jan 11 18:12:27 lompa kernel: ds: 007b es: 007b ss: 0068 Jan 11 18:12:27 lompa kernel: Process memcheck (pid: 27631, threadinfo=dba19000 task=db707030) Jan 11 18:12:27 lompa kernel: Stack: 00000000 fffca086 dba19fa8 dba19000 c01345f3 dba19fa8 ffc2f700 ffffffff Jan 11 18:12:27 lompa kernel: fffca086 00000000 fffca086 c013326a fffca086 00000000 fffca086 00000000 Jan 11 18:12:27 lompa kernel: c0103995 fffca086 09e9b36c 00de0ff4 00000000 fffca086 0eecdf18 ffffffda Jan 11 18:12:27 lompa kernel: Call Trace: Jan 11 18:12:27 lompa kernel: [<c01345f3>] posix_cpu_clock_get+0x48/0xf9 Jan 11 18:12:27 lompa kernel: [<c013326a>] sys_clock_gettime+0x16/0x82 Jan 11 18:12:27 lompa kernel: [<c0103995>] syscall_call+0x7/0xb Jan 11 18:12:27 lompa kernel: Code: 24 14 83 e0 03 83 f8 02 74 18 bb 00 ca 9a 3b b8 fa 09 3d 00 f7 e2 f7 f3 89 56 04 89 06 5b 5e 5f 5d c3 bb 00 ca 9a 3b 89 d0 89 ca <f7> f3 89 56 04 89 06 5b 5e 5f 5d c3 55 57 56 53 83 ec 1c 89 c5 System is still up and running - no kernel panic after it. It happens in a context of big multithreaded app that calls pthread_getcpuclockid/clock_gettime to get per-thread CPU usage statistics and clock_gettime(CLOCK_MONOTONIC, ...) to get 'uptime'-like value. In the log above you can see 'Process memcheck' because I tried to run my app under Valgrind. It happens in both cases - under Valgrind and natively. But I CAN'T reproduce this bug using test program attached by James. Kernel version: 2.6.14-1.1653_FC4smp CPU: Pentium 4, hyper-threading This is a mass-update to all currently open kernel bugs. A new kernel update has been released (Version: 2.6.15-1.1830_FC4) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. Thank you. Problem occured on FC4 with kernel 2.6.16.rc3. Pentium 4 HT, context of big multithreaded application. Happens quite often (more than 100 times in 24h) <0>divide error: 0000 [#254] PREEMPT SMP Modules linked in: iptable_nat parport_pc lp parport autofs4 nfsd exportfs lockd nfs_acl sunrpc xt_limit xt_tcpudp iptable_mangle ipt_MASQUERADE ip_nat ipt_TOS ipt_REJECT ip_conntrack_irc ip_conntrack_ftp xt_state ip_conntrack nfnetlink iptable_filter ip_tables x_tables vfat fat dm_mod video button battery ac ipv6 uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc b44 mii floppy CPU: 1 EIP: 0060:[<c012f20f>] Not tainted VLI EFLAGS: 00010246 (2.6.16-rc3 #2) EIP is at sample_to_timespec+0x31/0x3d eax: ffc2f700 ebx: 3b9aca00 ecx: ffffffff edx: ffffffff esi: c8bbffa8 edi: c8bbffa8 ebp: c8bbe000 esp: c8bbff78 ds: 007b es: 007b ss: 0068 Process my_app (pid: 5034, threadinfo=c8bbe000 task=f12db030) Stack: <0>00000000 ffff62ae c8bbffa8 c8bbe000 c0130fa1 c8bbffa8 ffc2f700 ffffffff ffff62ae 00000000 ffff62ae c012e470 ffff62ae 00000000 ffff62ae 00000000 c0102b7d ffff62ae 09391e4c 0015bff4 00000000 ffff62ae 09941128 ffffffda Call Trace: [<c0130fa1>] posix_cpu_clock_get+0x48/0xf9 [<c012e470>] sys_clock_gettime+0x16/0x84 [<c0102b7d>] syscall_call+0x7/0xb Code: 24 14 83 e0 03 83 f8 02 74 18 bb 00 ca 9a 3b b8 fa 09 3d 00 f7 e2 f7 f3 89 56 04 89 06 5b 5e 5f 5d c3 bb 00 ca 9a 3b 89 d0 89 ca <f7> f3 89 56 04 89 06 5b 5e 5f 5d c3 57 56 53 89 c6 8b 40 10 89 As you're obviously not adverse to building your own kernels, could you post this upstream to linux-kernel.org please ? I can't reproduce this on the 2.6.17 based kernels on either i386 or x86-64. I'm going to assume this got fixed. If it didn't, and you can still reproduce it, see comment #6 Thanks. (In reply to comment #7) > I can't reproduce this on the 2.6.17 based kernels on either i386 or x86-64. > I'm going to assume this got fixed. If it didn't, and you can still reproduce > it, see comment #6 This looks very similar to a problem discussed upstream in 2.6.21-rc2: https://lkml.org/lkml/2007/3/2/20 I've just had a customer report similar symptoms with 2.6.18-164.el5 (your employer might be interested in that one). This looks like the fix: http://marc.info/?l=linux-kernel&m=117210430112543&w=2 (In reply to comment #8) > This looks like the fix: > > http://marc.info/?l=linux-kernel&m=117210430112543&w=2 I've verified that the RHEL5 kernel does not contain this (simple) fix. (In reply to comment #9) > I've verified that the RHEL5 kernel does not contain this (simple) fix. And I've opened bug 729133 against RHEL5 kernel. |