Bug 172907

Summary:

Kernel panic (divide error) when calling clock_gettime

Product:

[Fedora] Fedora

Reporter:

James Lingard <redhat.com>

Component:

kernel

Assignee:

Dave Jones <davej>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

CC:

a.salwa, charlieb-fedora-bugzilla, dziekon, pfrields, wtogami

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

729133 (view as bug list)

Environment:

Last Closed:

2006-07-29 05:44:35 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Test program	none

Description James Lingard 2005-11-11 01:28:49 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7

Description of problem:
When calling the function clock_gettime() with parameter CLOCK_MONOTONIC, the kernel sometimes crashes, producing the following error in /var/log/messages.

Nov 10 14:32:50 obsidian kernel: divide error: 0000 [#1]
Nov 10 14:32:50 obsidian kernel: Modules linked in: tun ipv6 parport_pc lp parport autofs4 rfcomm l2cap bluetooth sunrpc pcmcia video ibm_acpi button battery ac yenta_socket rsrc_nonstatic pcmcia_core uhci_hcd ehci_hcd shpchp hw_random i2c_i801 i2c_core snd_intel8x0m snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ipw2200 ieee80211 ieee80211_crypt e1000 floppy joydev dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
Nov 10 14:32:50 obsidian kernel: CPU:    0
Nov 10 14:32:50 obsidian kernel: EIP:    0060:[<c012df2c>]    Not tainted VLI
Nov 10 14:32:50 obsidian kernel: EFLAGS: 00210246   (2.6.14-1.1637_FC4)
Nov 10 14:32:50 obsidian kernel: EIP is at sample_to_timespec+0x31/0x3d
Nov 10 14:32:50 obsidian kernel: eax: f80dac27   ebx: 3b9aca00   ecx: ffffffec   edx: ffffffec
Nov 10 14:32:50 obsidian kernel: esi: e6610fa8   edi: e6610fa8   ebp: e6610000   esp: e6610f78
Nov 10 14:32:50 obsidian kernel: ds: 007b   es: 007b   ss: 0068
Nov 10 14:32:50 obsidian kernel: Process python (pid: 6738, threadinfo=e6610000 task=dff5c570)
Nov 10 14:32:50 obsidian kernel: Stack: 00000000 fffffffe e6610fa8 e6610000 c012e313 e6610fa8 f80dac27 ffffffec
Nov 10 14:32:50 obsidian kernel:        fffffffe 08fdc818 00000003 c012cfa6 fffffffe 08fdc818 fffffffe 08fdc818
Nov 10 14:32:50 obsidian kernel:        c0102edd fffffffe bf9a0ec4 00a61ff4 08fdc818 00000003 bf9a0e98 ffffffda
Nov 10 14:32:50 obsidian kernel: Call Trace:
Nov 10 14:32:50 obsidian kernel:  [<c012e313>] posix_cpu_clock_get+0x48/0xf9
Nov 10 14:32:50 obsidian kernel:  [<c012cfa6>] sys_clock_gettime+0x16/0x82
Nov 10 14:32:50 obsidian kernel:  [<c0102edd>] syscall_call+0x7/0xb
Nov 10 14:32:50 obsidian kernel: Code: 24 14 83 e0 03 83 f8 02 74 18 bb 00 ca 9a 3b b8 fa 09 3d 00 f7 e2 f7 f3 89 56 04 89 06 5b 5e 5f 5d c3 bb 00 ca 9a 3b 89 d0 89 ca <f7> f3 89 56 04 89 06 5b 5e 5f 5d c3 55 57 56 53 83 ec 1c 89 c5
Nov 10 14:32:50 obsidian kernel:  <0>Fatal exception: panic in 5 seconds


The same problem occurs with kernel version 2.6.13-1.1532_FC4.

Version-Release number of selected component (if applicable):
kernel-2.6.14-1.1637_FC4

How reproducible:
Sometimes

Steps to Reproduce:
Program illustrating this problem to follow shortly...

Additional info:

Disassembling the code from the error message in gdb gives the following.

(gdb) print /x str
$1 = {0x24, 0x14, 0x83, 0xe0, 0x3, 0x83, 0xf8, 0x2,
  0x74, 0x18, 0xbb, 0x0, 0xca, 0x9a, 0x3b, 0xb8,
  0xfa, 0x9, 0x3d, 0x0, 0xf7, 0xe2, 0xf7, 0xf3,
  0x89, 0x56, 0x4, 0x89, 0x6, 0x5b, 0x5e, 0x5f,
  0x5d, 0xc3, 0xbb, 0x0, 0xca, 0x9a, 0x3b, 0x89,
  0xd0, 0x89, 0xca, 0xf7, 0xf3, 0x89, 0x56, 0x4,
  0x89, 0x6, 0x5b, 0x5e, 0x5f, 0x5d, 0xc3, 0x55,
  0x57, 0x56, 0x53, 0x83, 0xec, 0x1c, 0x89, 0xc5,
  0x0}
(gdb) disassemble str
Dump of assembler code for function str:
0x08049540 <str+0>:     and    $0x14,%al
0x08049542 <str+2>:     and    $0x3,%eax
0x08049545 <str+5>:     cmp    $0x2,%eax
0x08049548 <str+8>:     je     0x8049562 <str+34>
0x0804954a <str+10>:    mov    $0x3b9aca00,%ebx
0x0804954f <str+15>:    mov    $0x3d09fa,%eax
0x08049554 <str+20>:    mul    %edx
0x08049556 <str+22>:    div    %ebx
0x08049558 <str+24>:    mov    %edx,0x4(%esi)
0x0804955b <str+27>:    mov    %eax,(%esi)
0x0804955d <str+29>:    pop    %ebx
0x0804955e <str+30>:    pop    %esi
0x0804955f <str+31>:    pop    %edi
0x08049560 <str+32>:    pop    %ebp
0x08049561 <str+33>:    ret
0x08049562 <str+34>:    mov    $0x3b9aca00,%ebx
0x08049567 <str+39>:    mov    %edx,%eax
0x08049569 <str+41>:    mov    %ecx,%edx

[Crash is on this instruction:]
0x0804956b <str+43>:    div    %ebx

0x0804956d <str+45>:    mov    %edx,0x4(%esi)
0x08049570 <str+48>:    mov    %eax,(%esi)
0x08049572 <str+50>:    pop    %ebx
0x08049573 <str+51>:    pop    %esi
0x08049574 <str+52>:    pop    %edi
0x08049575 <str+53>:    pop    %ebp
0x08049576 <str+54>:    ret
0x08049577 <str+55>:    push   %ebp
0x08049578 <str+56>:    push   %edi
0x08049579 <str+57>:    push   %esi
0x0804957a <str+58>:    push   %ebx
0x0804957b <str+59>:    sub    $0x1c,%esp
0x0804957e <str+62>:    mov    %eax,%ebp
0x08049580 <str+64>:    add    %al,(%eax)
0x08049582 <str+66>:    add    %al,(%eax)
End of assembler dump.

This corresponds to the first branch of the 'if' test in sample_to_timespec (kernel/posix-cpu-timers.c, line 51).  I believe the divide error is due to an integer overflow.  The value of EDX, which I believe is the high word of cpu.sched, looks very suspicious.

Comment 1 James Lingard 2005-11-11 03:34:43 UTC

I have used the attached program to reproduce this bug.  Compile as:
    gcc -lrt bug.c -o bug
and then run as:
    ./bug <id-list>
for example:
    ./bug 0 2
    ./bug 0 0
    ./bug 2
    ./bug 0 1 2 3 -2 -3 -4 -6 -7 -8

I can reproduce this bug fairly reliably with many different combinations of
command-line arguments.  In most cases, the crash occurs after a few (2-10)
seconds of running the program.  It seems to be more prone to occur when there
are multiple arguments (and hence multiple threads calling clock_gettime), but I
have also reproduced it with only a single thread.

Comment 2 James Lingard 2005-11-11 03:36:36 UTC

Created attachment 120919 [details]
Test program

Comment 3 Aleksander Salwa 2006-01-19 10:18:18 UTC

Similiar thing happens sometimes on my two machines:

Jan 11 18:12:27 lompa kernel: divide error: 0000 [#1]
Jan 11 18:12:27 lompa kernel: SMP
Jan 11 18:12:27 lompa kernel: Modules linked in: loop i915 drm parport_pc lp
parport autofs4 rfcomm l2cap bluetooth sunrpc dm_mod video button battery ac
ipv6 uhci_hcd ehci_hcd i2c_i801 i2c_core snd_intel8x0 snd_ac97_codec
snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc tg3
floppy ext3 jbd ata_piix libata sd_mod scsi_mod
Jan 11 18:12:27 lompa kernel: CPU:    0
Jan 11 18:12:27 lompa kernel: EIP:    0060:[<c013420c>]    Not tainted VLI
Jan 11 18:12:27 lompa kernel: EFLAGS: 00010246   (2.6.14-1.1653_FC4smp)
Jan 11 18:12:27 lompa kernel: EIP is at sample_to_timespec+0x31/0x3d
Jan 11 18:12:27 lompa kernel: eax: ffc2f700   ebx: 3b9aca00   ecx: ffffffff  
edx: ffffffff
Jan 11 18:12:27 lompa kernel: esi: dba19fa8   edi: dba19fa8   ebp: dba19000  
esp: dba19f78
Jan 11 18:12:27 lompa kernel: ds: 007b   es: 007b   ss: 0068
Jan 11 18:12:27 lompa kernel: Process memcheck (pid: 27631, threadinfo=dba19000
task=db707030)
Jan 11 18:12:27 lompa kernel: Stack: 00000000 fffca086 dba19fa8 dba19000
c01345f3 dba19fa8 ffc2f700 ffffffff
Jan 11 18:12:27 lompa kernel:        fffca086 00000000 fffca086 c013326a
fffca086 00000000 fffca086 00000000
Jan 11 18:12:27 lompa kernel:        c0103995 fffca086 09e9b36c 00de0ff4
00000000 fffca086 0eecdf18 ffffffda
Jan 11 18:12:27 lompa kernel: Call Trace:
Jan 11 18:12:27 lompa kernel:  [<c01345f3>] posix_cpu_clock_get+0x48/0xf9
Jan 11 18:12:27 lompa kernel:  [<c013326a>] sys_clock_gettime+0x16/0x82
Jan 11 18:12:27 lompa kernel:  [<c0103995>] syscall_call+0x7/0xb
Jan 11 18:12:27 lompa kernel: Code: 24 14 83 e0 03 83 f8 02 74 18 bb 00 ca 9a 3b
b8 fa 09 3d 00 f7 e2 f7 f3 89 56 04 89 06 5b 5e
5f 5d c3 bb 00 ca 9a 3b 89 d0 89 ca <f7> f3 89 56 04 89 06 5b 5e 5f 5d c3 55 57
56 53 83 ec 1c 89 c5

System is still up and running - no kernel panic after it. It happens in a
context of big multithreaded app that calls pthread_getcpuclockid/clock_gettime
to get per-thread CPU usage statistics and clock_gettime(CLOCK_MONOTONIC, ...)
to get 'uptime'-like value. In the log above you can see 'Process memcheck'
because I tried to run my app under Valgrind. It happens in both cases - under
Valgrind and natively.

But I CAN'T reproduce this bug using test program attached by James.

Kernel version: 2.6.14-1.1653_FC4smp
CPU: Pentium 4, hyper-threading

Comment 4 Dave Jones 2006-02-03 07:01:10 UTC

This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.

Comment 5 Tomasz Dziekonski 2006-02-22 09:40:16 UTC

Problem occured on FC4 with kernel 2.6.16.rc3. Pentium 4 HT, context of big
multithreaded application. Happens quite often (more than 100 times in 24h)


 <0>divide error: 0000 [#254]
PREEMPT SMP 
Modules linked in: iptable_nat parport_pc lp parport autofs4 nfsd exportfs lockd
nfs_acl sunrpc xt_limit xt_tcpudp iptable_mangle ipt_MASQUERADE ip_nat ipt_TOS
ipt_REJECT ip_conntrack_irc ip_conntrack_ftp xt_state ip_conntrack nfnetlink
iptable_filter ip_tables x_tables vfat fat dm_mod video button battery ac ipv6
uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core snd_intel8x0 snd_ac97_codec
snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc b44 mii
floppy
CPU:    1
EIP:    0060:[<c012f20f>]    Not tainted VLI
EFLAGS: 00010246   (2.6.16-rc3 #2) 
EIP is at sample_to_timespec+0x31/0x3d
eax: ffc2f700   ebx: 3b9aca00   ecx: ffffffff   edx: ffffffff
esi: c8bbffa8   edi: c8bbffa8   ebp: c8bbe000   esp: c8bbff78
ds: 007b   es: 007b   ss: 0068
Process my_app (pid: 5034, threadinfo=c8bbe000 task=f12db030)
Stack: <0>00000000 ffff62ae c8bbffa8 c8bbe000 c0130fa1 c8bbffa8 ffc2f700 ffffffff 
       ffff62ae 00000000 ffff62ae c012e470 ffff62ae 00000000 ffff62ae 00000000 
       c0102b7d ffff62ae 09391e4c 0015bff4 00000000 ffff62ae 09941128 ffffffda 
Call Trace:
 [<c0130fa1>] posix_cpu_clock_get+0x48/0xf9
 [<c012e470>] sys_clock_gettime+0x16/0x84
 [<c0102b7d>] syscall_call+0x7/0xb
Code: 24 14 83 e0 03 83 f8 02 74 18 bb 00 ca 9a 3b b8 fa 09 3d 00 f7 e2 f7 f3 89
56 04 89 06 5b 5e 5f 5d c3 bb 00 ca 9a 3b 89 d0 89 ca <f7> f3 89 56 04 89 06 5b
5e 5f 5d c3 57 56 53 89 c6 8b 40 10 89

Comment 6 Dave Jones 2006-02-27 05:08:05 UTC

As you're obviously not adverse to building your own kernels, could you post
this upstream to linux-kernel.org please ?

Comment 7 Dave Jones 2006-07-29 05:44:35 UTC

I can't reproduce this on the 2.6.17 based kernels on either i386 or x86-64. 
I'm going to assume this got fixed.  If it didn't, and you can still reproduce
it, see comment #6

Thanks.

Comment 8 Charlie Brady 2011-08-08 18:06:54 UTC

(In reply to comment #7)
> I can't reproduce this on the 2.6.17 based kernels on either i386 or x86-64. 
> I'm going to assume this got fixed.  If it didn't, and you can still reproduce
> it, see comment #6

This looks very similar to a problem discussed upstream in 2.6.21-rc2:

https://lkml.org/lkml/2007/3/2/20

I've just had a customer report similar symptoms with 2.6.18-164.el5 (your employer might be interested in that one).

This looks like the fix:

http://marc.info/?l=linux-kernel&m=117210430112543&w=2

Comment 9 Charlie Brady 2011-08-08 18:31:24 UTC

(In reply to comment #8)

> This looks like the fix:
> 
> http://marc.info/?l=linux-kernel&m=117210430112543&w=2

I've verified that the RHEL5 kernel does not contain this (simple) fix.

Comment 10 Charlie Brady 2011-08-15 21:05:19 UTC

(In reply to comment #9)

> I've verified that the RHEL5 kernel does not contain this (simple) fix.

And I've opened bug 729133 against RHEL5 kernel.