Bug 206343 - [RHEL5 Beta1] kernel: ps command is killed by NaT consumption fault on IA64.
Summary: [RHEL5 Beta1] kernel: ps command is killed by NaT consumption fault on IA64.
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: ia64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-09-13 19:36 UTC by Kiyoshi Ueda
Modified: 2007-11-30 22:07 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-10-03 18:07:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Kiyoshi Ueda 2006-09-13 19:36:36 UTC
Description of problem:
ps command is killed by NaT consumption fault, though it's not often.
Under the stress of the NEC LISA program, which runs ps repeatedly,
this problem often occurs on large box like 16 CPU, and it sometimes
(or rarely) occurs on small box like 2 or 4 CPU.

Fault point of the NaT consumption fault and call-trace at the time
are almost always same.
But I have seen a different fault point only once.
(See 'Additional info' section about detail call-traces.)


Version-Release number of selected component:
kernel-2.6.17-1.2519.4.21.el5
kernel-2.6.17-1.2519.4.26.el5


How reproducible:
Often on 16 CPU box, sometimes (or rare) on 2 or 4 CPU box


Steps to Reproduce:
 1. Run the NEC LISA program on large IA64 (16 CPU or more) box.
      # ./lisa.sh 1h

      (First argument is a term of this test.  If using small box,
       '24h' or more long term should be needed.)

 The LISA program do the following:
    o I/O load to all mount points by cp, mv and diff
    o Kernel stress by mmap(), fork() and write()
    o Memory load by stream benchmark
    o lsof and ps command


Actual results:
NaT consumption fault occurs in ps command and the ps command is killed.
(It may take about a half hour or more to get the problem.)
Call-trace is attached in 'additional info' section.


Expected results:
NaT consumption fault should not occur.


Additional info:
This problem doesn't occur on rawhide kernel (2.6.17-1.2630.fc6)
and upstream kernel (2.6.18-rc6).

Sample fault message in 2.6.17-1.2519.4.26.el5 is below.
Fault message is almost always this one.
--------------------------------------------------------------------
[root@nec-tx7-2 ~]# uname -r
2.6.17-1.2519.4.26.el5
[root@nec-tx7-2 ~]# ls /sys/devices/system/cpu
cpu0  cpu10  cpu12  cpu14  cpu2  cpu4  cpu6  cpu8
cpu1  cpu11  cpu13  cpu15  cpu3  cpu5  cpu7  cpu9
[root@nec-tx7-2 ~]# free
             total       used       free     shared    buffers     cached
Mem:      66581888    2183872   64398016          0     159200     963344
-/+ buffers/cache:    1061328   65520560
Swap:      2040208          0    2040208
[root@nec-tx7-2 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2             14878952   9137256   4973696  65% /
/dev/sda1               511728     13352    498376   3% /boot/efi
tmpfs                 33290944         0  33290944   0% /dev/shm
[root@nec-tx7-2 ~]# cd lisa
[root@nec-tx7-2 lisa]# ./lisa.sh 1h
ps[4668]: NaT consumption 2216203124768 [1]
Modules linked in: nfs fscache nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm
l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc iscsi_tcp libiscsi
scsi_transport_iscsi ipv6 vfat fat dm_mirror dm_multipath dm_mod button
parport_pc lp parport sg uhci_hcd tg3 ide_cd cdrom serio_raw ext3 jbd aic7xxx
scsi_transport_spi qla1280 qla2xxx scsi_transport_fc sd_mod scsi_mod

Pid: 4668, CPU 5, comm:                   ps
psr : 0000121008526030 ifs : 8000000000000286 ip  : [<a0000001000ea6e1>]    Not
tainted
ip is at __delayacct_blkio_ticks+0x41/0xc0
unat: 0000000000000000 pfs : 0000000000000286 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 0000000000565559
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001000ea6c0 b6  : a0000001001e0440 b7  : a0000001001da780
f6  : 1003e000000000000001f f7  : 1003e431bde82d7b634db
f8  : 1003effffffffffffffbf f9  : 1003efffffffffffffff5
f10 : 1003e0000000000000016 f11 : 1003e8208208208208209
r1  : a000000100ba13c0 r2  : 0000000000000001 r3  : 0000000000000005
r8  : e00000101e41a810 r9  : ffffffffffffffff r10 : 0000000000000048
r11 : ffffffffdead4ead r12 : e0000007f9c3fce0 r13 : e0000007f9c38000
r14 : 431bde82d7b634db r15 : 0000000000000040 r16 : 00000000ffffffff
r17 : 00000000dead4ead r18 : e00000101e41a80c r19 : e0000004159a1028
r20 : e0000004159a0018 r21 : e0000004159a0184 r22 : e0000004159a0260
r23 : e0000004159a1008 r24 : 0000000000000016 r25 : e0000004159a1090
r26 : e0000007f9c3fb48 r27 : e0000004159a1008 r28 : e00000101e41a810
r29 : 0000000000000005 r30 : e00000101e41a818 r31 : e0000007f9c39044

Call Trace:
 [<a000000100013e60>] show_stack+0x40/0xa0
                                sp=e0000007f9c3f700 bsp=e0000007f9c39510
 [<a000000100014760>] show_regs+0x840/0x880
                                sp=e0000007f9c3f8d0 bsp=e0000007f9c394b8
 [<a000000100037b60>] die+0x1c0/0x2a0
                                sp=e0000007f9c3f8d0 bsp=e0000007f9c39470
 [<a000000100037c90>] die_if_kernel+0x50/0x80
                                sp=e0000007f9c3f8f0 bsp=e0000007f9c39440
 [<a00000010061dd90>] ia64_fault+0x10f0/0x1200
                                sp=e0000007f9c3f8f0 bsp=e0000007f9c393e0
 [<a00000010000c700>] __ia64_leave_kernel+0x0/0x280
                                sp=e0000007f9c3fb10 bsp=e0000007f9c393e0
 [<a0000001000ea6e0>] __delayacct_blkio_ticks+0x40/0xc0
                                sp=e0000007f9c3fce0 bsp=e0000007f9c393b0
 [<a0000001001e0160>] do_task_stat+0x740/0xa20
                                sp=e0000007f9c3fce0 bsp=e0000007f9c39208
 [<a0000001001e0470>] proc_tgid_stat+0x30/0x60
                                sp=e0000007f9c3fe20 bsp=e0000007f9c391d8
 [<a0000001001da840>] proc_info_read+0xc0/0x1a0
                                sp=e0000007f9c3fe20 bsp=e0000007f9c39190
 [<a000000100156d60>] vfs_read+0x200/0x3a0
                                sp=e0000007f9c3fe20 bsp=e0000007f9c39140
 [<a000000100157430>] sys_read+0x70/0xe0
                                sp=e0000007f9c3fe20 bsp=e0000007f9c390c8
 [<a00000010000c490>] __ia64_trace_syscall+0xd0/0x110
                                sp=e0000007f9c3fe30 bsp=e0000007f9c390c8
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e0000007f9c40000 bsp=e0000007f9c390c8
--------------------------------------------------------------------

Another fault message sample is below.
This one had happened only once.
--------------------------------------------------------------------
ps[21139]: NaT consumption 17179869216 [560]
Modules linked in: qla2xxx nfs fscache nfsd exportfs lockd nfs_acl autofs4 hidp
rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc iscsi_tcp libiscsi
scsi_transport_iscsi ipv6 vfat fat dm_mirror dm_round_robin dm_multipath dm_mod
button parport_pc lp parport sg tg3 e1000 ide_cd cdrom uhci_hcd serio_raw ext3
jbd mptspi scsi_transport_spi mptscsih mptbase scsi_transport_fc qla1280 sd_mod
scsi_mod

Pid: 21139, CPU 7, comm:                   ps
psr : 0000101008526030 ifs : 800000000000038d ip  : [<a000000100297470>]    Not
tainted
ip is at _raw_spin_lock+0x10/0x260
unat: 0000000000000000 pfs : 0000000000000205 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 0000000000565559
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001006029a0 b6  : a0000001001cb8e0 b7  : a0000001001c5c20
f6  : 1003ee401aca1dc557292 f7  : 1003e9e3779b97f4a7c16
f8  : 1003e0a0000001005fe7b f9  : 1003effffffffffffffdb
f10 : 1003e000000000000004a f11 : 1003e8208208208208209
r1  : a000000100b7ff30 r2  : 9fffffffffffffff r3  : e000000a3d630080
r8  : e000000a3d630230 r9  : e000000a3d631024 r10 : e000000a3d63016c
r11 : e00000065d557e18 r12 : e00000065d557ce0 r13 : e00000065d550000
r14 : 000000000000000c r15 : 0000000000000038 r16 : e000000a3d630e50
r17 : e000000a3d630e28 r18 : 0000000000000004 r19 : e000000a3d631008
r20 : e000000a3d630018 r21 : e000000a3d630184 r22 : e000000a3d630260
r23 : e000000a3d630fa8 r24 : 0000000000000015 r25 : 000000000000000a
r26 : 0000000000000005 r27 : a000000100952498 r28 : 0000000000000000
r29 : a00000010003c5a0 r30 : ffffffffff9b7d20 r31 : a000000100684880

Call Trace:
 [<a000000100013de0>] show_stack+0x40/0xa0
                                sp=e00000065d557700 bsp=e00000065d551570
 [<a0000001000146e0>] show_regs+0x840/0x880
                                sp=e00000065d5578d0 bsp=e00000065d551518
 [<a000000100033760>] die+0x1c0/0x2a0
                                sp=e00000065d5578d0 bsp=e00000065d5514d0
 [<a000000100033890>] die_if_kernel+0x50/0x80
                                sp=e00000065d5578f0 bsp=e00000065d5514a0
 [<a0000001006040b0>] ia64_fault+0x10f0/0x1200
                                sp=e00000065d5578f0 bsp=e00000065d551448
 [<a00000010000c700>] __ia64_leave_kernel+0x0/0x280
                                sp=e00000065d557b10 bsp=e00000065d551448
 [<a000000100297470>] _raw_spin_lock+0x10/0x260
                                sp=e00000065d557ce0 bsp=e00000065d5513d8
 [<a0000001006029a0>] _spin_lock+0x20/0x40
                                sp=e00000065d557ce0 bsp=e00000065d5513b8
 [<a0000001000d9640>] __delayacct_blkio_ticks+0x20/0xc0
                                sp=e00000065d557ce0 bsp=e00000065d551390
 [<a0000001001cb600>] do_task_stat+0x740/0xa20
                                sp=e00000065d557ce0 bsp=e00000065d5511e0
 [<a0000001001cb910>] proc_tgid_stat+0x30/0x60
                                sp=e00000065d557e20 bsp=e00000065d5511b8
 [<a0000001001c5ce0>] proc_info_read+0xc0/0x1a0
                                sp=e00000065d557e20 bsp=e00000065d551170
 [<a000000100142000>] vfs_read+0x200/0x3a0
                                sp=e00000065d557e20 bsp=e00000065d551120
 [<a0000001001426d0>] sys_read+0x70/0xe0
                                sp=e00000065d557e20 bsp=e00000065d5510a8
 [<a00000010000c490>] __ia64_trace_syscall+0xd0/0x110
                                sp=e00000065d557e30 bsp=e00000065d5510a8
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e00000065d558000 bsp=e00000065d5510a8
--------------------------------------------------------------------

Comment 2 Kiyoshi Ueda 2006-10-03 18:07:57 UTC
I confirmed this bug is fixed in the kernel-2.6.18-1.2702.el5.
So I close this bugzilla.


Note You need to log in before you can comment on or make changes to this bug.