Description of problem: ps command is killed by NaT consumption fault, though it's not often. Under the stress of the NEC LISA program, which runs ps repeatedly, this problem often occurs on large box like 16 CPU, and it sometimes (or rarely) occurs on small box like 2 or 4 CPU. Fault point of the NaT consumption fault and call-trace at the time are almost always same. But I have seen a different fault point only once. (See 'Additional info' section about detail call-traces.) Version-Release number of selected component: kernel-2.6.17-1.2519.4.21.el5 kernel-2.6.17-1.2519.4.26.el5 How reproducible: Often on 16 CPU box, sometimes (or rare) on 2 or 4 CPU box Steps to Reproduce: 1. Run the NEC LISA program on large IA64 (16 CPU or more) box. # ./lisa.sh 1h (First argument is a term of this test. If using small box, '24h' or more long term should be needed.) The LISA program do the following: o I/O load to all mount points by cp, mv and diff o Kernel stress by mmap(), fork() and write() o Memory load by stream benchmark o lsof and ps command Actual results: NaT consumption fault occurs in ps command and the ps command is killed. (It may take about a half hour or more to get the problem.) Call-trace is attached in 'additional info' section. Expected results: NaT consumption fault should not occur. Additional info: This problem doesn't occur on rawhide kernel (2.6.17-1.2630.fc6) and upstream kernel (2.6.18-rc6). Sample fault message in 2.6.17-1.2519.4.26.el5 is below. Fault message is almost always this one. -------------------------------------------------------------------- [root@nec-tx7-2 ~]# uname -r 2.6.17-1.2519.4.26.el5 [root@nec-tx7-2 ~]# ls /sys/devices/system/cpu cpu0 cpu10 cpu12 cpu14 cpu2 cpu4 cpu6 cpu8 cpu1 cpu11 cpu13 cpu15 cpu3 cpu5 cpu7 cpu9 [root@nec-tx7-2 ~]# free total used free shared buffers cached Mem: 66581888 2183872 64398016 0 159200 963344 -/+ buffers/cache: 1061328 65520560 Swap: 2040208 0 2040208 [root@nec-tx7-2 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda2 14878952 9137256 4973696 65% / /dev/sda1 511728 13352 498376 3% /boot/efi tmpfs 33290944 0 33290944 0% /dev/shm [root@nec-tx7-2 ~]# cd lisa [root@nec-tx7-2 lisa]# ./lisa.sh 1h ps[4668]: NaT consumption 2216203124768 [1] Modules linked in: nfs fscache nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc iscsi_tcp libiscsi scsi_transport_iscsi ipv6 vfat fat dm_mirror dm_multipath dm_mod button parport_pc lp parport sg uhci_hcd tg3 ide_cd cdrom serio_raw ext3 jbd aic7xxx scsi_transport_spi qla1280 qla2xxx scsi_transport_fc sd_mod scsi_mod Pid: 4668, CPU 5, comm: ps psr : 0000121008526030 ifs : 8000000000000286 ip : [<a0000001000ea6e1>] Not tainted ip is at __delayacct_blkio_ticks+0x41/0xc0 unat: 0000000000000000 pfs : 0000000000000286 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000565559 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001000ea6c0 b6 : a0000001001e0440 b7 : a0000001001da780 f6 : 1003e000000000000001f f7 : 1003e431bde82d7b634db f8 : 1003effffffffffffffbf f9 : 1003efffffffffffffff5 f10 : 1003e0000000000000016 f11 : 1003e8208208208208209 r1 : a000000100ba13c0 r2 : 0000000000000001 r3 : 0000000000000005 r8 : e00000101e41a810 r9 : ffffffffffffffff r10 : 0000000000000048 r11 : ffffffffdead4ead r12 : e0000007f9c3fce0 r13 : e0000007f9c38000 r14 : 431bde82d7b634db r15 : 0000000000000040 r16 : 00000000ffffffff r17 : 00000000dead4ead r18 : e00000101e41a80c r19 : e0000004159a1028 r20 : e0000004159a0018 r21 : e0000004159a0184 r22 : e0000004159a0260 r23 : e0000004159a1008 r24 : 0000000000000016 r25 : e0000004159a1090 r26 : e0000007f9c3fb48 r27 : e0000004159a1008 r28 : e00000101e41a810 r29 : 0000000000000005 r30 : e00000101e41a818 r31 : e0000007f9c39044 Call Trace: [<a000000100013e60>] show_stack+0x40/0xa0 sp=e0000007f9c3f700 bsp=e0000007f9c39510 [<a000000100014760>] show_regs+0x840/0x880 sp=e0000007f9c3f8d0 bsp=e0000007f9c394b8 [<a000000100037b60>] die+0x1c0/0x2a0 sp=e0000007f9c3f8d0 bsp=e0000007f9c39470 [<a000000100037c90>] die_if_kernel+0x50/0x80 sp=e0000007f9c3f8f0 bsp=e0000007f9c39440 [<a00000010061dd90>] ia64_fault+0x10f0/0x1200 sp=e0000007f9c3f8f0 bsp=e0000007f9c393e0 [<a00000010000c700>] __ia64_leave_kernel+0x0/0x280 sp=e0000007f9c3fb10 bsp=e0000007f9c393e0 [<a0000001000ea6e0>] __delayacct_blkio_ticks+0x40/0xc0 sp=e0000007f9c3fce0 bsp=e0000007f9c393b0 [<a0000001001e0160>] do_task_stat+0x740/0xa20 sp=e0000007f9c3fce0 bsp=e0000007f9c39208 [<a0000001001e0470>] proc_tgid_stat+0x30/0x60 sp=e0000007f9c3fe20 bsp=e0000007f9c391d8 [<a0000001001da840>] proc_info_read+0xc0/0x1a0 sp=e0000007f9c3fe20 bsp=e0000007f9c39190 [<a000000100156d60>] vfs_read+0x200/0x3a0 sp=e0000007f9c3fe20 bsp=e0000007f9c39140 [<a000000100157430>] sys_read+0x70/0xe0 sp=e0000007f9c3fe20 bsp=e0000007f9c390c8 [<a00000010000c490>] __ia64_trace_syscall+0xd0/0x110 sp=e0000007f9c3fe30 bsp=e0000007f9c390c8 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e0000007f9c40000 bsp=e0000007f9c390c8 -------------------------------------------------------------------- Another fault message sample is below. This one had happened only once. -------------------------------------------------------------------- ps[21139]: NaT consumption 17179869216 [560] Modules linked in: qla2xxx nfs fscache nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc iscsi_tcp libiscsi scsi_transport_iscsi ipv6 vfat fat dm_mirror dm_round_robin dm_multipath dm_mod button parport_pc lp parport sg tg3 e1000 ide_cd cdrom uhci_hcd serio_raw ext3 jbd mptspi scsi_transport_spi mptscsih mptbase scsi_transport_fc qla1280 sd_mod scsi_mod Pid: 21139, CPU 7, comm: ps psr : 0000101008526030 ifs : 800000000000038d ip : [<a000000100297470>] Not tainted ip is at _raw_spin_lock+0x10/0x260 unat: 0000000000000000 pfs : 0000000000000205 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000565559 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001006029a0 b6 : a0000001001cb8e0 b7 : a0000001001c5c20 f6 : 1003ee401aca1dc557292 f7 : 1003e9e3779b97f4a7c16 f8 : 1003e0a0000001005fe7b f9 : 1003effffffffffffffdb f10 : 1003e000000000000004a f11 : 1003e8208208208208209 r1 : a000000100b7ff30 r2 : 9fffffffffffffff r3 : e000000a3d630080 r8 : e000000a3d630230 r9 : e000000a3d631024 r10 : e000000a3d63016c r11 : e00000065d557e18 r12 : e00000065d557ce0 r13 : e00000065d550000 r14 : 000000000000000c r15 : 0000000000000038 r16 : e000000a3d630e50 r17 : e000000a3d630e28 r18 : 0000000000000004 r19 : e000000a3d631008 r20 : e000000a3d630018 r21 : e000000a3d630184 r22 : e000000a3d630260 r23 : e000000a3d630fa8 r24 : 0000000000000015 r25 : 000000000000000a r26 : 0000000000000005 r27 : a000000100952498 r28 : 0000000000000000 r29 : a00000010003c5a0 r30 : ffffffffff9b7d20 r31 : a000000100684880 Call Trace: [<a000000100013de0>] show_stack+0x40/0xa0 sp=e00000065d557700 bsp=e00000065d551570 [<a0000001000146e0>] show_regs+0x840/0x880 sp=e00000065d5578d0 bsp=e00000065d551518 [<a000000100033760>] die+0x1c0/0x2a0 sp=e00000065d5578d0 bsp=e00000065d5514d0 [<a000000100033890>] die_if_kernel+0x50/0x80 sp=e00000065d5578f0 bsp=e00000065d5514a0 [<a0000001006040b0>] ia64_fault+0x10f0/0x1200 sp=e00000065d5578f0 bsp=e00000065d551448 [<a00000010000c700>] __ia64_leave_kernel+0x0/0x280 sp=e00000065d557b10 bsp=e00000065d551448 [<a000000100297470>] _raw_spin_lock+0x10/0x260 sp=e00000065d557ce0 bsp=e00000065d5513d8 [<a0000001006029a0>] _spin_lock+0x20/0x40 sp=e00000065d557ce0 bsp=e00000065d5513b8 [<a0000001000d9640>] __delayacct_blkio_ticks+0x20/0xc0 sp=e00000065d557ce0 bsp=e00000065d551390 [<a0000001001cb600>] do_task_stat+0x740/0xa20 sp=e00000065d557ce0 bsp=e00000065d5511e0 [<a0000001001cb910>] proc_tgid_stat+0x30/0x60 sp=e00000065d557e20 bsp=e00000065d5511b8 [<a0000001001c5ce0>] proc_info_read+0xc0/0x1a0 sp=e00000065d557e20 bsp=e00000065d551170 [<a000000100142000>] vfs_read+0x200/0x3a0 sp=e00000065d557e20 bsp=e00000065d551120 [<a0000001001426d0>] sys_read+0x70/0xe0 sp=e00000065d557e20 bsp=e00000065d5510a8 [<a00000010000c490>] __ia64_trace_syscall+0xd0/0x110 sp=e00000065d557e30 bsp=e00000065d5510a8 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e00000065d558000 bsp=e00000065d5510a8 --------------------------------------------------------------------
I confirmed this bug is fixed in the kernel-2.6.18-1.2702.el5. So I close this bugzilla.