236430 – nfsd: BUG: using smp_processor_id() in preemptible code

Bug 236430 - nfsd: BUG: using smp_processor_id() in preemptible code

Summary: nfsd: BUG: using smp_processor_id() in preemptible code

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Clark Williams
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-13 20:52 UTC by IBM Bug Proxy
Modified:	2008-02-27 19:58 UTC (History)
CC List:	0 users
Fixed In Version:	2.6.21-31.el5rt
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-06-20 16:20:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
breaknfs testcase (1.55 KB, text/plain) 2007-04-13 20:54 UTC, IBM Bug Proxy	no flags	Details
fix_drain_array.patch (510 bytes, text/plain) 2007-04-17 05:55 UTC, IBM Bug Proxy	no flags	Details
fix_drain_array.patch (651 bytes, text/plain) 2007-04-18 04:55 UTC, IBM Bug Proxy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	33845	0	None	None	None	Never

Description IBM Bug Proxy 2007-04-13 20:52:47 UTC

LTC Owner is: ankigarg.com
LTC Originator is: ankigarg.com


Problem description:
Found the following in dmesg when running breaknfs testcase:

NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
BUG: using smp_processor_id() in preemptible [00000000] code: nfsd/2852
caller is drain_array+0x25/0x132

Call Trace:
 [<ffffffff8026d828>] dump_trace+0xbd/0x3d8
 [<ffffffff8026db87>] show_trace+0x44/0x6d
 [<ffffffff8026ddc8>] dump_stack+0x13/0x15
 [<ffffffff80355e2e>] debug_smp_processor_id+0xe3/0xf1
 [<ffffffff802e01b5>] drain_array+0x25/0x132
 [<ffffffff802e07ac>] __cache_shrink+0xd5/0x1a6
 [<ffffffff802e0a47>] kmem_cache_destroy+0x6c/0xe3
 [<ffffffff8846c809>] :nfsd:nfsd4_free_slab+0x16/0x21
 [<ffffffff8846c824>] :nfsd:nfsd4_free_slabs+0x10/0x36
 [<ffffffff8846daf9>] :nfsd:nfs4_state_shutdown+0x1a2/0x1ae
 [<ffffffff884570f8>] :nfsd:nfsd_last_thread+0x47/0x76
 [<ffffffff8839463a>] :sunrpc:svc_destroy+0x8d/0xd1
 [<ffffffff88394738>] :sunrpc:svc_exit_thread+0xba/0xc6
 [<ffffffff8845795f>] :nfsd:nfsd+0x2a3/0x2b8
 [<ffffffff802600f8>] child_rip+0xa/0x12

---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<ffffffff80355ddb>] .... debug_smp_processor_id+0x90/0xf1
.....[<ffffffff802e01b5>] ..   ( <= drain_array+0x25/0x132)

BUG: nfsd:2852 task might have lost a preemption check!

Call Trace:
 [<ffffffff8026d828>] dump_trace+0xbd/0x3d8
 [<ffffffff8026db87>] show_trace+0x44/0x6d
 [<ffffffff8026ddc8>] dump_stack+0x13/0x15
 [<ffffffff8028be9b>] preempt_enable_no_resched+0x5c/0x5e
 [<ffffffff80355e33>] debug_smp_processor_id+0xe8/0xf1
 [<ffffffff802e01b5>] drain_array+0x25/0x132
 [<ffffffff802e07ac>] __cache_shrink+0xd5/0x1a6
 [<ffffffff802e0a47>] kmem_cache_destroy+0x6c/0xe3
 [<ffffffff8846c809>] :nfsd:nfsd4_free_slab+0x16/0x21
 [<ffffffff8846c824>] :nfsd:nfsd4_free_slabs+0x10/0x36
 [<ffffffff8846daf9>] :nfsd:nfs4_state_shutdown+0x1a2/0x1ae
 [<ffffffff884570f8>] :nfsd:nfsd_last_thread+0x47/0x76
 [<ffffffff8839463a>] :sunrpc:svc_destroy+0x8d/0xd1
 [<ffffffff88394738>] :sunrpc:svc_exit_thread+0xba/0xc6
 [<ffffffff8845795f>] :nfsd:nfsd+0x2a3/0x2b8
 [<ffffffff802600f8>] child_rip+0xa/0x12

---------------------------
| preempt count: 00000000 ]
| 0-level deep critical section nesting:
----------------------------------------

nfsd: last server has exited
nfsd: unexporting all filesystems


Provide output from "uname -a", if possible:
Linux llm38.in.ibm.com 2.6.20-0119.rt8 #1 SMP PREEMPT Thu Feb 15 15:53:15 CET
2007 x86_64 x86_64 x86_64 GNU/Linux

Is this reproducible? Reproduced by running breaknfs test.
breaknfs has been pulled from lkml:
http://www.ussg.iu.edu/hypermail/linux/kernel/0603.0/0378.html 
 

Is the system (not just the application) hung? No.


A similar error has been reported earlier on linux-rt-users mailing list:

http://www.mail-archive.com/linux-rt-users@vger.kernel.org/msg00240.html



As is clear from the comment #5, the problem is not with nfsd.
kmem_cache_destroy invokes the drain_array function which invokes
smp_processor_id(). Now, if CONFIG_DEBUG_PREEMPT is on, call to
smp_processor_id() becomes a call to debug_smp_processor_id, which prints the
BUG report even if preempt_count =0 as indicated by 'preemptible [00000000]'.

Something to see here is that RHEL5RT kernel config, by default, has
DEBUG_PREEMPT config option ON. Will need to look at the areas of code affected
by this option, to see how it might affect timings/latencies. 

Trying with kernel by changing the call to smp_processor_id() in drain_array()
to raw_smp_processor_id(). This would not be a solution though.

Comment 1 IBM Bug Proxy 2007-04-13 20:54:10 UTC

Created attachment 152589 [details]
breaknfs testcase

Comment 2 IBM Bug Proxy 2007-04-14 15:15:28 UTC

----- Additional Comments From sripathi.com (prefers email at sripathik.com)  2007-04-14 11:11 EDT -------
I recreated this problem in another, easier way. I just started oprofile and
stopped it. I then saw the following message in dmesg. Though nfsd is not
involved here, it seems to be the same problem.

BUG: using smp_processor_id() in preemptible [00000000] code: oprofiled/29202
caller is drain_array+0x25/0x132

Call Trace:
 [<ffffffff8026d828>] dump_trace+0xbd/0x3d8
 [<ffffffff8026db87>] show_trace+0x44/0x6d
 [<ffffffff8026ddc8>] dump_stack+0x13/0x15
 [<ffffffff80355e2e>] debug_smp_processor_id+0xe3/0xf1
 [<ffffffff802e01b5>] drain_array+0x25/0x132
 [<ffffffff802e07ac>] __cache_shrink+0xd5/0x1a6
 [<ffffffff802e0a47>] kmem_cache_destroy+0x6c/0xe3
 [<ffffffff80314be9>] dcookie_unregister+0x109/0x121
 [<ffffffff8843af63>] :oprofile:event_buffer_release+0x1f/0x43
 [<ffffffff80212992>] __fput+0xc3/0x185
 [<ffffffff8022e537>] fput+0x14/0x16
 [<ffffffff802252cd>] filp_close+0x63/0x6e
 [<ffffffff80239cac>] put_files_struct+0x69/0xb8
 [<ffffffff802158a8>] do_exit+0x27d/0x8ad
 [<ffffffff8024af12>] cpuset_exit+0x0/0x70
 [<ffffffff8024ea66>] sys_exit_group+0x12/0x14
 [<ffffffff8025f42c>] tracesys+0x151/0x1be
 [<00000034ece948cf>]

---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<ffffffff80355ddb>] .... debug_smp_processor_id+0x90/0xf1
.....[<ffffffff802e01b5>] ..   ( <= drain_array+0x25/0x132)

Comment 3 IBM Bug Proxy 2007-04-15 09:00:42 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|RH236430 - nfsd: BUG: using |RH236430 - BUG: using
                   |smp_processor_id() in       |smp_processor_id() in
                   |preemptible code            |preemptible code




------- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-04-15 04:52 EDT -------
Changing summary to make it more general

Comment 4 IBM Bug Proxy 2007-04-17 05:55:48 UTC

Created attachment 152770 [details]
fix_drain_array.patch

Comment 5 IBM Bug Proxy 2007-04-17 05:56:03 UTC

----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-04-17 01:50 EDT -------
 
fix drain_array in mm/slab.c

With this patch, I have not seen the BUG report in 24 hours of the test run.
Working on investigating if this change would affect the kmem_cache_t control
path.

Comment 6 IBM Bug Proxy 2007-04-17 11:40:34 UTC

----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-04-17 07:38 EDT -------
This patch is just a workaround and does not fix the underlying problem. So,
investigating further.

Comment 7 IBM Bug Proxy 2007-04-18 04:55:48 UTC

Created attachment 152870 [details]
fix_drain_array.patch

Comment 8 IBM Bug Proxy 2007-04-18 04:56:03 UTC

----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-04-18 00:52 EDT -------
 
drain_array_fix.patch

This patch fixes the BUG in drain_array() function. Found that there is no need
to call smp_processor_id from drain_array(), as already the call to
slab_spin_lock_irq() fills the value of this_cpu using raw_smp_processor_id().

Comment 9 IBM Bug Proxy 2007-04-18 13:35:30 UTC

----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-04-18 09:30 EDT -------
Posted patch to lkml and Ingo

Comment 10 Tim Burke 2007-04-19 02:03:53 UTC

Thanks for investigating and proposing fix!

Assigning to Ingo.
We'll discuss w/ him on this Friday's internal call.

Comment 11 IBM Bug Proxy 2007-04-19 05:05:29 UTC

----- Additional Comments From ankigarg.com (prefers email at ankita.com)  2007-04-19 01:02 EDT -------
Tim, Ingo has accepted the fix!

Comment 12 Tim Burke 2007-04-26 22:14:44 UTC

ok, reasigning this to Clark to throw it into the patch heap.

Comment 13 Clark Williams 2007-04-30 15:26:34 UTC

The patch should be in any kernel-rt above 2.6.20-25. I believe it's also in all
the 2.6.21-rtX series.

We'll let this ride for a while and if no one sees it in a week or so, we'll
declare victory and close it.

Note You need to log in before you can comment on or make changes to this bug.