541080 – MRG Kernel crashes when we run "cset set" command, but the crash is random.

Bug 541080 - MRG Kernel crashes when we run "cset set" command, but the crash is random.

Summary: MRG Kernel crashes when we run "cset set" command, but the crash is random.

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	1.2.4
Target Release:	---
Assignee:	Steven Rostedt
QA Contact:	David Sommerseth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-11-24 22:37 UTC by tushar
Modified:	2018-10-27 12:53 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-01-09 22:05:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sosreport for RT system with cgroup oops (1.80 MB, application/octet-stream) 2009-12-03 21:46 UTC, Monit Kapoor	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0041	0	normal	SHIPPED_LIVE	Important: kernel-rt security and bug fix update	2010-01-21 14:10:26 UTC

Description tushar 2009-11-24 22:37:43 UTC

Description of problem:
We are running MRG Kernel on Dell PowerEdge R610 NEHALEM Servers


When we run "cset set -l" the server panics and hangs. We cannot do anything, except powercycling the server.

Version-Release number of selected component (if applicable):

Original MRG Kernel is :
kernel-rt-2.6.24.7-108.el5rt

We even upgraded to :
kernel-rt-2.6.24.7-137.el5rt
cpuset-1.5.1-1.1

How reproducible:
out of 4 reboots, it crashes atleast once.

Steps to Reproduce:
1. Boot the server with MRG Kernel
2. cset set -l OR cset set
3.
  
Actual results:
[<ffffffff8106d725>] cgroup_iter_next+0x11/0x39 
PGD 63c55a067 PUD 62f8a7067 PMD 0 
Oops: 0000 [1] PREEMPT SMP 
CPU 0 
Modules linked in: ipv6 nfs lockd nfs_acl sunrpc dm_mirror dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac parport_pc lp parport joyddPid: 7223, comm: cset.base Not tainted 2.6.24.7-137.el5rt #1 
RIP: 0010:[<ffffffff8106d725>]  [<ffffffff8106d725>] cgroup_iter_next+0x11/0x39 
RSP: 0018:ffff8106395a7d70  EFLAGS: 00010286 
RAX: 0000000000100100 RBX: 0000000000000000 RCX: ffff81033c1dff00 
RDX: 0000000000100100 RSI: ffff8106395a7da8 RDI: ffff81033a890028 
RBP: ffff8106395a7d78 R08: 0000000000000000 R09: 0000000000000000 
R10: 0000000000000003 R11: ffff8106395a7d58 R12: ffff8106395a7da8 
R13: ffff81063a112720 R14: 0000000000000000 R15: 0000000000000153 
FS:  00007f538f4676e0(0000) GS:ffffffff813f5100(0000) knlGS:0000000000000000 
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
CR2: 0000000000100100 CR3: 000000063cc37000 CR4: 00000000000006e0 
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
Process cset.base (pid: 7223, threadinfo ffff8106395a6000, task ffff81062cc64280) 
Stack:  ffff81062f83c800 ffff8106395a7df8 ffffffff8106fc3b ffff8106395a7d98 
 ffff81062f9337c0 00000001395a7dd8 ffff81033a890028 ffff81033c1dff00 
 0000000000100100 0000000000000010 ffff81033a48f160 0000000000000000 
Call Trace: 
 [<ffffffff8106fc3b>] cgroup_tasks_open+0xe9/0x1a8 
 [<ffffffff8106f8e1>] ? cgroup_file_open+0x0/0x49 
 [<ffffffff8106f921>] cgroup_file_open+0x40/0x49 
 [<ffffffff810af1c5>] __dentry_open+0x139/0x212 
 [<ffffffff810af336>] nameidata_to_filp+0x2d/0x3f 
 [<ffffffff810af37e>] do_filp_open+0x36/0x46 
 [<ffffffff810abc28>] ? kmem_cache_alloc+0xbb/0xe9 
 [<ffffffff810af071>] ? get_unused_fd_flags+0x113/0x121 
 [<ffffffff810af3df>] do_sys_open+0x51/0xd2 
 [<ffffffff810af489>] sys_open+0x1b/0x1d 
 [<ffffffff8100c23e>] system_call_ret+0x0/0x5 
 
 
Code: 8b 51 20 48 8d 42 18 48 39 42 18 74 dd 48 89 0e 48 8b 42 18 48 89 46 08 c9 c3 55 48 8b 46 08 48 89 e5 53 31 db 48 83 3e 00 74 22 <48> 8b 10 4 
RIP  [<ffffffff8106d725>] cgroup_iter_next+0x11/0x39 
 RSP <ffff8106395a7d70> 
CR2: 0000000000100100 
Kernel panic - not syncing: Fatal exception 
Pid: 7223, comm: cset.base Tainted: G      D  2.6.24.7-137.el5rt #1 
 
Call Trace: 
 [<ffffffff8103dcec>] panic+0xaf/0x160 
 [<ffffffff8100c886>] ? retint_kernel+0x26/0x30 
 [<ffffffff8128aa24>] ? oops_end+0x3d/0x5d 
 [<ffffffff8128aa3b>] oops_end+0x54/0x5d 
 [<ffffffff8128c574>] do_page_fault+0x67e/0x76d 
 [<ffffffff8105f3dd>] ? try_to_take_rw_read+0x4ae/0x5a8 
 [<ffffffff81060486>] ? rt_read_slowlock+0x7c/0x302 
 [<ffffffff81060486>] ? rt_read_slowlock+0x7c/0x302 
 [<ffffffff8128a6c9>] error_exit+0x0/0x51 
 [<ffffffff8106d725>] ? cgroup_iter_next+0x11/0x39 
 [<ffffffff8106fc3b>] ? cgroup_tasks_open+0xe9/0x1a8 
 [<ffffffff8106f8e1>] ? cgroup_file_open+0x0/0x49 
 [<ffffffff8106f921>] ? cgroup_file_open+0x40/0x49 
 [<ffffffff810af1c5>] ? __dentry_open+0x139/0x212 
 [<ffffffff810af336>] ? nameidata_to_filp+0x2d/0x3f 
 [<ffffffff810af37e>] ? do_filp_open+0x36/0x46 
 [<ffffffff810abc28>] ? kmem_cache_alloc+0xbb/0xe9 
 [<ffffffff810af071>] ? get_unused_fd_flags+0x113/0x121 
 [<ffffffff810af3df>] ? do_sys_open+0x51/0xd2 
 [<ffffffff810af489>] ? sys_open+0x1b/0x1d 
 [<ffffffff8100c23e>] ? system_call_ret+0x0/0x5

Expected results:


Additional info:

Comment 2 Steven Rostedt 2009-12-03 17:37:13 UTC

I tried this on a del 610 with 16 CPUs (2x4-core hyperthreaded).

I downloaded libbitmask and libcpuset from:

ftp://oss.sgi.com/projects/cpusets/download/libbitmask-2.0.tar.bz2
ftp://oss.sgi.com/projects/cpusets/download/libcpuset-1.0.tar.bz2

I built and installed them and installed cpuset 1.5.2

I then ran:

[root@dell-r610-1 cpuset-1.5.2]# ./cset set
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-15 y     0-1 y   458    0 /

[root@dell-r610-1 cpuset-1.5.2]# ./cset set -l
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-15 y     0-1 y   458    0 /


I did both of the above several times.

Then just to play, I did:

[root@dell-r610-1 cpuset-1.5.2]# ./cset set -c 8-15 test
cset: --> created cpuset "test"
[root@dell-r610-1 cpuset-1.5.2]# ./cset set -l
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-15 y     0-1 y   458    1 /
         test       8-15 n       0 n     0    0 /test
[root@dell-r610-1 cpuset-1.5.2]# ./cset set -m 1 test
cset: --> modified cpuset "test"
[root@dell-r610-1 cpuset-1.5.2]# ./cset set -l
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-15 y     0-1 y   458    1 /
         test       8-15 n       1 n     0    0 /test

And everything worked fine.  I could not reproduce the bug.

Perhaps there is some other configuration I need to perform. Can you please
run sosreport and attach the resulting file.

Comment 3 Guy Streeter 2009-12-03 18:44:21 UTC

I have provided the sosreport that was attached to the IT.

Comment 4 Monit Kapoor 2009-12-03 21:46:25 UTC

Created attachment 375927 [details]
sosreport for RT system with cgroup oops

Comment 12 errata-xmlrpc 2010-01-21 14:11:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0041.html

Comment 16 Issue Tracker 2010-03-15 15:13:51 UTC

Event posted on 03-12-2010 11:04am CST by jbrier

Customer doesn't have a vmcore. Has Engineering made any progress on
this?


===
we are using "cset set -l" command - this crashes/panics the system. I
dont have core at this time. Meanwhile can you please investigate why the
patch that you recommended is not working ?
===

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter 
 issue 371974

Comment 20 Clark Williams 2012-01-09 22:05:55 UTC

I haven't been able to reproduce this on our 2.6.33-based kernel.

Note You need to log in before you can comment on or make changes to this bug.