Bug 818154 - Kernel oops related to cgroup freezer / condor
Summary: Kernel oops related to cgroup freezer / condor
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-05-02 10:51 UTC by Bert DeKnuydt
Modified: 2012-11-13 15:06 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-11-13 15:06:19 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Bert DeKnuydt 2012-05-02 10:51:40 UTC
Description of problem:

Kernel BUGS as put further down... 


Version-Release number of selected component (if applicable):

kernel-3.3.2-6.fc16.x86_64
condor-7.7.5-0.2.fc16.x86_64

How reproducible:

Not always, but often.

Steps to Reproduce: (this is guesswork, I cannot 100% reproduce it)

1. Run a condor compute node and enable job suspension on
   user activity. (I.e.: a compute job gets signal TSTP
   when other system activity happens, suspending the process(es)
   temporarily.
2. Set condor in such a way, that a 'job' (= collection of processes) 
   are killed after a certain time of being suspended.  Condor (in fact condor_procd)
   does this using the cgroup freezer mechanism, when available, afaik. 
3. Now when the 'job' gets killed, sometimes the whole machines
   hangs/crashed/needs a reboot.  
  
Actual results:

Apr 30 15:51:51 izar kernel: [960860.513694] ------------[ cut here ]------------
Apr 30 15:51:51 izar kernel: [960860.513779] kernel BUG at kernel/cgroup_freezer.c:243!
Apr 30 15:51:51 izar kernel: [960860.513861] invalid opcode: 0000 [#2] SMP 
Apr 30 15:51:51 izar kernel: [960860.513948] CPU 4 
Apr 30 15:51:51 izar kernel: [960860.513955] Modules linked in: ppdev parport_pc lp parport nfs fscache nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack sha256_generic dm_crypt snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel nvidia(PO) snd_hda_codec snd_hwdep snd_seq snd_seq_device nfsd snd_pcm lockd nfs_acl snd_timer auth_rpcgss snd e1000e soundcore iTCO_wdt sunrpc i7core_edac snd_page_alloc i2c_i801 iTCO_vendor_support edac_core microcode i2c_core uinput firewire_ohci firewire_core crc_itu_t [last unloaded: scsi_wait_scan]
Apr 30 15:51:51 izar kernel: [960860.514744] 
Apr 30 15:51:51 izar kernel: [960860.514820] Pid: 12105, comm: condor_procd Tainted: P      D    O 3.3.2-1.fc16.x86_64 #1 transtec AG /DP55WB
Apr 30 15:51:51 izar kernel: [960860.514989] RIP: 0010:[<ffffffff810c481b>]  [<ffffffff810c481b>] update_if_frozen+0x8b/0xd0
Apr 30 15:51:51 izar kernel: [960860.515152] RSP: 0018:ffff8801f478bdb8  EFLAGS: 00010097
Apr 30 15:51:51 izar kernel: [960860.515234] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff880138f4e418
Apr 30 15:51:51 izar kernel: [960860.515387] RDX: 0000000000000000 RSI: ffff8801f478bdc8 RDI: ffff8801b6376800
Apr 30 15:51:51 izar kernel: [960860.515541] RBP: ffff8801f478be08 R08: ffff8801b6376a50 R09: ffffc90008164000
Apr 30 15:51:51 izar kernel: [960860.517261] R10: 00000000ffffffff R11: 0000000000000246 R12: ffff8801b6376800
Apr 30 15:51:51 izar kernel: [960860.517414] R13: 0000000000000000 R14: 0000000000000002 R15: ffff880102387b00
Apr 30 15:51:51 izar kernel: [960860.517568] FS:  00007fa6296e1b40(0000) GS:ffff88021fd00000(0000) knlGS:0000000000000000
Apr 30 15:51:51 izar kernel: [960860.517722] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 30 15:51:51 izar kernel: [960860.517804] CR2: 00007fa6281de000 CR3: 00000001fd860000 CR4: 00000000000006e0
Apr 30 15:51:51 izar kernel: [960860.517958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 30 15:51:51 izar kernel: [960860.518111] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 30 15:51:51 izar kernel: [960860.518265] Process condor_procd (pid: 12105, threadinfo ffff8801f478a000, task ffff8801da334590)
Apr 30 15:51:51 izar kernel: [960860.518419] Stack:
Apr 30 15:51:51 izar kernel: [960860.518495]  0000000000001000 ffff8801dbce2e60 0000000000000000 ffff8801dbce3560
Apr 30 15:51:51 izar kernel: [960860.518665]  ffff8801b6376800 ffff8801b6376800 0000000000000000 ffff880102387b00
Apr 30 15:51:51 izar kernel: [960860.518834]  ffffffff810c4860 ffff8801f478be80 ffff8801f478be58 ffffffff810c48d5
Apr 30 15:51:51 izar kernel: [960860.519003] Call Trace:
Apr 30 15:51:51 izar kernel: [960860.519082]  [<ffffffff810c4860>] ? update_if_frozen+0xd0/0xd0
Apr 30 15:51:51 izar kernel: [960860.519164]  [<ffffffff810c48d5>] freezer_write+0x75/0x200
Apr 30 15:51:51 izar kernel: [960860.519246]  [<ffffffff810c4860>] ? update_if_frozen+0xd0/0xd0
Apr 30 15:51:51 izar kernel: [960860.519329]  [<ffffffff810c11be>] cgroup_file_write+0x1fe/0x2e0
Apr 30 15:51:51 izar kernel: [960860.519413]  [<ffffffff8114d4a8>] ? do_mmap_pgoff+0x348/0x360
Apr 30 15:51:51 izar kernel: [960860.519502]  [<ffffffff81269a0c>] ? security_file_permission+0x2c/0xb0
Apr 30 15:51:51 izar kernel: [960860.519595]  [<ffffffff81181653>] vfs_write+0xb3/0x180
Apr 30 15:51:51 izar kernel: [960860.519688]  [<ffffffff8118197a>] sys_write+0x4a/0x90
Apr 30 15:51:51 izar kernel: [960860.519770]  [<ffffffff815fc5a9>] system_call_fastpath+0x16/0x1b
Apr 30 15:51:51 izar kernel: [960860.519852] Code: 89 e7 e8 09 f7 ff ff 48 83 c4 28 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 2e 0f 1f 84 00 00 00 00 00 41 83 fe 01 74 0a 41 39 dd 74 d0 <0f> 0b 0f 1f 00 41 39 dd 75 c6 41 c7 47 20 02 00 00 00 eb bc 48 
Apr 30 15:51:51 izar kernel: [960860.520337] RIP  [<ffffffff810c481b>] update_if_frozen+0x8b/0xd0
Apr 30 15:51:51 izar kernel: [960860.520424]  RSP <ffff8801f478bdb8>
Apr 30 15:51:51 izar kernel: [960860.521362] ---[ end trace cea69b21df553451 ]---

Expected results:

No kernel bug..

Additional info:

Kernel is tainted, sorry for that.

Comment 1 Dave Jones 2012-10-23 15:25:19 UTC
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 2 Justin M. Forbes 2012-11-13 15:06:19 UTC
With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.


Note You need to log in before you can comment on or make changes to this bug.