Bug 510746

Summary:

BUG: warning at kernel/softirq.c:138/local_bh_enable() (Tainted: G )

Product:

Red Hat Enterprise Linux 5

Reporter:

Jan Tluka <jtluka>

Component:

kernel

Assignee:

Paolo Bonzini <pbonzini>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.4

CC:

clalance, davem, dzickus, emcnabb, pbonzini, peterm, prarit, qcai, xen-maint, yzheng

Target Milestone:

Target Release:

---

Hardware:

ia64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-03-30 07:45:26 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

470201, 470436

Bug Blocks:

533192

Attachments:

Description	Flags
reproducer xml for RHTS	none
patch that could fix the bug	none

Description Jan Tluka 2009-07-10 15:01:40 UTC

Description of problem:
While runing regression tests in RHTS on RHEL5.4 snapshot I found following warning after /kernel/errata/5.3.z/470436 test finished and passed:

Checking dmesg for specific failures!
BUG: warning at kernel/softirq.c:138/local_bh_enable() (Not tainted)
End of log.

Links to RHTS logs in Additional info.

Version-Release number of selected component (if applicable):
RHEL5.4-Server-20090708.0
kernel-xen-2.6.18-157.el5

How reproducible:
Run /kernel/errata/5.3.z/470436 test in RHTS on ia64 machine using xen kernel.

Steps to Reproduce:
1. kernel_workflow.py -x -u rhuser -t /kernel/errata/5.3.z/470436 -a ia64 -S rhts.redhat.com -d RHEL5.4-Server-20090708.0
2.
3.
  
Actual results:
BUG warning in dmesg and test Fails

Expected results:
No BUG warning in dmesg and test Passes

Additional info:
Job where I saw the warning: http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=71162
Testruns that ever failed: http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?result=Fail&test_filter=/kernel/errata/5.3.z/470436

Comment 1 Jan Tluka 2009-07-10 15:18:25 UTC

Created attachment 351275 [details]
reproducer xml for RHTS

The kernel_workflow.py reproducer does not run as i expected, so use attached xml file to reproduce on correct system configuration.

/usr/bin/submit_job.py -S rhts.redhat.com -j bug510746.xml

Scheduled job is here: http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=71644

Comment 2 Jan Tluka 2009-07-10 16:19:32 UTC

Scheduled job in comment 1 was aborted so I scheduled another one with specific host: http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=71657

Comment 4 Paolo Bonzini 2009-07-13 13:33:25 UTC

Unfortunately, the error does not help much.  The log at http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8983261 does not include the full dmesg output.

Would it be possible (by modifying the workflow?) to include /tmp/dmesg.log at the end of the output if the test fails?

Thanks!

Comment 5 Jan Tluka 2009-07-15 13:17:44 UTC

Hi Paolo, I scheduled yet another job, because provided xml file had an error. The job in comment 2 failed to install xen kernel and was testing non-xen one.

Here's the job that will additionaly include dmesg log once it's finished.
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=72787

Comment 6 Jan Tluka 2009-07-15 17:08:03 UTC

(In reply to comment #5)
> Hi Paolo, I scheduled yet another job, because provided xml file had an error.
> The job in comment 2 failed to install xen kernel and was testing non-xen one.
> 
> Here's the job that will additionaly include dmesg log once it's finished.
> http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=72787  

Damn, my XML file got corrupted somehow so another job is queued ATM.
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=72837

Comment 7 Jan Tluka 2009-07-17 11:50:24 UTC

Great, now I think I have the stuff you requested.
This happened on hp-bl870c-02.rhts.bos.redhat.com system.

BUG: warning at kernel/softirq.c:138/local_bh_enable() (Not tainted)

Call Trace:
 [<a00000010001d240>] show_stack+0x40/0xa0
                                sp=e00000018ac97bc0 bsp=e00000018ac915a8
 [<a00000010001d2d0>] dump_stack+0x30/0x60
                                sp=e00000018ac97d90 bsp=e00000018ac91590
 [<a00000010009b520>] local_bh_enable+0x120/0x1c0
                                sp=e00000018ac97d90 bsp=e00000018ac91578
 [<a000000100549f90>] lock_sock+0x190/0x1c0
                                sp=e00000018ac97d90 bsp=e00000018ac91548
 [<a000000100542f20>] sock_fasync+0xe0/0x320
                                sp=e00000018ac97dc0 bsp=e00000018ac914e8
 [<a000000100544970>] sock_close+0x70/0xa0
                                sp=e00000018ac97dc0 bsp=e00000018ac914c0
 [<a0000001001841e0>] __fput+0x1a0/0x420
                                sp=e00000018ac97dc0 bsp=e00000018ac91480
 [<a0000001001844a0>] fput+0x40/0x60
                                sp=e00000018ac97dc0 bsp=e00000018ac91460
 [<a00000010055a6d0>] __scm_destroy+0x130/0x1e0
                                sp=e00000018ac97dc0 bsp=e00000018ac91438
 [<a000000100662e50>] unix_destruct_fds+0x70/0xa0
                                sp=e00000018ac97dd0 bsp=e00000018ac91418
 [<a000000100550b70>] skb_release_head_state+0x1f0/0x300
                                sp=e00000018ac97e00 bsp=e00000018ac913e8
 [<a000000100552640>] __kfree_skb+0x20/0x60
                                sp=e00000018ac97e00 bsp=e00000018ac913c8
 [<a000000100552880>] kfree_skb+0x140/0x160
                                sp=e00000018ac97e00 bsp=e00000018ac91398
 [<a000000100660f00>] unix_release_sock+0x360/0x460
                                sp=e00000018ac97e00 bsp=e00000018ac91340
 [<a000000100661040>] unix_release+0x40/0x60
                                sp=e00000018ac97e00 bsp=e00000018ac91320
 [<a0000001005447c0>] sock_release+0x80/0x1c0
                                sp=e00000018ac97e00 bsp=e00000018ac912f8
 [<a000000100544980>] sock_close+0x80/0xa0
                                sp=e00000018ac97e10 bsp=e00000018ac912d0
 [<a0000001001841e0>] __fput+0x1a0/0x420
                                sp=e00000018ac97e10 bsp=e00000018ac91290
 [<a0000001001844a0>] fput+0x40/0x60
                                sp=e00000018ac97e10 bsp=e00000018ac91270
 [<a00000010017da90>] filp_close+0x110/0x140
                                sp=e00000018ac97e10 bsp=e00000018ac91240
 [<a00000010008fac0>] put_files_struct+0x120/0x1e0
                                sp=e00000018ac97e10 bsp=e00000018ac91200
 [<a000000100093be0>] do_exit+0x7a0/0x1800
                                sp=e00000018ac97e10 bsp=e00000018ac911a8
 [<a000000100094e50>] do_group_exit+0x210/0x220
                                sp=e00000018ac97e30 bsp=e00000018ac91170
 [<a000000100094e80>] sys_exit_group+0x20/0x40
                                sp=e00000018ac97e30 bsp=e00000018ac91118
 [<a00000010006ae00>] xen_trace_syscall+0x100/0x140
                                sp=e00000018ac97e30 bsp=e00000018ac91118
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e00000018ac98000 bsp=e00000018ac91118

Comment 8 Chris Lalancette 2009-07-17 13:05:46 UTC

Hm, it looks kind of similar to some other local_bh_enable badness we've had elsewhere (bz 508648, bz 498394, bz 470919).  Paolo, care to take a look?

Chris Lalancette

Comment 9 Paolo Bonzini 2009-07-21 11:16:26 UTC

It looks very different from other local_bh_enable problems. :-(

Comment 10 Paolo Bonzini 2009-07-21 12:07:43 UTC

For the record, here is the job that had the failure:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=72909

Comment 11 Paolo Bonzini 2009-07-21 14:55:29 UTC

Created attachment 354496 [details]
patch that could fix the bug

This is a backport of upstream 233e70f4228e78eb2f80dc6650f65d3ae3dbf17c.  It will remove the call to sock_fasync in the case of unix.c and thus it should fix the badness.  However, the bug could still be latent for a more complicated testcase.

Comment 13 Paolo Bonzini 2009-07-24 08:49:58 UTC

Dave,

this bug has been assigned to kernel-xen, but it seems like a non-virtualization-related problem.

It is related to the SCM_RIGHTS DoS of bug 470201, in that it is triggered by the same testcase.  It is not as serious, however, because this is just a WARN_ON_ONCE rather than a kernel panic.

My backport of an upstream patch should fix this bug by removing the execution path that triggered the bug.  However, I didn't really understand the root cause of the problem (i.e. where are the IRQs enabled in the call trace of comment #7) and I'm pretty sure that the bug would resurface if FASYNC usage was added somehow to the unix.c testcase.  Can you take a look?

Comment 17 Paolo Bonzini 2009-07-27 10:12:07 UTC

Reassigned from kernel-xen to kernel as the patch does not affect Xen at all.

Comment 19 RHEL Program Management 2009-09-25 17:36:09 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 20 Don Zickus 2009-11-17 21:55:57 UTC

in kernel-2.6.18-174.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 24 errata-xmlrpc 2010-03-30 07:45:26 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html