Bug 210499 - panic - "NetLabel: failed to initialize properly" on DL585
Summary: panic - "NetLabel: failed to initialize properly" on DL585
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Vivek Goyal
QA Contact: Brian Brock
URL:
Whiteboard:
: 210191 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-12 15:42 UTC by Matt Brodeur
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-10-26 11:41:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fix for bss data corruption (2.37 KB, patch)
2006-10-16 18:26 UTC, Vivek Goyal
no flags Details | Diff

Description Matt Brodeur 2006-10-12 15:42:29 UTC
Description of problem:
On one of our HP Proliant DL585s I'm seeing this output when booting the
installer kernel:
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
Kernel panic - not syncing: NetLabel: failed to initialize properly (-17)

This is happening with every x86_64 RHEL5 tree I've tested so far (0927-1012),
but only on this system.  All other trees, including i386 RHEL5, install
normally, and I'm not seeing this behavior on another DL585 in the lab.


Version-Release number of selected component (if applicable):
RHEL5-Server-20060927.0 (kernel 2.6.18-1.2702.el5)
RHEL5-Server-20061006.2 (kernel 2.6.18-1.2714.el5)
RHEL5-Server-20061010.nightly (kernel 2.6.18-1.2717.el5)
RHEL5-Server-20061012.nightly (kernel 2.6.18-1.2717.el5)

How reproducible:
100%

Steps to Reproduce:
1. Boot a x86_64 RHEL5 install on dl585-02.rhts.boston.redhat.com

Actual results:
Kernel panic

Expected results:
Booting into anaconda

Additional info:
Console output from test runs available at:
RHEL5-Server-20060927.0
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=870766
RHEL5-Server-20061006.2
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=870556
RHEL5-Server-20061010.nightly
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=870019

Comment 1 Tom Kincaid 2006-10-12 20:03:57 UTC
Marking as a Beta blocker. Won't get testing on a signficant platform.

Section 6c of the release criteria.

http://intranet.corp.redhat.com/ic/intranet/RHEL500Beta2ReleaseCriteria#HardwareCert



Comment 2 RHEL Program Management 2006-10-12 20:19:20 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 3 James Morris 2006-10-13 19:16:46 UTC
Netlabel should not be active, I think.

Comment 4 Chip Coldwell 2006-10-13 20:12:04 UTC
*** Bug 210191 has been marked as a duplicate of this bug. ***

Comment 5 Eric Paris 2006-10-13 20:26:17 UTC
Chip, was this actually on dl585-02.rhts.boston.redhat.com or do you have a
second physical machine that had the same problem?

Comment 6 Chip Coldwell 2006-10-13 20:35:50 UTC
See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=208445 for a possibly
related issue.



Comment 7 Chip Coldwell 2006-10-13 20:36:46 UTC
(In reply to comment #5)
> Chip, was this actually on dl585-02.rhts.boston.redhat.com or do you have a
> second physical machine that had the same problem?

This happened to me on dl585-03.lab.boston.redhat.com, so it is a second
physical machine with the identical problem.

Chip


Comment 8 Chip Coldwell 2006-10-13 20:44:29 UTC
I think this is likely a duplicate of bug 208445; maybe someone in RHTS can test a  
2.6.18-1.2725.el5 or later kernel and see if the bug is fixed?

Chip

Comment 9 Eric Paris 2006-10-14 17:15:30 UTC
Just tried 2.6.18-1.2727.el5 and ended with

NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
Kernel panic - not syncing: NetLabel: failed to initialize properly (-17)

So it appears to not be fixed.

Comment 10 Eric Paris 2006-10-14 23:33:58 UTC
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel: About to call netlbl_domhsh_init
NetLabel: returned from netlbl_domhsh_init with ret_val:0
NetLabel: About to call netlbl_netlink_init
NetLabel: returned from netlbl_netlink_init with ret_val:0
NetLabel: About to call netlbl_unlabel_defconf
NetLabel: returned from netlbl_netlink_init with ret_val:-17
Kernel panic - not syncing: NetLabel: failed to initialize properly (-17)

Did a little adding of printk's (and apparently messed the last one up) and it
looks like netlbl_unabel_defconf is where the EEXIST comes from.  I'll start
putting them deeper down tomorrow.  And look at the thread 

http://marc.theaimsgroup.com/?l=linux-netdev&m=116016517332266&w=2

which might have some remote chance of applicability.

Comment 11 Eric Paris 2006-10-15 16:35:40 UTC
NetLabel: Initializing
NetLabel:  domain hash size = 128
NetLabel:  protocols = UNLABELED CIPSOv4
NetLabel: About to call netlbl_domhsh_init
NetLabel: returned from netlbl_domhsh_init with ret_val:0
NetLabel: About to call netlbl_netlink_init
NetLabel: returned from netlbl_netlink_init with ret_val:0
NetLabel: About to call netlbl_unlabel_defconf
NetLabel: netlbl_unlabel_defconf: about to call netlbl_domhsh_add_default
NetLabel: netlbl_domhsh_add: entry->type:5 NETLBL_NLTYPE_UNLABELED:5
NetLabel: netlbl_domhsh_add: entry->domain IS null
NetLabel: netlbl_domhsh_add: rcu_dereference(netlbl_domhsh_def) != NULL
NetLabel: netlbl_domhsh_def: ffffffffffffffff and &netlbl_domhsh_def:
ffffffff808d6580
NetLabel: netlbl_unlabel_defconf: return from netlbl_domhsh_add_default with
ret_val:-17
NetLabel: returned from netlbl_unlabel_defconf with ret_val:-17
Kernel panic - not syncing: NetLabel: failed to initialize properly (-17)


Comment 12 Eric Paris 2006-10-15 17:18:22 UTC
I'm not going to claim to understand at ALL, but the PAGE_ALIGN changed to
bad_addr as talked about in the netdev thread seem to be allowing the machine to
boot.  This is going to take some research as all of the addresses that it looks
like bad_addr are fixing up as so far away from 808d6580......

Comment 13 Eric Paris 2006-10-16 15:15:41 UTC
Just for reference, I decided to include the printk's I got from bad_addr which
prints out the 'fixed-up' *addrp

bad_addr: fixing up addr < 0x8000   *addrp: 0000000000008000
bad_addr: fixing up last >= table_start<<PAGE_SHIFT && addr <
table_end<<PAGE_SHIFT   *addrp: 0000000000029000
bad_addr: fixing up last >= 640*1024 && addr < 1024*1024    *addrp: 0000000000100000
bad_addr: fixing up last >= __pa_symbol(&_text) && last < __pa_symbol(&_end)   
*addrp: 0000000000ad7000

Comment 14 Eric Paris 2006-10-16 15:31:34 UTC
Intelligent conversation about the problem happens here:

http://marc.theaimsgroup.com/?l=linux-netdev&m=116014524814284&w=2

Comment 15 Linda Wang 2006-10-16 18:01:26 UTC
seems like Vivek's proposed patch on lkml on 10/6 fixes the problem. 


Comment 16 Vivek Goyal 2006-10-16 18:26:44 UTC
Created attachment 138605 [details]
fix for bss data corruption

Comment 17 Linda Wang 2006-10-16 19:40:00 UTC
post for review on 10/16.

Comment 18 Don Zickus 2006-10-17 02:04:36 UTC
in kernel-2.6.18-1.2728.el5

Comment 20 Matt Brodeur 2006-10-25 16:06:22 UTC
We've now had multiple successful RHTS installs of the 1020.1 tree on dl585-02.
 Marking this as VERIFIED/CURRENTRELEASE.


Comment 21 Jay Turner 2006-10-26 11:39:15 UTC
Actually running through the Verified state.


Note You need to log in before you can comment on or make changes to this bug.