505541 – BUG: soft lockup - CPU#0 stuck for 10s! [NetworkManager:5182]

Bug 505541 - BUG: soft lockup - CPU#0 stuck for 10s! [NetworkManager:5182]

Summary: BUG: soft lockup - CPU#0 stuck for 10s! [NetworkManager:5182]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	---
Assignee:	Don Dutile (Red Hat)
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	505736 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-12 10:56 UTC by Jay Turner
Modified:	2018-12-03 17:06 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:19:53 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
'lspci -nv' from the system (6.05 KB, text/plain) 2009-06-12 13:49 UTC, Jay Turner	no flags	Details
Proposed patch to remove second call to pci_add_saved_cap (450 bytes, patch) 2009-06-12 20:11 UTC, Richard A Lary	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Jay Turner 2009-06-12 10:56:20 UTC

Description of problem:
Getting the following after upgrading to 2.6.18-153.el5.x86_64:

Jun 12 06:10:11 dyno kernel: BUG: soft lockup - CPU#0 stuck for 10s! [NetworkManager:5182]
Jun 12 06:10:11 dyno kernel: CPU 0:
Jun 12 06:10:11 dyno kernel: Modules linked in: hidp rfcomm l2cap lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh sha256 aes_generic aes_x86_64 dm_crypt video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev snd_hda_intel testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi crypto_api arc4 snd_seq_dummy snd_seq_oss snd_seq_midi_event iwlagn snd_seq snd_seq_device iwlcore uvcvideo compat_ioctl32 i2c_i801 videodev snd_pcm_oss sr_mod snd_mixer_oss cdrom sky2 v4l1_compat snd_pcm v4l2_common hci_usb i2c_core sg mac80211 shpchp bluetooth snd_timer snd_page_alloc snd_hwdep sdhci mmc_core snd pcspkr serio_raw cfg80211 soundcore dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ahci libata sd_mod scsi_mod raid1 ext3 jb
Jun 12 06:10:11 dyno kernel:  uhci_hcd ohci_hcd ehci_hcd
Jun 12 06:10:11 dyno kernel: Pid: 5182, comm: NetworkManager Not tainted 2.6.18-153.el5 #1
Jun 12 06:10:11 dyno kernel: RIP: 0010:[<ffffffff8016583a>]  [<ffffffff8016583a>] pci_restore_msi_state+0x12/0x13a
Jun 12 06:10:11 dyno kernel: RSP: 0018:ffff81011f897898  EFLAGS: 00000212
Jun 12 06:10:11 dyno kernel: RAX: 0000000042328086 RBX: ffff81013fcd5800 RCX: 0000000042328086
Jun 12 06:10:11 dyno kernel: RDX: 0000000000000cfc RSI: 0000000000000246 RDI: ffff81013d866fc0
Jun 12 06:10:11 dyno kernel: RBP: 0000000000000001 R08: 0000000000000004 R09: ffff81011f89788c
Jun 12 06:10:11 dyno kernel: R10: 0000000000000000 R11: ffffffff8021faab R12: ffff81013fcd59c8
Jun 12 06:10:11 dyno kernel: R13: ffffffff8021faab R14: 0000000000000014 R15: ffff81011f8977f8
Jun 12 06:10:11 dyno kernel: FS:  00002aefd51c9380(0000) GS:ffffffff803c0000(0000) knlGS:0000000000000000
Jun 12 06:10:11 dyno kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 12 06:10:11 dyno kernel: CR2: 000000000044ea60 CR3: 000000011ffe7000 CR4: 00000000000006e0
Jun 12 06:10:11 dyno kernel: 
Jun 12 06:10:11 dyno kernel: Call Trace:
Jun 12 06:10:11 dyno kernel:  [<ffffffff801595a0>] pci_restore_state+0x16d/0x18a
Jun 12 06:10:11 dyno kernel:  [<ffffffff883ebb94>] :iwlagn:iwl4965_mac_start+0x47/0x62f
Jun 12 06:10:11 dyno kernel:  [<ffffffff8026471f>] fn_hash_dump+0xfe/0x1ce
Jun 12 06:10:11 dyno kernel:  [<ffffffff882762a6>] :mac80211:ieee80211_open+0x21f/0x50a
Jun 12 06:10:11 dyno kernel:  [<ffffffff801288d8>] avc_has_perm+0x43/0x55
Jun 12 06:10:11 dyno kernel:  [<ffffffff8022a583>] dev_open+0x2f/0x6e
Jun 12 06:10:11 dyno kernel:  [<ffffffff80228751>] dev_change_flags+0x5a/0x119
Jun 12 06:10:11 dyno kernel:  [<ffffffff80230e81>] do_setlink+0x72/0x36e
Jun 12 06:10:11 dyno kernel:  [<ffffffff80230900>] rtnetlink_rcv_msg+0x1b7/0x1dd
Jun 12 06:10:11 dyno kernel:  [<ffffffff8023eae2>] netlink_run_queue+0x84/0x147
Jun 12 06:10:11 dyno kernel:  [<ffffffff80230749>] rtnetlink_rcv_msg+0x0/0x1dd
Jun 12 06:10:11 dyno kernel:  [<ffffffff802306f4>] rtnetlink_rcv+0x30/0x4e
Jun 12 06:10:11 dyno kernel:  [<ffffffff8023ef18>] netlink_data_ready+0x12/0x50
Jun 12 06:10:11 dyno kernel:  [<ffffffff8023e0cf>] netlink_sendskb+0x26/0x40
Jun 12 06:10:11 dyno kernel:  [<ffffffff8023eef3>] netlink_sendmsg+0x2b4/0x2c7
Jun 12 06:10:11 dyno kernel:  [<ffffffff80055607>] sock_sendmsg+0xf8/0x14a
Jun 12 06:10:11 dyno kernel:  [<ffffffff80030ca6>] sock_recvmsg+0x107/0x15f
Jun 12 06:10:11 dyno kernel:  [<ffffffff800a0512>] autoremove_wake_function+0x0/0x2e
Jun 12 06:10:11 dyno kernel:  [<ffffffff80129195>] inode_has_perm+0x56/0x63
Jun 12 06:10:11 dyno kernel:  [<ffffffff80220588>] sys_sendmsg+0x217/0x28a
Jun 12 06:10:11 dyno kernel:  [<ffffffff8002d121>] mntput_no_expire+0x19/0x89
Jun 12 06:10:11 dyno kernel:  [<ffffffff80220a96>] sys_sendto+0x11c/0x14f
Jun 12 06:10:11 dyno kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0


Version-Release number of selected component (if applicable):
2.6.18-153.el5

How reproducible:
Always

Steps to Reproduce:
1. Boot -153.el5 on my Sony VAIO laptop
  
Actual results:
soft lockup and the machine is completely unusable (I lose the keyboard and network, possibly other things)

Expected results:
Love, luck and lollipops :-)

Additional info:
Booting -152.el5 works.

Comment 1 Jay Turner 2009-06-12 13:49:43 UTC

Created attachment 347569 [details]
'lspci -nv' from the system

Comment 3 John W. Linville 2009-06-12 14:11:03 UTC

What went into -153 that touches PCI?

Comment 4 John W. Linville 2009-06-12 14:29:16 UTC

linux-2.6-ppc64-resolves-issues-with-pcie-save-restore-state.patch

Suspiciously, that patch changed the line just before the call to pci_restore_msi_state...

Comment 6 Jay Turner 2009-06-12 16:35:11 UTC

2.6.18-153.el5.bz505541.1 appears to be a happy camper.  I'm running with it right now to ensure all is well but no longer seeing the soft lockups on boot.

Comment 7 John W. Linville 2009-06-12 16:58:02 UTC

Cool!  I think my work here is done... :-)

Comment 10 Don Dutile (Red Hat) 2009-06-12 18:47:32 UTC

I think I found the problem with the patch listed in #4.
The resulting code does a pre-mature pci state save, thus, storing
two save states for  PCI_CAP_ID_EXP.

What the code looks like after patch applied:

        save_state = pci_find_saved_cap(dev, PCI_CAP_ID_EXP);
        if (!save_state) {
                save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 7, GFP_KERNEL);
                if (!save_state) {
                        printk(KERN_ERR "Out of memory in pci_save_pcie_state\n");
                        return -ENOMEM;
                }
                save_state->cap_nr = PCI_CAP_ID_EXP;
                pci_add_saved_cap(dev, save_state);   <=== first save
        }
        cap = (u16 *)&save_state->data[0];

        pci_read_config_word(dev, pos + PCI_EXP_DEVCTL, &cap[i++]);
        pci_read_config_word(dev, pos + PCI_EXP_LNKCTL, &cap[i++]);
        pci_read_config_word(dev, pos + PCI_EXP_SLTCTL, &cap[i++]);
        pci_read_config_word(dev, pos + PCI_EXP_RTCTL, &cap[i++]);
        pci_read_config_word(dev, pos + PCI_EXP_DEVCTL2, &cap[i++]);
        pci_read_config_word(dev, pos + PCI_EXP_LNKCTL2, &cap[i++]);
        pci_read_config_word(dev, pos + PCI_EXP_SLTCTL2, &cap[i++]);
        pci_add_saved_cap(dev, save_state);  <== second, correct save...
        return 0;


I'll bet if you remove the first pci_add_saved_cap(), all will work.

Do you want me to do a brew build with such a change ?

- Don (Dutile)

Comment 11 Don Zickus 2009-06-12 18:56:53 UTC

Don if you wouldn't mind doing a quick scratch build with that change.  Jay seems to have a good sense of patience when dealing with us kernel folk that he might not mind testing it again. ;-)

I would rather fix the patch than revert and re-apply.

Thanks,
Don

Comment 13 Don Dutile (Red Hat) 2009-06-12 19:19:04 UTC

brew build kicked w/patch that removes first pci_add_saved_cap()
from code snippet shown in c#10.

(x86_64 build only):

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1842628

I'll ping jturner when the build is done (for x86_64 bare-metal, at least).

- Don (Dutile)

Comment 14 Richard A Lary 2009-06-12 20:01:45 UTC

I think that the pci_add_saved_cap() should stay, and the second one should be removed. 

The logic being if a saved_state entry for  PCI_CAP_ID_EXP is not found on the list, alllocate and initialize a new structure and add it to the list i.e. the first first pci_add_saved_cap().   If an existing PCI_CAP_ID_EXP  is found, use the found structure.  Then save the pcie register contents to the PCI_CAP_ID_EXP  entry.  The second call to pci_add_saved_cap()should be removed.

Comment 15 Richard A Lary 2009-06-12 20:11:05 UTC

Created attachment 347660 [details]
Proposed patch to remove second call to pci_add_saved_cap

I will test this patch on Power PC if someone else can test on x86_64.

Comment 16 Don Dutile (Red Hat) 2009-06-12 20:30:58 UTC

Yes, the patch in #15 is the proper one.
Keeping the second one will cause one to add an existing
saved_cap to the list again..... very screwy doubly-linked list!
should add when kzalloc'd.

I've kicked a new brew build with the same patch in #15:

(x86_64 only).

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1843241

again, I'll ask Jay to test once it's done.

Comment 17 IBM Bug Proxy 2009-06-12 20:50:42 UTC

------- Comment From rlary.com 2009-06-12 16:47 EDT-------
Built -153 kernel with patch #15 on Power PC, confirmed that EEH recovery worked with lpfc driver, which is primary case where Power PC relies upon pci_save/restore_state  to preserve  pci state across PCI slot resets during recovery from pci bus errors.

One note, IBM had tested with original  patch 'Resolves issues with pcie-save-restore-state' submitted by IBM and did not see same issue as seen on x86_64.

Comment 18 Don Dutile (Red Hat) 2009-06-12 21:40:42 UTC

Jay,

Can you grab the x86_64 rpm from #16 & give it a quick spin?
if it works, let me know so I can post a patch & get it into 5.4.

- Don (Dutile)

Comment 20 Jay Turner 2009-06-15 12:11:58 UTC

Sorry for the delay . . . had a busy weekend which didn't involve verifying bugs, sadly!  Anyway, 2.6.18-153.el5bz505541ddd.x86_64 works beautifully on my VAIO.  I did a cursory review of syslog and don't see any nasties there either, so appears we have something that sticks.  Also, from comment 17, it appears that IBM is happy with the patch as well.

Comment 23 RHEL Program Management 2009-06-16 14:01:55 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 24 IBM Bug Proxy 2009-06-16 18:30:38 UTC

------- Comment From rlary.com 2009-06-16 14:20 EDT-------
Verified patch in -153 kernel with native EEH handler in lpfc driver.

------- Comment From rlary.com 2009-06-16 14:24 EDT-------
Patch submitted to Red Hat

------- Comment From rlary.com 2009-06-16 14:26 EDT-------
Patch in POST state on Red Hat side

Comment 26 Jay Turner 2009-06-18 11:38:09 UTC

-154.el5.x86_64 is working well on my laptop.  Will move to Verified once -154.el5 is available in a rel-eng tree.

Comment 27 Don Zickus 2009-06-18 14:52:19 UTC

in kernel-2.6.18-154.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 29 Linda Wang 2009-06-22 16:21:17 UTC

*** Bug 505736 has been marked as a duplicate of this bug. ***

Comment 30 Jay Turner 2009-06-23 10:57:14 UTC

Moving to Verified.  2.6.18-155.el5 is included in the latest 5.4 snap (20090622.0)

Comment 31 IBM Bug Proxy 2009-07-01 23:20:49 UTC

------- Comment From rlary.com 2009-07-01 19:15 EDT-------
Verified by code inspection and correct function during EEH recovery with PCIe devices on Power PC.

Closing on IBM side of mirror.

Comment 32 Richard A Lary 2009-07-13 17:39:20 UTC

Verified as fixed on Power PC in RHEL5.4 Snapshot 1

Comment 34 errata-xmlrpc 2009-09-02 08:19:53 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.