Description of problem: Getting the following after upgrading to 2.6.18-153.el5.x86_64: Jun 12 06:10:11 dyno kernel: BUG: soft lockup - CPU#0 stuck for 10s! [NetworkManager:5182] Jun 12 06:10:11 dyno kernel: CPU 0: Jun 12 06:10:11 dyno kernel: Modules linked in: hidp rfcomm l2cap lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh sha256 aes_generic aes_x86_64 dm_crypt video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev snd_hda_intel testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi crypto_api arc4 snd_seq_dummy snd_seq_oss snd_seq_midi_event iwlagn snd_seq snd_seq_device iwlcore uvcvideo compat_ioctl32 i2c_i801 videodev snd_pcm_oss sr_mod snd_mixer_oss cdrom sky2 v4l1_compat snd_pcm v4l2_common hci_usb i2c_core sg mac80211 shpchp bluetooth snd_timer snd_page_alloc snd_hwdep sdhci mmc_core snd pcspkr serio_raw cfg80211 soundcore dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ahci libata sd_mod scsi_mod raid1 ext3 jb Jun 12 06:10:11 dyno kernel: uhci_hcd ohci_hcd ehci_hcd Jun 12 06:10:11 dyno kernel: Pid: 5182, comm: NetworkManager Not tainted 2.6.18-153.el5 #1 Jun 12 06:10:11 dyno kernel: RIP: 0010:[<ffffffff8016583a>] [<ffffffff8016583a>] pci_restore_msi_state+0x12/0x13a Jun 12 06:10:11 dyno kernel: RSP: 0018:ffff81011f897898 EFLAGS: 00000212 Jun 12 06:10:11 dyno kernel: RAX: 0000000042328086 RBX: ffff81013fcd5800 RCX: 0000000042328086 Jun 12 06:10:11 dyno kernel: RDX: 0000000000000cfc RSI: 0000000000000246 RDI: ffff81013d866fc0 Jun 12 06:10:11 dyno kernel: RBP: 0000000000000001 R08: 0000000000000004 R09: ffff81011f89788c Jun 12 06:10:11 dyno kernel: R10: 0000000000000000 R11: ffffffff8021faab R12: ffff81013fcd59c8 Jun 12 06:10:11 dyno kernel: R13: ffffffff8021faab R14: 0000000000000014 R15: ffff81011f8977f8 Jun 12 06:10:11 dyno kernel: FS: 00002aefd51c9380(0000) GS:ffffffff803c0000(0000) knlGS:0000000000000000 Jun 12 06:10:11 dyno kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 12 06:10:11 dyno kernel: CR2: 000000000044ea60 CR3: 000000011ffe7000 CR4: 00000000000006e0 Jun 12 06:10:11 dyno kernel: Jun 12 06:10:11 dyno kernel: Call Trace: Jun 12 06:10:11 dyno kernel: [<ffffffff801595a0>] pci_restore_state+0x16d/0x18a Jun 12 06:10:11 dyno kernel: [<ffffffff883ebb94>] :iwlagn:iwl4965_mac_start+0x47/0x62f Jun 12 06:10:11 dyno kernel: [<ffffffff8026471f>] fn_hash_dump+0xfe/0x1ce Jun 12 06:10:11 dyno kernel: [<ffffffff882762a6>] :mac80211:ieee80211_open+0x21f/0x50a Jun 12 06:10:11 dyno kernel: [<ffffffff801288d8>] avc_has_perm+0x43/0x55 Jun 12 06:10:11 dyno kernel: [<ffffffff8022a583>] dev_open+0x2f/0x6e Jun 12 06:10:11 dyno kernel: [<ffffffff80228751>] dev_change_flags+0x5a/0x119 Jun 12 06:10:11 dyno kernel: [<ffffffff80230e81>] do_setlink+0x72/0x36e Jun 12 06:10:11 dyno kernel: [<ffffffff80230900>] rtnetlink_rcv_msg+0x1b7/0x1dd Jun 12 06:10:11 dyno kernel: [<ffffffff8023eae2>] netlink_run_queue+0x84/0x147 Jun 12 06:10:11 dyno kernel: [<ffffffff80230749>] rtnetlink_rcv_msg+0x0/0x1dd Jun 12 06:10:11 dyno kernel: [<ffffffff802306f4>] rtnetlink_rcv+0x30/0x4e Jun 12 06:10:11 dyno kernel: [<ffffffff8023ef18>] netlink_data_ready+0x12/0x50 Jun 12 06:10:11 dyno kernel: [<ffffffff8023e0cf>] netlink_sendskb+0x26/0x40 Jun 12 06:10:11 dyno kernel: [<ffffffff8023eef3>] netlink_sendmsg+0x2b4/0x2c7 Jun 12 06:10:11 dyno kernel: [<ffffffff80055607>] sock_sendmsg+0xf8/0x14a Jun 12 06:10:11 dyno kernel: [<ffffffff80030ca6>] sock_recvmsg+0x107/0x15f Jun 12 06:10:11 dyno kernel: [<ffffffff800a0512>] autoremove_wake_function+0x0/0x2e Jun 12 06:10:11 dyno kernel: [<ffffffff80129195>] inode_has_perm+0x56/0x63 Jun 12 06:10:11 dyno kernel: [<ffffffff80220588>] sys_sendmsg+0x217/0x28a Jun 12 06:10:11 dyno kernel: [<ffffffff8002d121>] mntput_no_expire+0x19/0x89 Jun 12 06:10:11 dyno kernel: [<ffffffff80220a96>] sys_sendto+0x11c/0x14f Jun 12 06:10:11 dyno kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0 Version-Release number of selected component (if applicable): 2.6.18-153.el5 How reproducible: Always Steps to Reproduce: 1. Boot -153.el5 on my Sony VAIO laptop Actual results: soft lockup and the machine is completely unusable (I lose the keyboard and network, possibly other things) Expected results: Love, luck and lollipops :-) Additional info: Booting -152.el5 works.
Created attachment 347569 [details] 'lspci -nv' from the system
What went into -153 that touches PCI?
linux-2.6-ppc64-resolves-issues-with-pcie-save-restore-state.patch Suspiciously, that patch changed the line just before the call to pci_restore_msi_state...
2.6.18-153.el5.bz505541.1 appears to be a happy camper. I'm running with it right now to ensure all is well but no longer seeing the soft lockups on boot.
Cool! I think my work here is done... :-)
I think I found the problem with the patch listed in #4. The resulting code does a pre-mature pci state save, thus, storing two save states for PCI_CAP_ID_EXP. What the code looks like after patch applied: save_state = pci_find_saved_cap(dev, PCI_CAP_ID_EXP); if (!save_state) { save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 7, GFP_KERNEL); if (!save_state) { printk(KERN_ERR "Out of memory in pci_save_pcie_state\n"); return -ENOMEM; } save_state->cap_nr = PCI_CAP_ID_EXP; pci_add_saved_cap(dev, save_state); <=== first save } cap = (u16 *)&save_state->data[0]; pci_read_config_word(dev, pos + PCI_EXP_DEVCTL, &cap[i++]); pci_read_config_word(dev, pos + PCI_EXP_LNKCTL, &cap[i++]); pci_read_config_word(dev, pos + PCI_EXP_SLTCTL, &cap[i++]); pci_read_config_word(dev, pos + PCI_EXP_RTCTL, &cap[i++]); pci_read_config_word(dev, pos + PCI_EXP_DEVCTL2, &cap[i++]); pci_read_config_word(dev, pos + PCI_EXP_LNKCTL2, &cap[i++]); pci_read_config_word(dev, pos + PCI_EXP_SLTCTL2, &cap[i++]); pci_add_saved_cap(dev, save_state); <== second, correct save... return 0; I'll bet if you remove the first pci_add_saved_cap(), all will work. Do you want me to do a brew build with such a change ? - Don (Dutile)
Don if you wouldn't mind doing a quick scratch build with that change. Jay seems to have a good sense of patience when dealing with us kernel folk that he might not mind testing it again. ;-) I would rather fix the patch than revert and re-apply. Thanks, Don
brew build kicked w/patch that removes first pci_add_saved_cap() from code snippet shown in c#10. (x86_64 build only): http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1842628 I'll ping jturner when the build is done (for x86_64 bare-metal, at least). - Don (Dutile)
I think that the pci_add_saved_cap() should stay, and the second one should be removed. The logic being if a saved_state entry for PCI_CAP_ID_EXP is not found on the list, alllocate and initialize a new structure and add it to the list i.e. the first first pci_add_saved_cap(). If an existing PCI_CAP_ID_EXP is found, use the found structure. Then save the pcie register contents to the PCI_CAP_ID_EXP entry. The second call to pci_add_saved_cap()should be removed.
Created attachment 347660 [details] Proposed patch to remove second call to pci_add_saved_cap I will test this patch on Power PC if someone else can test on x86_64.
Yes, the patch in #15 is the proper one. Keeping the second one will cause one to add an existing saved_cap to the list again..... very screwy doubly-linked list! should add when kzalloc'd. I've kicked a new brew build with the same patch in #15: (x86_64 only). http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1843241 again, I'll ask Jay to test once it's done.
------- Comment From rlary.com 2009-06-12 16:47 EDT------- Built -153 kernel with patch #15 on Power PC, confirmed that EEH recovery worked with lpfc driver, which is primary case where Power PC relies upon pci_save/restore_state to preserve pci state across PCI slot resets during recovery from pci bus errors. One note, IBM had tested with original patch 'Resolves issues with pcie-save-restore-state' submitted by IBM and did not see same issue as seen on x86_64.
Jay, Can you grab the x86_64 rpm from #16 & give it a quick spin? if it works, let me know so I can post a patch & get it into 5.4. - Don (Dutile)
Sorry for the delay . . . had a busy weekend which didn't involve verifying bugs, sadly! Anyway, 2.6.18-153.el5bz505541ddd.x86_64 works beautifully on my VAIO. I did a cursory review of syslog and don't see any nasties there either, so appears we have something that sticks. Also, from comment 17, it appears that IBM is happy with the patch as well.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
------- Comment From rlary.com 2009-06-16 14:20 EDT------- Verified patch in -153 kernel with native EEH handler in lpfc driver. ------- Comment From rlary.com 2009-06-16 14:24 EDT------- Patch submitted to Red Hat ------- Comment From rlary.com 2009-06-16 14:26 EDT------- Patch in POST state on Red Hat side
-154.el5.x86_64 is working well on my laptop. Will move to Verified once -154.el5 is available in a rel-eng tree.
in kernel-2.6.18-154.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
*** Bug 505736 has been marked as a duplicate of this bug. ***
Moving to Verified. 2.6.18-155.el5 is included in the latest 5.4 snap (20090622.0)
------- Comment From rlary.com 2009-07-01 19:15 EDT------- Verified by code inspection and correct function during EEH recovery with PCIe devices on Power PC. Closing on IBM side of mirror.
Verified as fixed on Power PC in RHEL5.4 Snapshot 1
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html