Bug 505541
| Summary: | BUG: soft lockup - CPU#0 stuck for 10s! [NetworkManager:5182] | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Jay Turner <jturner> | ||||||
| Component: | kernel | Assignee: | Don Dutile (Red Hat) <ddutile> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 5.4 | CC: | aparanja, benl, bugproxy, dcbw, ddutile, dzickus, linville, lwang, rlary, srevivo, toshaan, zcerza | ||||||
| Target Milestone: | beta | Keywords: | Regression | ||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2009-09-02 08:19:53 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Jay Turner
2009-06-12 10:56:20 UTC
Created attachment 347569 [details]
'lspci -nv' from the system
What went into -153 that touches PCI? linux-2.6-ppc64-resolves-issues-with-pcie-save-restore-state.patch Suspiciously, that patch changed the line just before the call to pci_restore_msi_state... 2.6.18-153.el5.bz505541.1 appears to be a happy camper. I'm running with it right now to ensure all is well but no longer seeing the soft lockups on boot. Cool! I think my work here is done... :-) I think I found the problem with the patch listed in #4.
The resulting code does a pre-mature pci state save, thus, storing
two save states for PCI_CAP_ID_EXP.
What the code looks like after patch applied:
save_state = pci_find_saved_cap(dev, PCI_CAP_ID_EXP);
if (!save_state) {
save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 7, GFP_KERNEL);
if (!save_state) {
printk(KERN_ERR "Out of memory in pci_save_pcie_state\n");
return -ENOMEM;
}
save_state->cap_nr = PCI_CAP_ID_EXP;
pci_add_saved_cap(dev, save_state); <=== first save
}
cap = (u16 *)&save_state->data[0];
pci_read_config_word(dev, pos + PCI_EXP_DEVCTL, &cap[i++]);
pci_read_config_word(dev, pos + PCI_EXP_LNKCTL, &cap[i++]);
pci_read_config_word(dev, pos + PCI_EXP_SLTCTL, &cap[i++]);
pci_read_config_word(dev, pos + PCI_EXP_RTCTL, &cap[i++]);
pci_read_config_word(dev, pos + PCI_EXP_DEVCTL2, &cap[i++]);
pci_read_config_word(dev, pos + PCI_EXP_LNKCTL2, &cap[i++]);
pci_read_config_word(dev, pos + PCI_EXP_SLTCTL2, &cap[i++]);
pci_add_saved_cap(dev, save_state); <== second, correct save...
return 0;
I'll bet if you remove the first pci_add_saved_cap(), all will work.
Do you want me to do a brew build with such a change ?
- Don (Dutile)
Don if you wouldn't mind doing a quick scratch build with that change. Jay seems to have a good sense of patience when dealing with us kernel folk that he might not mind testing it again. ;-) I would rather fix the patch than revert and re-apply. Thanks, Don brew build kicked w/patch that removes first pci_add_saved_cap() from code snippet shown in c#10. (x86_64 build only): http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1842628 I'll ping jturner when the build is done (for x86_64 bare-metal, at least). - Don (Dutile) I think that the pci_add_saved_cap() should stay, and the second one should be removed. The logic being if a saved_state entry for PCI_CAP_ID_EXP is not found on the list, alllocate and initialize a new structure and add it to the list i.e. the first first pci_add_saved_cap(). If an existing PCI_CAP_ID_EXP is found, use the found structure. Then save the pcie register contents to the PCI_CAP_ID_EXP entry. The second call to pci_add_saved_cap()should be removed. Created attachment 347660 [details]
Proposed patch to remove second call to pci_add_saved_cap
I will test this patch on Power PC if someone else can test on x86_64.
Yes, the patch in #15 is the proper one. Keeping the second one will cause one to add an existing saved_cap to the list again..... very screwy doubly-linked list! should add when kzalloc'd. I've kicked a new brew build with the same patch in #15: (x86_64 only). http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1843241 again, I'll ask Jay to test once it's done. ------- Comment From rlary.com 2009-06-12 16:47 EDT------- Built -153 kernel with patch #15 on Power PC, confirmed that EEH recovery worked with lpfc driver, which is primary case where Power PC relies upon pci_save/restore_state to preserve pci state across PCI slot resets during recovery from pci bus errors. One note, IBM had tested with original patch 'Resolves issues with pcie-save-restore-state' submitted by IBM and did not see same issue as seen on x86_64. Jay, Can you grab the x86_64 rpm from #16 & give it a quick spin? if it works, let me know so I can post a patch & get it into 5.4. - Don (Dutile) Sorry for the delay . . . had a busy weekend which didn't involve verifying bugs, sadly! Anyway, 2.6.18-153.el5bz505541ddd.x86_64 works beautifully on my VAIO. I did a cursory review of syslog and don't see any nasties there either, so appears we have something that sticks. Also, from comment 17, it appears that IBM is happy with the patch as well. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. ------- Comment From rlary.com 2009-06-16 14:20 EDT------- Verified patch in -153 kernel with native EEH handler in lpfc driver. ------- Comment From rlary.com 2009-06-16 14:24 EDT------- Patch submitted to Red Hat ------- Comment From rlary.com 2009-06-16 14:26 EDT------- Patch in POST state on Red Hat side -154.el5.x86_64 is working well on my laptop. Will move to Verified once -154.el5 is available in a rel-eng tree. in kernel-2.6.18-154.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. *** Bug 505736 has been marked as a duplicate of this bug. *** Moving to Verified. 2.6.18-155.el5 is included in the latest 5.4 snap (20090622.0) ------- Comment From rlary.com 2009-07-01 19:15 EDT------- Verified by code inspection and correct function during EEH recovery with PCIe devices on Power PC. Closing on IBM side of mirror. Verified as fixed on Power PC in RHEL5.4 Snapshot 1 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |