Created attachment 379032 [details] Patch to add lock protection to lpfc_find_target lpfc_find_target needs to acquire the host lock before it begins iterating the lists to avoid a potential hang. Patch enclosed. Z-stream request should be made shortly.
Casey - will you be able to test this on behalf of Emulex, or will you require Emulex to test this as well?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Laurie, Can someone at Emulex review the attached patch? Thanks, Rob
Description of Problem: If multiple hosts connected to FC switch are rebooted synchronously, the boot sequence stops while loading the lpfc driver. ------------------------------- ELILO boot: Uncompressing Linux... done Loading initrd initrd-2.6.9-67.EL.img...done i8042.c: No controller found. Red Hat nash version 4.2.1.13 starting lpfc 0005:0b:01.0: 0:1303 Link Up Event x1 received Data: x1 x0 x10 x0 lpfc 0005:0b:01.1: 1:1303 Link Up Event x1 received Data: x1 x0 x10 x0 (*** stops here ***) ------------------------------- It seems the lpfc driver is looping forever in the code where the driver is scanning through the list of node because that code is not protected by a spinlock. This problem is only seen in RHEL4.6, but it may be a cause of the potential bug in the other versions of lpfc driver. Version-Release number of selected component: Red Hat Enterprise Linux Version Number: RHEL4 Release Number: 4.6 Architecture: ia64 Kernel Version: 2.6.9-67.EL Related Package Version: lpfc driver v8.0.16.40 Related Middleware / Application: None Drivers or hardware or architecture dependency: lpfc driver for RHEL4 How reproducible: Unclear, our customer says about 1 out of 10 times, although that environment is somewhat special since it is SAN boot and all hosts are rebooted synchronously. In out test environment it was about 1 out of 5000 tries. Step to Reproduce: Prepare multi node in SAN boot environment, and keep rebooting nodes synchronously until the problem occures. Actual Results: Boot sequence stops while loading the lpfc driver Expected Results: Boot sequence completes normally Summary of actions taken to resolve issue: Reset the system. Location of diagnostic data: When scanning through the listp, the listp had NULL value and ended up looping forever, although the previous list_empty() test had passed as the listp exists. I guess the content of the listp had changed after passing the list_empty() test since it is not protected by a spinlock. Handling the node list should be protected by spinlocks. ============================ struct lpfc_target * lpfc_find_target(struct lpfc_hba * phba, uint32_t tgt, struct lpfc_nodelist *nlp) { struct lpfc_target *targetp = NULL; int found = 0, i; struct list_head *listp; struct list_head *node_list[6]; ... if(!nlp) { // spin_lock_irqsave(phba->host->host_lock, iflag); Need to get spinlock /* Search over all lists other than fc_nlpunmap_list */ node_list[0] = &phba->fc_npr_list; node_list[1] = &phba->fc_nlpmap_list; /* Skip fc_nlpunmap */ node_list[2] = &phba->fc_prli_list; node_list[3] = &phba->fc_reglogin_list; node_list[4] = &phba->fc_adisc_list; node_list[5] = &phba->fc_plogi_list; for (i=0; i < 6 && !found; i++) { listp = node_list[i]; if (list_empty(listp)) continue; list_for_each_entry(nlp, listp, nlp_listp) { // loop here if (tgt == nlp->nlp_sid) { found = 1; break; } } } // spin_unlock_irqrestore(phba->host->host_lock, iflag); Need to unlock spinlock ============================
We reviewed the patch, and it looks good. Thank you.
(In reply to comment #4) > Casey - will you be able to test this on behalf of Emulex, or will you require > Emulex to test this as well? Casey, I need to know the status of testing in order to post this. Rob
Confirmed patch fixed problem from issue tracker.
@CAI. See Comment #13.
Committed in 89.20.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html