Bug 510665
| Summary: | megaraid sas driver in rhel5.4-beta fails to scan for SAS tape drive (HP Ultrium 4-SCSI) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Mark Goodwin <mgoodwin> |
| Component: | kernel | Assignee: | Tomas Henzl <thenzl> |
| Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 5.4 | CC: | andriusb, bo.yang, coughlan, cward, dzickus, hjia, ishida-sxc, jtluka, ltroan, mgoodwin, revers, tao, thenzl |
| Target Milestone: | rc | Keywords: | Regression |
| Target Release: | 5.4 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-09-02 08:37:30 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 475518 | ||
| Attachments: | |||
|
Description
Mark Goodwin
2009-07-10 06:06:27 UTC
The HBA is : Host: scsi0 Channel: 02 Id: 00 Lun: 00 Vendor: LSI Model: MegaRAID 8708EM2 Rev: 1.40 Type: Direct-Access ANSI SCSI revision: 05 With the RHEL5.3 -128 kernel, the HBA and tape device show up as: Host: scsi0 Channel: 00 Id: 07 Lun: 00 Vendor: HP Model: Ultrium 4-SCSI Rev: U2AN Type: Sequential-Access ANSI SCSI revision: 05 Host: scsi0 Channel: 02 Id: 00 Lun: 00 Vendor: LSI Model: MegaRAID 8708EM2 Rev: 1.40 Type: Direct-Access ANSI SCSI revision: 05 Apparently LSI driver versions 00.00.04.07 and 00.00.04.09 from the LSI website do not have the issue. Unclear at present what the difference is between the LSI drivers what what's in the 5.4-beta driver. Mark, Is there anything in /var/log/messages indicating a failure when the driver loads, and scans the tape drive? Please post the boot log. Bo, Do you have any reports of trouble with this driver version configuring tapes? Have you tested this configuration? Tom Tom/Mark, I am download the src and testing. Should have the result by tomorrow. Thanks, Bo Yang (In reply to comment #5) > Mark, > > Is there anything in /var/log/messages indicating a failure when the driver > loads, and scans the tape drive? Please post the boot log. > > Tom Tom & Bo, I will attach messages snippets from RHEL5.3 (where the tape was correctly discovered and initialized), and from RHEL5.4-beta (where the tape was not found). A few things stand-out: For RHEL5.4, the only reference to the tape device is : Jul 7 20:03:58 localhost raidsrv[4068]: [PD:3(ID=7 SLT=8)] HP Ultrium 4-SCSI U2AN which appears in the log *before* the SCSI subsystem has been initialized. In the RHEL5.4-beta messages, there are no scsi tape (st driver) messages at all. So, perhaps raidsrv in RHEL5.4-beta has "claimed" the device somehow, and the scsi_tape driver doesn't even get to look for it? It would probably be worth enabling scsi_scan debugging on rhel5.4-beta. That can be done by : # echo 448 > /sys/module/scsi_mod/parameters/scsi_logging_level or perhaps better to set this in /etc/modprobe.conf and then do a full reboot. Cheers -- Mark Goodwin Created attachment 354798 [details]
syslog snippets from RHEL5.3, where tape device is discovered correctly
Created attachment 354799 [details]
syslog snippets from RHEL5.4-beta, where tape device is NOT discovered
Please note we are waiting for a response from NEC engineering - the issue in this BZ *might* be due to the NEC raidsrv daemon, where we see : Jul 7 20:03:58 localhost raidsrv[4068]: [PD:3(ID=7 SLT=8)] HP Ultrium 4-SCSI U2AN in syslog on the RHEL5.4-beta system, *before* scsi_mod and the megaraid driver have even loaded. We need NEC to disable raidsrv and then reboot RHEL5.4-beta to see if this resolves the issue. Note that raidsrv is not part of RHEL5.4. Given the above, it's plausible that the LSI megaraid driver in RHEL5.4-beta has not actually regressed. We need to confirm this fairly urgently. Thanks -- Mark Goodwin NEC disabled the raidsrv daemon on their RHEL5.4-beta test box and this made no difference. So much for that theory ... I have asked for scsi_scan debugging to be enabled and to send the logs. -- Mark I am in testing the tape drive. I will give update soon. Created attachment 355325 [details] This patch suppose to fix the tape drive issue which reported by NEC. This is a copy of comment: https://bugzilla.redhat.com/show_bug.cgi?id=475574#c52 That BZ is for the general driver update. That code is checked in. The BZ should stay in its current state. This BZ is for the problem scanning tape drives seen with the driver update. A test kernel with Bo's patch was brewed yesterday and made available to the customer: http://people.redhat.com/~mgoodwin/BZ510665 This is going on over in the IT ticket. I'll update here as soon as we have test results, hopefully today sometime. Thanks -- Mark Goodwin Tom, the customer has verified the test kernel with Bo Yang's patch fixes the issue reported in this BZ, i.e. the tape is now correctly discovered. They are now running additional tests to verify nothing else has regressed. So, as far as managing this BZ goes, it's over to engineering, PM and QA to decide the fate of the fix for RHEL5.4. I have set CustomerVerified. I'm going to set the IT to WoENG now. Thanks -- Mark Goodwin This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Bo, thanks for the patch. We are now very late in the 5.4 process and this patch is rather big, and with no explanation. It will be good to understand why is each part needed and where was the problem. Could you provide some comments for us ? I know it is a painful work for you, but we have not the time to test this patch if there is not another regression, so at least good understanding is necessary. Thanks, Tomas Tomas,
When I go through the src, I saw some of the src which we fixed the issues, but didn't included to the src. It is important to add them in: Here are the details: Please let me know, or if you need more info?
1. Need add the spin lock when fire the cmd to FW to fix the potenial system hang issue.
- writel((frame_phys_addr | (frame_count<<1))|1,
+ unsigned long flags;
+ spin_lock_irqsave(&instance->fire_lock, flags);
+ writel((frame_phys_addr | (frame_count<<1))|1,
&(regs)->inbound_queue_port);
+ spin_unlock_irqrestore(&instance->fire_lock, flags);
2. System PDs is the DISK TYPE.
- if (sdev->channel < MEGASAS_MAX_PD_CHANNELS) {
+ if (sdev->channel < MEGASAS_MAX_PD_CHANNELS && sdev->type == TYPE_DISK) {
3. If it is Abort EVENT type, Driver can't register EVENT again.
-
- if (instance->unload == 0) {
+ if ((!cmd->abort_aen) && (instance->unload == 0 )) {
4. For SAS2 controller, unmap_sgbuf is different from other controller.
+ if ((instance->pdev->device == PCI_DEVICE_ID_LSI_SAS0073SKINNY) ||
+ (instance->pdev->device == PCI_DEVICE_ID_LSI_SAS0071SKINNY)) {
+ buf_h = cmd->frame->io.sgl.sge_skinny[0].phys_addr;
+
+ } else if (IS_DMA64)
5. takeoff the extra printout.
- printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__,
- instance->host->host_no, instance->evt_detail->code);
(In reply to comment #18) > Tomas, > > When I go through the src, I saw some of the src which we fixed the issues, but > didn't included to the src. It is important to add them in: Bo, Please be aware that, if we take a patch for megaraid sas, it will be going in the 5.4 release candidate. This is the last build. There will be no chance to fix any mistakes. No field test. Normally, we would only take patches for critical show stoppers (like the regression reported here, where tapes are not configured). The other bugs addressed by the patch have not been reported to us during beta test. I expect that you have a better view of the risk/benefit for these, but we have not seen them. We just need to be sure you understand the situation we are in with the 5.4 end-game. Are the other changes included in the patch really worth the risk at this point? Tom (In reply to comment #15) > They are now running additional tests to verify nothing else has regressed. Mark, are these tests finished now without problems ? Thanks, Tomas (In reply to comment #18) > Tomas, > > When I go through the src, I saw some of the src which we fixed the issues, but > didn't included to the src. It is important to add them in: Here are the > details: Please let me know, or if you need more info? > Thanks Bo, does your comment mean that the hunks 1-5 are unrelated to the tape driver issue ? We have to take the whole patch or nothing because of the testing done by the customer. I don't see problems in the patch itself it is only we can't test it enough - Comment #19. > 1. Need add the spin lock when fire the cmd to FW to fix the potenial system > hang issue. Interesting that this hasn't caused problems without the lock. Bo, I forgot to ask you before - have you already posted this to upstream ? If not please do this soon. This and some signs of upstream acceptance are also of interest. Thanks, Tomas Tomas, I will post them possible early next week after I clean all my current work. You should be copied if I post them. Thanks, Bo Yang Bo, please explain also this part of your patch. I was questioned about this on our internal list. You missed to explain this in Comment#18. Thanks. @@ -3604,6 +3614,7 @@ megasas_mgmt_fw_ioctl(struct megasas_ins */ memcpy(cmd->frame, ioc->frame.raw, 2 * MEGAMFI_FRAME_SIZE); cmd->frame->hdr.context = cmd->index; + cmd->frame->hdr.pad_0 = 0; /* * The management interface between applications and the fw uses @@ -4034,19 +4045,11 @@ megasas_aen_polling(void *arg) } if (instance->evt_detail) { - printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__, - instance->host->host_no, instance->evt_detail->code); switch (instance->evt_detail->code) { - case MR_EVT_LD_CREATED: case MR_EVT_PD_INSERTED: - case MR_EVT_LD_DELETED: - case MR_EVT_LD_OFFLINE: Tomas, 1. Driver need to clean the pad_0 field, because some of the application cmds don't clean it (FW will take long time to process if it not set to 0). + cmd->frame->hdr.pad_0 = 0; 2. For LDs case, driver don't need to scan the devices because our megaraid sas application already do. - case MR_EVT_LD_CREATED: case MR_EVT_PD_INSERTED: - case MR_EVT_LD_DELETED: - case MR_EVT_LD_OFFLINE: 3. Take off the printout, application will have all of those information. - printk(KERN_INFO "%s[%d]: event code 0x%04x\n", __FUNCTION__, - instance->host->host_no, instance->evt_detail->code); in kernel-2.6.18-162.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. *** Bug 506510 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |