Bug 488471
Description
Tuomo Soini
2009-03-04 14:52:16 UTC
http://lkml.org/lkml/2008/12/22/12 Upstream patch is broken. So this is regression in 5.3 caused by broken upstream patch which got included. Upstream bug: http://bugzilla.kernel.org/show_bug.cgi?id=12266 Hi, I am building a test kernel to look more closely at ahci_sw_activity_blink which is the function that puts together the led messages to transmit to the controller. Looking at the ahci spec, it would seem the ahci driver is setting Bits 8:6 when setting the activity LED. The LED bit locations are: Bits 2:0 – Activity LED (may be driven by hardware) Bits 5:3 – Vendor Specific LED (e.g. locate) Bits 8:6 – Vendor Specific LED (e.g. fault) Bits 15:9 – Reserved I will post a link to download the test kernel soon. I think I didn't say it clearly enough but fault led doesn't burn for sda. It only burns for sdb, sdc, sdd. That might be wy Intel verified patch and told leds work ok. If they only had one disk on their test system it did look like working ok. Would you please boot kernel-2.6.18-131.el5.bz488471.1 and attach the output of "dmesg -c"? I have added quite a bit of debug to ahci_sw_activity_blink, so you may want to boot with the "log_buf_len=1000000" kernel parameter or hook up a serial console, so we can capture all the messages from boot especially when the ahci driver initializes and see how it it getting into this state. Please let me know if the debug kernel generates too much output and I will tone it down. Thank you. http://people.redhat.com/dmilburn/ I'm running xen on that machine. Can you just provide patch to try? Created attachment 334318 [details]
Patch to debug ahci LED problem
Sure, here you go, this patch adds debug to the blinking function, if you can capture the output after the driver loads we should be able to see the led message that the driver is sending down to the controller. Created attachment 334351 [details]
ata/AHCI related lines from dmesg
em_ctl 117506304 = 7010100, which means the Transmit bit is set, so the driver thinks the controller is still busy transmitting a previous message, if you look a little farther where there any messages from "AHCI_SW_ACTIVITY_BLINK"? Actually, it looks like bit 26 is set which means the driver doesn't use ahci_sw_activity_blink, let me look over the driver some more. No other AHCI related lines on dmesg. Symptoms are still same. sda is ok, sdb to sdd have error led blinking. Created attachment 334355 [details]
Patch #2 to debug ahci LED problem
I have added a couple of printks to ahci_start_port(), please re-test and see if they show up in dmesg. Created attachment 334415 [details]
ata/AHCI related lines from dmesg using patch #2
Created attachment 334694 [details]
Patch #3 to debug ahci LED problem
According to the dmesg output in Comment#14, after the first message is transmitted it appears that Bit 8 (Transmit Message) of the Enclosure Management Control register never clears. Once a message has been sent, the HBA should be clearing this bit, but we are seeing subsequent transmits finding this bit set so ahci_transmit_led_message is returning -EINVAL. I have added a few more printks to determine if the driver may not be putting the LED message in the real transmit buffer. Please retest and attach your dmesg output, thanks. Created attachment 334785 [details]
ata/AHCI related lines from dmesg using patch #3
I thought the driver maybe writing the enclosure messages outside the enclosure management buffer, but that doesn't seem to be the case ABAR[EM_LOC.OFST*4] buffer starts here ABAR[(EM_LOC.OFST*4) + EM_LOC.SIZE] buffer ends here Driver starts up AHCI_INIT_ONE: em_ctl 0x7010000 em_loc 0x1000002 em_loc was read from global control register offset 0x1c, bits 15:0 specifies the size of the transmit message buffer area in dwords, in this case 2 dwords. Bits 31:16 specifies the offset of the message buffer in dwords from the beginning of ABAR, in this case 0x100. So, a message really consists of 2 dwords, the first one is the message header, the driver sets bits 15:8 of the message header to indicate the size of the following message which is always 4 bytes. The driver seems to be writing to the correct location of the enclosure buffer /* write message to EM_LOC */ writel(message[0], mmio + hpriv->em_loc); writel(message[1], mmio + hpriv->em_loc+4); Maybe the writes of the tranmit are getting out-of-order, I haved added a readl after the above code to flush those writes, I will attach another patch. Created attachment 334854 [details]
Patch #4 to debug ahci LED problem
Same as patch #3 except added a readl in ahci_transmit_led_message.
Would you please test the patch in Comment#19, and attach the output of "lspci -xxvvv"? Created attachment 334856 [details]
lspci -xxvvv output from ASUS RS160-E5 server
I'll test new patch asap. Btw. At beginning of boot lights are ok. But when text Red Hat nash is shown on console, sdb to sdd lights change to drive error state. I did testing with patch4. There were only one line change to previous ahci messages. -ata3: SATA max UDMA/133 irq_stat 0x00400040, connection status changed +ata3: SATA max UDMA/133 irq 20, connection status changed lspci -xxvvv output didn't change. Did the drive error LEDs still blink for sdb to sdd with patch4? Yes, drive error LEDs are still blinking with patch4. Created attachment 334924 [details]
Patch to disable Enclosure Management messaging for ICH9r
I think all the upstream reports of this problem are using ICH9r, please test this patch and make sure your LEDs function normally, and please post the dmesg output showing the message stating that messaging has been disabled. I will ask Intel if they are aware of any problems with this specfic controller. Thanks for all your testing. Created attachment 335014 [details]
ata/AHCI related lines from dmesg using ich9r work-around patch
And yes, drive leds seem to work normally with patch. Created attachment 335554 [details]
Patch #5 to debug ahci LED problem
Created attachment 335555 [details]
Patch to disable Enclosure Management messaging for ICH9r - AHCI mode only
Would you please try patch #5 and supply the dmesg output? Also if after applying patch #5 (Comment #30) you end up in the same state with drive error LEDs blinking, would you see if you can manually turn them off? # echo 0 > /sys/class/scsi_host/host1/em_message # echo 0 > /sys/class/scsi_host/host2/em_message # echo 0 > /sys/class/scsi_host/host3/em_message # echo 0 > /sys/class/scsi_host/host4/em_message # echo 0 > /sys/class/scsi_host/host5/em_message Finally, please test with the patch in Comment #31, it should only disable enclosure management if ICH9r is in AHCI mode. We will not be able to disable it in all cases, it has been tested in RAID mode without problems with enclosure management LEDs. Also, please supply dmesg output when testing with patch in Comment #31, I expect that your LEDs will behave appropriately, but I would also like to see the message where enclosure management is disabled while configured in AHCI mode. Will test. But previously you couldn't turn off because /sys/class/scsi_host/*/em_message were read only. Did you fix them to have correct mode so setting can be changed? No, actually I didn't make any change for managing LEDs through /sys, I am sure that we have been able use this method in previous releases. Though, the error that you see when using the /sys method maybe a side-effect of the problem, when changing values the driver should be using ahci_led_store which calls ahci_transmit_led_message which I expect is failing in this case too, the dmesg output will tell us for sure. Actually, why don't you try the /sys method with both patches (Comment #30 and Comment #31). On second thought you can skip the /sys test after applying the patch in Comment #31 since we are disabling EMS for the ICH9r, I will see if I can try it on a configuration known to work. Created attachment 335685 [details]
ata/AHCI related lines from dmesg using patch #5
I couldn't disable em_messages when drives were blinking error, because they all were already disabled. But when I enabled em_messages, drive status led went to correct state.
You can see relevant debug messages when I enabled em messages.
Only strange thing is that even when sda had em_messages set to 0 it's error state led didn't show error while all other devices were in error state.
I guess work-around patch is wrong approach.
When you say you enabled em_messages, are you saying that you echo'd a value into /sys/class/scsi_host/hostX/em_message em_message is a 32-bit value and the LED state is defined by bits 16-24 bits 24-22 Fault LED bits 21-19 Locate LED bits 18-16 Activity LED And withing those 3-bit ranges, 000b turns off and 001b turns on For instance, # echo 4194304 > /sys/class/scsi_host/hostX/em_message That should turn on the Fault LED and echo'ing 0 should turn all LEDs off, so enclosure management is enabled in patch #5, but the em_message /sys attribute is a message value not enable/disable. So could you please reboot with patch #5, at this point drive #2 should be blinking the Fault LED, and then just echo 0 to the corresponding em_message and lets see if the error LED goes off and see what ahci_transmit_message says in dmesg. Also see if you can echo 524288 to the same scsi_host, this should turn on the Locate LED, and see what ahci_transmit_message. Thanks for testing. echo 0 > /sys/class/scsi_host/host1/em_message Turns error led off and gives normal state. AHCI_TRANSMIT_LED_MESSAGE: state 0x0 AHCI_TRANSMIT_LED_MESSAGE: pmp 0x0 AHCI_TRANSMIT_LED_MESSAGE: em_ctl 0x7010000 AHCI_TRANSMIT_LED_MESSAGE: mmio 0x2e800 AHCI_TRANSMIT_LED_MESSAGE: hpriv->em_loc 0x400 AHCI_TRANSMIT_LED_MESSAGE: mmio + hpriv->em_loc+4 0x2ec04 AHCI_TRANSMIT_LED_MESSAGE: mmio + (hpriv->em_loc+4) 0x2ec04 echo 524288 > /sys/class/scsi_host/host1/em_message Makes error led blink twice the error speed so I think it works. AHCI_TRANSMIT_LED_MESSAGE: state 0x80000 AHCI_TRANSMIT_LED_MESSAGE: pmp 0x0 AHCI_TRANSMIT_LED_MESSAGE: em_ctl 0x7010000 AHCI_TRANSMIT_LED_MESSAGE: mmio 0x2e800 AHCI_TRANSMIT_LED_MESSAGE: hpriv->em_loc 0x400 AHCI_TRANSMIT_LED_MESSAGE: mmio + hpriv->em_loc+4 0x2ec04 AHCI_TRANSMIT_LED_MESSAGE: mmio + (hpriv->em_loc+4) 0x2ec04 echo 4194304 > /sys/class/scsi_host/host1/em_message Leaves red led burning and other leds off. AHCI_TRANSMIT_LED_MESSAGE: state 0x400000 AHCI_TRANSMIT_LED_MESSAGE: pmp 0x0 AHCI_TRANSMIT_LED_MESSAGE: em_ctl 0x7010000 AHCI_TRANSMIT_LED_MESSAGE: mmio 0x2e800 AHCI_TRANSMIT_LED_MESSAGE: hpriv->em_loc 0x400 AHCI_TRANSMIT_LED_MESSAGE: mmio + hpriv->em_loc+4 0x2ec04 AHCI_TRANSMIT_LED_MESSAGE: mmio + (hpriv->em_loc+4) 0x2ec04 But I think you find this interesting: cd /sys/class/scsi_host/host0/ cat em_message 0 echo 0 >em_message cat em_message 0 cd ../host1 echo 0 >em_message cat em_message 1 cd ../host2 cat em_message 0 echo 0 >em_message 2 cd ../host3 cat em_message 0 (Error led is blinking) echo 0 >em_message (Error led turns off) cat em_message 3 Initially all em_message's are 0 Created attachment 335775 [details]
Patch #6 to debug ahci LED problem
I have added an ahci_led_init function to turn off LEDs near the end of the driver loading, this should be the same as "echo'ing 0". Please apply this patch, boot and see if the error LED is off for drives 2 - 4. Let's come back to the potential "cat em_message" problem, it maybe that is common to all the controllers not just ich9r, we can investigate after we solve this issue. Created attachment 335831 [details]
ata/AHCI related lines from dmesg using patch #6
No change in behaviour. sdb-sdd were still in error state.
Would you please attach the dmesg for the last boot up with patch #6? Created attachment 335850 [details]
dmesg from boot with patch #6
Created attachment 335883 [details]
Patch #7 to debug ahci LED problem
This patch adds some delays to ahci_transmit_led_message if it finds the controller has not cleared the transmit bit, I expect this to work since we know that it eventually clears since you can change the led values thru /sys. This patch will also tells us in the dmesg output how long it took to clear. Also, I think I see why "cat ...em_message" is giving those values, ahci_transmit_led_message sets emp->led_state = message[1]; So, now emp->led_state includes the port no, then ahci_led_show reads emp->led_state, so instead of cat'ing and seeing zero, you see (zero | port_no). I have changed the above to emp->led_state = state; Please boot up, hopefully drive error LEDs will be off for drives 2 - 4, but, I also need to see the dmesg output so I can see what kind of delay is needed for the transmit bit to clear. Thanks again for all your testing. Created attachment 335931 [details]
dmesg from boot with patch #7
And with patch#7 leds finally are ok. But it felt like boot took longer.
And verified, echo 0 > em_message doesn't change it's value any more. Now it's always 0 after this.
Ok, thanks, I am sure the delay and printks are slowing boot up. Let me see if I can ask Intel about the delay allowing the transmit bit to clear. Created attachment 336405 [details]
Patch to delay to wait for transmit bit to clear for ICH9r - AHCI mode only
It turns out that the BIOS is resetting the enclosure management message logic, this is the reason the transmit bit is busy preventing the driver from turning the LEDs off. The patch in Comment #49 is a work-around for the ahci.c driver, would please verify that your system boots with the LEDs in the proper state, please note this doesn't have the fix for cat'ing the /sys/....em_message values, I have to treat them as seperate patches since they are different issues. For this test I have added a printk to show how many delays were needed, would you please include that from the dmesg output? I intended for the udelay(10) to actually be udelay(100) in ahci_start_port(), please let me know if you need an updated patch; otherwise, please make the change after applying the patch in Comment #49. Thanks. Created attachment 336535 [details]
Dmesg using ich9r workaround patch
Tested patch was modified with s/10/100/ change.
Leds worked correctly with this patch. Hi i grabbed patch 7 from the Bugzilla and this seems to fix the issue. The addition of the new function ahci_led_init and the call to it from the port init fuction seem to clear up the issue for us. We run on ICH10 and had the same issue with port 0 LEds OK and port 1-3 FAULT and LOCATE LEDS on. We were installing sgpio package and using it to turn the lights off. [root@ban25c13bl8 ~]# yum install sgpio [root@ban25c13bl8 ~]# sgpio -p 1 -s off [root@ban25c13bl8 ~]# sgpio -p 2 -s off [root@ban25c13bl8 ~]# sgpio -p 3 -s off I have added myself to CC list. Please let me know if I can be of help with debug on ICH10 controller. Gregory, Would please attach your "lspci -xxvvv"? I need to add your device ID to the official patch which will be more like Comment #49? Thanks. Created attachment 336790 [details]
ICH10 SATA lspci -xxvvv
Hi Unreleased system and this is an open bugzilla. For more info on my particular config please see private issue tracker https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=273228 There you can get complete lspci output and anything else you require. I tried patch from comment 49 and it did not work for me. Let me re-patch and reboot and report back just to be sure, since you say the official patch will look like that. R u dropping the ahci_led_init function and it's call? Created attachment 336802 [details]
Patch to delay to wait for transmit bit to clear for ICH9r and ICH10 - AHCI mode only
Hi, the patch in Comment #49 was a work-around for the ICH9r in ahci mode, I have added your device ID to ahci_ems_bios_workaround(), the patch in Comment #64 should work on your platform. I still have 1 printk to see in ahci_start_port to see how many retries it takes for the ems transmit bit to clear, the driver is already trying to turn off the LEDs so we don't need ahci_led_init. Please attach your dmesg output after applying the patch and rebooting and hopefully your LEDs will be in a good state. Created attachment 336882 [details]
dmesg from patch 8
Hi, I think you must still be using older code, I don't have any of the printks in the patch in Comment #64, would you please double-check? Created attachment 336921 [details]
Real dmesg for patch 8
ooops> :-(
Here is the real deal.
Thanks, so the LEDs were ok when you booted, right? Yes the LEDs were OK when booted. Thanks Tested on RHEL 5.3 x86_64 on a ICH10R platform. Tested Patch in Comment #64. Patch applied and complied no problems. Upon reboot the LED's are off. SGPIO application can still controll all LED's on all ports. Can we make this patch not Specific to ICH9R and ICH10R? Our next platform will have the same issues, I believe.? (testing now) Thanks for the fix! Tested on RHEL 5.3 x86_64 on a ICH10R platform. Applied RPM from the David Milburn's site. This RPM applied correctly and on next reboot LED are in the off state. I am able to controll the LED's with SGPIO application in the OS. This is a very nice RPM that creates another boot menu option! Thanks again! *EDC* Created attachment 337538 [details]
Patch to delay to wait for transmit bit to clear when in AHCI mode.
Ed, thanks for testing.
I have modified the patch to look for device IDs greater than 0x2821 since
ICH8 supports enclosure management, this should should handle all cases
when the controller is configured in AHCI mode.
I am currently building test rpms, I will make them available as soon as they
are ready, we will need to test one more time.
Ed, Gregory, If someone could please test kernel-2.6.18-137.el5.bz488471.2 http://people.redhat.com/dmilburn/ Once we have confirmed test results I will send the patch upstream. Thanks. Hi David I have just booted your kernel and the Drive LED status is good. No Fault or Locate LEDs were lit. Looks good. Thanks David, I was just perusing the patch you proposed and, as Greg mentioned, it works for us. I just wanted to make a couple of comments 1. In the following code section + pci_read_config_byte(pdev, SCC_REG, &tmp); + + if (tmp & AHCI_MODE) + return 1; + else + return 0; + } + Should the test be more strict? According to the Serial ATA AHCI 1.3 spec (http://download.intel.com/technology/serialata/pdf/rev1_3.pdf), page 8 (section 2.1.5), it says that you should also check the PI byte (offset 0xb) and check if that is 1 to be sure you are in AHCI mode. i.e pci_read_config_byte(pdev, SCC_REG, &tmp); pci_read_config_byte(pdev, PI_REG, &tmp2); if ((tmp & AHCI_MODE) && (tmp2 & 0x1)) return 1; else return 0; 2. Since you have defined SCC_REG, should you use that in ahci_print_info for the following line in that function? pci_read_config_word(pdev, 0x0a, &cc); Use SCC_REG instead of 0x0a. thanks James James, I think we may change the patch to not use EM when the controller is configured in AHCI mode, I will look over the document again and see if we need the stricter check. Thanks, David Created attachment 337951 [details]
Patch to delay to wait for transmit bit to clear regardless of device id.
Ed, I discussed furthur with Kristen, this patch will be much simplier not keying off any particular Device ID, and in most cases the transmit will succeed first time. I am currently building test rpms with the patch in Comment #80, feel free to test the patch in the mean time. James, I looked over the doc and the ICH9 docs, actually the PI (Programming Interface Register) would be at offset = 0x9. And you are right it is programmed differently based upon what value is in the Sub Class Code Reg (0xa). And if the Sub Class Code Reg is 0x6 then the PI register would be 1 to indicate the AHCI revision. I think checking the SCC_REG would have been sufficient, but, hopefully the simplified patch in Comment #80 will be accepted upstream. Thanks, David Ed, Gregory, Please test kernel-2.6.18-137.el5.bz488471.3, once I get positive confirmation from both of you I will submit the patch in Comment #80 upstream. http://people.redhat.com/dmilburn/ Thanks, David Tested on RHEL 5.3 x86_64 on a ICH10R platform. Applied RPM kernel-2.6.18-137.el5.bz488471.3 from the David Milburn's site. This RPM applied correctly and on next reboot LED are in the off state. I am able to controll the LED's with SGPIO application in the OS. I believe this patch/RPM is Good to go upstream! Thank you for all the work on this! I think not keying off any particular Device ID is the perfect solution. Thanks again! *EDC* I will continue to regression test platforms with different Chipsets (ICH8R,9R,10R, ibexpeak) and in RAID mode too. This will take a week or 2 as resources are slim right now. Also, Verified ICH10R platform in RAID mode works correctly. All LED are in the off state on OS boot and can be controller from with in the OS via FS and SGPIO application. *EDC* Hello I have verified your latest test kernel. Everything looks good on our system. Thanks This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in kernel-2.6.18-140.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. Intel, Sun, Can you please test this on one of our 5.4 builds? http://people.redhat.com/dzickus/el5/ ~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative. (In reply to comment #93) > Intel, Sun, > > Can you please test this on one of our 5.4 builds? > > http://people.redhat.com/dzickus/el5/ I have tested 5.4 beta . It's all good. Verified fixed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html |