Bug 521931 - Kernel panic - 2.6.30.5 - iwlagn - PRO/Wireless 5300 AGN
Summary: Kernel panic - 2.6.30.5 - iwlagn - PRO/Wireless 5300 AGN
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 11
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
Assignee: Stanislaw Gruszka
QA Contact: Fedora Extras Quality Assurance
URL: http://bugzilla.intellinuxwireless.or...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-08 18:53 UTC by Mats Ljunggren
Modified: 2010-03-09 07:14 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-09 07:14:10 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Kernel oops log (5.13 KB, text/plain)
2009-09-08 18:54 UTC, Mats Ljunggren
no flags Details
dmesg output with debug/debug50=0x40000000 (123.02 KB, text/plain)
2009-09-17 19:26 UTC, Mats Ljunggren
no flags Details
dmesg output with debug/debug50=0x41800fff (123.15 KB, text/plain)
2009-09-17 19:27 UTC, Mats Ljunggren
no flags Details
Test patch - bug workaround (558 bytes, application/octet-stream)
2009-09-21 12:55 UTC, Stanislaw Gruszka
no flags Details
use RTS/CTS for HT (2.15 KB, patch)
2009-10-01 18:31 UTC, reinette chatre
no flags Details | Diff
dmesg output with debug/debug50=0x45900FFF (122.93 KB, text/plain)
2009-10-02 18:40 UTC, Mats Ljunggren
no flags Details

Description Mats Ljunggren 2009-09-08 18:53:04 UTC
Description of problem:
System crashes after just a minute.

Version-Release number of selected component (if applicable):
kernel-2.6.30.5-43.fc11.x86_64

How reproducible:
Reboot the system and it will happen again.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
System crashes

Expected results:
No crash

Additional info:
I basically got the same result when I tested compat-wireless-2.6.30/2.6.31 on 2.6.29 kernel.

Comment 1 Mats Ljunggren 2009-09-08 18:54:27 UTC
Created attachment 360118 [details]
Kernel oops log

Comment 2 Chuck Ebbert 2009-09-09 05:29:58 UTC
BUG: unable to handle kernel NULL pointer dereference at 0000000000000016
IP: [<ffffffffa0251a38>] iwl5000_rx_reply_tx+0x50d/0xb62 [iwlagn]

Comment 3 Stanislaw Gruszka 2009-09-12 09:31:44 UTC
(gdb) l *(iwl5000_rx_reply_tx+0x50d)
0x14a5c is in iwl5000_tx_status_reply_tx (drivers/net/wireless/iwlwifi/iwl-5000.c:1200).
1195				IWL_DEBUG_TX_REPLY(priv, "FrameCnt = %d, txq_id=%d idx=%d\n",
1196						   agg->frame_count, txq_id, idx);
1197	
1198				hdr = iwl_tx_queue_get_hdr(priv, txq_id, idx);
1199	
1200				sc = le16_to_cpu(hdr->seq_ctrl);
1201				if (idx != (SEQ_TO_SN(sc) & 0xff)) {
1202					IWL_ERR(priv,
1203						"BUG_ON idx doesn't match seq control"
1204						" idx=%d, seq_idx=%d, seq=%d\n",
(gdb)

Comment 4 Stanislaw Gruszka 2009-09-14 10:37:12 UTC
Mats, you catch kernel panic dmesg using kdump, right? Could you provide compressed memory dump somewhere to download or send it to me by email (if file is not so big).

Could you also provide two dmesg outputs when kernel panic with these iwlagn module parameters:

$ modprobe iwlagn debug=0x40000000 

$ modprobe iwlagn debug=0x41800fff

Comment 5 Stanislaw Gruszka 2009-09-14 10:55:03 UTC
(In reply to comment #4)
> Could you also provide two dmesg outputs when kernel panic with these iwlagn
> module parameters:
> 
> $ modprobe iwlagn debug=0x40000000 
> 
> $ modprobe iwlagn debug=0x41800fff  

For 50xx adapters "debug50" option should be used instead of "debug". It should be safe to add both like in example:

$ modprobe iwlagn debug=0x40000000 debug50=0x40000000

Comment 6 Mats Ljunggren 2009-09-15 19:12:57 UTC
Actually, I didn't use kdump. I just copy and pasted the output I got on the serial console. I don't know if the console output with the debug/debug50 set to the suggested values will be of any use to you? The generated core dumps are quite big, 5.7G uncompressed, 0.36G gzip compressed and with makedumpfile -c 1.23G. I saw that makedumpfile has other options to make dump files even smaller. I'm not that familiar with the options so any suggestion is appreciated. My understanding is that I'm supposed to use the crash command to get some information from the core dump files, e.g. kernel panic dmesg. And in order to do that I need kernel-debuginfo installed. The only kernel-debuginfo I could install with yum was for kernel 2.6.29.4-167. I guess I need a matching kernel-debuginfo to the kernel? I don't know the best way to get the core dump files to you. They are slightly too big to be emailed.

Comment 7 Stanislaw Gruszka 2009-09-16 14:40:37 UTC
(In reply to comment #6)
> Actually, I didn't use kdump. I just copy and pasted the output I got on the
> serial console. I don't know if the console output with the debug/debug50 set
> to the suggested values will be of any use to you? 

It depends, but definitely these verbose debug messages can be useful.

> The generated core dumps are
> quite big, 5.7G uncompressed, 0.36G gzip compressed and with makedumpfile -c
> 1.23G. I saw that makedumpfile has other options to make dump files even
> smaller. I'm not that familiar with the options so any suggestion is
> appreciated. 

I think using makedumpfile -c -d 31 is ok, as all pages (zero, cache, cache private, user data and free) seems to be unneeded for debugging the issue.

From other site you can boot kernel with mem= parameter to limit memory, for example: mem=2G , hence dump file will be smaller. However you can not be able to reproduce the bug with limited memory. If so, this give us some additional info about bug (suspect to some upper memory dma mappings).

> My understanding is that I'm supposed to use the crash command to
> get some information from the core dump files, e.g. kernel panic dmesg. And in
> order to do that I need kernel-debuginfo installed. The only kernel-debuginfo I
> could install with yum was for kernel 2.6.29.4-167. I guess I need a matching
> kernel-debuginfo to the kernel?

Below command should enable proper repository and install uptodate kernel: 

$ debuginfo-install kernel

If not you can download rpms here:
http://mirror.hmc.edu/fedora/linux/updates/11/x86_64/debug/kernel-debuginfo-2.6.30.5-43.fc11.x86_64.rpm
http://mirror.hmc.edu/fedora/linux/updates/11/x86_64/debug/kernel-debuginfo-common-x86_64-2.6.30.5-43.fc11.x86_64.rpm

> I don't know the best way to get the core dump
> files to you. They are slightly too big to be emailed.  

I hope with makedumpfile -d 31 or with mem= parameter you will be able to generate small enough image for email.

Comment 8 Mats Ljunggren 2009-09-17 19:26:24 UTC
Created attachment 361551 [details]
dmesg output with debug/debug50=0x40000000

Comment 9 Mats Ljunggren 2009-09-17 19:27:33 UTC
Created attachment 361552 [details]
dmesg output with debug/debug50=0x41800fff

Comment 10 Mats Ljunggren 2009-09-17 19:32:53 UTC
I succeeded to extract dmesg from core dump files at last. Latest version of crash for fc11 didn't support 2.6.30 x86_64 (crash: invalid structure size: x8664_pda), but when I rebuild the latest fc12 version for fc11 instead I got it to work. I'm attaching the two dmesg outputs to this bug report. I got very weird results with -d 31 option of the makedumpfile command, so I only used -c. I added mem=2G as kernel boot parameter. The system still crashed. The dump files are still quite big, about 350MB. You don't have any upload area? Otherwise I have to see if I can find some free storage somewhere.

Comment 11 Mats Ljunggren 2009-09-18 05:26:06 UTC
Just discovered that my ISP gives me 10GB free upload area. Stanislaw, I will send you an email with the details where you can find the dump files.

Comment 12 Stanislaw Gruszka 2009-09-21 12:55:08 UTC
Created attachment 361921 [details]
Test patch - bug workaround

Comment 13 Stanislaw Gruszka 2009-09-21 13:00:41 UTC
I think this is firmware bug, I'll  provide more info soon. We can do workaround in driver. 

Mats, could you test attached patch, does it help and cause any other troubles ?

Comment 14 Stanislaw Gruszka 2009-09-21 13:55:15 UTC
What happens in dmesg attached in comment #9:

- ieee80211 core try to send 5 frames with sequence numbers: 0Xa64, 0Xa65, 0Xa66, 0Xa67, 0Xa65;  sequence numbers correspond to transmit buffer indexes 100, 101,..., 104
- firmware send to driver some number (about 40) of TX responses without block-ack
- on the last response before oops, last index become erroneously 105 instead of 104, and corresponding transmit skb (priv->txq[txq_id].txb[idx].skb[0]) is NULL

Interesting are values of struct agg_tx_status on the last response (taken from memory dump):

crash> struct iwl5000_tx_resp 0xffff88006038c008
struct iwl5000_tx_resp {
  frame_count = 5 '\005',
  bt_kill_count = 0 '\0',
  failure_rts = 0 '\0',
  failure_frame = 0 '\0',
  rate_n_flags = 2147535118,
  wireless_media_time = 350,
  pa_status = 0 '\0',
  pa_integ_res_a = "\000\000",
  pa_integ_res_b = "\000\000",
  pa_integ_res_C = "\000\000",
  tfd_info = 3222315108,
  seq_ctl = 34368,
  byte_cnt = 1542,
  tlc_info = 0 '\0',
  ra_tid = 0 '\0',
  frame_ctrl = 0,
  status = {
    status = 4096,
    sequence = 2660
  }
}
crash> struct agg_tx_status 0xffff88006038c028
struct agg_tx_status {
  status = 4096,
  sequence = 2660
}
crash> struct agg_tx_status 0xffff88006038c02c
struct agg_tx_status {
  status = 4096,
  sequence = 2661
}
crash> struct agg_tx_status 0xffff88006038c030
struct agg_tx_status {
  status = 4096,
  sequence = 2662
}
crash> struct agg_tx_status 0xffff88006038c034
struct agg_tx_status {
  status = 4096,
  sequence = 2663
}
crash> struct agg_tx_status 0xffff88006038c038
struct agg_tx_status {
  status = 0,
  sequence = 2665
}
crash>

In last entry, looks like status bit (4096) is shifted by 4 places and go to sequence field.

Reinette, could you comment this? Do you think we need to fill firmware bug report on intellinuxwireless.org?

Comment 15 Mats Ljunggren 2009-09-21 18:34:31 UTC
I just built the kernel with the patch. However, I got the same result. It crashed within a minute.

Comment 16 Mats Ljunggren 2009-09-21 20:32:52 UTC
I must take back last comment. I applied the patch incorrectly. I have now had the kernel with the patch running for half an hour. Sorry.

Comment 17 Stanislaw Gruszka 2009-09-23 08:20:36 UTC
Ok, I'm going to send patch upstream together with similar fix for 4965.

Comment 18 Dan Williams 2009-09-23 09:15:38 UTC
I'm sure the Intel guys would love to see firmware debug logs for this too, which I think can be enabled by using debug50=0x43ff as a module parameter for 'iwlagn'.  Worth a shot if you can reliably reproduce (the debugspew is pretty heavy, and may even prevent the condition from happening because it slows stuff down).

Comment 19 Mats Ljunggren 2009-09-23 19:03:03 UTC
It crashed with debug/debug50 set to 0x43ff. I have just uploaded the crash dump. Dan, I'm sending you an email with info where you can find the dump.

Comment 20 reinette chatre 2009-09-24 21:33:48 UTC
I'd like to comment on three items:

Thank you very much for the great analysis in comment #14. I would like to see if we can maybe look if we can catch the problem earlier. From what I can tell the driver transmits those five frames and then receives more than forty TX responses for those frames. It was the last of these that caused the panic. I would like to explore if there is anything the driver can do to treat the TX responses correctly in the first place.

The second thing I'd like to comment on is the request in #18 for firmware debug logs. When you use debug flag 0x40000 you will get a firmware error and event log when there is a firmware sysassert. This is the only time this information is printed so when we get a kernel problem like this there will not be a firmware error. If you do want the firmware logs to be printed at this point it can be requested via debugfs or you can place the dump_nic_event_log/dump_nic_error_log calls in the code at which point the problem occurs.

The third thing that you may like to here is that debug50 is now deprecated. This may be in 2.6.31 ... whatever has the patch "iwlwifi: make debug level more user friendly".

Stanislaw, I did see that you submitted the patch upstream. I would like to hold off merging it for now until we know if we can fix the problem of the many TX responses. We can use this bugzilla to discuss the patch and once we have the patch sorted out I would like to merge it into our local repo first for some testing and will then submit it upstream on your behalf after it passes testing. Is that ok with you?

Comment 21 reinette chatre 2009-09-24 23:33:02 UTC
Mats,

Which firmware are you using? Please ensure it is the latest (8.24.2.12).

Also, I do understand from your initial post that you tried with stable compat-wireless ... could you please try with the bleeding edge compat-wireless?

Comment 22 Mats Ljunggren 2009-09-25 17:11:08 UTC
Yes, I have 8.24.2.12. dmesg output:
iwlagn 0000:04:00.0: firmware: requesting iwlwifi-5000-2.ucode
iwlagn 0000:04:00.0: loaded firmware version 8.24.2.12

I just built and tested compat-wireless-2.6.31-rc7, but it crashed the same way as before.

Comment 23 reinette chatre 2009-09-25 22:10:16 UTC
(In reply to comment #22)

> 
> I just built and tested compat-wireless-2.6.31-rc7, but it crashed the same way
> as before.  

You tested the latest stable version of compat-wireless (see http://wireless.kernel.org/en/users/Download ). Could you please test with bleeding edge that can be downloaded from http://wireless.kernel.org/download/compat-wireless-2.6/compat-wireless-2.6.tar.bz2 ?

Comment 24 Mats Ljunggren 2009-09-27 18:41:18 UTC
OK, learning every day :-). I built compat-wireless-2009-09-26. Even though the system is running stably, dmesg is full of:
iwlagn 0000:04:00.0: BUG_ON idx doesn't point to valid skb idx=215, txq_id=10
This is exactly what Stanislaw's patch prints. Checking iwl-5000.c in the source tree showed that the patch has already been included, so I guess the problem remains.

Comment 25 reinette chatre 2009-10-01 18:31:45 UTC
Created attachment 363382 [details]
use RTS/CTS for HT

Mats,

Could you please run a test with this patch applied on top of the latest compat-wireless code? The goal is to have those new error messages disappear because if I understand correctly you have no HT capability at the moment.

Thank you

Comment 26 reinette chatre 2009-10-01 20:20:25 UTC
An additional note ... 

Mats, 

When you rerun your test, could you please compile with CONFIG_MAC80211_VERBOSE_DEBUG ?

Thanks

Comment 27 Mats Ljunggren 2009-10-02 16:59:21 UTC
Reinette, I have built compat-wireless-2009-09-30 with your patch and CONFIG_MAC80211_VERBOSE_DEBUG. I still get dmesg full of "BUG_ON idx doesn't...". What is HT capability, by the way?

Comment 28 reinette chatre 2009-10-02 17:29:31 UTC
HT (High Throughput) is usually used to refer to 802.11n

Could you please run your test with debug flags 0x45900FFF and CONFIG_MAC80211_VERBOSE_DEBUG enabled? Please post the logs.

Thank you

Comment 29 Mats Ljunggren 2009-10-02 18:40:37 UTC
Created attachment 363512 [details]
dmesg output with debug/debug50=0x45900FFF

Is there any other log than dmesg you need? I saw that now all "BUG_ON idx
doesn't..." are gone. Perhaps it is supposed to be like that?

Comment 30 reinette chatre 2009-10-12 20:46:27 UTC
(In reply to comment #14)

> Reinette, could you comment this? Do you think we need to fill firmware bug
> report on intellinuxwireless.org?  

Please do. Thank you very much.

Comment 31 Stanislaw Gruszka 2009-10-13 15:43:42 UTC
(In reply to comment #29)
> Is there any other log than dmesg you need? I saw that now all "BUG_ON idx
> doesn't..." are gone. Perhaps it is supposed to be like that?  

Yes, that was exacly reasoning of this patch. But this is still
workaround against bad firmware function.

Comment 32 reinette chatre 2009-10-13 15:55:06 UTC
(In reply to comment #31)
> (In reply to comment #29)
> > Is there any other log than dmesg you need? I saw that now all "BUG_ON idx
> > doesn't..." are gone. Perhaps it is supposed to be like that?  
> 
> Yes, that was exacly reasoning of this patch. But this is still
> workaround against bad firmware function.  

I don't think so. Essentially the patch replaced the kernel oops with an error message "BUG_ON idx ...". If we do not see that error message anymore I assume that the oops will not be hit either. Unfortunately it thus sounds like we are dealing with a timing issue / race condition here since the problem disappears when you add debugging.

Comment 33 Stanislaw Gruszka 2009-10-13 16:11:29 UTC
(In reply to comment #32)

Ahh, I missed that debug=0x45900FFF make BUG_ON disappear not "use RTS/CTS for HT" patch.

Comment 34 Stanislaw Gruszka 2010-03-09 07:14:10 UTC
For firmware fix for that bug look at:
http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2098

F12 update kernel to 2.6.32 which include workaround for that bug (kernel stops panic), so I'm closing this bug with NEXTRELEASE resolution.


Note You need to log in before you can comment on or make changes to this bug.