Description of problem: System crashes after just a minute. Version-Release number of selected component (if applicable): kernel-2.6.30.5-43.fc11.x86_64 How reproducible: Reboot the system and it will happen again. Steps to Reproduce: 1. 2. 3. Actual results: System crashes Expected results: No crash Additional info: I basically got the same result when I tested compat-wireless-2.6.30/2.6.31 on 2.6.29 kernel.
Created attachment 360118 [details] Kernel oops log
BUG: unable to handle kernel NULL pointer dereference at 0000000000000016 IP: [<ffffffffa0251a38>] iwl5000_rx_reply_tx+0x50d/0xb62 [iwlagn]
(gdb) l *(iwl5000_rx_reply_tx+0x50d) 0x14a5c is in iwl5000_tx_status_reply_tx (drivers/net/wireless/iwlwifi/iwl-5000.c:1200). 1195 IWL_DEBUG_TX_REPLY(priv, "FrameCnt = %d, txq_id=%d idx=%d\n", 1196 agg->frame_count, txq_id, idx); 1197 1198 hdr = iwl_tx_queue_get_hdr(priv, txq_id, idx); 1199 1200 sc = le16_to_cpu(hdr->seq_ctrl); 1201 if (idx != (SEQ_TO_SN(sc) & 0xff)) { 1202 IWL_ERR(priv, 1203 "BUG_ON idx doesn't match seq control" 1204 " idx=%d, seq_idx=%d, seq=%d\n", (gdb)
Mats, you catch kernel panic dmesg using kdump, right? Could you provide compressed memory dump somewhere to download or send it to me by email (if file is not so big). Could you also provide two dmesg outputs when kernel panic with these iwlagn module parameters: $ modprobe iwlagn debug=0x40000000 $ modprobe iwlagn debug=0x41800fff
(In reply to comment #4) > Could you also provide two dmesg outputs when kernel panic with these iwlagn > module parameters: > > $ modprobe iwlagn debug=0x40000000 > > $ modprobe iwlagn debug=0x41800fff For 50xx adapters "debug50" option should be used instead of "debug". It should be safe to add both like in example: $ modprobe iwlagn debug=0x40000000 debug50=0x40000000
Actually, I didn't use kdump. I just copy and pasted the output I got on the serial console. I don't know if the console output with the debug/debug50 set to the suggested values will be of any use to you? The generated core dumps are quite big, 5.7G uncompressed, 0.36G gzip compressed and with makedumpfile -c 1.23G. I saw that makedumpfile has other options to make dump files even smaller. I'm not that familiar with the options so any suggestion is appreciated. My understanding is that I'm supposed to use the crash command to get some information from the core dump files, e.g. kernel panic dmesg. And in order to do that I need kernel-debuginfo installed. The only kernel-debuginfo I could install with yum was for kernel 2.6.29.4-167. I guess I need a matching kernel-debuginfo to the kernel? I don't know the best way to get the core dump files to you. They are slightly too big to be emailed.
(In reply to comment #6) > Actually, I didn't use kdump. I just copy and pasted the output I got on the > serial console. I don't know if the console output with the debug/debug50 set > to the suggested values will be of any use to you? It depends, but definitely these verbose debug messages can be useful. > The generated core dumps are > quite big, 5.7G uncompressed, 0.36G gzip compressed and with makedumpfile -c > 1.23G. I saw that makedumpfile has other options to make dump files even > smaller. I'm not that familiar with the options so any suggestion is > appreciated. I think using makedumpfile -c -d 31 is ok, as all pages (zero, cache, cache private, user data and free) seems to be unneeded for debugging the issue. From other site you can boot kernel with mem= parameter to limit memory, for example: mem=2G , hence dump file will be smaller. However you can not be able to reproduce the bug with limited memory. If so, this give us some additional info about bug (suspect to some upper memory dma mappings). > My understanding is that I'm supposed to use the crash command to > get some information from the core dump files, e.g. kernel panic dmesg. And in > order to do that I need kernel-debuginfo installed. The only kernel-debuginfo I > could install with yum was for kernel 2.6.29.4-167. I guess I need a matching > kernel-debuginfo to the kernel? Below command should enable proper repository and install uptodate kernel: $ debuginfo-install kernel If not you can download rpms here: http://mirror.hmc.edu/fedora/linux/updates/11/x86_64/debug/kernel-debuginfo-2.6.30.5-43.fc11.x86_64.rpm http://mirror.hmc.edu/fedora/linux/updates/11/x86_64/debug/kernel-debuginfo-common-x86_64-2.6.30.5-43.fc11.x86_64.rpm > I don't know the best way to get the core dump > files to you. They are slightly too big to be emailed. I hope with makedumpfile -d 31 or with mem= parameter you will be able to generate small enough image for email.
Created attachment 361551 [details] dmesg output with debug/debug50=0x40000000
Created attachment 361552 [details] dmesg output with debug/debug50=0x41800fff
I succeeded to extract dmesg from core dump files at last. Latest version of crash for fc11 didn't support 2.6.30 x86_64 (crash: invalid structure size: x8664_pda), but when I rebuild the latest fc12 version for fc11 instead I got it to work. I'm attaching the two dmesg outputs to this bug report. I got very weird results with -d 31 option of the makedumpfile command, so I only used -c. I added mem=2G as kernel boot parameter. The system still crashed. The dump files are still quite big, about 350MB. You don't have any upload area? Otherwise I have to see if I can find some free storage somewhere.
Just discovered that my ISP gives me 10GB free upload area. Stanislaw, I will send you an email with the details where you can find the dump files.
Created attachment 361921 [details] Test patch - bug workaround
I think this is firmware bug, I'll provide more info soon. We can do workaround in driver. Mats, could you test attached patch, does it help and cause any other troubles ?
What happens in dmesg attached in comment #9: - ieee80211 core try to send 5 frames with sequence numbers: 0Xa64, 0Xa65, 0Xa66, 0Xa67, 0Xa65; sequence numbers correspond to transmit buffer indexes 100, 101,..., 104 - firmware send to driver some number (about 40) of TX responses without block-ack - on the last response before oops, last index become erroneously 105 instead of 104, and corresponding transmit skb (priv->txq[txq_id].txb[idx].skb[0]) is NULL Interesting are values of struct agg_tx_status on the last response (taken from memory dump): crash> struct iwl5000_tx_resp 0xffff88006038c008 struct iwl5000_tx_resp { frame_count = 5 '\005', bt_kill_count = 0 '\0', failure_rts = 0 '\0', failure_frame = 0 '\0', rate_n_flags = 2147535118, wireless_media_time = 350, pa_status = 0 '\0', pa_integ_res_a = "\000\000", pa_integ_res_b = "\000\000", pa_integ_res_C = "\000\000", tfd_info = 3222315108, seq_ctl = 34368, byte_cnt = 1542, tlc_info = 0 '\0', ra_tid = 0 '\0', frame_ctrl = 0, status = { status = 4096, sequence = 2660 } } crash> struct agg_tx_status 0xffff88006038c028 struct agg_tx_status { status = 4096, sequence = 2660 } crash> struct agg_tx_status 0xffff88006038c02c struct agg_tx_status { status = 4096, sequence = 2661 } crash> struct agg_tx_status 0xffff88006038c030 struct agg_tx_status { status = 4096, sequence = 2662 } crash> struct agg_tx_status 0xffff88006038c034 struct agg_tx_status { status = 4096, sequence = 2663 } crash> struct agg_tx_status 0xffff88006038c038 struct agg_tx_status { status = 0, sequence = 2665 } crash> In last entry, looks like status bit (4096) is shifted by 4 places and go to sequence field. Reinette, could you comment this? Do you think we need to fill firmware bug report on intellinuxwireless.org?
I just built the kernel with the patch. However, I got the same result. It crashed within a minute.
I must take back last comment. I applied the patch incorrectly. I have now had the kernel with the patch running for half an hour. Sorry.
Ok, I'm going to send patch upstream together with similar fix for 4965.
I'm sure the Intel guys would love to see firmware debug logs for this too, which I think can be enabled by using debug50=0x43ff as a module parameter for 'iwlagn'. Worth a shot if you can reliably reproduce (the debugspew is pretty heavy, and may even prevent the condition from happening because it slows stuff down).
It crashed with debug/debug50 set to 0x43ff. I have just uploaded the crash dump. Dan, I'm sending you an email with info where you can find the dump.
I'd like to comment on three items: Thank you very much for the great analysis in comment #14. I would like to see if we can maybe look if we can catch the problem earlier. From what I can tell the driver transmits those five frames and then receives more than forty TX responses for those frames. It was the last of these that caused the panic. I would like to explore if there is anything the driver can do to treat the TX responses correctly in the first place. The second thing I'd like to comment on is the request in #18 for firmware debug logs. When you use debug flag 0x40000 you will get a firmware error and event log when there is a firmware sysassert. This is the only time this information is printed so when we get a kernel problem like this there will not be a firmware error. If you do want the firmware logs to be printed at this point it can be requested via debugfs or you can place the dump_nic_event_log/dump_nic_error_log calls in the code at which point the problem occurs. The third thing that you may like to here is that debug50 is now deprecated. This may be in 2.6.31 ... whatever has the patch "iwlwifi: make debug level more user friendly". Stanislaw, I did see that you submitted the patch upstream. I would like to hold off merging it for now until we know if we can fix the problem of the many TX responses. We can use this bugzilla to discuss the patch and once we have the patch sorted out I would like to merge it into our local repo first for some testing and will then submit it upstream on your behalf after it passes testing. Is that ok with you?
Mats, Which firmware are you using? Please ensure it is the latest (8.24.2.12). Also, I do understand from your initial post that you tried with stable compat-wireless ... could you please try with the bleeding edge compat-wireless?
Yes, I have 8.24.2.12. dmesg output: iwlagn 0000:04:00.0: firmware: requesting iwlwifi-5000-2.ucode iwlagn 0000:04:00.0: loaded firmware version 8.24.2.12 I just built and tested compat-wireless-2.6.31-rc7, but it crashed the same way as before.
(In reply to comment #22) > > I just built and tested compat-wireless-2.6.31-rc7, but it crashed the same way > as before. You tested the latest stable version of compat-wireless (see http://wireless.kernel.org/en/users/Download ). Could you please test with bleeding edge that can be downloaded from http://wireless.kernel.org/download/compat-wireless-2.6/compat-wireless-2.6.tar.bz2 ?
OK, learning every day :-). I built compat-wireless-2009-09-26. Even though the system is running stably, dmesg is full of: iwlagn 0000:04:00.0: BUG_ON idx doesn't point to valid skb idx=215, txq_id=10 This is exactly what Stanislaw's patch prints. Checking iwl-5000.c in the source tree showed that the patch has already been included, so I guess the problem remains.
Created attachment 363382 [details] use RTS/CTS for HT Mats, Could you please run a test with this patch applied on top of the latest compat-wireless code? The goal is to have those new error messages disappear because if I understand correctly you have no HT capability at the moment. Thank you
An additional note ... Mats, When you rerun your test, could you please compile with CONFIG_MAC80211_VERBOSE_DEBUG ? Thanks
Reinette, I have built compat-wireless-2009-09-30 with your patch and CONFIG_MAC80211_VERBOSE_DEBUG. I still get dmesg full of "BUG_ON idx doesn't...". What is HT capability, by the way?
HT (High Throughput) is usually used to refer to 802.11n Could you please run your test with debug flags 0x45900FFF and CONFIG_MAC80211_VERBOSE_DEBUG enabled? Please post the logs. Thank you
Created attachment 363512 [details] dmesg output with debug/debug50=0x45900FFF Is there any other log than dmesg you need? I saw that now all "BUG_ON idx doesn't..." are gone. Perhaps it is supposed to be like that?
(In reply to comment #14) > Reinette, could you comment this? Do you think we need to fill firmware bug > report on intellinuxwireless.org? Please do. Thank you very much.
(In reply to comment #29) > Is there any other log than dmesg you need? I saw that now all "BUG_ON idx > doesn't..." are gone. Perhaps it is supposed to be like that? Yes, that was exacly reasoning of this patch. But this is still workaround against bad firmware function.
(In reply to comment #31) > (In reply to comment #29) > > Is there any other log than dmesg you need? I saw that now all "BUG_ON idx > > doesn't..." are gone. Perhaps it is supposed to be like that? > > Yes, that was exacly reasoning of this patch. But this is still > workaround against bad firmware function. I don't think so. Essentially the patch replaced the kernel oops with an error message "BUG_ON idx ...". If we do not see that error message anymore I assume that the oops will not be hit either. Unfortunately it thus sounds like we are dealing with a timing issue / race condition here since the problem disappears when you add debugging.
(In reply to comment #32) Ahh, I missed that debug=0x45900FFF make BUG_ON disappear not "use RTS/CTS for HT" patch.
For firmware fix for that bug look at: http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2098 F12 update kernel to 2.6.32 which include workaround for that bug (kernel stops panic), so I'm closing this bug with NEXTRELEASE resolution.