Bug 1199727

Summary: (iwlwifi) fail to flush all tx fifo queues
Product: [Fedora] Fedora Reporter: Chris van de Sande <cvandesande>
Component: kernelAssignee: fedora-kernel-wireless-iwl
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 21CC: anelson, cvandesande, extras-orphan, gansalmon, hamzy, itamar, jonathan, kernel-maint, linville, madhu.chinakonda, mchehab
Target Milestone: ---Flags: kernel-team: needinfo?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-02 09:52:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output
none
Output for comment 13
none
dmesg output kernel 3.19.1
none
fix
none
dmesg with patched kernel
none
dmesg 4.0-rc3 with patch
none
dmesg 4.0-rc3 patched none

Description Chris van de Sande 2015-03-07 13:55:30 UTC
Description of problem:

Wireless connection drops after a few minutes of use.

Version-Release number of selected component (if applicable):
Kernel 3.18.7
iwlwifi firmware 18.168.6.1

How reproducible:
Every 5-20mins depending on usage. The heavier the usage,the sooner the problem seems to happen. Othertimes it will happen 1 minute after connecting for no apparent reason.

Steps to Reproduce:
1. Boot or Resume laptop
2. Start browsing on wifi


Actual results:
Works for a few minutes, then unable to access Internet.

Expected results:
A stable connection to the Internet.

Additional info:

Moments before the error occurs ping 8.8.8.8 will often show: ping: sendmsg: No buffer space available

Disabling/re-enabling wireless on my Thinkpad with Fn-F5 worksaround the problem. Must repeat every time it happens.

lscpi:
03:00.0 Network controller: Intel Corporation Centrino Wireless-N 2200 (rev c4)

Laptop dual boots Windows 7. Problem doesn't seem to happen in Windows.

iwlwifi 0000:03:00.0: loaded firmware version 18.168.6.1 op_mode iwldvm

journalcfg snippit:
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: fail to flush all tx fifo queues Q 0
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Current SW read_ptr 39 write_ptr 43
Mar 07 14:25:01 t430 kernel: iwl data: 00000000: 00 00 00 00 00 00 00 00 80 07 00 00 00 00 00 00  ................
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: FH TRBs(0) = 0x00000000
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: FH TRBs(1) = 0x80102007
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: FH TRBs(2) = 0x00000000
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: FH TRBs(3) = 0x8030002a
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: FH TRBs(4) = 0x00000000
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: FH TRBs(5) = 0x00000000
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: FH TRBs(6) = 0x00000000
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: FH TRBs(7) = 0x00709098
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [39,43]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [8,8]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 4 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 5 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 6 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 7 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 8 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 9 is active and mapped to fifo 7 ra_tid 0x0000 [153,153]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 10 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 11 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 12 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Mar 07 14:25:01 t430 kernel: iwlwifi 0000:03:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]

Comment 1 Mark Hamzy 2015-03-08 14:26:44 UTC
Created attachment 999343 [details]
dmesg output

Comment 2 Mark Hamzy 2015-03-08 14:27:54 UTC
I see this as well on a Lenovo Thinkpad W540

[root@hamzy-tp-w540 ~]# uname -a
Linux hamzy-tp-w540 3.18.8-201.fc21.x86_64 #1 SMP Fri Feb 27 18:18:27 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Comment 3 John W. Linville 2015-03-09 14:28:15 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=56581#c154

Comment 5 John W. Linville 2015-03-09 14:32:04 UTC
This is actually upstream commit a0855054e59b0c5b2b00237fdb5147f7bcc18efb.

We probably also want to take 5a12a07e4495d1e4d79382e05c9d6e8b4d9fa4ec (which is a bugfix for the patch above), and 4e6c48e0984e28d064ee8fbc292aee7b7920c507 (which is the same patch/fix for a related set of hardware).

Comment 6 Josh Boyer 2015-03-10 12:53:10 UTC
(In reply to John W. Linville from comment #5)
> This is actually upstream commit a0855054e59b0c5b2b00237fdb5147f7bcc18efb.

This is already in 3.18 upstream and therefore in Fedora.

> We probably also want to take 5a12a07e4495d1e4d79382e05c9d6e8b4d9fa4ec
> (which is a bugfix for the patch above), and

That one is in 3.18.y stable as commit e7fd25db8348873b40dfa8ef882758e731de0ae1 which is in 3.18.3 so already in Fedora.

> 4e6c48e0984e28d064ee8fbc292aee7b7920c507 (which is the same patch/fix for a
> related set of hardware).

OK, this one went into 3.19 and hasn't been pulled into a stable tree yet.  So it's possible that would fix this bug but it would be nice to confirm that the reporters have the hardware related to that bugfix.

Comment 7 Chris van de Sande 2015-03-10 13:01:08 UTC
Going to test 3.19 today

Comment 8 John W. Linville 2015-03-10 13:47:27 UTC
The output from 'lspci -n' is probably sufficient to see if you have the hardware, but testing 3.19 seems like a good idea too... :-)

Comment 9 Mark Hamzy 2015-03-10 14:21:01 UTC
[root@hamzy-tp-w540 ~]# lspci -n
00:00.0 0600: 8086:0c04 (rev 06)
00:01.0 0604: 8086:0c01 (rev 06)
00:02.0 0300: 8086:0416 (rev 06)
00:03.0 0403: 8086:0c0c (rev 06)
00:14.0 0c03: 8086:8c31 (rev 04)
00:16.0 0780: 8086:8c3a (rev 04)
00:16.3 0700: 8086:8c3d (rev 04)
00:19.0 0200: 8086:153a (rev 04)
00:1a.0 0c03: 8086:8c2d (rev 04)
00:1b.0 0403: 8086:8c20 (rev 04)
00:1c.0 0604: 8086:8c10 (rev d4)
00:1c.1 0604: 8086:8c12 (rev d4)
00:1c.2 0604: 8086:8c14 (rev d4)
00:1c.4 0604: 8086:8c18 (rev d4)
00:1d.0 0c03: 8086:8c26 (rev 04)
00:1f.0 0601: 8086:8c4f (rev 04)
00:1f.2 0106: 8086:8c03 (rev 04)
00:1f.3 0c05: 8086:8c22 (rev 04)
01:00.0 0300: 10de:0ff6 (rev a1)
02:00.0 0805: 1217:8520 (rev 01)
03:00.0 0280: 8086:08b2 (rev 83)

Comment 10 Chris van de Sande 2015-03-10 14:27:55 UTC
I built and installed 3.19.1 anyway, so far it's working but it's a bit too soon to tell.

root@t430 cvandesande]# lspci -n
00:00.0 0600: 8086:0154 (rev 09)
00:02.0 0300: 8086:0166 (rev 09)
00:14.0 0c03: 8086:1e31 (rev 04)
00:16.0 0780: 8086:1e3a (rev 04)
00:19.0 0200: 8086:1502 (rev 04)
00:1a.0 0c03: 8086:1e2d (rev 04)
00:1b.0 0403: 8086:1e20 (rev 04)
00:1c.0 0604: 8086:1e10 (rev c4)
00:1c.1 0604: 8086:1e12 (rev c4)
00:1c.2 0604: 8086:1e14 (rev c4)
00:1d.0 0c03: 8086:1e26 (rev 04)
00:1f.0 0601: 8086:1e55 (rev 04)
00:1f.2 0106: 8086:1e03 (rev 04)
00:1f.3 0c05: 8086:1e22 (rev 04)
02:00.0 0880: 1180:e822 (rev 07)
03:00.0 0280: 8086:0891 (rev c4)
[root@t430 cvandesande]#

Comment 11 John W. Linville 2015-03-10 14:45:29 UTC
Well, it looks like Chris van de Sande's 8086:0891 device is a dvm device.  This should have been covered by the two patches already in the 3.18.y stream.

I don't find any listing for Mark Hamzy's 8086:08b2 device.  Mark, is that device even working for you at all?

Comment 12 Mark Hamzy 2015-03-10 14:49:29 UTC
Yes! After I rebooted, I had a lot of problems connecting to the network. And I would see a lot of messages from iwlwifi, cfg80211, and wlp3s0 in dmesg. But now the network seems to have stabilized.

This machine is a Lenovo Thinkpad W540.

Comment 13 John W. Linville 2015-03-10 14:52:21 UTC
Interesting...could you attatch the output of "modinfo iwlwifi" and the output of "ethtool -i wlp3s0"?

Comment 14 Mark Hamzy 2015-03-10 14:59:01 UTC
Created attachment 999980 [details]
Output for comment 13

Comment 15 John W. Linville 2015-03-10 15:13:40 UTC
OK, I see it now -- need to renew my source code search training... ;-)

Mark, your device is in the MVM category and hopefully it will benefit from the 4e6c48e098 patch.

Comment 16 Mark Hamzy 2015-03-10 15:34:16 UTC
Josh, I could test a scratch koji build, if you would be willing to provide one. It is really easy to download it and install/remove it as an rpm...

Comment 17 Chris van de Sande 2015-03-10 18:20:45 UTC
Just confirmed the problem still happens in 3.19.1, as expected. Interesting though, it took a few hours for the error to manifest instead of a few minutes.  Though it could be due to today's wifi climate in my area.

Comment 18 Chris van de Sande 2015-03-10 18:21:31 UTC
Created attachment 1000098 [details]
dmesg output kernel 3.19.1

Comment 19 Emmanuel Grumbach 2015-03-10 20:05:37 UTC
I'll send a tentative fix tomorrow.

Comment 20 Emmanuel Grumbach 2015-03-10 20:47:44 UTC
Created attachment 1000162 [details]
fix

please test this patch.

Comment 21 Chris van de Sande 2015-03-10 23:13:27 UTC
Thanks Emmanual! Patch applied, running now but it's getting a little late for me. Will report back tomorrow with results.

Comment 22 Chris van de Sande 2015-03-10 23:38:32 UTC
Never mind, just happened with the patch :(

Comment 23 Chris van de Sande 2015-03-10 23:39:15 UTC
Created attachment 1000199 [details]
dmesg with patched kernel

Comment 24 Emmanuel Grumbach 2015-03-11 05:55:08 UTC
yes - this is because I forgot that this patch relies on 3b24f4c65386dc0f2efb41027bc6e410ea2c0049.

Can you please take 3b24f4c65386dc0f2efb41027bc6e410ea2c0049 as well?

Comment 25 Chris van de Sande 2015-03-11 07:58:31 UTC
I'm unable to apply 3b24f4c65386dc0f2efb41027bc6e410ea2c0049. I get
 1 out of 1 hunk FAILED -- saving rejects to file net/mac80211/cfg.c.rej
1 out of 1 hunk FAILED -- saving rejects to file net/mac80211/ieee80211_i.h.rej
1 out of 1 hunk FAILED -- saving rejects to file net/mac80211/tx.c.rej
2 out of 3 hunks FAILED -- saving rejects to file net/mac80211/util.c.rej

I'm using the Fedora kernel, 3.18.8.

Comment 26 Emmanuel Grumbach 2015-03-11 10:28:31 UTC
Sorry, but I won't port this patch. It is not stable material anyway.
Can you please test on 4.0 with the patch attached?

Comment 27 Chris van de Sande 2015-03-11 10:32:59 UTC
No problem, will report back.

Comment 28 Chris van de Sande 2015-03-11 12:34:34 UTC
Ok so I built 4.0-rc3. The patch 3b24f4c65386dc0f2efb41027bc6e410ea2c0049 already seemed to be in, so I only applied the patch from comment 20.

I then booted the new kernel, connected and started a background "ping 8.8.8.8". I've been doing the ping ever since I've been having this problem, just to see if I've really lost my connection. I then watched some YouTube for a short awhile until I lost connectivity.

While I didn't get the "failed to flush" error, I did get the other symptom:
64 bytes from 8.8.8.8: icmp_seq=1477 ttl=59 time=52.8 ms
64 bytes from 8.8.8.8: icmp_seq=1478 ttl=59 time=318 ms
64 bytes from 8.8.8.8: icmp_seq=1479 ttl=59 time=6.41 ms
64 bytes from 8.8.8.8: icmp_seq=1480 ttl=59 time=21.3 ms
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available
ping: sendmsg: No buffer space available

I waited a few minutes to see if it would recover on its own, but after about 2mins of no connectivity, I forced NetworkManager to reconnect. It succeeded and connectivity was restored.

Comment 29 Chris van de Sande 2015-03-11 12:35:44 UTC
Created attachment 1000395 [details]
dmesg 4.0-rc3 with patch

Comment 30 Emmanuel Grumbach 2015-03-11 12:51:33 UTC
You have disconnections here... not much I can do about that.

Is the system behaving better now?

Comment 31 Chris van de Sande 2015-03-11 22:58:10 UTC
Seems much less frequent, but still happened twice a few minutes apart.

Comment 32 Chris van de Sande 2015-03-11 22:59:27 UTC
Created attachment 1000711 [details]
dmesg 4.0-rc3 patched

Comment 33 Emmanuel Grumbach 2015-03-12 06:29:30 UTC
Ok-  in the case it does happen, I can't do much. This is a firmware / environment problem.

The patch is on its way upstream.

Comment 34 Chris van de Sande 2015-03-12 12:03:45 UTC
Well you're certainly right about the environment. I never had this problem until I moved into my current dwelling. 

So while it's true the problem still occurred, last night was by far the best night I had since I've moved here in terms of connectivity. Instead getting drops every few minutes, I only had 2, since installing 4.0 with your patch.

Where that leaves us in terms of this bug I'm not sure. Do we consider this as a upstream fix?

Comment 36 Emmanuel Grumbach 2015-03-12 12:36:46 UTC
FWIW - I am removing Intel from here since we've done what we could. Other issues would be a firmware problem and we don't have firmware support for these devices.

Comment 37 Chris van de Sande 2015-03-13 11:55:11 UTC
Just to post an update, the bug really does seem to be related to the environment. Last night I tried to watch a streaming movie and got the "failed to flush" several times. That's with the patch running the 4.0-rc3 kernel.

So I have to conclude that it's an Intel firmware bug, on hardware that they no longer support. I do appreciate Emmanuel's effort though, it seems his hands are tied.

Normally, I would change wireless cards at this point, but my Lenovo T430 whitelists only a select few wireless cards, none of which are supported by Intel anymore. Not Red Hat's problem, I know.

The best way to fix this problem is to get a stronger signal. Move closer or get a new AP. This is harder to do in a corporate environment, but referencing this bug might suffice in requesting a different machine.

So unless anyone else has anything to add, this bug can probably be closed as WONTFIX, though the upstream patches do improve the situation.

Comment 38 Mark Hamzy 2015-03-13 13:10:04 UTC
John,

Is the Fedora 21 kernel going to get all of the patches mentioned in this bugzilla?

Does anyone know what exactly the firmware problem is?

Comment 39 John W. Linville 2015-03-13 15:00:35 UTC
I'll have to defer to Jarod on that one.  The patch is not marked for -stable, so it probably won't get into Fedora 21 by default (unless/until Fedora 21 gets a 4.0 kernel)...

Comment 40 Josh Boyer 2015-03-13 15:35:33 UTC
Jarod probably doesn't want to be messing with Fedora kernels.

F21 will get a 4.0 rebase around the time 4.0.1 comes out.

Comment 41 John W. Linville 2015-03-13 15:47:57 UTC
Josh -- brain fart!  At least you knew who I meant... ;-)

Comment 42 Fedora Kernel Team 2015-04-28 18:33:57 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 21 kernel bugs.

Fedora 21 has now been rebased to 3.19.5-200.fc21.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 22, and are still experiencing this issue, please change the version to Fedora 22.

If you experience different issues, please open a new bug report for those.

Comment 43 Fedora End Of Life 2015-11-04 12:20:36 UTC
This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 44 cvandesande 2015-11-04 12:32:52 UTC
Original reporter here. Later kernel versions did improve the situation, but it still occurred regularly.

I've since moved to a new country and am no longer in the environment that caused the issue in the first place. So I'm no longer able to reproduce.

I'll leave to it you Fedora guys to decide what you want to do about it. I do appreciate the help and I support I got from everyone.

Thank you guys!

Comment 45 Fedora End Of Life 2015-12-02 09:52:31 UTC
Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.