Bug 830151

Summary: NM wifi scans makes VOIP uncomfortable
Product: [Fedora] Fedora Reporter: Zdenek Kabelac <zkabelac>
Component: kernelAssignee: Stanislaw Gruszka <sgruszka>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: danw, dcbw, gansalmon, itamar, jonathan, jwboyer, kernel-maint, madhu.chinakonda, pedrogfrancisco, sgruszka
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-04 09:40:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Log of ping to wifi gateway
none
scan_latency_optimalizations_draft.patch none

Description Zdenek Kabelac 2012-06-08 11:52:53 UTC
Created attachment 590412 [details]
Log of ping to wifi gateway

Description of problem:

So I've spent some time figuring out what is wrong with my VOIP usage, when I'm connected via Wifi.   Once per ~minute (not sure about exact timing) I've been getting big drops in VOIP quality.

After some more experimenting - plain ping to my router has shown there is something wrong going on (ping attached).

Then I've tried other connected machine using different OS (Ubuntu) and having exact same physical position and signal quality and no problems - I'd to drop the idea of having Wifi signal problems.

Next experiment was to use  Wifi WITHOUT NetworkManager - after some time spent with configuring all those WPA passwords - I've finally got undisturbed pings.

So this leaved NM as primary suspect.
My next step was to run 'iwlist wlan0 scan' -  and results on ping times looked very similar to ping delays as to what default NM is doing on my desktop.

Then I've tried to figure out if there is some NM option to disable auto scanning - and haven't found out one??

And so I've opened BZ to have the option to disable Wifi scanning when connected over wifi - since it makes VOIP basically unusable.

If there is option to disable autoscanning - it's very well hidden.
If there is not - it should be added (seems like Debian is using it??)

My hw -  T61 C2D, iwl3945, kernel 3.4  x86_64


Version-Release number of selected component (if applicable):
NetworkManager-0.9.4-5.git20120521.fc18.x86_64


How reproducible:


Steps to Reproduce:
1. ping to gateway router
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Stanislaw Gruszka 2012-06-11 14:51:44 UTC
This problem seems to be 3945 specific.

Comment 2 Zdenek Kabelac 2012-06-11 15:00:08 UTC
Debugging of wpa_supplicant:
https://live.gnome.org/NetworkManager/Debugging#wpa_supplicant07

revealed that if no BSSID is given the scan is being made in upto 2 minutes interval.

As a workaround it's currently possible to setup BSSID to the AP address which disables periodic wlan scans.

Comment 3 Zdenek Kabelac 2012-06-11 20:13:40 UTC
Ok - so I've test  BSSID solution - and it works for VOIP quite well.

However this should be somehow intelligently managed by NM itself.

It should possibly detect the hw scan takes a lot of time and delays ping and user should have some option according to his preference - whether scan for surrouding AP or  VOIP latency is for him more important - otherwise he will get lags and it's not so trivial to detect real reason for packet delays.

I've also tried an iwl3945 option    disable_hw_scan=0  - this option does some help - it seems to make the number of delayed packet smaller  - instead of 5 it's usually just 2 of them with latency upto 100ms, but it also makes NM unusable after suspend/resume cycle since no new HW scan happens and new AP are not discovered after resume.

Comment 4 Zdenek Kabelac 2012-06-12 08:12:19 UTC
It's also probably worth to note here, that even machines like T520 with i7 CPUs and iwlan WiFi drivers are influenced by those scans in default f17 - though the pings are not reaching 200ms - they are in 70ms range - but it still quite noticeable ping delay, thought it's probably not such a big issue for VOIP.

So NM should really give user an option to enable/disable wifi scanning, and also and option for  scan on request.   NM cannot pretend it's smarter than user, because it simple cannot know all the conditions.

So I guess this BZ may turn into  RFE for adding 2 new NM applet options -

Disable autoscan
WiFi scan now

And as mentioned in comment 3 - it would be nice, if the tool would be able to detect latencies like 200ms upon periodic scan and give user a warning explaining him these latencies may have serious impact on low latency apps like VOIP, so user may choose what does he prefer.

Comment 5 Stanislaw Gruszka 2012-06-12 09:44:43 UTC
I discussed disabling autoscan from NM with Dan some time ago. The point was that we can not do this if we want to have wireless roaming working smoothly. We have NM BSSID: option, which disable autoscan and connect only with one AP.

"Trigger scan now" button is not that Dan would like to see from user experience perspective, he prefer to have things just works.

To make scan faster NM use now wpa_supplicant "background scan" feature, but I'm not quite sure if it is really faster option, or if it is, maybe not for all drivers.

Anyway the thing to do here is make scan work fast and do not influence (much) latency and throughput. I.e for 3945 the goal would be make sw_scan as fast as hw_scan.

Comment 6 Zdenek Kabelac 2012-06-12 09:58:30 UTC
(In reply to comment #5)
> I discussed disabling autoscan from NM with Dan some time ago. The point was
> that we can not do this if we want to have wireless roaming working
> smoothly. We have NM BSSID: option, which disable autoscan and connect only
> with one AP.

Well I'm just wondering how the average user 'Joe' will be able to setup BSSID properly - if there would be some proper support from nm-applet and also user would KNOW what he should do - then I do not see a big problem.

But in all other case simple button 'Disable scan'  is fairly simple to understand and easy to implement - I do not see any way how to make it better
with current state of drivers - it's pretty 'long run task' for kernel drivers to make the scan 'unnoticeable'  - meanwhile all users will suffer ???
That's IMHO not fair for user who are not able to setup things - and they are even not aware where is the problem...



> 
> Anyway the thing to do here is make scan work fast and do not influence
> (much) latency and throughput. I.e for 3945 the goal would be make sw_scan
> as fast as hw_scan.


Well even hw_scan  makes  100ms delays with 3945 - so not an option at all...
(And doesn't work after resume)

Comment 7 Pedro Francisco 2012-11-26 14:22:47 UTC
As a workaround try:
$ sudo wpa_cli -iwlan0 set_network <ESSID HERE> bgscan \"simple:4000:-200:8000\"

You will probably find out NetworkManager resets those values on multi-AP networks, but maybe on single AP ones it doesn't.

Comment 8 Zdenek Kabelac 2012-12-09 18:47:21 UTC
Unsure what has changed were

but now with kernel 3.7-rc8 and
NetworkManager-0.9.7.0-9.git20121004.fc19.x86_64
it's actually impossible to disable scan at all.

It doesn't matter if I set  BSSID or not I'm getting now this - and it's completely unsuable for VOIP over wifi with  iwl3945!


168.192.1) 56(84) bytes of data.
64 bytes from 192.168.192.1: icmp_seq=1 ttl=64 time=0.941 ms
64 bytes from 192.168.192.1: icmp_seq=2 ttl=64 time=1.39 ms
64 bytes from 192.168.192.1: icmp_seq=3 ttl=64 time=0.963 ms
64 bytes from 192.168.192.1: icmp_seq=4 ttl=64 time=0.973 ms
64 bytes from 192.168.192.1: icmp_seq=5 ttl=64 time=20.5 ms
64 bytes from 192.168.192.1: icmp_seq=6 ttl=64 time=1.98 ms
64 bytes from 192.168.192.1: icmp_seq=7 ttl=64 time=0.927 ms
64 bytes from 192.168.192.1: icmp_seq=8 ttl=64 time=0.955 ms
64 bytes from 192.168.192.1: icmp_seq=9 ttl=64 time=0.981 ms
64 bytes from 192.168.192.1: icmp_seq=10 ttl=64 time=0.858 ms
64 bytes from 192.168.192.1: icmp_seq=11 ttl=64 time=1.01 ms
64 bytes from 192.168.192.1: icmp_seq=12 ttl=64 time=0.927 ms
64 bytes from 192.168.192.1: icmp_seq=13 ttl=64 time=0.851 ms
64 bytes from 192.168.192.1: icmp_seq=14 ttl=64 time=1.45 ms
64 bytes from 192.168.192.1: icmp_seq=15 ttl=64 time=1.51 ms
64 bytes from 192.168.192.1: icmp_seq=16 ttl=64 time=2.20 ms
64 bytes from 192.168.192.1: icmp_seq=17 ttl=64 time=0.846 ms
64 bytes from 192.168.192.1: icmp_seq=18 ttl=64 time=1.79 ms
64 bytes from 192.168.192.1: icmp_seq=19 ttl=64 time=0.997 ms
64 bytes from 192.168.192.1: icmp_seq=20 ttl=64 time=0.988 ms
64 bytes from 192.168.192.1: icmp_seq=21 ttl=64 time=1.82 ms
64 bytes from 192.168.192.1: icmp_seq=22 ttl=64 time=206 ms
64 bytes from 192.168.192.1: icmp_seq=23 ttl=64 time=754 ms
64 bytes from 192.168.192.1: icmp_seq=24 ttl=64 time=1295 ms
64 bytes from 192.168.192.1: icmp_seq=25 ttl=64 time=1416 ms
64 bytes from 192.168.192.1: icmp_seq=26 ttl=64 time=437 ms
64 bytes from 192.168.192.1: icmp_seq=27 ttl=64 time=710 ms
64 bytes from 192.168.192.1: icmp_seq=28 ttl=64 time=586 ms
64 bytes from 192.168.192.1: icmp_seq=29 ttl=64 time=0.976 ms
64 bytes from 192.168.192.1: icmp_seq=30 ttl=64 time=0.935 ms
64 bytes from 192.168.192.1: icmp_seq=31 ttl=64 time=0.982 ms
64 bytes from 192.168.192.1: icmp_seq=32 ttl=64 time=0.888 ms
64 bytes from 192.168.192.1: icmp_seq=33 ttl=64 time=1.06 ms
64 bytes from 192.168.192.1: icmp_seq=34 ttl=64 time=0.848 ms
64 bytes from 192.168.192.1: icmp_seq=35 ttl=64 time=0.749 ms
64 bytes from 192.168.192.1: icmp_seq=36 ttl=64 time=0.844 ms
64 bytes from 192.168.192.1: icmp_seq=37 ttl=64 time=3.39 ms
64 bytes from 192.168.192.1: icmp_seq=38 ttl=64 time=0.961 ms
64 bytes from 192.168.192.1: icmp_seq=39 ttl=64 time=0.960 ms
64 bytes from 192.168.192.1: icmp_seq=40 ttl=64 time=3.57 ms
64 bytes from 192.168.192.1: icmp_seq=41 ttl=64 time=0.949 ms
64 bytes from 192.168.192.1: icmp_seq=42 ttl=64 time=0.976 ms
64 bytes from 192.168.192.1: icmp_seq=43 ttl=64 time=0.843 ms
64 bytes from 192.168.192.1: icmp_seq=44 ttl=64 time=0.969 ms
64 bytes from 192.168.192.1: icmp_seq=45 ttl=64 time=1.23 ms
64 bytes from 192.168.192.1: icmp_seq=46 ttl=64 time=0.873 ms
64 bytes from 192.168.192.1: icmp_seq=47 ttl=64 time=0.964 ms
64 bytes from 192.168.192.1: icmp_seq=48 ttl=64 time=140 ms
64 bytes from 192.168.192.1: icmp_seq=49 ttl=64 time=499 ms
64 bytes from 192.168.192.1: icmp_seq=50 ttl=64 time=349 ms
64 bytes from 192.168.192.1: icmp_seq=51 ttl=64 time=438 ms
64 bytes from 192.168.192.1: icmp_seq=52 ttl=64 time=388 ms
64 bytes from 192.168.192.1: icmp_seq=53 ttl=64 time=408 ms
64 bytes from 192.168.192.1: icmp_seq=54 ttl=64 time=384 ms
64 bytes from 192.168.192.1: icmp_seq=55 ttl=64 time=738 ms
64 bytes from 192.168.192.1: icmp_seq=56 ttl=64 time=503 ms
64 bytes from 192.168.192.1: icmp_seq=57 ttl=64 time=174 ms

Comment 9 Stanislaw Gruszka 2012-12-11 13:35:40 UTC
Looks like BSSID does not any longer prevent NM from requesting periodical scans - IOW seems that is NM regression.

Zdenek, try to downgrade NetworkManager. Check if scanning request are triggered  can be tracked by "iwevent" tool . When scan finish there is "Scan request completed" message reported by iwevent.

If this is NetworkManager regression, please open a new bug for that. In this bugzilla, the goal is to improve mac80211 scans to do cause network traffic latency, or more realistically to make latency as small as possible.

Comment 10 Stanislaw Gruszka 2012-12-11 13:42:05 UTC
> 64 bytes from 192.168.192.1: icmp_seq=47 ttl=64 time=0.964 ms
> 64 bytes from 192.168.192.1: icmp_seq=48 ttl=64 time=140 ms
> 64 bytes from 192.168.192.1: icmp_seq=49 ttl=64 time=499 ms
> 64 bytes from 192.168.192.1: icmp_seq=50 ttl=64 time=349 ms
> 64 bytes from 192.168.192.1: icmp_seq=51 ttl=64 time=438 ms
> 64 bytes from 192.168.192.1: icmp_seq=52 ttl=64 time=388 ms
> 64 bytes from 192.168.192.1: icmp_seq=53 ttl=64 time=408 ms
> 64 bytes from 192.168.192.1: icmp_seq=54 ttl=64 time=384 ms
> 64 bytes from 192.168.192.1: icmp_seq=55 ttl=64 time=738 ms
> 64 bytes from 192.168.192.1: icmp_seq=56 ttl=64 time=503 ms
> 64 bytes from 192.168.192.1: icmp_seq=57 ttl=64 time=174 ms

This actually looks more like iwl3945 or mac80211 bug ...

Comment 11 Stanislaw Gruszka 2012-12-17 13:42:33 UTC
Created attachment 664880 [details]
scan_latency_optimalizations_draft.patch

This draft patch include various iwl3945 and mac80211 changes, which optimize network traffic latency while scanning.

Tested with:
while true; do iw dev wlan0 scan; sleep 3; done > /dev/null &
ping -i0.2 -c 1000 192.168.1.1

Unpatched kernel:
--- 192.168.1.1 ping statistics ---
1000 packets transmitted, 991 received, 0% packet loss, time 200988ms
rtt min/avg/max/mdev = 0.618/42.983/241.669/62.745 ms, pipe 2

Patched kernel:
--- 192.168.1.1 ping statistics ---
1000 packets transmitted, 998 received, 0% packet loss, time 200462ms
rtt min/avg/max/mdev = 0.637/3.764/69.229/8.517 ms

Patch has to be reworked (spit into individual changes) and posted upstream. Some of those changes could not be accepted, because they could make scan itself too slow or not detecting all APs. We will see ...

Comment 12 Stanislaw Gruszka 2013-02-01 06:17:21 UTC
(In reply to comment #11)
> Patched kernel:
> --- 192.168.1.1 ping statistics ---
> 1000 packets transmitted, 998 received, 0% packet loss, time 200462ms
> rtt min/avg/max/mdev = 0.637/3.764/69.229/8.517 ms

The final result will be worse than that, on my tests I have:

rtt min/avg/max/mdev = 0.928/19.678/150.845/33.130 ms

This is because on draft patch I removed scanning of 5GHz band, which consist mostly of PASSIVE channels (scan time 125 ms, btw: ACTIVE channel scan time is 60 ms).

On a bit different test, ping from remote machine to periodically scanning machine, final patches show nice improvement:

Unpatched: rtt min/avg/max/mdev = 1.637/120.683/709.139/164.337 ms
Patched:   rtt min/avg/max/mdev = 1.807/26.893/201.435/40.284 ms

Comment 14 Josh Boyer 2013-02-01 14:39:57 UTC
I've applied them to rawhide.

Did you want them in F17/F18?  If so, aacde9ee45225f7e0b90960f479aef83c66bfdc0 is already queued for 3.7.6, so that is easy but 3f892b61a8cf6a85bc389c8d13209d686a4f6403 gets a reject in hunk #4 and probably needs a rework.

Comment 15 Stanislaw Gruszka 2013-02-04 09:40:23 UTC
This is improvement, but not a critical bug fix, so let do not apply them to F17/F18. Closing with RAWHIDE resolution.