On an ARM tablet with a wifi chip driven by the r8188eu kernel module, NetworkManager causes dmesg to fill with: [ 795.636135] R8188EU: ERROR indicate disassoc [ 795.779237] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 796.504303] R8188EU: ERROR indicate disassoc [ 796.645616] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 797.035199] R8188EU: ERROR indicate disassoc [ 797.167413] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 798.129007] R8188EU: ERROR indicate disassoc [ 798.282473] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 798.543461] R8188EU: ERROR indicate disassoc [ 798.681296] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 799.741503] R8188EU: ERROR indicate disassoc [ 799.871259] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 800.129377] R8188EU: ERROR indicate disassoc [ 800.242726] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready Note how NM is retrying the bringup of wlan0 every second. This constant hitting of the wifi chip with a request it cannot fulfil also causes scanning for networks to get interrupted, making: "nmcli d wifi" show either an empty list (sometimes), or most of the time simply: [root@localhost ~]# nmcli d wifi (process:3953): nmcli-CRITICAL **: Error: Could not create NMClient object: Timeout was reached. Additonal info: [root@localhost ~]# rpm -q NetworkManager NetworkManager-1.4.0-4.fc25.armv7hl Doing "service NetworkManager stop" stops the dmesg flood. This same tablet works fine with Fedora 24, using the exact same kernel and wpa_supplicant as in my F25 testing.
Can you grab some NetworkManager logs? nmcli g log level debug <reproduce the issue> journalctl -b -u NetworkManager does the device have firmware? If it doesn't, the driver is supposed to return ENOENT when NM sets IFF_UP and NM will detect that. Since IFF_UP is a blocking kernel operation, and since the device must be up before it can be used, NM tries to bring the device up a couple times with a short wait in between. Looks like NM isn't able to detect that the IFF_UP has failed hard and to break out of that loop. But lets find out what the actual error is and we'll see if we can get NM to handle it better.
Created attachment 1208181 [details] Journal output with the log level set to debug Journal output without setting the log level to debug is: Jul 25 21:52:29 localhost.localdomain NetworkManager[250]: <warn> [1469476349.9540] device (wlan0): set-hw-addr: new MAC address BE:CA:1D:8D:8C:80 not successfully set (scanning) Jul 25 21:52:30 localhost.localdomain NetworkManager[250]: <info> [1469476350.0798] device (wlan0): supplicant interface state: inactive -> disabled Jul 25 21:52:30 localhost.localdomain NetworkManager[250]: <info> [1469476350.0912] device (wlan0): supplicant interface state: disabled -> inactive (repeating itself)
Might be caused by the driver reporting a mtu size of 0 when not associated ?
The "98.129007] R8188EU: ERROR indicate disassoc" message seems to be a red herring, that's always printed when the driver disassociates. Obviously it shouldn't be sending that message since it's not associated to anything, but it's a staging driver and looks generally awful. Looks like changing the MAC address only works if padapter->bup == false: static int rtw_net_set_mac_address(struct net_device *pnetdev, void *p) { struct adapter *padapter = (struct adapter *)rtw_netdev_priv(pnetdev); struct sockaddr *addr = p; if (!padapter->bup) memcpy(padapter->eeprompriv.mac_addr, addr->sa_data, ETH_ALEN); return 0; } padapter->bup is a boolean that is only set to 'false' when the USB device is suspended or when the USB device is removed or the driver is unloaded. So basically, the MAC address of this device cannot ever change while the device is plugged in after bringing it up the first time. This driver sucks. But NM should handle this better though.
The driver should really be returning an error from rtw_net_set_mac_address if it can't set the address, instead of success when it's not actually going to do anything.
Hi, Yeah the driver sucks, but this is a quite popular driver (it is even enabled by default in Fedora even though it is in staging); and this used to work in F24. Now with that said, I've worked a bit on this driver before (only a bit), so if you tell me what would be correct behavior then I can fix it. If I understand you correctly there are 2 problems: 1) rtw_net_set_mac_address should return an error when not actually changing the address, what error should that be, -EBUSY perhaps ? -ENOTSUPP ? 2) At every ifup the driver logs a "R8188EU: ERROR indicate disassoc" error, if I understand you right you believe that it also sends a disassoc event to userspace, right ? And the correct behavior would be to return an error from the ifup ioctl, but to not also send the disassoc event, right ? I can take a shot at fixing those, it might be an ugly driver, but it is opensource :) Regards, Hans
(In reply to Hans de Goede from comment #6) > Hi, > > Yeah the driver sucks, but this is a quite popular driver (it is even > enabled by default in Fedora even though it is in staging); and this used to > work in F24. Yeah, which is the curse of staging drivers. If nobody uses them not many are interested in cleaning them up, but they are often utterly awful. > Now with that said, I've worked a bit on this driver before (only a bit), so > if you tell me what would be correct behavior then I can fix it. > > If I understand you correctly there are 2 problems: > > 1) rtw_net_set_mac_address should return an error when not actually changing > the address, what error should that be, -EBUSY perhaps ? -ENOTSUPP ? EOPNOTSUPP I would say. This won't fix NM's bad behavior in this case though. See below... > 2) At every ifup the driver logs a "R8188EU: ERROR indicate disassoc" error, > if I understand you right you believe that it also sends a disassoc event to > userspace, right ? And the correct behavior would be to return an error from > the ifup ioctl, but to not also send the disassoc event, right ? This seems to just be informational, the "ERROR" part is because the function just happens to use some logging code that always prints ERROR. That's annoying but just cosmetic. What I think is a bug is that this function even gets called at all in this path, because the device isn't associated to anything yet, so why is it calling a disconnect notification? Anyway, cosmetic. ----- What appears to be happening is an NM bug and a race between NM and the supplicant: 1) NM takes device down 2) NM sets new MAC but of course it's ignored 3) NM brings the device up again 4) supplicant also watches netdev flags, notices the interface went !IFF_UP from step #1, and sets the interface state to DISABLED 5) NM notices the interface state is disable, logs it, but doesn't do anything 6) supplicant sees the IFF_UP from step #3, and sets interface state to DISCONNECTED 7) supplicant calls wpa_supplicant_scan() with scan_req=NORMAL_SCAN_REQ, but since there are no enabled networks in the supplicant config (NM only sends them to the supplicant after a scan) and since wpa_s->scan_req == NORMAL_SCAN_REQ, it rejects the scan and sets the interface to INACTIVE state 8) NM notices INACTIVE state (the supplicant doesn't emit DISCONNECTED because it coalesces dbus property changes), asks for a wifi scan since the interface is no longer disabled 9) go to step #1 Steps 7 & 8 were likely what eed8fd2e43d244caa856d9993e750ff19ba62fd7 was originally trying to solve, but this commit triggers step #9 and causes the problem. We can't just remove the _requested_scan_set(FALSE) (which would fix the issue) because then we would re-introduce the issue eed8fd2e43d244caa856d9993e750ff19ba62fd7 fixed.
And to be clear, commit eed8fd2e43d244caa856d9993e750ff19ba62fd7 triggering step #9 is itself newly caused by moving the wifi randomized scan MAC code into NM instead of the previous 1.2 behavior of letting the supplicant do it.
(In reply to Dan Williams from comment #7) > (In reply to Hans de Goede from comment #6) > > Hi, > > > > Yeah the driver sucks, but this is a quite popular driver (it is even > > enabled by default in Fedora even though it is in staging); and this used to > > work in F24. > > Yeah, which is the curse of staging drivers. If nobody uses them not many > are interested in cleaning them up, but they are often utterly awful. Ack. Note that this particular driver is being replaced by a new upstream driver called rtl8xxxu, which will grow support for the chips supported only by this staging driver for 4.9 I think. But there still are: https://github.com/hadess/rtl8723bs https://github.com/jwrdegoede/rtl8189ES_linux (2 branches for 2 different chips) Which are pretty much the same mess and are necessary support for various realtek sdio wifi chips found in ARM, Bay-trail and Cherry-trail devices. These are likely hit by the same issue (I have not tested this yet) Jes Sorensen is looking into adding support for these to rtl8xxxu in the future too, but that will likely take a while. <snip> > What appears to be happening is an NM bug and a race between NM and the > supplicant: > > 1) NM takes device down > 2) NM sets new MAC but of course it's ignored > 3) NM brings the device up again > 4) supplicant also watches netdev flags, notices the interface went !IFF_UP > from step #1, and sets the interface state to DISABLED > 5) NM notices the interface state is disable, logs it, but doesn't do > anything > 6) supplicant sees the IFF_UP from step #3, and sets interface state to > DISCONNECTED > 7) supplicant calls wpa_supplicant_scan() with scan_req=NORMAL_SCAN_REQ, but > since there are no enabled networks in the supplicant config (NM only sends > them to the supplicant after a scan) and since wpa_s->scan_req == > NORMAL_SCAN_REQ, it rejects the scan and sets the interface to INACTIVE state > 8) NM notices INACTIVE state (the supplicant doesn't emit DISCONNECTED > because it coalesces dbus property changes), asks for a wifi scan since the > interface is no longer disabled > 9) go to step #1 > > Steps 7 & 8 were likely what eed8fd2e43d244caa856d9993e750ff19ba62fd7 was > originally trying to solve, but this commit triggers step #9 and causes the > problem. We can't just remove the _requested_scan_set(FALSE) (which would > fix the issue) because then we would re-introduce the issue > eed8fd2e43d244caa856d9993e750ff19ba62fd7 fixed. Ok, thanks for the detailed analysis let me know when you've a fix ready, if you can do a scratch-build of NM for F25 with the fix (please include armv7 in the archs for the scratchbuild) I will happily test any fix you have. Regards, Hans
This is still happening and is still breaking wifi on various devices. Actually the rtl8723bs will be in drivers/staging for 4.12 and I just lost 2 hours figuring out why things would not work until I remembered this bug. I've since learned that adding: [device] wifi.scan-rand-mac-address=no to /etc/NetworkManager/NetworkManager.conf also works around this, if this is hard to fix, can we at least make NM automatically disable scan-rand-mac-address when it sees the rtl8723bs driver ?
(In reply to Hans de Goede from comment #10) [...] > can we at least make NM automatically disable > scan-rand-mac-address when it sees the rtl8723bs driver ? See: https://bugzilla.redhat.com/show_bug.cgi?id=1421383#c3
Also: "wifi: only attempt to set the scan MAC address when it actually changes" https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=023417292339e34dd07e5b902576c1472ff0c3dd
NetworkManager-1.8.0-0.2.rc3.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-555e0285ce
NetworkManager-1.8.0-0.2.rc3.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-555e0285ce
*** Bug 1445496 has been marked as a duplicate of this bug. ***
It could be that bug 143579 is the same as this one. What I see with the recent NetworkManager is that things seem to be worse (subjective). The network seems to be disconnecting and reconnecting much more frequently and, after a reconnect, it is already so messed up that I am often not even able to load a single page (before, I had a window of grace for anywhere from a minute to a few minutes, it seemed). I have also reported this this morning on: https://github.com/lwfinger/rtlwifi_new/issues/235 but it has only been about 9 hours and there has been no response. I had stumbled actross this project through my searches for a solution and noticed that there was much discussion here that appeared to be an exact match to what is happeneing with my system.
That was supposed to be bug 1435793.
(In reply to Peter Gückel from comment #16) > It could be that bug 143579 is the same as this one. What I see with the > recent NetworkManager is that things seem to be worse (subjective). The > network seems to be disconnecting and reconnecting much more frequently and, > after a reconnect, it is already so messed up that I am often not even able > to load a single page (before, I had a window of grace for anywhere from a > minute to a few minutes, it seemed). you don't mention which exact version of NM you used. A logfile is necessary to understand what really happens (contrary to what seems to happen). See the comment in https://cgit.freedesktop.org/NetworkManager/NetworkManager/plain/contrib/fedora/rpm/NetworkManager.conf?id=3876b10a4749638c3dcfa7e65b12bfee8030334c . Thanks.
NetworkManager-1.8.0-0.2.rc3.fc26.x86_64 I didn't restart NM with the increased debugging level, but here is the logfile I generated from the journalctl command.
Created attachment 1274695 [details] output of journalctl
I forgot to mention (the events of just now reminded me): when the wifi drops out and comes back, due to NM restarting it, it is generally (always?) impossible to load a web page. Although the network is back up, I can click on links and absolutely nothing happens. While the page is being sought, in the lower left corner, it says looking up <some web site> and it just hangs until Firefox displays a blank page with the notification that the server was not found. To get around this, I have learned, in order to be able to use my computer wirelessly, to run: sudo modprobe -r rtl8821ae;sudo modprobe rtl8821ae As the log file might illustrate, I run this command almost every minute or so.
*** Bug 1441255 has been marked as a duplicate of this bug. ***
does disabling mac-address-randomization while scanning avoid the issue? As mentioned, do that via [device] wifi.scan-rand-mac-address=no in /etc/NetworkManager/NetworkManager.conf Followed by `systemctl restart NetworkManager` While at it, enable level=TRACE logging before submitting a logfile. Thanks.
The device section did not exist in the file, so I added it, removed the pound sign from the TRACE line and restarted NM. I loaded about 10 web pages from 2 -3 different sites. It looked really good—no problems! Then, about 5 minutes in, the wifi dropped out and it took quite a long time (15-30 seconds? guess) for NM to notice and restart it. It did restart the connection and I did not need to issue the release and load kernel module commands. I just clicked on Firefox's Server not found, retry links in the otherwise empty tabs and the pages loaded just fine. A small victory. As I type this, without even loading a page, as I am typing into the comment filed of a page long loaded and idly opened, the network dropped out and reconnected again. NM sure does wait a long time before bothering to reconnect! But, once connected, everything works fine... until the next loss of connection. And no more modprobe -rrrrrrgh. This is definitely an improvement in making wifi usable and more painless, despite the ongoing and unresolved underlying issue. That issue might yet be the rtlwifi_new commits on github that might or might not have been completely successful and might or might not have made their way into the fedora 26 kernels.
It sure does take a long time for NM to realize that wifi is dead. The indicator drops to 0 bars displayed and the connection is dead and it just sits like that for a minute (I haven't time it), before a notification finally appears that the connection is deactivated. Then, it sits dead like that for another minute or so (again, no stopwatch), before NM finally initiates the reconnection, which typically goes fairly rapidly.
Do you want another logfile since enabling TRACE?
(In reply to Peter Gückel from comment #27) > Do you want another logfile since enabling TRACE? sure. Let's first stop NetworkManager and make sure that the MAC address of the Wi-Fi device is the permanent one. For that, compare the output of ethtool -P wlp3s0 and ip link show wlp3s0 (if it isn't, reset it with `ip link set wlp3s0 addr "$ADDR"`). Then start NM again, and provide the level=TRACE log. (don't attach all logs since 2 weeks ago. You can pass --since "$TIME" to journalctl)
NetworkManager-1.8.0-0.2.rc3.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.
(In reply to Thomas Haller from comment #28) > Let's first stop NetworkManager and make sure that the MAC address of the > Wi-Fi device is the permanent one. > > For that, compare the output of > > ethtool -P wlp3s0 > > and > > ip link show wlp3s0 > > (if it isn't, reset it with `ip link set wlp3s0 addr "$ADDR"`). Both commands show b0:c0:90:a1:91:79. > Then start NM again, and provide the level=TRACE log. > > > (don't attach all logs since 2 weeks ago. You can pass --since "$TIME" to > journalctl) OK. I didn't know how to limit. I did --since=today. Also, since this bug is now closed, despite being unresolved and in the process of being debugged, should we continue at my original bug report, bug 1435793, or will this bug automatically reopen itself, since it is clearly not solved?
Created attachment 1274795 [details] new output, since today
I should remind you that, while I verified the MAC address using both commands before restarting NM and generating the logfile output, the modified NM.conf with the added device section is still in place, as per your Comment 24.
In general, if you still have issues with the wifi.scan-rand-mac-address=no setting, your issue is not the same as this bug. If you still have issues with 1.8.0-0.2.rc3.fc26, it is also likely not this bug because this bug is considered closed. The logfile shows that after a while supplicant fails with <warn> [1493313695.1804] sup-iface[0x55e426f49010,wlp3s0]: connection disconnected (reason -4) which then leads to <info> [1493313710.6812] device (wlp3s0): state change: activated -> failed (reason 'ssid-not-found') [100 120 53] I am not sure what this means. Looks like a supplicant or driver issue. Getting logfiles for the supplicant would be helpful. See https://wiki.gnome.org/Projects/NetworkManager/Debugging#Debugging_wpa_supplicant_0.7_and_later (let's continue the discussion on the other bug, because the issue seems not to be the same).
(In reply to Thomas Haller from comment #33) > In general, if you still have issues with the wifi.scan-rand-mac-address=no > setting, your issue is not the same as this bug. > If you still have issues with 1.8.0-0.2.rc3.fc26, it is also likely not this > bug because this bug is considered closed. > > (let's continue the discussion on the other bug, because the issue seems not > to be the same). ok, over to bug 1435793.
Thomas, what happened to you? You said, "let's continue the discussion on the other bug," but I haven't seen you all day. Are you waiting for me to hear from upstream on github? Well, I sort of have, but we'll have to see what transpires. Nevertheless, there has been a new development: I have determined when the bug entered the kernel series...
(In reply to Peter Gückel from comment #35) > Thomas, what happened to you? You said, "let's continue the discussion on > the other bug," but I haven't seen you all day. I am not around every day :) Anyway, I don't have much to add to help with your problem. It does not look related to NetworkManager to me. As your investigation hints to, it seems to be a driver issue (great, that you were able to determine working and non-working versions). I don't have enough experience with such problems to contribute to a solution, neither I have to environment to reproduce the issue. The loglevel of NetworkManager you may reset or leave to your liking. Also, mac-address-randomization doesn't seem to matter for your issue either. But for now I would leave it disable (wifi.scan-rand-mac-address=no) to rules out a possible interference from this side. Good luck.
I left the randomization off, like you suggested. Yes, it could be the driver. There was supposed to have been a new one and others have complained, too. Something does seem to have improved, though. I don't know quite what it is. I got a new plasma 5.9.5-2 last night and I have the feeling that something improved dramatically. I am not sure if this included a new NetworkManager widget, but I believe so. I am still getting connectivity losses, but they sure are spaced out widely, into 5- or 10-minute intervals or seemingly even longer! It is really very liveable :-) (until the driver gets sorted out). There is some bizarre behaviour now that I didn't perceive before. When the network cuts out, I can sometimes rescue it, it seems to me. When I notice a page hang and the little box in the lower left of the tab says 'loading <some page>' and nothing happens, I open a new tab and load a second page and quite often that seems to jog the hung one. Very strange... or just subjective, but I think it is really happening. If I wait too long or I have done it a couple of times previously, then it doesn't seem to work. Maybe it is just subjective? lol Well, I discovered your web page this afternoon. You are the guy to talk to!!! I have been poring over the NM manual pages and I am confused about some possibly useful options. You might know the answers. When the NM widget suddenly displays a loss of connectivity, by fading out the signal strength bars, it just hangs for what seems to be a whole minute, before it finally says that the network cannot be found and it is disconnected. Is there a setting so that I can shorten this timeout? And, once it has finally realized that the connection is lost, instead of reconnecting right away, it waits for what must be a full minute before attempting to reconnect. Is there a setting so that I can shorten this timeout, too? I can handle the now much rarer connection losses, but why do I have to wait so long? Can't it just tell it to reconnect immediately, instead of waiting for 2 minutes? It always manages to reconnect, but why the lag? I would like to shorten it to almost nothing. Then, I would barely even notice it. Perhaps just need to click a link twice and that would be all, instead of waiting so long before I can resume my searches.
And another thing is very curious. When the NM widget shows that the signal has gone down to nothing and it is just hanging, if I click on the widget to display the avaibable connections, it shows absolutely *nothing*... it's just blank! It is as if it cannot even see the wifi signals in the area at all. Then, it sort of freezes or jerks after the subjective minute elapses and then all of the connections reappear and my connection is immediately tried automatically and the connection succeeds. Strange! Does this behaviour sound like something that would give you any further clues about what is happening?
(In reply to Peter Gückel from comment #38) > Does this behaviour sound like something that would give you any further > clues about what is happening? It is very unlikely that a GUI issue is related to this bug (issue in NM when driver cannot change the MAC address) or your issue (driver Wi-Fi issue).