Bug 1592596 - Regression: with stacked VPNs active, second VPN to activate uses "incorrect" routing to connect to its server
Summary: Regression: with stacked VPNs active, second VPN to activate uses "incorrect"...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: NetworkManager-openvpn
Version: 30
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: David Sommerseth
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-18 23:55 UTC by Dimitris
Modified: 2020-05-26 17:44 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-26 17:44:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
NM log at TRACE level starting the devops (non-default-route) VPN (92.92 KB, text/plain)
2018-06-22 02:15 UTC, Dimitris
no flags Details

Description Dimitris 2018-06-18 23:55:57 UTC
Description of problem:

This is a regression from F27:

I use two VPNs: One is effectively "always on", pushing (and my client accepting) a default route.  All networks I connect to (WLAN, WWNA) are set up to automatically also activate this VPN via nm-connection-editor.

The other VPN is used occasionally when doing work-related devops.  Although this server pushes a default route, I'm using the "use this vpn only for connections on its network" setting in the NM openvpn UI to avoid this VPN used as the default route.

During F26 and F27, with both VPNs active my routing table would look like:

default via <vpn_one peer> dev tun0 proto static metric 50
default via <wlan gw> dev wlp3s0 proto dhcp metric 20600
<... some VPN-pushed routes>
<vpn_two_server_ip> via <vpn_one_peer> dev tun0 proto static metric <n>
<vpn_one_server_ip> via <wlan gw> dev wlp3s0 proto static metric <n>

Very, very occasionally I would see what I though were fragmentation/MTU detection issues in this scenario, but otherwise everything worked great.

With the upgrade to F28 however, when I activate vpn_two, the secondary one, with vpn_one already running, I can't get traffic to any host behind a route that vpn_two is advertising.

The next-to-last routing table entry from above has now, with F28, changed to:

<vpn_two_server_ip> via <wlan gw> dev wlp3s0 proto static metric <n>

That seems "ok" in that it shold avoid the MTU issues mentioned above.  However, it seems there's a race, and that this route is added to the table *after* the vpn_two connection is made.  I say this because using tshark on the vpn_one tun0 interface, I can see the openvpn packets themselves flowing over that interface as the connection is established.

Then when I try to push some traffic through vpn_two, I also see packets destined for the vpn_two server, but originating on the "naked" WLAN interface/address (wlp3s0).

I don't have access to the vpn_two server's logs, but what I assume is happening here is that for whatever reason, this attempt of this client to "float" doesn't work.

So the picture seems to be that the connection to vpn_two is established under the previous state of the routing table, talking via the existing tunnel.  But then the server-specific route is added, *ignoring* that and using the "real" interface (WLAN in this case), effectively breaking vpn_two.

It seems that the ideal scenario would be if the route to the vpn_two server was added *before* the connection to it was made.  Then we'd both avoid fragmentation/MNU issues and keep connectivity to both VPNs.

Version-Release number of selected component (if applicable):

2.4.6-1.fc28

How reproducible:

Every time

Steps to Reproduce:
1. Have to VPNs, one with default route, and one without.
2. Connect to default-route VPN, then connect to the other one.
3. Hosts on the second VPN are not reachable.

Actual results:

Cannot reach second VPN hosts, see above.

Expected results:

Under F27 was able to use multiple VPNs, with only one providing a default route, successfully.

Additional info:

Comment 1 David Sommerseth 2018-06-20 16:26:33 UTC
How do you start the VPN tunnels?  Via NetworkManager?  systemd?

Comment 2 Dimitris 2018-06-20 16:29:17 UTC
Both are started with NetworkManager.  I'm under the impression though that the remote host bypass route is added by the openvpn, is that correct?

Comment 3 David Sommerseth 2018-06-20 16:37:03 UTC
I'm fuzzy about the implementation details between OpenVPN and NetworkManager.  I do know the NM-openvpn-plugin is picking up configuration details for the tun/tap device and lets NM do some of the work while some is handled by OpenVPN itself.  In addition, F-26, F-27 and F-28 uses the exact same upstream OpenVPN version.  

The areas where OpenVPN does configuration changes, that is handled via iproute2 on Fedora.

Since this worked fine on Fedora versions older than F-28 ... this smells a bit like either iproute2 changed (perhaps OpenVPN needs to do something slightly differently) and/or that NetworkManager-openvpn does something different.

Can you please try to start your VPN tunnels from the command line (openvpn --config /path/to/config) and see if it behaves differently?

Comment 4 Dimitris 2018-06-20 16:56:25 UTC
Unfortunately under NetworkManager-openvpn there's no "real" openvpn config file.  The NM config is translated into command line options, which include:

--up /usr/libexec/nm-openvpn-service-openvpn-helper
--management /var/run/NetworkManager/nm-openvpn-<uuid>

so I think you're right, the interesting changes seem to be in NM and/or iproute2.

Comment 5 David Sommerseth 2018-06-20 17:59:14 UTC
That's right.  You can extract most of what you need for the configuration file outside of NM via the command line.  The --up and --management can be ignored for now (you might need to do DNS changes manually, though).

It would be interesting to know if it is a NM issue, iproute2 or OpenVPN 2's implementation of iproute2.  So if you have a chance to fully test this, it would be appreciated a lot!

Comment 6 Beniamino Galvani 2018-06-21 09:06:28 UTC
Hi, can you please provide NM journal logs for the issue, possibly at TRACE level [1]? Thanks.

[1] https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf#n28

Comment 7 Dimitris 2018-06-22 02:15:15 UTC
Created attachment 1453630 [details]
NM log at TRACE level starting the devops (non-default-route) VPN

Attaching TRACE log while starting the second, non-default-route, VPN.  I've sanitized some IP addresses:

<remote IP> is this VPN's remote IP address
<remote port> is the port for the same
<default VPN remote IP> is the already-connected VPN's remote IP address


The "offending" route is added after the VPN connection is established; 192.168.1.1 is the local WLAN gateway, even though the pre-existing tunnel's peer is in the routing table with a lower metric, and the UDP connection to the new remote started out over that route:

Jun 21 18:56:03 vimes NetworkManager[1602]: <trace> [1529632563.9994] platform: route: get IPv4 route for: <remote IP> oif 4
Jun 21 18:56:03 vimes NetworkManager[1602]: <trace> [1529632563.9995] platform-linux: event-notification: RTM_NEWROUTE, flags 0, seq 351: <remote IP>/32 via 192.168.1.1 dev 4 metric 0 mss 0 rt-src rt-unspec rtm_flags cloned scope global pref-src 192.168.1.169
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9995] platform: route: get IPv4 route for: <remote IP> succeeded: <remote IP>/32 via 192.168.1.1 dev 4 metric 0 mss 0 rt-src rt-unspec rtm_flags cloned scope global pref-src 192.168.1.169
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9996] device[0x55d5837b8050] (wlp3s0): ip4-config: update (commit=1, new-config=0x55d5838866c0)
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9996] platform: address: adding or updating IPv4 address: 192.168.1.169/24 lft 28402sec pref 28402sec lifetime 36197-0[28402,28402] dev 4 flags noprefixroute src unknown
Jun 21 18:56:03 vimes NetworkManager[1602]: <trace> [1529632563.9997] platform-linux: event-notification: RTM_NEWADDR, flags 0, seq 352: 192.168.1.169/24 lft 28402sec pref 28402sec lifetime 36197-36197[28402,28402] dev 4 flags noprefixroute src kernel
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9997] platform: signal: address 4 changed: 192.168.1.169/24 lft 28402sec pref 28402sec lifetime 36197-36197[28402,28402] dev 4 flags noprefixroute src kernel
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9997] device[0x55d5837b8050] (wlp3s0): queued IP4 config change
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9997] platform-linux: do-add-ip4-address[4: 192.168.1.169/24]: success
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9998] platform: route: append     IPv4 route: <remote IP>/32 via 192.168.1.1 dev 4 metric 600 mss 0 rt-src vpn
Jun 21 18:56:03 vimes NetworkManager[1602]: <trace> [1529632563.9998] platform-linux: event-notification: RTM_NEWROUTE, flags excl,create, seq 353: <remote IP>/32 via 192.168.1.1 dev 4 metric 600 mss 0 rt-src rt-static scope global
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9998] platform: signal: route   4   added: <remote IP>/32 via 192.168.1.1 dev 4 metric 600 mss 0 rt-src rt-static scope global
Jun 21 18:56:03 vimes NetworkManager[1602]: <debug> [1529632563.9999] platform-linux: do-add-ip4-route[<remote IP>/32 via 192.168.1.1 dev 4 metric 600 mss 0 rt-src rt-static scope global]: success

Comment 8 Beniamino Galvani 2018-06-23 09:32:24 UTC
Wwhen the first VPN activates, a in-memory connection for tun0 gets
created. On NM 1.8 such connection became the primary connection and NM
then used it to to reach the 2nd VPN gateway.

Since 1.10, it seems we don't update the primary connection after
first VPN connects, and so the 2nd VPN gateway is still reached
through the Wi-Fi interface.  But even if the primary connection was
properly set to tun0, the tun0 connection is 'external' and so
normally NM wouldn't add a new route through it. I'm investigating how
to fix these two issues.

Comment 9 Dimitris 2018-06-23 16:42:25 UTC
Thanks for looking into this.  For what it's worth, routing the second VPN's tunnel through the WLAN interface isn't bad - it actually improves performance (tunnel overhead and MTU).  If it's possible to move the addition of the /32 route to the VPN server to *before* the VPN connection is made, I think this will let us have the best of both worlds.  It should result in tun1 traffic also using the previous default route, despite its higher metric.

What I'm seeing now is that, after the second VPN connection is made, outbound tun1 tunnel packets are going out of the WLAN interface, but the remote server sends back traffic over the route that has them arrive over tun0.

I've tried adding --float to the tun1 client's config.  The very first time I tried it, tun1 actually worked, and I thought that was a fix/workaround.  However it does seem to have been a race where I got lucky just that once, every time after that one --float didn't help.

Comment 10 David Sommerseth 2018-06-23 23:34:01 UTC
(In reply to Dimitris from comment #9)
> I've tried adding --float to the tun1 client's config.  The very first time
> I tried it, tun1 actually worked, and I thought that was a fix/workaround. 
> However it does seem to have been a race where I got lucky just that once,
> every time after that one --float didn't help.

--float is a fairly obscure feature.  When used on the client side, it only makes the client a little bit more relaxed if the public server IP changes.  When used on the server side, the server is a bit more relaxed if the public client IP changes.  And with "relaxed" I mean that it won't do a full TLS reconnect, but will continue to use the negotiated TLS parameters and session keys.

When connecting to OpenVPN 2.4 servers and more recent 2.3 and newer OpenVPN clients, the --float feature is even less "useful" as that implements something called peer-id per session.  This peer-id is used as a key to reuse the negotiated TLS parameters and session keys if the client IP changes.  This does not need to be explicitly enabled, it enables itself automatically if both sides supports it.

Bottom line is, I would not expect --float to make much difference in this context ... as one of the client sessions would have to be "moved" from one public IP address to another one.

Comment 11 Beniamino Galvani 2018-07-02 14:10:39 UTC
(In reply to Dimitris from comment #9)
> Thanks for looking into this.  For what it's worth, routing the second VPN's
> tunnel through the WLAN interface isn't bad - it actually improves
> performance (tunnel overhead and MTU).  If it's possible to move the
> addition of the /32 route to the VPN server to *before* the VPN connection
> is made, I think this will let us have the best of both worlds.  It should
> result in tun1 traffic also using the previous default route, despite its
> higher metric.

This is not possible at the moment due to how NM and plugins interact. The IP configuration, including the external gateway is known to NetworkManager only after the connection is established.

I've pushed branch bg/stacked-vpn-rh1592596 that should restore the previous (NM 1.8) behavior. However there are still some issue, as the fact that we now manage the tun device causes the addition of duplicate routes with different metrics.

Comment 12 Dimitris 2018-08-21 01:02:30 UTC
FWIW, while waiting for this patch to hit release I can restore previous behavior by adding an explicit host route push to the "always on for privacy" VPN server config:

push "route <devops VPN server IP> 255.255.255.255 vpn_gateway 50"

Unfortunately, trying to use net_gateway instead, to avoid the double-tunneling and potential MTU issues, (i.e. having the cake and eating it too) doesn't work; even though the route is pushed, per journald:

Static Route: <devops VPN server IP>/32Next Hop: 192.168.1.1

the route doesn't actually make it to the routing table.

*Probably* a different bug though, correct?

Comment 13 Ben Cotton 2019-05-02 21:03:20 UTC
This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 14 Dimitris 2019-05-21 16:22:38 UTC
Still happening with F29

Comment 15 Ben Cotton 2019-10-31 19:17:48 UTC
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 16 Dimitris 2019-11-01 06:05:50 UTC
Still happening with F30

Comment 17 Ben Cotton 2020-04-30 20:48:37 UTC
This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 18 Ben Cotton 2020-05-26 17:44:06 UTC
Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.