Created attachment 947288 [details] Kernel stack dump preceding deadlock In brief: I'm trying to set up a VPN connection, using PPP over L2TP over IPSEC to a Cisco ASA server. The server works fine with Windows and OS-X clients, but not with NetworkManager-l2tp. I have tracked down multiple reasons for these failures. Some I have worked around but others are baffling. I will need help to fix them. Some information to start: [stern@saphir ~]$ uname -a Linux saphir.localdomain 3.16.4-200.fc20.i686 #1 SMP Mon Oct 6 13:22:51 UTC 2014 i686 i686 i386 GNU/Linux [stern@saphir ~]$ rpm -q NetworkManager-l2tp xl2tpd NetworkManager-openswan libreswan ppp NetworkManager-l2tp-0.9.8.7-1.fc20.i686 xl2tpd-1.3.6-1.fc20.i686 package NetworkManager-openswan is not installed libreswan-3.10-3.fc20.i686 ppp-2.4.5-34.fc20.i686 First problem: selinux violations. This has been reported before in bug #887674 and supposedly was fixed. Not on my system. For now I work around this by running "setenforce 0" before starting the VPN connection. There doesn't seem to be any point in trying to fix this issue until everything else is working. Second problem: The /usr/libexec/nm-l2tp-service program doesn't start the pluto daemon properly. Probably because it's not integrated with systemd, but I didn't try to investigate the reason for the failure. I edited the program's source code and in nm_l2tp_start_ipsec() changed "[ \"x$defaultrouteaddr\" = \"x\" ] && ipsec setup restart"); to "[ \"x$defaultrouteaddr\" = \"x\" ] && /usr/bin/systemctl restart ipsec.service"); sleep(2); The sleep() appears to be necessary because the daemon takes some time to get started. That worked (and so did starting the ipsec service by hand before turning on the VPN connection). Third problem: The following line, where the program does sys += system("PATH=/usr/local/sbin:/usr/sbin:/sbin ipsec whack" " --listen"); always returns a nonzero value, which messes up the error handling. I don't know why the return value is nonzero, but I changed the program to ignore the return value instead of adding it into sys. Fourth problem: When the program creates a temporary /etc/ipsec.secrets file to store the PSK, the file it creates is world-readable! Even though the file persists for a short time, this is obviously a security breach -- especially when something goes wrong and the temporary file is not erased (which happened to me repeatedly). Fifth problem: When the program creates the libreswan config file, it does not specify a rightprotoport parameter. The pluto daemon ends up using 17/0 by default, and the VPN server doesn't like this. Without a port number, the server doesn't realize that this connection will use L2TP. I added write_config_option (ipsec_fd, " rightprotoport=17/1701\n"); to the appropriate place in the program source, which resolved this issue. Sixth problem: After all the previous fixes, the connection is established. xl2tpd and pppd are started. But they don't work right. For example, /var/log/secure contains this entry: Oct 13 17:11:27 saphir pluto[2182]: ERROR: "nm-ipsec-l2tpd-2573" #1: sendto on wlan0 to 140.247.233.37:4500 failed in NAT-T Keep Alive. Errno 105: No buffer space available That error 105 comes straight from the kernel, but I don't know the reason. A possibly related series of errors appears in /var/log/messages, multiple repetitions of: Oct 13 17:10:28 saphir NetworkManager: xl2tpd[2616]: network_thread: select timeout followed by: Oct 13 17:11:32 saphir NetworkManager: xl2tpd[2616]: Maximum retries exceeded for tunnel 33716. Closing. I'm at a loss to tell what the reason is for those failures. Also, as far as I can tell, no data ever gets transmitted through the PPP tunnel. On numerous attempts, the computer deadlocked. I was able to capture a stack dump from one of those occasions; it is in the first attachment. As near as I can interpret it, the routine ppp_channel_push() (near the end of the dump) acquires the pch->downl spinlock and then calls pch->chan->ops->start_xmit(), which indirectly calls down to ppp_push(), which tries to acquire the same spinlock. This seems to indicate that the kernel tried to route a PPP-encapsulated packet through the PPP tunnel itself, which is clearly wrong. I don't know what could cause this -- some sort of routing misconfiguration? I've got detailed logs (probably *too* detailed) showing the setup of the IPSEC tunnel, which seems to work okay. (And indeed, the client does receive an IP address that's in the VPN server's DHCP range, so some information definitely is getting sent.) But even though I have enabled debugging for xl2tpd and pppd, the logs don't contain much to indicate what's going wrong. I'll also attach the relevant portion of /var/log/messages.
Created attachment 947289 [details] Log messages from VPN connection attempt This is what appeared in /var/log/messages (with PSK and password redacted) during a recent VPN connection attempt.
Hi, Alan. Thank you for your great bugreport. Unfortunately, I, as upstream developer, don't use ipsec part of plugin. ipsec was contributed before I joined this project, so, I really have no competence on it. But, if you can make a patch, I'll definitely take a look at it and will try to merge.
I've got a series of six (!) patches fixing various aspects of this thing. With them in place, NetworkManager succeeds in setting up and tearing down the VPN tunnel. But the tunnel still doesn't work; I will need help to debug the underlying problem. I will attach the patches to this bug report. The first one changes a file in the libreswan package, but the others affect only nm-l2tp-service.c.
Created attachment 948191 [details] Make "ipsec setup restart" work if the service isn't already running Patch 1: change the "ipsec setup" script. It's not entirely clear that this is the right thing to do; however there needs to be some way to (1) start the service if it isn't running and (2) restart it if it is. Currently the "start" command does (1) and the "restart" command does (2), but nothing does both.
Created attachment 948193 [details] Add a delay after the ipsec service is started Without this change, nm-l2tp-service tries to connect to the ipsec service before the service's daemon is fully initialized. I don't know how to determine a good length for the delay, but one second works on my system.
Created attachment 948194 [details] Make the ipsec.secrets file not world-readable Patch 3: This one is a no-brainer.
Created attachment 948195 [details] Remove an unnecessary command Patch 4: It appears from the name that the person who added cmd11 in the first place knew that it did pretty much the same thing as cmd1. There's no explanation in the code for why both are present, and it works just as well without cmd11.
Created attachment 948196 [details] Keep track of the IPsec connections state Patch 5: another no-brainer.
Created attachment 948197 [details] Fix the parameters passed to the ipsec daemon Patch 6: Without the rightprotoport parameter in particular, my server will never accept the VPN connection request.
(In reply to Alan Stern from comment #3) > I've got a series of six (!) patches fixing various aspects of this thing. > With them in place, NetworkManager succeeds in setting up and tearing down > the VPN tunnel. But the tunnel still doesn't work; I will need help to > debug the underlying problem. > > I will attach the patches to this bug report. The first one changes a file > in the libreswan package, but the others affect only nm-l2tp-service.c. Hi. I've scheduled a task to review and apply your patches in project's bug tracker there: https://github.com/seriyps/NetworkManager-l2tp/issues/25
Okay. I don't know how the upstream developers for libreswan would react to proposed patch #1 (which isn't in the right format for a source-package change anyway). They might think the way things work currently is preferable, in which case nm-l2tp-service would have to work around the problem (for example, by stopping ipsec and then starting it instead of simply doing a restart). I figured out the reason why my VPN connection didn't work even after all these changes. It turned out to be a configuration problem in the VPN server, which I was able to work around -- it has nothing to do with NetworkManager. Which is good, because it means that issue is irrelevant to this bug report. On the other hand, I haven't yet tried to investigate the selinux violations. I'll add more information about that when I have a chance.
This message is a reminder that Fedora 20 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 20. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '20'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 20 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.