Bug 428994
Summary: | libvirtd seems to interfere with SCTP connects | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | Martin Jenner <mjenner> | ||||||
Severity: | low | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 5.1 | CC: | djansa, herbert.xu, nhorman, xen-maint | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-01-30 18:44:51 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Nate Straz
2008-01-16 17:17:48 UTC
Try disabling the libvirt default network with virsh net-destroy default virsh net-autostart --disable default If that works, then collect a Tcpdump of traffic when the default network is running, and another when it is not running. I don't think those commands will do anything unless I'm running the -xen kernel with the xen hypervisor. [root@dash-03 ~]# virsh net-destroy default virsh: error: failed to connect to the hypervisor [root@dash-03 ~]# virsh net-autostart --disable default virsh: error: failed to connect to the hypervisor I'm not running either. I'm just running the standard kernel and when libvirtd starts my SCTP connects stop working. I'll collect tcpdumps with and without libvirtd started. Created attachment 293193 [details]
tcpdump from dash-03 with libvirtd running and sctp failing
Here is a tcpdump from dash-01 which was trying to start 5 d_doio processes.
Each process tries to connect to d_iogen on another machine. The first three
processes connect, the last two do not.
nstraz: you want to try: $> virsh -c qemu:///system net-destory default You don't need to be running a xen kernel Okay, I'm not at all familiar with SCTP, but here's what the dump shows: - The INIT packets contain the address of the virbr0 bridge interface (192.168.122.1), which is an address we don't want exposed to the external network - 3 of the connections succeed, but 2 receive an ABORT with an "Restart of an association with new addresses" (SCTP_ERROR_RESTART) cause with the external facing IP address as the specified new address - All 5 INIT packets contain the same list of IP addresses - The way I interpret the SCTP code is that it thinks these two INITs are duplicate INITs on existing associations, but that the existing associations didn't contain the external address Best guess is that each of the cluster nodes have virbr0 and are simultaneously creating a number of connections originating from the same port, but because each of the connections specify the IP address of the bridge as a valid transport, the connections are interpreted as restarts of an existing connection e.g. Node A Server Node B ====== ====== ====== INIT 10.15.89.98:32928 192.168.122.1:32928 -> <- ACK INIT 10.15.89.97:32928 <- 192.168.122.1:32928 ABORT -> INIT 10.15.89.98:32929 192.168.122.1:32929 -> <- ACK INIT 10.15.89.97:32929 <- 192.168.122.1:32929 ABORT -> INIT 10.15.89.98:32930 192.168.122.1:32930 -> <- ACK INIT 10.15.89.97:32930 <- 192.168.122.1:32930 ABORT -> INIT 10.15.89.97:32931 <- 192.168.122.1:32931 ACK -> INIT 10.15.89.98:32931 192.168.122.1:32931 -> <- ABORT INIT 10.15.89.97:32932 <- 192.168.122.1:32932 ACK -> INIT 10.15.89.98:32932 192.168.122.1:32932 -> <- ABORT So, two things: 1) Need to verify this theory nstraz: can you get a tcpdump from the server? Right now we can only see the "Node A" side of the interaction 2) Need to figure out how to make sure that the virbr0 address doesn't appear in the list of addresses in the outgoing packets herbert, nhorman ? nstraz: if this is the problem, one thing that should fix it for you is to have d_doio bind its socket to its external IP address so that the virbr0 address won't be contained in the SCTP INIT messages. That sucks as a solution, but I'm not sure the kernel or libvirt is doing anything "wrong" here - i.e. libvirt has set up a network interface which isn't reachable by the server, the kernel can't know which network interfaces are reachable by the server and the whole thing blows up because each of the clients have a virbr0 using the same address. Still curious whether herbert or nhorman have any ideas. Mark, you're theory looks pretty solid to me. If you send an init chunk containing the same ip address from the same port, SCTP will have no way to tell if its from a separate host or not. I think the best thing to do is look at the code for the d_doio utility and see how its generating the list of addresses to send in the INIT list (mostly likely its just blindly adding all the node addresses via sctp_bindx or some such). We'll most likely have to change the d_doio code to be a little more intellegent about that. Can you point me to where I get a copy of that code? Thanks! Created attachment 293272 [details]
tcpdump from d_iogen side
Here is the tcpdump of sctp from the d_iogen side.
nhorman: I found the source here: http://sts.lab.msp.redhat.com/dist/brewroot/repos/qe-rhel5/SRPMS/sts-rhel5.2-3-1.el5.src.rpm see src/d_doio/d_doio.c:init_connection() - it's not binding to any address at all, which means the address list must be coming from sctp_copy_local_addr_list(), right? Note one thing you might have missed - libvirt is configured by default to create a bridge interface assigned with 192.168.122.1, so it's perfectly possible that each node is sending out an INIT with the same address. You could imagine a similar situation if e.g. multiple linksys/DSL/NAT routers with a default configuration were trying to initiate an SCTP connection with a server. That's in effect what libvirt is implementing, a NAT router for virtual machines. nstraz: thanks, I think the tcpdump confirms it ... the first and third INIT originate from the same port on different hosts, both contain the virbr0 address and the second of those two gets an ABORT response complaining that it's address is a new address for an existing connection. I found /etc/libvirt/qemu/networks/default.xml on each node and changed them to talk on different subnets. The test program is now working as expected. This is probably something we will have to note in our cluster documentation. nhorman: re-assigning to the kernel and you since I don't think libvirt is doing anything wrong here. It's probably fine to close this as WONTFIX with the conclusion that: SCTP requires the application to explicitly bind to the external IP address(es) it wishes to be contactable by, otherwise other (private) IP addresses assigned to network interfaces on the system may conflict with existing associations on the server that may have included that same IP address in its INIT chunk. But two suggestions to first consider: - Perhaps by default the Linux SCTP implementation should not include any IP addresses in the INIT chunk? I don't immediately see anything in the RFCs which require this, and it might make sense to only include additional addresses if the application explicitly binds to them. - Or alternatively, perhaps we should only look for existing associations based on the actual remote endpoint address (i.e. ignore the transport addresses supplied in the INIT chunk) of the existing associations i.e. rather than def find_existing_assoc(new_assoc): for assoc in existing_assocs: for addr in assoc.transport_addrs: for new_addr in new_assoc.transport_addrs: if addr == new_addr: return assoc return None do: def find_existing_assoc(new_assoc): for assoc in existing_assocs: for new_addr in new_assoc.transport_addrs: if assoc.peer_addr == new_addr: return assoc return None Not sure whether this would be still correct, but e.g. Section 5.1.2 of RFC 4460 says: An INIT or INIT ACK chunk MUST be treated as belonging to an already established association (or one in the process of being established) if the use of any of the valid address parameters contained within the chunk would identify an existing TCB. which doesn't specify what addresses in the existing TCB you match against. I guess that since these addresses are supposed to be valid "alternate addresses", then (1) is a more likely fix than (2) - i.e. they're not valid alternate addresses, so we shouldn't be sending them out at all. I'll close it as NOTABUG. I agree that libvirt isn't doing anything wrong, its just making assumptions about the default behavior regarding the sctp protocol. When using sctp, user space utilities need to call bind (or sctp_bindx) to specify which local address will be used by associations on the specified socket. This is mandated in section 3.1.2 of the sctp api: http://www3.tools.ietf.org/html/draft-ietf-tsvwg-sctpsocket-15#section-3.1.2 This behavior is implemented in sctp_autobind down in the kernel. Regarding the suggestions above, the first is definately possible, and in fact available, but requires the user space utility to use bind to specify the address it will use. We can't just look for associations based on the remote endpoint address. I don't think that would allow us to recieve frames on associations if one address failed and we had to use one of the alternative addresses specified in the INIT chunk during retransmits. From my view, (1) is the right solution, and thats currently available. it just requires the application to do the appropriate setup to get the desired behavior. The default behavior just isn't what d_doio wants in this case. I don't think you understand how d_doio is trying to use SCTP. It is only trying to take advantage of the message boundary features of SCTP using a one-to-one style socket. That's section 4, not section 3. The automatic multi-homing is what is causing d_doio trouble. The real trouble is from libvirtd always using the same IP for its NAT interface. When VMs are configured on a cluster of machines, this will cause problems. Each physical machine needs to use a different NAT IP or any SCTP applications will get confused as d_doio did. Is this a bug in SCTP? No. Is this a bug in libvirtd? Not really. Is this a configuration issue which needs to be documented and addressed. Yes. Requiring that each libvirt NAT network using a different IP addres range is completely missing the point of this being *NAT*. The 192.168.122.* addresses are never supposed to leave the physical machine. This is why the virbr0 bridge is not connected to any physical devices, and why no routing rules will direct its traffic to a physical device. Any traffic from virbr0 (and thus the guests) must be NAT'd before leaving the machine. If SCTP can't do this, then it shouldn't try to use the IP address from virbr0 at all. Requiring that the libvirt network is changed on each machine is not a scalable answer to this problem. Whatever in ClusterSuite is using SCTP should only generate packets from real physical interfaces (which it can discover via sysfs or HAL), and not blindly use all addresses configured on a machine. Let me make this clear. d_doio is part of the test suite for Cluster Suite/GFS. Nothing _in_ Cluster Suite uses SCTP at the moment. cman was going to use it, but that was pulled because of problems with SCTP in VM clusters. I'm guessing because of this exact issue. So since this is a QA test suite issue only, there's no need to document changing libvirt networking setup for RHEL. Correct me if I am mistaken -- wouldn't any use of SCTP as used in the test tools cause this issue in a VM config? If so, why shouldn't we document this? I still think it needs to be addressed, though, and it needs to be addressed in userspace, since the kernel can't know in which cases to include what address in the multihoming list, and it seems to me that sctp in a clustered vm environment would be handy to have. In response to Nates comments in comment 15, yes. I'm on the same page as you. The default multihoming features of SCTP are whats causing you grief. I'm sorry I didn't realize that you were using a 1:1 style socket rather than a 1:m style socket, but the bind behavior in section 4.1.2 for 1:1 sockets is exactly the same as for 1:many style sockets, which is to use all interfaces if no list is specified. The fix for that is to use sctp_bindx to explictly list the interfaces that you want d_doio to use, and to make sure the ip address of virbr0 is not in that list. In response Dans comments in comment 16, this can work just fine. SCTP is a transport under IP, just like UDP and TCP. As such it will be natted just like any other IP frame. The problem comes in with the semantics of the protocol. When sctp establishes a new connection, its sends an INIT chunk to its peer(s). This INIT chunks payload contains a list of addresses on the sending system which can be used for multihoming. Since this data is in the payload, rather than the IP header, it escapes any modification during the NAT process. The mandated default behavior in sctp, according to the implementation guide is to bind to all available interfaces if no list is specified via sctp_bindx(). If libvirt wants to use SCTP, its going to need to mask away the virbr0 interface ip address via sctp_bindx() when it creates an SCTP socket. (In reply to comment #14) > I agree that libvirt isn't doing anything wrong, its just making assumptions > about the default behavior regarding the sctp protocol. libvirt isn't making any assumptions about sctp. You mean d_doio? > Regarding the suggestions above, the first is definately possible, and in fact > available, but requires the user space utility to use bind to specify the > address it will use. The suggestion was: Perhaps by default the Linux SCTP implementation should not include any IP addresses in the INIT chunk? Is there any way a user space application can do this? As you pointed out, NAT can't translate the addresses in the INIT chunk, so that's another reason why I'd think it would be better for most applications if they just let the endpoint address from the IP header be the only address used for creating the association. >libvirt isn't making any assumptions about sctp. You mean d_doio? Whomever is opening an sctp socket is making these assumptions regarding the default behavior of the protocol. I'm afraid I really don't know enough about the cluster setup to know which user space utilities that includes (although the source from d_doio certainly seems to be on that list). Regardless, any application being used in this clustered vm environment is going to need to be aware of this behavior. >The suggestion was: > > Perhaps by default the Linux SCTP implementation should not include > any IP addresses in the INIT chunk? > >Is there any way a user space application can do this? As you pointed out, NAT >can't translate the addresses in the INIT chunk, so that's another reason why >I'd think it would be better for most applications if they just let the >endpoint >address from the IP header be the only address used for creating the >association. Yes it is possible for sctp to not include any additional IP addresses in the INIT chunk (or more to the point, to only include non-natted addresses). I've explained how to do this in comment #20. One of the key features in sctp is the ability to multihome. The implementation guide referenced in comment 14 mandates that, unless an explicit list of addresses to multihome on is specified by the application using the socket (via sctp_bindx), the protocol is to use all available ip addresses for multihoming. The addresses in the INIT chunkare the list of addresses to multihome on. So by using sctp_bindx in the userspace application, you can add addresses to, or remove addresses from that list. Thats why I've been saying that userspace needs to be the one to correct this behavior. Be it d_doio, or libvirt, or any application using using an sctp socket in a clustered vm environment. Okay, sounds like this is something fundamentally required by the protocol specs and, so, is is a potential problem for any SCTP client application. The advice to application developers would then be: Q. I'm getting reports that my application fails to connect to a server when another client has already connected from the same client port on another machine with an identical IP address to one of the IP addresses on the connecting machine. How do I make my application robust in this situation? A. Easy. You need to explicitly bind the client to a set of addresses so that the server never sees these duplicate addresses. The pseudo code goes like: fd = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP) bind(fd, INADDR_ANY); for iface in get_network_ifaces(): if iface == get_default_route_iface(): continue for addr in get_iface_addrs(): sctp_bindx(fd, SCTP_BINDX_REM_ADDR, addr) connect(fd, server_addr) Alternatively, you could make it configurable which addresses your application binds to and document how the user should configure your application if they see such connection failures. Yes, I concur. This is an issue that will have to be addressed by any user of sctp in the referenced environment. ACK to the above documentation. Thanks! Do we ship libraries in RHEL5 that provide the get_*ifaces*() or equivalent functions? not directly as such, but the rtnetlink message set can provide the needed info (man 7 rtnetlink) |