Description of problem:
Cluster QE has a testing tool which uses the SCTP protocol to communicate
between nodes with. When the service libvirtd is running, the tool
(d_iogen/d_doio) does not reliably connect. When the libvirtd service is
stopped, the tool connects correctly.
I ran d_doio, the client portion with strace and found that the connects were
hanging on some nodes some of the time.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. start a cluster and mount a shared GFS file system
2. install sts-rhel5.2 from repo
3. on a node outside of the cluster run `d_iogen -I 111 -s read,write -i 0 -F
4. on the cluster nodes run `d_doio -I 111 -P <hostname from 3> -w <path to GFS>
Some d_doio processes connect and the rest hang in connect().
All d_doio processes should connect within a few seconds.
This was tested w/o the Xen kernel.
Try disabling the libvirt default network with
virsh net-destroy default
virsh net-autostart --disable default
If that works, then collect a Tcpdump of traffic when the default network is
running, and another when it is not running.
I don't think those commands will do anything unless I'm running the -xen kernel
with the xen hypervisor.
[root@dash-03 ~]# virsh net-destroy default
virsh: error: failed to connect to the hypervisor
[root@dash-03 ~]# virsh net-autostart --disable default
virsh: error: failed to connect to the hypervisor
I'm not running either. I'm just running the standard kernel and when libvirtd
starts my SCTP connects stop working. I'll collect tcpdumps with and without
Created attachment 293193 [details]
tcpdump from dash-03 with libvirtd running and sctp failing
Here is a tcpdump from dash-01 which was trying to start 5 d_doio processes.
Each process tries to connect to d_iogen on another machine. The first three
processes connect, the last two do not.
nstraz: you want to try:
$> virsh -c qemu:///system net-destory default
You don't need to be running a xen kernel
Okay, I'm not at all familiar with SCTP, but here's what the dump shows:
- The INIT packets contain the address of the virbr0 bridge
interface (192.168.122.1), which is an address we don't want exposed
to the external network
- 3 of the connections succeed, but 2 receive an ABORT with an "Restart
of an association with new addresses" (SCTP_ERROR_RESTART) cause with
the external facing IP address as the specified new address
- All 5 INIT packets contain the same list of IP addresses
- The way I interpret the SCTP code is that it thinks these two INITs
are duplicate INITs on existing associations, but that the existing
associations didn't contain the external address
Best guess is that each of the cluster nodes have virbr0 and are simultaneously
creating a number of connections originating from the same port, but because
each of the connections specify the IP address of the bridge as a valid
transport, the connections are interpreted as restarts of an existing connection
Node A Server Node B
====== ====== ======
So, two things:
1) Need to verify this theory
nstraz: can you get a tcpdump from the server? Right now we can only see
the "Node A" side of the interaction
2) Need to figure out how to make sure that the virbr0 address doesn't appear
in the list of addresses in the outgoing packets
herbert, nhorman ?
nstraz: if this is the problem, one thing that should fix it for you is to have
d_doio bind its socket to its external IP address so that the virbr0 address
won't be contained in the SCTP INIT messages.
That sucks as a solution, but I'm not sure the kernel or libvirt is doing
anything "wrong" here - i.e. libvirt has set up a network interface which isn't
reachable by the server, the kernel can't know which network interfaces are
reachable by the server and the whole thing blows up because each of the clients
have a virbr0 using the same address.
Still curious whether herbert or nhorman have any ideas.
Mark, you're theory looks pretty solid to me. If you send an init chunk
containing the same ip address from the same port, SCTP will have no way to tell
if its from a separate host or not. I think the best thing to do is look at the
code for the d_doio utility and see how its generating the list of addresses to
send in the INIT list (mostly likely its just blindly adding all the node
addresses via sctp_bindx or some such). We'll most likely have to change the
d_doio code to be a little more intellegent about that. Can you point me to
where I get a copy of that code? Thanks!
Created attachment 293272 [details]
tcpdump from d_iogen side
Here is the tcpdump of sctp from the d_iogen side.
nhorman: I found the source here:
see src/d_doio/d_doio.c:init_connection() - it's not binding to any address at
all, which means the address list must be coming from
Note one thing you might have missed - libvirt is configured by default to
create a bridge interface assigned with 192.168.122.1, so it's perfectly
possible that each node is sending out an INIT with the same address.
You could imagine a similar situation if e.g. multiple linksys/DSL/NAT routers
with a default configuration were trying to initiate an SCTP connection with a
server. That's in effect what libvirt is implementing, a NAT router for virtual
nstraz: thanks, I think the tcpdump confirms it ... the first and third INIT
originate from the same port on different hosts, both contain the virbr0 address
and the second of those two gets an ABORT response complaining that it's address
is a new address for an existing connection.
I found /etc/libvirt/qemu/networks/default.xml on each node and changed them to
talk on different subnets. The test program is now working as expected. This
is probably something we will have to note in our cluster documentation.
nhorman: re-assigning to the kernel and you since I don't think libvirt is doing
anything wrong here.
It's probably fine to close this as WONTFIX with the conclusion that:
SCTP requires the application to explicitly bind to the external IP
address(es) it wishes to be contactable by, otherwise other (private)
IP addresses assigned to network interfaces on the system may conflict
with existing associations on the server that may have included that
same IP address in its INIT chunk.
But two suggestions to first consider:
- Perhaps by default the Linux SCTP implementation should not include
any IP addresses in the INIT chunk? I don't immediately see anything
in the RFCs which require this, and it might make sense to only include
additional addresses if the application explicitly binds to them.
- Or alternatively, perhaps we should only look for existing associations
based on the actual remote endpoint address (i.e. ignore the transport
addresses supplied in the INIT chunk) of the existing associations i.e.
for assoc in existing_assocs:
for addr in assoc.transport_addrs:
for new_addr in new_assoc.transport_addrs:
if addr == new_addr:
for assoc in existing_assocs:
for new_addr in new_assoc.transport_addrs:
if assoc.peer_addr == new_addr:
Not sure whether this would be still correct, but e.g. Section 5.1.2
of RFC 4460 says:
An INIT or INIT ACK chunk MUST be treated as belonging
to an already established association (or one in the
process of being established) if the use of any of the
valid address parameters contained within the chunk
would identify an existing TCB.
which doesn't specify what addresses in the existing TCB you match against.
I guess that since these addresses are supposed to be valid "alternate
addresses", then (1) is a more likely fix than (2) - i.e. they're not valid
alternate addresses, so we shouldn't be sending them out at all.
I'll close it as NOTABUG.
I agree that libvirt isn't doing anything wrong, its just making assumptions
about the default behavior regarding the sctp protocol. When using sctp, user
space utilities need to call bind (or sctp_bindx) to specify which local address
will be used by associations on the specified socket. This is mandated in
section 3.1.2 of the sctp api:
This behavior is implemented in sctp_autobind down in the kernel.
Regarding the suggestions above, the first is definately possible, and in fact
available, but requires the user space utility to use bind to specify the
address it will use.
We can't just look for associations based on the remote endpoint address. I
don't think that would allow us to recieve frames on associations if one address
failed and we had to use one of the alternative addresses specified in the INIT
chunk during retransmits.
From my view, (1) is the right solution, and thats currently available. it just
requires the application to do the appropriate setup to get the desired
behavior. The default behavior just isn't what d_doio wants in this case.
I don't think you understand how d_doio is trying to use SCTP. It is only
trying to take advantage of the message boundary features of SCTP using a
one-to-one style socket. That's section 4, not section 3. The automatic
multi-homing is what is causing d_doio trouble.
The real trouble is from libvirtd always using the same IP for its NAT
interface. When VMs are configured on a cluster of machines, this will cause
problems. Each physical machine needs to use a different NAT IP or any SCTP
applications will get confused as d_doio did.
Is this a bug in SCTP? No. Is this a bug in libvirtd? Not really. Is this a
configuration issue which needs to be documented and addressed. Yes.
Requiring that each libvirt NAT network using a different IP addres range is
completely missing the point of this being *NAT*. The 192.168.122.* addresses
are never supposed to leave the physical machine. This is why the virbr0 bridge
is not connected to any physical devices, and why no routing rules will direct
its traffic to a physical device. Any traffic from virbr0 (and thus the guests)
must be NAT'd before leaving the machine.
If SCTP can't do this, then it shouldn't try to use the IP address from virbr0
at all. Requiring that the libvirt network is changed on each machine is not a
scalable answer to this problem. Whatever in ClusterSuite is using SCTP should
only generate packets from real physical interfaces (which it can discover via
sysfs or HAL), and not blindly use all addresses configured on a machine.
Let me make this clear. d_doio is part of the test suite for Cluster Suite/GFS.
Nothing _in_ Cluster Suite uses SCTP at the moment. cman was going to use it,
but that was pulled because of problems with SCTP in VM clusters. I'm guessing
because of this exact issue.
So since this is a QA test suite issue only, there's no need to document
changing libvirt networking setup for RHEL.
Correct me if I am mistaken -- wouldn't any use of SCTP as used in the test
tools cause this issue in a VM config? If so, why shouldn't we document this?
I still think it needs to be addressed, though, and it needs to be addressed in
userspace, since the kernel can't know in which cases to include what address in
the multihoming list, and it seems to me that sctp in a clustered vm
environment would be handy to have.
In response to Nates comments in comment 15, yes. I'm on the same page as you.
The default multihoming features of SCTP are whats causing you grief. I'm
sorry I didn't realize that you were using a 1:1 style socket rather than a 1:m
style socket, but the bind behavior in section 4.1.2 for 1:1 sockets is exactly
the same as for 1:many style sockets, which is to use all interfaces if no list
is specified. The fix for that is to use sctp_bindx to explictly list the
interfaces that you want d_doio to use, and to make sure the ip address of
virbr0 is not in that list.
In response Dans comments in comment 16, this can work just fine. SCTP is a
transport under IP, just like UDP and TCP. As such it will be natted just like
any other IP frame. The problem comes in with the semantics of the protocol.
When sctp establishes a new connection, its sends an INIT chunk to its peer(s).
This INIT chunks payload contains a list of addresses on the sending system
which can be used for multihoming. Since this data is in the payload, rather
than the IP header, it escapes any modification during the NAT process. The
mandated default behavior in sctp, according to the implementation guide is to
bind to all available interfaces if no list is specified via sctp_bindx(). If
libvirt wants to use SCTP, its going to need to mask away the virbr0 interface
ip address via sctp_bindx() when it creates an SCTP socket.
(In reply to comment #14)
> I agree that libvirt isn't doing anything wrong, its just making assumptions
> about the default behavior regarding the sctp protocol.
libvirt isn't making any assumptions about sctp. You mean d_doio?
> Regarding the suggestions above, the first is definately possible, and in fact
> available, but requires the user space utility to use bind to specify the
> address it will use.
The suggestion was:
Perhaps by default the Linux SCTP implementation should not include
any IP addresses in the INIT chunk?
Is there any way a user space application can do this? As you pointed out, NAT
can't translate the addresses in the INIT chunk, so that's another reason why
I'd think it would be better for most applications if they just let the endpoint
address from the IP header be the only address used for creating the association.
>libvirt isn't making any assumptions about sctp. You mean d_doio?
Whomever is opening an sctp socket is making these assumptions regarding the
default behavior of the protocol. I'm afraid I really don't know enough about
the cluster setup to know which user space utilities that includes (although the
source from d_doio certainly seems to be on that list). Regardless, any
application being used in this clustered vm environment is going to need to be
aware of this behavior.
>The suggestion was:
> Perhaps by default the Linux SCTP implementation should not include
> any IP addresses in the INIT chunk?
>Is there any way a user space application can do this? As you pointed out, NAT
>can't translate the addresses in the INIT chunk, so that's another reason why
>I'd think it would be better for most applications if they just let the >endpoint
>address from the IP header be the only address used for creating the >association.
Yes it is possible for sctp to not include any additional IP addresses in the
INIT chunk (or more to the point, to only include non-natted addresses). I've
explained how to do this in comment #20.
One of the key features in sctp is the ability to multihome. The implementation
guide referenced in comment 14 mandates that, unless an explicit list of
addresses to multihome on is specified by the application using the socket (via
sctp_bindx), the protocol is to use all available ip addresses for multihoming.
The addresses in the INIT chunkare the list of addresses to multihome on. So
by using sctp_bindx in the userspace application, you can add addresses to, or
remove addresses from that list. Thats why I've been saying that userspace
needs to be the one to correct this behavior. Be it d_doio, or libvirt, or any
application using using an sctp socket in a clustered vm environment.
Okay, sounds like this is something fundamentally required by the protocol specs
and, so, is is a potential problem for any SCTP client application.
The advice to application developers would then be:
Q. I'm getting reports that my application fails to connect to a server
when another client has already connected from the same client port on
another machine with an identical IP address to one of the IP addresses
on the connecting machine. How do I make my application robust in
A. Easy. You need to explicitly bind the client to a set of addresses
so that the server never sees these duplicate addresses. The pseudo
code goes like:
fd = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP)
for iface in get_network_ifaces():
if iface == get_default_route_iface():
for addr in get_iface_addrs():
sctp_bindx(fd, SCTP_BINDX_REM_ADDR, addr)
Alternatively, you could make it configurable which addresses your
application binds to and document how the user should configure your
application if they see such connection failures.
Yes, I concur. This is an issue that will have to be addressed by any user of
sctp in the referenced environment. ACK to the above documentation. Thanks!
Do we ship libraries in RHEL5 that provide the get_*ifaces*() or equivalent
not directly as such, but the rtnetlink message set can provide the needed info
(man 7 rtnetlink)