Bug 428994

Summary: libvirtd seems to interfere with SCTP connects
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED NOTABUG QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 5.1CC: djansa, herbert.xu, nhorman, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-01-30 18:44:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tcpdump from dash-03 with libvirtd running and sctp failing
none
tcpdump from d_iogen side none

Description Nate Straz 2008-01-16 17:17:48 UTC
Description of problem:

Cluster QE has a testing tool which uses the SCTP protocol to communicate
between nodes with.  When the service libvirtd is running, the tool
(d_iogen/d_doio) does not reliably connect.  When the libvirtd service is
stopped, the tool connects correctly.

I ran d_doio, the client portion with strace and found that the connects were
hanging on some nodes some of the time.


Version-Release number of selected component (if applicable):
kernel-2.6.18-53.el5
libvirt-0.2.3-9.el5

How reproducible:
Every time

Steps to Reproduce:
1. start a cluster and mount a shared GFS file system
2. install sts-rhel5.2 from repo
http://sts.lab.msp.redhat.com/dist/brewroot/repos/qe-rhel5/$ARCH
3. on a node outside of the cluster run `d_iogen -I 111 -s read,write -i 0 -F
file:1m`
4. on the cluster nodes run `d_doio -I 111 -P <hostname from 3> -w <path to GFS>
-n 5`
  
Actual results:
Some d_doio processes connect and the rest hang in connect().

Expected results:
All d_doio processes should connect within a few seconds.

Additional info:

This was tested w/o the Xen kernel.

Comment 1 Daniel Berrangé 2008-01-24 18:38:57 UTC
Try disabling the libvirt  default network with

 virsh net-destroy default
 virsh net-autostart --disable default

If that works, then collect a Tcpdump of traffic when the default network is
running, and another when it is not running.


Comment 2 Nate Straz 2008-01-28 20:12:43 UTC
I don't think those commands will do anything unless I'm running the -xen kernel
with the xen hypervisor.

[root@dash-03 ~]# virsh net-destroy default
virsh: error: failed to connect to the hypervisor
[root@dash-03 ~]# virsh net-autostart --disable default
virsh: error: failed to connect to the hypervisor

I'm not running either.  I'm just running the standard kernel and when libvirtd
starts my SCTP connects stop working.  I'll collect tcpdumps with and without
libvirtd started.

Comment 3 Nate Straz 2008-01-28 20:40:19 UTC
Created attachment 293193 [details]
tcpdump from dash-03 with libvirtd running and sctp failing

Here is a tcpdump from dash-01 which was trying to start 5 d_doio processes. 
Each process tries to connect to d_iogen on another machine.  The first three
processes connect, the last two do not.

Comment 4 Mark McLoughlin 2008-01-29 09:13:59 UTC
nstraz: you want to try:

  $> virsh -c qemu:///system net-destory default

You don't need to be running a xen kernel

Comment 5 Mark McLoughlin 2008-01-29 10:33:18 UTC
Okay, I'm not at all familiar with SCTP, but here's what the dump shows:

  - The INIT packets contain the address of the virbr0 bridge 
    interface (192.168.122.1), which is an address we don't want exposed
    to the external network

  - 3 of the connections succeed, but 2 receive an ABORT with an "Restart 
    of an association with new addresses" (SCTP_ERROR_RESTART) cause with
    the external facing IP address as the specified new address

  - All 5 INIT packets contain the same list of IP addresses

  - The way I interpret the SCTP code is that it thinks these two INITs
    are duplicate INITs on existing associations, but that the existing
    associations didn't contain the external address

Best guess is that each of the cluster nodes have virbr0 and are simultaneously
creating a number of connections originating from the same port, but because
each of the connections specify the IP address of the bridge as a valid
transport, the connections are interpreted as restarts of an existing connection
e.g.

Node A                 Server                  Node B
======                 ======                  ======
INIT
 10.15.89.98:32928
 192.168.122.1:32928  ->
                     <-  ACK
                                                  INIT
                                    10.15.89.97:32928
                               <- 192.168.122.1:32928
                         ABORT  ->
INIT
 10.15.89.98:32929
 192.168.122.1:32929  ->
                     <-  ACK
                                                  INIT
                                    10.15.89.97:32929
                               <- 192.168.122.1:32929
                         ABORT  ->
INIT
 10.15.89.98:32930
 192.168.122.1:32930  ->
                     <-  ACK
                                                  INIT
                                    10.15.89.97:32930
                               <- 192.168.122.1:32930
                         ABORT  ->
                                                  INIT
                                    10.15.89.97:32931
                               <- 192.168.122.1:32931
                            ACK ->
INIT
 10.15.89.98:32931
 192.168.122.1:32931  ->
                     <-  ABORT
                                                  INIT
                                    10.15.89.97:32932
                               <- 192.168.122.1:32932
                            ACK ->
INIT
 10.15.89.98:32932
 192.168.122.1:32932  ->
                     <-  ABORT


So, two things:

  1) Need to verify this theory

     nstraz: can you get a tcpdump from the server? Right now we can only see
     the "Node A" side of the interaction

  2) Need to figure out how to make sure that the virbr0 address doesn't appear
     in the list of addresses in the outgoing packets

     herbert, nhorman ?

Comment 6 Mark McLoughlin 2008-01-29 11:26:46 UTC
nstraz: if this is the problem, one thing that should fix it for you is to have
d_doio bind its socket to its external IP address so that the virbr0 address
won't be contained in the SCTP INIT messages.

That sucks as a solution, but I'm not sure the kernel or libvirt is doing
anything "wrong" here - i.e. libvirt has set up a network interface which isn't
reachable by the server, the kernel can't know which network interfaces are
reachable by the server and the whole thing blows up because each of the clients
have a virbr0 using the same address.

Still curious whether herbert or nhorman have any ideas.

Comment 7 Neil Horman 2008-01-29 12:20:37 UTC
Mark, you're theory looks pretty solid to me.  If you send an init chunk
containing the same ip address from the same port, SCTP will have no way to tell
if its from a separate host or not.  I think the best thing to do is look at the
code for the d_doio utility and see how its generating the list of addresses to
send in the INIT list (mostly likely its just blindly adding all the node
addresses via sctp_bindx or some such).  We'll most likely have to change the
d_doio code to be a little more intellegent about that.  Can you point me to
where I get a copy of that code?  Thanks!

Comment 9 Nate Straz 2008-01-29 13:30:38 UTC
Created attachment 293272 [details]
tcpdump from d_iogen side

Here is the tcpdump of sctp from the d_iogen side.

Comment 10 Mark McLoughlin 2008-01-29 13:31:52 UTC
nhorman: I found the source here:
 
http://sts.lab.msp.redhat.com/dist/brewroot/repos/qe-rhel5/SRPMS/sts-rhel5.2-3-1.el5.src.rpm

see src/d_doio/d_doio.c:init_connection() - it's not binding to any address at
all, which means the address list must be coming from
sctp_copy_local_addr_list(), right?

Note one thing you might have missed - libvirt is configured by default to
create a bridge interface assigned with 192.168.122.1, so it's perfectly
possible that each node is sending out an INIT with the same address.

You could imagine a similar situation if e.g. multiple linksys/DSL/NAT routers
with a default configuration were trying to initiate an SCTP connection with a
server. That's in effect what libvirt is implementing, a NAT router for virtual
machines.

Comment 11 Mark McLoughlin 2008-01-29 13:39:19 UTC
nstraz: thanks, I think the tcpdump confirms it ... the first and third INIT
originate from the same port on different hosts, both contain the virbr0 address
and the second of those two gets an ABORT response complaining that it's address
is a new address for an existing connection.

Comment 12 Nate Straz 2008-01-29 13:47:15 UTC
I found /etc/libvirt/qemu/networks/default.xml on each node and changed them to
talk on different subnets.  The test program is now working as expected.  This
is probably something we will have to note in our cluster documentation.

Comment 13 Mark McLoughlin 2008-01-30 16:27:17 UTC
nhorman: re-assigning to the kernel and you since I don't think libvirt is doing
anything wrong here.

It's probably fine to close this as WONTFIX with the conclusion that:

  SCTP requires the application to explicitly bind to the external IP 
  address(es) it wishes to be contactable by, otherwise other (private)
  IP addresses assigned to network interfaces on the system may conflict
  with existing associations on the server that may have included that
  same IP address in its INIT chunk.

But two suggestions to first consider:

  - Perhaps by default the Linux SCTP implementation should not include
    any IP addresses in the INIT chunk? I don't immediately see anything
    in the RFCs which require this, and it might make sense to only include
    additional addresses if the application explicitly binds to them.

  - Or alternatively, perhaps we should only look for existing associations
    based on the actual remote endpoint address (i.e. ignore the transport
    addresses supplied in the INIT chunk) of the existing associations i.e.
    rather than

    def find_existing_assoc(new_assoc):
        for assoc in existing_assocs:
            for addr in assoc.transport_addrs:
                for new_addr in new_assoc.transport_addrs:
                    if addr == new_addr:
                        return assoc
        return None

    do:

    def find_existing_assoc(new_assoc):
        for assoc in existing_assocs:
            for new_addr in new_assoc.transport_addrs:
                if assoc.peer_addr == new_addr:
                    return assoc
        return None

    Not sure whether this would be still correct, but e.g. Section 5.1.2
    of RFC 4460 says:

      An INIT or INIT ACK chunk MUST be treated as belonging
      to an already established association (or one in the
      process of being established) if the use of any of the
      valid address parameters contained within the chunk
      would identify an existing TCB.

   which doesn't specify what addresses in the existing TCB you match against.

I guess that since these addresses are supposed to be valid "alternate
addresses", then (1) is a more likely fix than (2) - i.e. they're not valid
alternate addresses, so we shouldn't be sending them out at all.



Comment 14 Neil Horman 2008-01-30 18:44:51 UTC
I'll close it as NOTABUG.

I agree that libvirt isn't doing anything wrong, its just making assumptions
about the default behavior regarding the sctp protocol.  When using sctp, user
space utilities need to call bind (or sctp_bindx) to specify which local address
will be used by associations on the specified socket.  This is mandated in
section 3.1.2 of the sctp api:
http://www3.tools.ietf.org/html/draft-ietf-tsvwg-sctpsocket-15#section-3.1.2

This behavior is implemented in sctp_autobind down in the kernel.

Regarding the suggestions above, the first is definately possible, and in fact
available, but requires the user space utility to use bind to specify the
address it will use.

We can't just look for associations based on the remote endpoint address.  I
don't think that would allow us to recieve frames on associations if one address
failed and we had to use one of the alternative addresses specified in the INIT
chunk during retransmits.

From my view, (1) is the right solution, and thats currently available.  it just
requires the application to do the appropriate setup to get the desired
behavior.  The default behavior just isn't what d_doio wants in this case.

Comment 15 Nate Straz 2008-01-30 20:14:58 UTC
I don't think you understand how d_doio is trying to use SCTP.  It is only
trying to take advantage of the message boundary features of SCTP using a
one-to-one style socket.  That's section 4, not section 3.  The automatic
multi-homing is what is causing d_doio trouble.

The real trouble is from libvirtd always using the same IP for its NAT
interface.  When VMs are configured on a cluster of machines, this will cause
problems.  Each physical machine needs to use a different NAT IP or any SCTP
applications will get confused as d_doio did.

Is this a bug in SCTP? No.  Is this a bug in libvirtd?  Not really.  Is this a
configuration issue which needs to be documented and addressed.  Yes.

Comment 16 Daniel Berrangé 2008-01-30 20:23:41 UTC
Requiring that each libvirt NAT network using a different IP addres range is
completely missing the point of this being *NAT*.  The 192.168.122.* addresses
are never supposed to leave the physical machine. This is why the virbr0 bridge
is not connected to any physical devices, and why no routing rules will direct
its traffic to a physical device. Any traffic from virbr0 (and thus the guests)
must be NAT'd before leaving the machine. 

If SCTP can't do this, then it shouldn't try to use the IP address from virbr0
at all.  Requiring that the libvirt network is changed on each machine is not a
scalable answer to this problem. Whatever in ClusterSuite is using SCTP should
only generate packets from real physical interfaces (which it can discover via
sysfs or HAL), and not blindly use all addresses configured on a machine.

Comment 17 Nate Straz 2008-01-30 20:33:43 UTC
Let me make this clear.  d_doio is part of the test suite for Cluster Suite/GFS.
 Nothing _in_ Cluster Suite uses SCTP at the moment.  cman was going to use it,
but that was pulled because of problems with SCTP in VM clusters.  I'm guessing
because of this exact issue.

Comment 18 Daniel Berrangé 2008-01-30 20:49:05 UTC
So since this is a QA test suite issue only, there's no need to document
changing libvirt networking setup for RHEL.

Comment 19 Dean Jansa 2008-01-30 21:05:52 UTC
Correct me if I am mistaken -- wouldn't any use of SCTP as used in the test
tools cause this issue in a VM config?  If so, why shouldn't we document this?


Comment 20 Neil Horman 2008-01-30 21:14:24 UTC
I still think it needs to be addressed, though, and it needs to be addressed in
userspace, since the kernel can't know in which cases to include what address in
 the multihoming list, and it seems to me that sctp in a clustered vm
environment would be handy to have.

In response to Nates comments in comment 15, yes.  I'm on the same page as you.
 The default multihoming features of SCTP are whats causing you grief.  I'm
sorry I didn't realize that you were using a 1:1 style socket rather than a 1:m
style socket, but the bind behavior in section 4.1.2 for 1:1 sockets is exactly
the same as for 1:many style sockets, which is to use all interfaces if no list
is specified.  The fix for that is to use sctp_bindx to explictly list the
interfaces that you want d_doio to use, and to make sure the ip address of
virbr0 is not in that list.

In response Dans comments in comment 16, this can work just fine.  SCTP is a
transport under IP, just like UDP and TCP.  As such it will be natted just like
any other IP frame.  The problem comes in with the semantics of the protocol. 
When sctp establishes a new connection, its sends an INIT chunk to its peer(s).
 This INIT chunks payload contains a list of addresses on the sending system
which can be used for multihoming.  Since this data is in the payload, rather
than the IP header, it escapes any modification during the NAT process.  The
mandated default behavior in sctp, according to the implementation guide is to
bind to all available interfaces if no list is specified via sctp_bindx().  If
libvirt wants to use SCTP, its going to need to mask away the virbr0 interface
ip address via sctp_bindx() when it creates an SCTP socket.


Comment 21 Mark McLoughlin 2008-01-31 08:34:58 UTC
(In reply to comment #14)

> I agree that libvirt isn't doing anything wrong, its just making assumptions
> about the default behavior regarding the sctp protocol.

libvirt isn't making any assumptions about sctp. You mean d_doio?

> Regarding the suggestions above, the first is definately possible, and in fact
> available, but requires the user space utility to use bind to specify the
> address it will use.

The suggestion was:

  Perhaps by default the Linux SCTP implementation should not include
  any IP addresses in the INIT chunk?

Is there any way a user space application can do this? As you pointed out, NAT
can't translate the addresses in the INIT chunk, so that's another reason why
I'd think it would be better for most applications if they just let the endpoint
address from the IP header be the only address used for creating the association.


Comment 22 Neil Horman 2008-01-31 12:49:37 UTC
>libvirt isn't making any assumptions about sctp. You mean d_doio?
Whomever is opening an sctp socket is making these assumptions regarding the 
default behavior of the protocol.  I'm afraid I really don't know enough about
the cluster setup to know which user space utilities that includes (although the
source from d_doio certainly seems to be on that list). Regardless, any
application being used in this clustered vm environment is going to need to be
aware of this behavior.



>The suggestion was:
>
> Perhaps by default the Linux SCTP implementation should not include
>  any IP addresses in the INIT chunk?
>
>Is there any way a user space application can do this? As you pointed out, NAT
>can't translate the addresses in the INIT chunk, so that's another reason why
>I'd think it would be better for most applications if they just let the >endpoint
>address from the IP header be the only address used for creating the >association.


Yes it is possible for sctp to not include any additional IP addresses in the
INIT chunk (or more to the point, to only include non-natted addresses).  I've
explained how to do this in comment #20.

One of the key features in sctp is the ability to multihome.  The implementation
guide referenced in comment 14 mandates that, unless an explicit list of
addresses to multihome on is specified by the application using the socket (via
sctp_bindx), the protocol is to use all available ip addresses for multihoming.
 The addresses in the INIT chunkare the list of addresses to multihome on.  So
by using sctp_bindx in the userspace application, you can add addresses to, or
remove addresses from that list.  Thats why I've been saying that userspace
needs to be the one to correct this behavior.  Be it d_doio, or libvirt, or any
application using using an sctp socket in a clustered vm environment.


Comment 23 Mark McLoughlin 2008-01-31 14:07:08 UTC
Okay, sounds like this is something fundamentally required by the protocol specs
and, so, is is a potential problem for any SCTP client application.

The advice to application developers would then be:

  Q. I'm getting reports that my application fails to connect to a server
     when another client has already connected from the same client port on
     another machine with an identical IP address to one of the IP addresses
     on the connecting machine. How do I make my application robust in
     this situation?

  A. Easy. You need to explicitly bind the client to a set of addresses
     so that the server never sees these duplicate addresses. The pseudo
     code goes like:

       fd = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP)

       bind(fd, INADDR_ANY);

       for iface in get_network_ifaces():
           if iface == get_default_route_iface():
               continue

           for addr in get_iface_addrs():
               sctp_bindx(fd, SCTP_BINDX_REM_ADDR, addr)

       connect(fd, server_addr)

     Alternatively, you could make it configurable which addresses your
     application binds to and document how the user should configure your 
     application if they see such connection failures.

Comment 24 Neil Horman 2008-01-31 14:38:57 UTC
Yes, I concur.  This is an issue that will have to be addressed by any user of
sctp in the referenced environment.  ACK to the above documentation.  Thanks!

Comment 25 Nate Straz 2008-01-31 14:46:39 UTC
Do we ship libraries in RHEL5 that provide the get_*ifaces*() or equivalent
functions?

Comment 26 Neil Horman 2008-01-31 15:11:27 UTC
not directly as such, but the rtnetlink message set can provide the needed info
(man 7 rtnetlink)