Bug 1160340 - ovs allows moving interfaces to different netns although it cannot work across netns
Summary: ovs allows moving interfaces to different netns although it cannot work acros...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openvswitch2.10
Version: 7.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Flavio Leitner
QA Contact: qding
URL:
Whiteboard:
: 1157776 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-11-04 15:11 UTC by Jiri Benc
Modified: 2019-07-11 08:19 UTC (History)
18 users (show)

Fixed In Version: openvswitch2.10-2.10.0-1.el7fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-20 20:13:35 UTC


Attachments (Terms of Use)
ovsdb-client dump log for OVS2.9 (7.82 KB, text/plain)
2018-08-23 05:22 UTC, qding
no flags Details
ovsdb-client dump log for OVS2.10 (8.00 KB, text/plain)
2018-08-23 05:24 UTC, qding
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:2735 None None None 2018-09-20 20:14:18 UTC

Description Jiri Benc 2014-11-04 15:11:10 UTC
When querying ovsdb, values for some columns (ifindex, mac_in_use, etc.) in Interface table are apparently calculated by the user space daemon based solely on the interface name. This leads to zero values reported when the interface has been moved to a different name space. Even worse, it leads to bogus values when an interface with the same name is added later in the root namespace.

Steps to reproduce:

ovs-vsctl add-br ovs0
ovs-vsctl add-port iface0 -- set interface iface0 type=internal
ip netns add ns0
ip l s iface0 netns ns0
ip l a type veth peer name iface0
ovsdb-client dump

Comment 1 Jiri Benc 2014-11-04 15:14:19 UTC
Side note: I suspect openvswitch is much more broken wrt. netns. And netns with ovs is being used by openstack...

Comment 3 Jiri Benc 2015-02-13 11:23:15 UTC
This is much broader (and worse) problem then the initial description suggests.

The problem is openvswitch does not support its ports to be moved to a different name space. The fact that it's possible to do that is a bug - such operation should have been denied. Unfortunately, due to a missing check, it's not been denied. Such setup does not work reliably, though, and leads to various issues from incorrect resource accounting to kernel crashes.

We have basically two options to fix this:

1. Disabling moving of ovs ports to different name spaces. Easy to do but would break OpenStack. Given that the kernel has no user space breakage policy, this would also have great chance of being rejected upstream.

2. Making ovs to work across net name spaces. This is hard and will require addition of kernel APIs and large changes in ovs user space daemons. This constitutes large amount of work.

Comment 4 Ben Pfaff 2015-02-13 22:26:47 UTC
(In reply to Jiri Benc from comment #3)
> 2. Making ovs to work across net name spaces. This is hard and will require
> addition of kernel APIs and large changes in ovs user space daemons. This
> constitutes large amount of work.

OVS is intended to work across namespaces.  Why do you think it is so hard?  It shouldn't be--I believe that the only requirement is for ovs-vswitchd to have a netlink socket in each namespace that matters, so that it can make netlink requests from the appropriate namespace.  (That requires having a thread setns to the namespace, create a netlink socket, then setns back to the original namespace.)  It's a bit of work but I don't see that it's a huge job.  Do you see big obstacles?  Until now we haven't bothered just because the stats aren't that important.

Comment 5 Jiri Benc 2015-02-16 08:48:02 UTC
(In reply to Ben Pfaff from comment #4)
> OVS is intended to work across namespaces.  Why do you think it is so hard? 

Let's assume a situation when somebody moves an internal port to a different netns using the ip command.

1. In the kernel, we'll need to call skb_scrub_packet whenever the packet crosses name spaces. This has to be done only when the packet actually crosses name spaces. One would think it would be easy to put this call into netdev_send, internal_dev_recv (or, rather, dev_forward_skb in this case) and maybe a few other places. Unfortunately, this doesn't work if you have two ports in the same netns (which is different to netns of all other ports) and you're forwarding between them. The information about the originating netns has to be propagated somehow to the send callback.

2. Querying statistics and other information about such interface would require switching to that name space and do the query there. For this, ovs needs to know the netns the interface was moved to. I'm currently aware of no way to do this. Listening on netlink would not help, as there's no event sent when a new netns appears and moving a device to another netns looks to the user space as if the device was destroyed in the old one and created in the new one (which is visible only in this new netns).

To solve this, we may use Nicolas Dichtel's work and communicate the peer netns from the kernel ovs module. Then we can listen for device deletion on netlink, ask the ovs module for the new location, switch to that name space and start tracking the netlink events there, too. This is racy, though.

Another way to solve this would be querying the stats (+ whatever needed) always through the ovs module. Then we won't have to care about name spaces in the ovs user space.

3. If I'm not missing something, it's unfortunately not possible "to have a netlink socket in each namespace that matters", see above.

Comment 6 Jiri Benc 2015-02-16 08:56:08 UTC
(In reply to Jiri Benc from comment #5)
> 1. In the kernel, we'll need to call skb_scrub_packet whenever the packet
> crosses name spaces. This has to be done only when the packet actually
> crosses name spaces. One would think it would be easy to put this call into
> netdev_send, internal_dev_recv (or, rather, dev_forward_skb in this case)
> and maybe a few other places. Unfortunately, this doesn't work if you have
> two ports in the same netns (which is different to netns of all other ports)
> and you're forwarding between them. The information about the originating
> netns has to be propagated somehow to the send callback.

One thing that's probably not that obvious is that this has to be done also for non-net_device backed ports, e.g. vxlan, when the packet comes from a different netns than the one that's used to send the encapsulated packet.

The right location to deal with this might be ovs_vport_send.

Comment 7 Ansis Atteka 2015-02-18 20:57:41 UTC
(In reply to Jiri Benc from comment #5)
> (In reply to Ben Pfaff from comment #4)
> > OVS is intended to work across namespaces.  Why do you think it is so hard? 
> 
> Let's assume a situation when somebody moves an internal port to a different
> netns using the ip command.
> 

> 
> 2. Querying statistics and other information about such interface would
> require switching to that name space and do the query there. For this, ovs
> needs to know the netns the interface was moved to. I'm currently aware of
> no way to do this. Listening on netlink would not help, as there's no event
> sent when a new netns appears and moving a device to another netns looks to
> the user space as if the device was destroyed in the old one and created in
> the new one (which is visible only in this new netns).

I have tested code that switches to namespace <X>, creates NETLINK socket in namespace <X> and then switches back to <root> namespace and then uses NETLINK socket created in namespace <X> from the <root> namespace. This seems to work fine. Of course there are some other system calls, where this trick would not work and ovs-vswitchd would have to nest them inside setns() calls (e.g. if_nametoindex(), ioctl()).

The namespace where interface would be moved into would receive RTM_NEWLINK notification. So, if we would have a NETLINK socket in every possible namespace then we would have to listen for RTM_NEWLINK call to figure out when interface was moved to different namespace. See below on how we might want to solve this issue with inotify().


> 
> To solve this, we may use Nicolas Dichtel's work and communicate the peer
> netns from the kernel ovs module. Then we can listen for device deletion on
> netlink, ask the ovs module for the new location, switch to that name space
> and start tracking the netlink events there, too. This is racy, though.
I will look up this to understand the actual implementation.

> 
> Another way to solve this would be querying the stats (+ whatever needed)
> always through the ovs module. Then we won't have to care about name spaces
> in the ovs user space.

Getting interface stats is one problem, but I believe sometimes ovs-vswitchd would also be interested in other NETLINK notifications(for example, if you renamed interface with "ip link" in non-root namespace).

> 
> 3. If I'm not missing something, it's unfortunately not possible "to have a
> netlink socket in each namespace that matters", see above.

While namespace theoretically can be mounted anywhere under filesystem, it seems that "ip netns" always mounts them under /var/run/netns. Could we use inotify() there so that ovs-vswitchd would get "namespace created" and "namespace removed" events?

Comment 8 Ben Pfaff 2015-02-18 21:07:58 UTC
> This seems to work fine. Of course there are some other system calls, where this trick would not work and ovs-vswitchd would have to nest them inside setns() calls (e.g. if_nametoindex(), ioctl()).

An equivalent to if_nametoindex() can be implemented via Netlink, as can some of the other operations for which OVS currently uses ioctls (e.g. get/set MTU).

Comment 9 Flavio Leitner 2015-02-19 15:41:47 UTC
(In reply to Ansis Atteka from comment #7)
> I have tested code that switches to namespace <X>, creates NETLINK socket in
> namespace <X> and then switches back to <root> namespace and then uses
> NETLINK socket created in namespace <X> from the <root> namespace. This
> seems to work fine. Of course there are some other system calls, where this
> trick would not work and ovs-vswitchd would have to nest them inside setns()
> calls (e.g. if_nametoindex(), ioctl()).

I am afraid that this works only one time.  If you move the interface once more to another netns then it's broken again, right? Also, there are things like MTU changing events, mac changing events, that I am not sure if that would be enough.


> The namespace where interface would be moved into would receive RTM_NEWLINK
> notification. So, if we would have a NETLINK socket in every possible
> namespace then we would have to listen for RTM_NEWLINK call to figure out
> when interface was moved to different namespace. See below on how we might
> want to solve this issue with inotify().

How would you know that the interface is the same? I mean, for every new interface you would have to check if it's backed by vport, and then do something.


> Getting interface stats is one problem, but I believe sometimes ovs-vswitchd
> would also be interested in other NETLINK notifications(for example, if you
> renamed interface with "ip link" in non-root namespace).

I agree here.  From OVS perspective, you would do 'ovs-vsctl show' and the interface there doesn't exist in the system because it is in another netns.  I think somehow that should work as expected by users. Perhaps show the netns id?

> > 3. If I'm not missing something, it's unfortunately not possible "to have a
> > netlink socket in each namespace that matters", see above.
> 
> While namespace theoretically can be mounted anywhere under filesystem, it
> seems that "ip netns" always mounts them under /var/run/netns. Could we use
> inotify() there so that ovs-vswitchd would get "namespace created" and
> "namespace removed" events?

It sounds racy and how would you handle when OVS is not running in the root netns? I.e. you don't have access to that mountpoint.

I discussed this with Jiri and we think differently.  My opinion is that OVS should forbid internal ports to be moved to another netns.  We should stick to veth pairs which does everything right and the overhead in the fast path is minimal.  Doing so, we have the local interface showing up on 'ovs-vsctl show' properly, events are ok, and the admin is free to move the other peer to any other netns and how many times is needed.

I was testing a new vport type to handle this properly which is half internal port and half veth.  The idea was to provide the flexibility of veth pair but having one side connected directly to DP (skip the initial of network stack) plus some other interesting features like the possibility to lock peers mac address and MTU. But at least speed-wise it didn't show any noticeable gain so I am inclined now to consider simply use standard veth pairs.

Comment 10 Jiri Benc 2015-02-19 19:25:42 UTC
(In reply to Ansis Atteka from comment #7)
> Getting interface stats is one problem, but I believe sometimes ovs-vswitchd
> would also be interested in other NETLINK notifications(for example, if you
> renamed interface with "ip link" in non-root namespace).

Yes. It would be better to rely on ifindex+netns id combination, though.

> While namespace theoretically can be mounted anywhere under filesystem, it
> seems that "ip netns" always mounts them under /var/run/netns. Could we use
> inotify() there so that ovs-vswitchd would get "namespace created" and
> "namespace removed" events?

This would be too fragile. There are other tools in the wild that don't use /var/run/netns. One example is OpenStack; it at least leaves a symlink in /var/run/netns as a courtesy, though.

This is of course solvable but needs kernel support. But we'll need a proper kernel interface to enumerate all net namespaces sooner or later anyway.

(In reply to Flavio Leitner from comment #9)
> I discussed this with Jiri and we think differently.  My opinion is that OVS
> should forbid internal ports to be moved to another netns.

Yes, our opinions differ on this one :-) I think this falls under the "don't break user space" rule, thus we can't do that.

Comment 11 Flavio Leitner 2015-02-19 20:07:25 UTC
(In reply to Jiri Benc from comment #10)
> (In reply to Flavio Leitner from comment #9)
> > I discussed this with Jiri and we think differently.  My opinion is that OVS
> > should forbid internal ports to be moved to another netns.
> 
> Yes, our opinions differ on this one :-) I think this falls under the "don't
> break user space" rule, thus we can't do that.

As you said, it's broken already. Even the fact that packets are flowing now is broken because of the lack of security.  Maybe instead of forbid, ovs can log a warning message when the internal port is moved to another netns and we update the documentation recommending to use veth instead.

Comment 12 Ansis Atteka 2015-02-20 21:52:03 UTC
(In reply to Flavio Leitner from comment #9)
> (In reply to Ansis Atteka from comment #7)
> > I have tested code that switches to namespace <X>, creates NETLINK socket in
> > namespace <X> and then switches back to <root> namespace and then uses
> > NETLINK socket created in namespace <X> from the <root> namespace. This
> > seems to work fine. Of course there are some other system calls, where this
> > trick would not work and ovs-vswitchd would have to nest them inside setns()
> > calls (e.g. if_nametoindex(), ioctl()).
> 
> I am afraid that this works only one time.  If you move the interface once
> more to another netns then it's broken again, right? Also, there are things
> like MTU changing events, mac changing events, that I am not sure if that
> would be enough.

Could you clarify what you mean with "works only one time"? I can move the same interface to different namespaces back and forth without any problems. Each time interface is moved to a different namespace then in that namespace RTM_NEWLINK notification is issued by the kernel.

> 
> 
> > The namespace where interface would be moved into would receive RTM_NEWLINK
> > notification. So, if we would have a NETLINK socket in every possible
> > namespace then we would have to listen for RTM_NEWLINK call to figure out
> > when interface was moved to different namespace. See below on how we might
> > want to solve this issue with inotify().
> 
> How would you know that the interface is the same? I mean, for every new
> interface you would have to check if it's backed by vport, and then do
> something.

I am using ifindex for this. Ifindex does not seem to change when I move interface from once namespace to another one. Just curious, are there scenarios when ifindex could change?

> 
> 
> > Getting interface stats is one problem, but I believe sometimes ovs-vswitchd
> > would also be interested in other NETLINK notifications(for example, if you
> > renamed interface with "ip link" in non-root namespace).
> 
> I agree here.  From OVS perspective, you would do 'ovs-vsctl show' and the
> interface there doesn't exist in the system because it is in another netns. 
> I think somehow that should work as expected by users. Perhaps show the
> netns id?

Yes, netns id (mount point on the file system, I guess) could be showed in 'ovs-vsctl show' command if ovs-vswitchd knows that interface is in different namespace.

> 
> > > 3. If I'm not missing something, it's unfortunately not possible "to have a
> > > netlink socket in each namespace that matters", see above.
> > 
> > While namespace theoretically can be mounted anywhere under filesystem, it
> > seems that "ip netns" always mounts them under /var/run/netns. Could we use
> > inotify() there so that ovs-vswitchd would get "namespace created" and
> > "namespace removed" events?
> 
> It sounds racy and how would you handle when OVS is not running in the root
> netns? I.e. you don't have access to that mountpoint.

Not sure what exactly you mean with "it is racy", but, yes, ovs-vswitchd would have to be careful to not to miss any NETLINK notifications by doing interface dump to get all interfaces immediately after it subscribed to NETLINK notifications. However, I think these would be ovs-vswitchd implementation details that are solvable.

I think the problem you are concerned about is not "what to do when ovs-vswitchd is not running in the root net-ns", but rather "what to do when ovs-vswitchd is not running in root mount-ns and root process-ns". I haven't played with these other types of namespace, but I would think that your concern is valid here that different ovs-vswitchd instances might get different results by looking at their /var/run/netns mount points.

On the other hand /init process is running in the root net-ns and ovs-vswitchd running in different namespace should be capable to subscribe to NETLINK notifications in the root namespace by using /proc/1/ns/net. I have to admit that I haven't tried this though.

> 
> I discussed this with Jiri and we think differently.  My opinion is that OVS
> should forbid internal ports to be moved to another netns.  We should stick
> to veth pairs which does everything right and the overhead in the fast path
> is minimal.  Doing so, we have the local interface showing up on 'ovs-vsctl
> show' properly, events are ok, and the admin is free to move the other peer
> to any other netns and how many times is needed.

This idea would be quite similar to the classical virtualization model (e.g. what KVM is using). With this I mean that one side of veth would represent interface on the hypervisor while the other side of veth would be like the interface that you can actually see in the guest VM. I can't judge about performance, so I can't say whether this is good or bad idea.


> 
> I was testing a new vport type to handle this properly which is half
> internal port and half veth.  The idea was to provide the flexibility of
> veth pair but having one side connected directly to DP (skip the initial of
> network stack) plus some other interesting features like the possibility to
> lock peers mac address and MTU. But at least speed-wise it didn't show any
> noticeable gain so I am inclined now to consider simply use standard veth
> pairs.

Comment 13 Jiri Benc 2015-02-22 13:15:22 UTC
(In reply to Ansis Atteka from comment #12)
> Could you clarify what you mean with "works only one time"? I can move the
> same interface to different namespaces back and forth without any problems.
> Each time interface is moved to a different namespace then in that namespace
> RTM_NEWLINK notification is issued by the kernel.

That's the problem, in *that* name space. How would you listen for such events? You don't get notifications about appearance of new name spaces.

> I am using ifindex for this. Ifindex does not seem to change when I move
> interface from once namespace to another one. Just curious, are there
> scenarios when ifindex could change?

ifindex can change when moving interface to a different netns. If the ifindex of the interface is currently used in the target netns, the interface gets a new value of ifindex.

> Yes, netns id (mount point on the file system, I guess)

Mount point and netns id are different things. Kernel does not actually care about the mount point, it uses file descriptors to specify the net ns. You can have multiple mount points or none.

netns id is a new concept added recently upstream. That one can be actually used to track interfaces across net name spaces (and it was the reason for adding it).

> could be showed in
> 'ovs-vsctl show' command if ovs-vswitchd knows that interface is in
> different namespace.

Which it has to learn somehow; this needs new API. The API currently provided by kernel does not allow doing that reliably.

> > It sounds racy and how would you handle when OVS is not running in the root
> > netns? I.e. you don't have access to that mountpoint.
> 
> Not sure what exactly you mean with "it is racy", but, yes, ovs-vswitchd
> would have to be careful to not to miss any NETLINK notifications by doing
> interface dump to get all interfaces immediately after it subscribed to
> NETLINK notifications. However, I think these would be ovs-vswitchd
> implementation details that are solvable.

Except that those notifications are per net ns. And it's currently impossible to listen in all net name spaces and it's impossible to learn about newly added net name spaces.

> I think the problem you are concerned about is not "what to do when
> ovs-vswitchd is not running in the root net-ns", but rather "what to do when
> ovs-vswitchd is not running in root mount-ns and root process-ns". I haven't
> played with these other types of namespace, but I would think that your
> concern is valid here that different ovs-vswitchd instances might get
> different results by looking at their /var/run/netns mount points.

/var/run/netns is not important, this cannot be used reliably even in case of a single mount ns and a single user ns.

But yes, so far we've been talking only about a single user name space. As net name spaces are per user name space and it's completely impossible to detect anything across user name spaces (for security reasons), there are other challenges here. I haven't even looked into them, yet, but I expect much more problems to be present here. Basically, what you want is ability to run multiple instances of ovs in different user name spaces; the kernel datapath has to be aware of this.

> On the other hand /init process is running in the root net-ns and
> ovs-vswitchd running in different namespace should be capable to subscribe
> to NETLINK notifications in the root namespace by using /proc/1/ns/net. I
> have to admit that I haven't tried this though.

Using netns of PID 1 should generally work, it's very unlikely someone will run a custom init that runs in a non-root netns. That is, if you're not in a container. But then we're talking about a different user name space, too, and the above applies.

Comment 15 Jiri Benc 2015-05-07 11:59:10 UTC
See bug 1157776 for another symptom of this problem.

Comment 16 Nir Magnezi 2015-05-07 12:48:14 UTC
*** Bug 1157776 has been marked as a duplicate of this bug. ***

Comment 26 sushil kulkarni 2017-10-24 14:25:03 UTC
Dropping this from 7.5 RPL and deferring to 7.6.. Per email discussions from dev/PM -  re: looking for feedback from OSP team.

-Sushil

Comment 31 sushil kulkarni 2018-07-12 13:37:47 UTC
Taking this out of the 7.6 RPL since it is for FDP.

-Sushil

Comment 34 qding 2018-08-23 05:22:30 UTC
Created attachment 1478036 [details]
ovsdb-client dump log for OVS2.9

Comment 35 qding 2018-08-23 05:24:07 UTC
Created attachment 1478037 [details]
ovsdb-client dump log for OVS2.10

Comment 36 qding 2018-08-23 05:31:49 UTC
Hi Jiri,

I'm doing verification with steps in Description #0. The logs for ovsdb-client dump are attached. Please help evaluate if the issue is fixed. Or if more test is needed, please let me know. Thanks.

ovs2.9 is for openvswitch-2.9.0-55.el7fdp.x86_64
ovs2.10 is for openvswitch2.10-2.10.0-1.el7fdp.x86_64

And ifindex for test with both ovs2.9 and ovs2.10 are same as below.

[root@dell-per730-04 tmp]# cat /sys/class/net/iface0/ifindex 
15

Qijun

Comment 37 Flavio Leitner 2018-08-28 03:05:59 UTC
Hi Qijun,

If you move an OVS internal port to another netns before the fix, the statistics and the iface status gets broken.

For instance, move the interface to another netns and bring it up/down to change the status and then check back with OVS. OVS will use ioctl and report errors on the log that the device doesn't exist.  It can't check MACADDR of the iface, so try change it and the stats also gets incorrect.

When you have the fixed OVS, the interface status remains accurate to what happens inside of the netns. The same for stats and MACADDR.

HTH,
fbl

Comment 38 qding 2018-08-28 06:25:37 UTC
Reproduced with openvswitch-2.9.0-55.el7fdp.x86_64
Verified with openvswitch2.10-2.10.0-1.el7fdp.x86_64

Comment 40 errata-xmlrpc 2018-09-20 20:13:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2735


Note You need to log in before you can comment on or make changes to this bug.