Bug 1680681 - [regression] no IPv6 communication possible on isolated virtual network
Summary: [regression] no IPv6 communication possible on isolated virtual network
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-25 14:07 UTC by post+redhat
Modified: 2019-03-18 14:14 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-12 16:52:24 UTC


Attachments (Terms of Use)
ip6tables -S when it does not work (3.75 KB, text/plain)
2019-02-27 19:08 UTC, post+redhat
no flags Details

Description post+redhat 2019-02-25 14:07:18 UTC
Description of problem:


Version-Release number of selected component (if applicable): 5.0.0


How reproducible:
Fully


Steps to Reproduce:
1. Create two virtual machines connected by an isolated network, and run radvd on one.
2. Check the network interface for arriving RAs on the other.



Actual results:
No traffic.


Expected results:
The RAs should arrive at the other virtual host.


Additional info:

I have a test setup where I run a virtual router in one VM and a virtual client connected via an isolated network on another.  This worked fine for years, but the recent upgrade from version 4.10 to 5.0 broke this setup for IPv6: in Wireshark I could see that no IPv6 traffic was visible at all on the bridge, even though there should be some base load of RAs and neighborhood discovery always going on.
After several hours of debugging, I finally found out that libvirt sets "disable_ipv6" on the bridge to 1. Manually setting it back to 0 works around the problem.

In the docs I found an "ipv6=yes" parameter mentioned for isolated network. My isolated network (created with the virt-manager UI years ago) does not have this attribute set, and up until recently that was no problem. (It also seems like a rather awful default to block IPv6 communication between guests.) Also, setting this attribute now makes no difference at all. And indeed, looking at the code in bridge_driver.c, `ipv6nogw` is not involved in determining whether or not to set `disable_ipv6` to 1.

Comment 1 Laine Stump 2019-02-25 17:22:18 UTC
When you say "isolated network", do you mean a network with no IPv6 address assigned to the bridge? Or do you mean a network that has an IPv6 address, but no forward mode set?

In the latter case, you shouldn't need the ipv6='yes' attribute set - as long as the network definition has at least on IPv6 address in its config, disable_ipv6 should be set to 0. If it isn't, that is a bug.

In the former case (no IPv6 address in the network config), we *should have* always set disable_ipv6 to 1 unless ipv6="yes" was set. It was initially added for exactly this reason - networks with no IPv6 address specified were blocking all ipv6 traffic, and a user wanted the ability to have IPv6 on the network even if the host wasn't participating; having IPv6 magically turned for all networks with no warning to the user was deemed to be a security problem, so rather than just mass enabling IPv6 on all networks, we (actually it was the user who requested the feature) added the ipv6="yes" attribute to turn on IPv6 even when there is no IPv6 address in the network config. If that wasn't previously working, and it now is, then that is a case of a bug that has been fixed.

Comment 2 post+redhat 2019-02-25 20:22:09 UTC
> When you say "isolated network", do you mean a network with no IPv6 address assigned to the bridge? Or do you mean a network that has an IPv6 address, but no forward mode set?

I mean what virt-manager describes in its UI as "isolated network". The network XML definition is

<network ipv6='yes'>
  <name>ffnet</name>
  <uuid>cfd2c92a-db77-4b27-ad78-a8a81ace32b6</uuid>
  <bridge name='virbr1' stp='on' delay='0'/>
  <mac address='52:54:00:27:6c:42'/>
  <domain name='ffnet'/>
</network>

I added the "ipv6=yes" attribute manually, but that does not make a difference. So even if this is a bug that got fixed (I see no recent changes of this code in the git history), there are at least two more bugs:
* Setting ipv6='yes' has no effect, "disable_ipv6" is still set.
* When creating an "isolated network" in virt-manager, that should have working IPv6. It would be terribly surprising if such a network is IPv4-only. In general, I'd call an IPv4-only network that actively blocks IPv6 packets broken. What I am expecting is a switch between my VMs that just forwards all traffic and doesn't care about Layer 3 and up. Blocking IPv6 might be want you want in some special situations, but I don't understand how it can be considered a reasonable default. I understand the attribute was introduced for backwards compatibility with prior installations in case someone relied on the lack of IPv6 support in libvirt back then, but then it should be set per default on newly created networks.

Comment 3 post+redhat 2019-02-25 20:25:26 UTC
Also I just noticed that setting "disable_ipv6=0" is not sufficient to fix IPv6. The bridge still swallows neighbor solicitations that are sent from the client to the router, meaning that there is still no IPv6 connectivity. I am at a loss about what is causing these packets to get dropped. I tried setting multicast_router=2, which is supposed to mean that all multicast traffic is sent to all ports, to no avail.

And in any case, it is certainly a regression when a network that got created using the virt-manager UI several years ago always worked fine over multiple major versions of libvirt, and now suddenly IPv6 does not work any more. At this point you have backwards compatibility concerns here as well.

Comment 4 Laine Stump 2019-02-26 20:14:05 UTC
I've looked back through the history of this, and have found a few things:

1) I had forgotten the details of the "disable_ipv6" setting, but looking into this refreshed my memory. For this problem, is a red herring. It has always been set unless there is an IPv6 address configured in the network definition. This is *not* (and never has been) changed by setting "ipv6='yes'". My understanding is that disable_ipv6 only affects IPv6 operation for the *host* network stack, but not for any guests that are attached to the bridge. Bug 501934 (filed and fixed nearly 10 years ago) explains why we must set disable_ipv6 to 1 on a bridge when the host isn't participating in an IPv6 network on that bridge. The setup of the reporter is surprisingly similar to your own, leading me to believe that setting disable_ipv6=0 on your bridge could actually *break* your networking rather than fixing it.


2) The only thing that setting ipv6='yes' changes is the ip6tables rules that are added for the network - if ipv6='yes' is *not* specified, then no ip6tables rules are added. If ipv6='yes' (or if there is an IPv6 address configured for the network, then the following 3 rules are added:

  -A FORWARD -i virbr0 -o virbr0 -j ACCEPT
  -A FORWARD -o virbr0 -j REJECT --reject-with icmp6-port-unreachable
  -A FORWARD -i virbr0 -j REJECT --reject-with icmp6-port-unreachable

(the first allows all IPv6 to be forwarded between any port on virbr0, the other 2 reject any traffic being forwarded anywhere else)


3) Looking through the code history in git, I don't see any functional changes to the ip6tables rules added for IPv6 in at least 5 years, much less in the 1 year between libvirt-4.1.0 and libvirt-5.0.0.

My suspicion is that the behavior change you're seeing is due to a change outside of libvirt. You could verify this by downgrading *just* libvirt back to 4.1.0 to see if you get ipv6 functionality back on your network (I'm guessing it won't make a difference).

Can you provide more info on the rest of your system. For example, the distro and version. Also, are you running firewalld and, if so, what version, and does the IPv6 begin to work on the isolated network if you stop firewalld.service? Did you possibly upgrade to a firewalld version that is 0.6.0 or greater, and had the firewalld backend switched from iptables to nftables?

If you can create a setup where IPv6 works, can you grab the output of "ip6tables -S" and attach it to the bug? Also do the same for a setup where it *doesn't* work.

You can probably determine which rule in your ip6tables configuration is rejecting the RA packets by running a script like this:

while true; do
   ip6tables -S -v -Z | grep -v ^Zero | grep -v -e "-c 0 0" | grep -v ^-N | egrep REJECT\|DROP
done

Comment 5 post+redhat 2019-02-27 19:08:09 UTC
> My understanding is that disable_ipv6 only affects IPv6 operation for the *host* network stack, but not for any guests that are attached to the bridge. Bug 501934

That makes more sense. But still, I definitely see the RAs appear on the bridge the moment I set disable_ipv6=0. But maybe that's because then the kernel kicks on, becomes multicast-enabled, and that circumvents whatever filtering is otherwise in place?

> The only thing that setting ipv6='yes' changes is the ip6tables rules that are added for the network

I had briefly looked at iptables before, but -- why should bridge traffic even go through there? No routing is happening, just forwarding. `nf_call_ip6tables` is set to 0.
On the other hand, I *do* see the packages in vnet0 in wireshark, and then they are gone in virbr1. So this definitely looks like filtering, and "ebtables -L" says that one is empty.

> much less in the 1 year between libvirt-4.1.0 and libvirt-5.0.0.

I was on 4.10.0 before, not 4.1.0.
However, I just tried downgrading everything (libvirt, firewalld, linux) and that didn't help. I don't know why this stopped working...


> Can you provide more info on the rest of your system. For example, the distro and version.

Debian testing.

> Also, are you running firewalld and, if so, what version

Yes, firewalld is installed, version 0.6.3-5. I configured it to use the iptables backend due to issues with libvirt.

> If you can create a setup where IPv6 works, can you grab the output of "ip6tables -S" and attach it to the bug? Also do the same for a setup where it *doesn't* work.

I don't know how to make it work. I attached the output for when it does not work.

> You can probably determine which rule in your ip6tables configuration is rejecting the RA packets by running a script like this:

Is that looking for the DROP rule where the count keeps increasing? I tried that visually without much result. Your script prints

-A INPUT -c 4 432 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 4 432 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 152 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 152 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 152 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 152 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 152 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 152 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 152 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 152 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 4 432 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 216 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 1 127 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 5 830 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 216 -j REJECT --reject-with icmp6-adm-prohibited
-A INPUT -c 2 216 -j REJECT --reject-with icmp6-adm-prohibited

I will experiment with disabling firewalld entirely (just stopping it mid-operation didn't help, but I don't trust that that will do anything meaningful).

Comment 6 post+redhat 2019-02-27 19:08:30 UTC
Created attachment 1539252 [details]
ip6tables -S when it does not work

Comment 7 post+redhat 2019-02-27 19:33:49 UTC
I tried `systemctl disable firewalld` followed by a reboot. That indeed fixed the bridge! ip6tables -S says:

-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A FORWARD -i virbr1 -o virbr1 -j ACCEPT
-A FORWARD -o virbr1 -j REJECT --reject-with icmp6-port-unreachable
-A FORWARD -i virbr1 -j REJECT --reject-with icmp6-port-unreachable

However, now the other virtual network, the one set up as "NAT", stopped working -- so the router VM does not have internet...

Comment 8 post+redhat 2019-02-27 21:07:18 UTC
Interesting find: The "ip6tables -t raw -S" shows

  -A PREROUTING -m rpfilter --invert -j DROP

and when I remove that rule, things start to work.

Comment 9 Laine Stump 2019-02-27 22:50:02 UTC
I just talked to a firewalld maintainer, and he says that rule is added by firewalld when IPv6_rpfilter=yes in /etc/firewalld/firewalld.conf. He also says that the the rule has always been added, and the defaut hasn't changed, so you may have encountered an error in rpfilter in your kernel. He pointed out one BZ that could have caused your problem - Bug 1575431 - but that bug was introduced in kernel 4.16 and fixed in kernel 4.18, so unless you are using 4.16.* or 4.17.*, you may be seeing some *different* rpfilter bug (possibly one that hasn't been reported by anyone else yet).

So you should be able to get your setup working temporarily by setting IPv6_rpfilter=no in firewalld.conf, or if your kernel has the abovementioned rpfilter bug then maybe upgrading the kernel will help. But if you're already running kernel 4.18, we may need to report this to kernel network people.

Comment 10 post+redhat 2019-02-28 08:31:36 UTC
I am on 4.19.16, so I should have that fix.

Comment 11 post+redhat 2019-02-28 08:46:25 UTC
I can confirm setting IPv6_rpfilter=no and rebooting fixes the problem.


So yes, seems like a bugreport elsewhere is in order. Where do kernel net bugreports usually get sent? (My personal success rate with kernel bug reports is... spotty.)

Comment 12 Laine Stump 2019-02-28 17:51:43 UTC
Yeah, I've had/noticed much the same experience with kernel bugs, unless I happened to know and catch the attention of a kernel developer working on the related part of the kernel...

It's been suggested to me that you could do one of the following:

* file a kernel bug at https://www.debian.org/Bugs

* if we're certain it's a kernel bug that isn't yet fixed upstream (keep in mind upstream is beyond 4.19 now), file a bug at bugzilla.kernel.org (although that might get ignored)

* if it's "clearly netfilter" (and, again, you're sure it's not yet fixed upstream) then file it at bugzilla.netfilter.org.

We can leave this BZ open for now, until you've filed something somewhere else, or you've found the problem is resolved by upgrading the kernel.

Comment 13 post+redhat 2019-03-11 21:44:43 UTC
The problem is indeed fixed by using upstream kernel 4.20.14. I have reported a bug against the Debian kernel.

Comment 14 Laine Stump 2019-03-12 16:52:24 UTC
Okay, thanks for following up. I'm closing this one as NOTABUG then.

Comment 15 post+redhat 2019-03-18 14:14:51 UTC
For the record, the Debian kernel bug report is at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=924349


Note You need to log in before you can comment on or make changes to this bug.