Bug 1275468
Summary: | configure many networks with ipv6; one will randomly not autostart. hits DAD timeout | ||
---|---|---|---|
Product: | [Community] Virtualization Tools | Reporter: | jean-christophe manciot <actionmystique> |
Component: | libvirt | Assignee: | Libvirt Maintainers <libvirt-maint> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | unspecified | CC: | actionmystique, berrange, crobinso, dyuan, fjin, jdenemar, laine, mzhan, rbalakri |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-10 09:20:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
jean-christophe manciot
2015-10-27 01:27:36 UTC
Same issue with 1.2.21 Uninstalling libvirt while there are some domains or networks running is not a very good idea because they will be still running but libvirt will lose track of them. Just shutdown all domains and destroy all networks before uninstalling libvirt and everything should work. I don't see any bug here. Got it, thanks. Well, I tried to follow your guidelines: - shutdown all domains - destroy all networks - uninstall libvirt - install libvirt 1.2.21 The issue remains and I have to perform the following each time the machine boots up in order to try have the first three networks active - the others are OK: sometimes, a new issue surfaces! cd /etc/libvirt/qemu/networks virsh connect qemu:///system virsh net-list --all Name State Autostart Persistent ---------------------------------------------------------- default inactive yes yes loopback inactive yes yes virtual-bridge-1 active yes yes virtual-bridge-2 active yes yes virtual-bridge-3 inactive yes yes virtual-bridge-4 active yes yes virtual-bridge-5 active yes yes virtual-bridge-6 active yes yes virtual-bridge-7 active yes yes virtual-bridge-8 active yes yes virtual-bridge-9 active yes yes virtual-router inactive yes yes brctl show bridge name bridge id STP enabled interfaces virbr0 8000.000000000000 yes virbr1 8000.000000000000 yes virbr10 8000.525400a8ff01 yes virbr10-nic virbr11 8000.525400335153 yes virbr11-nic virbr2 8000.000000000000 yes virbr3 8000.5254002a7ee2 yes virbr3-nic virbr4 8000.52540097ec52 yes virbr4-nic virbr6 8000.5254005f9a92 yes virbr6-nic virbr7 8000.525400573542 yes virbr7-nic virbr8 8000.525400960209 yes virbr8-nic virbr9 8000.525400e89e8c yes virbr9-nic ifconfig virbr0 down brctl delbr virbr0 virsh net-define default.xml Network default defined from default.xml virsh net-start default Network default started ifconfig virbr1 down brctl delbr virbr1 virsh net-define loopback.xml Network loopback defined from loopback.xml virsh net-start loopback Network default started ifconfig virbr2 down brctl delbr virbr2 virsh net-define virtual-router.xml Network virtual-router defined from virtual-router.xml virsh net-start virtual-router Network virtual-router started virsh net-list --all Name State Autostart Persistent ---------------------------------------------------------- default active yes yes loopback active yes yes virtual-bridge-1 active yes yes virtual-bridge-2 active yes yes virtual-bridge-3 active yes yes virtual-bridge-4 active yes yes virtual-bridge-5 active yes yes virtual-bridge-6 active yes yes virtual-bridge-7 active yes yes virtual-bridge-8 active yes yes virtual-bridge-9 active yes yes virtual-router active yes yes brctl show bridge name bridge id STP enabled interfaces virbr0 8000.5254009d4405 yes virbr0-nic virbr1 8000.525400ffebb0 yes virbr1-nic virbr10 8000.525400a8ff01 yes virbr10-nic virbr11 8000.525400335153 yes virbr11-nic virbr2 8000.5254000d35ae yes virbr2-nic virbr3 8000.5254002a7ee2 yes virbr3-nic virbr4 8000.52540097ec52 yes virbr4-nic virbr5 8000.5254004491cb yes virbr5-nic virbr6 8000.5254005f9a92 yes virbr6-nic virbr7 8000.525400573542 yes virbr7-nic virbr8 8000.525400960209 yes virbr8-nic virbr9 8000.525400e89e8c yes virbr9-nic ------------------------------------------------------------------------------- Sometimes, I get the following errors: error: Disconnected from qemu:///system due to I/O error error: Failed to start network loopback error: Cannot recv data: Connection reset by peer error: One or more references were leaked after disconnect from the hypervisor error: failed to connect to the hypervisor error: no valid connection error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory error: failed to connect to the hypervisor error: no valid connection error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory error: failed to connect to the hypervisor error: no valid connection error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory Not solved with libvirt 1.3.0 with exactly the same symptoms. If I stop & disable gnome NetworkManager systemd service, the symptoms disappear mostly, even though from time to time a random net is not started after reboot despite being marked as Autostart: virsh net-list --all Name State Autostart Persistent ---------------------------------------------------------- default active yes yes loopback active yes yes ovs-net active yes yes virl-data-flat active yes yes virl-data-flat1 active yes yes virl-data-snat active yes yes virl-openstack active yes yes virtual-bridge-1 active yes yes virtual-bridge-2 active yes yes virtual-bridge-3 active yes yes virtual-bridge-4 active yes yes virtual-bridge-5 active yes yes virtual-bridge-6 active yes yes virtual-bridge-7 inactive yes yes virtual-bridge-8 active yes yes virtual-bridge-9 active yes yes virtual-router active yes yes Strange, not sure how NetworkManager would be affecting things here. What distro are you on? Is this still reproducing with latest libvirt? The behavior has been improved with the latest releases of libvirt. With 1.3.3, I continue to experience the latest symptom (one virtual net not active after reboot) with network-manager active. I use Ubuntu server 15.10 4.2.0-35. Can you check syslog for any libvirt errors? It may explain why the network is not starting at bootup With log_level = 1 in /etc/libvirt/libvirtd.conf and sed '/libvirt/!d' syslog > libvirtd.debug.log The result is attached. Created attachment 1145997 [details]
log_level=debug
global debug logging is extremely verbose... and I don't see messages about most of your networks in that log, just virtual-bridge-3 and virtual-bridge-4, so i'm not really sure what's going if. default settings should have libvirtd send errors to syslog which should catch the startup failure I assume. dmesg may also have bits about network stuff going up or down OK. This time, the vnet "virtual-router" has not started despite being set as autostart: sudo virsh net-list --all Name State Autostart Persistent ---------------------------------------------------------- default active yes yes loopback active yes yes ovs-net active yes yes virl-data-flat active yes yes virl-data-flat1 active yes yes virl-data-snat active yes yes virl-openstack active yes yes virtual-bridge-3 active yes yes virtual-bridge-4 active yes yes virtual-mgt-5 active yes yes virtual-router inactive yes yes It is linked to virbr2 bridge which is nowhere to be found: sudo brctl show bridge name bridge id STP enabled interfaces docker0 8000.024230f861fd no lxcbr0 8000.000000000000 no virbr0 8000.525400e18980 yes virbr0-nic virbr1 8000.525400ffebb0 yes virbr1-nic virbr12 8000.5254008826e1 yes virbr12-nic virbr13 8000.5254001b43cc yes virbr13-nic virbr14 8000.525400cb1e3a yes virbr14-nic virbr15 8000.525400d30f13 yes virbr15-nic virbr3 8000.5254006b9df4 yes virbr3-nic virbr4 8000.52540078ef37 yes virbr4-nic virbr5 8000.5254009aa68c yes virbr5-nic There are the follwing entries in the log: NetworkManager[1643]: <info> (virbr2): Activation: successful, device activated. ... NetworkManager[1643]: <warn> (virbr2): failed to detach bridge port virbr2-nic NetworkManager[1643]: <info> (virbr2): device state change: activated -> unmanaged (reason 'removed') [100 10 36] It's the only bridge which experiences the previous fate. syslog.libvirt.log & syslog.only-virbr2.log are attached Created attachment 1146206 [details]
Contains only lines with: libvirt, NetworkManager, dnsmasq, systemd, virtual-router and virbr2
Created attachment 1146207 [details]
syslog with only lines containing virbr2
I also do not know the consequences of: libvirtd[1518]: Duplicate Address Detection not finished in 20 seconds libvirtd[1518]: this function is not supported by the connection driver: virConnectGetCPUModelNames I rebooted & confirmed that this issue happens randomly to another virtual net/bridge, this time to virtual-bridge-4/virbr4. I started virtual-bridge-4 with: sudo virsh net-start virtual-bridge-4 We can note from the log "syslog.only_virbr4_virtual-bridge-4.log" that this time: NetworkManager[1025]: <info> (virbr4): bridge port virbr4-nic was detached I have no idea why it succeeds when done from the CLI. Created attachment 1146225 [details]
Syslog with only lines containing virbr4 and virtual-bridge-4 after reboot & manual start of the vnet
Do those failing network configs have IPv6 in them? When there's a failure, is there always an error in the logs from libvirt about duplicate address detection timeout? Yes, the virtual networks not defined with IPv6 never fail. Yes, it seems that the "Duplicate Address Detection not finished in 20 seconds" triggers the removal of the virtual bridge. Also, "Interface virbrn.IPv4 no longer relevant for mDNS" always appears right after "Duplicate Address Detection not finished in 20 seconds", although there should not be any link in theory. libvirtd[1119]: libvirt version: 1.3.3 libvirtd[1119]: hostname: samsung-ubuntu.actionmystique.net libvirtd[1119]: Duplicate Address Detection not finished in 20 seconds avahi-daemon[1016]: Interface virbr5.IPv4 no longer relevant for mDNS. avahi-daemon[1016]: Leaving mDNS multicast group on interface virbr5.IPv4 with address 172.21.100.1. kernel: [ 20.678465] virbr5: port 1(virbr5-nic) entered disabled state avahi-daemon[1016]: Withdrawing address record for 172.21.100.1 on virbr5. avahi-daemon[1016]: Joining mDNS multicast group on interface virbr5-nic.IPv6 with address fe80::5054:ff:fe9a:a68c. avahi-daemon[1016]: New relevant interface virbr5-nic.IPv6 for mDNS. avahi-daemon[1016]: Registering new address record for fe80::5054:ff:fe9a:a68c on virbr5-nic.*. avahi-daemon[1016]: Interface virbr5-nic.IPv6 no longer relevant for mDNS. avahi-daemon[1016]: Leaving mDNS multicast group on interface virbr5-nic.IPv6 with address fe80::5054:ff:fe9a:a68c. avahi-daemon[1016]: Withdrawing address record for fe80::5054:ff:fe9a:a68c on virbr5-nic. kernel: [ 20.847127] device virbr5-nic left promiscuous mode kernel: [ 20.847130] virbr5: port 1(virbr5-nic) entered disabled state avahi-daemon[1016]: Withdrawing workstation service for virbr5-nic. NetworkManager[975]: <info> devices removed (path: /sys/devices/virtual/net/virbr5-nic, iface: virbr5-nic) NetworkManager[975]: <info> (virbr5-nic): device state change: activated -> unmanaged (reason 'removed') [100 10 36] NetworkManager[975]: <warn> (virbr5): failed to detach bridge port virbr5-nic NetworkManager[975]: <warn> (virbr5-nic): failed to disable userspace IPv6LL address handling avahi-daemon[1016]: Withdrawing workstation service for virbr5. NetworkManager[975]: <info> (virbr5): device state change: activated -> unmanaged (reason 'removed') [100 10 36] NetworkManager[975]: <info> devices removed (path: /sys/devices/virtual/net/virbr5, iface: virbr5) Full log filtered with libvirtd, virbr5 & IPv6 is attached. Created attachment 1146699 [details]
Syslog with only libvirtd, virbr5 & IPv6
Is there a way to change the DAD timeout? Nothing like that appears in libvirtd.conf. According to this post - https://www.redhat.com/archives/libvir-list/2015-October/msg00851.html -, it should equal to the following sum: net.ipv6.conf.default.router_solicitation_delay + net.ipv6.conf.default.dad_transmits * net.ipv6.neigh.default.retrans_time_ms On my system, that means 3s: net.ipv6.conf.all.router_solicitation_delay = 1 net.ipv6.conf.default.dad_transmits = 1 net.ipv6.neigh.default.retrans_time_ms = 1000 No, it's not configurable. But if you are building libvirt from source, you can manually adjust the code. If that works, maybe we should look at making it configurable The line you need to edit is in src/util/virnetdev.c: #define VIR_DAD_WAIT_TIMEOUT 20 /* seconds */ Try upping that to 120, recompile + install, and see what happens Changing VIR_DAD_WAIT_TIMEOUT to 60 makes the issue vanish. However, I'm still puzzled: I have 5 virtual networks defined with IPv6; 5x3x2=30 s (one DAD for each LLA and one for each static IPv6 address). That would explain that 20 s are too short. Why doesn't libvirtd test IPv6 DAD on all interfaces in parallel, instead of sequentially which seems to be the case? (In reply to jean-christophe manciot from comment #24) > Changing VIR_DAD_WAIT_TIMEOUT to 60 makes the issue vanish. > > However, I'm still puzzled: I have 5 virtual networks defined with IPv6; > 5x3x2=30 s (one DAD for each LLA and one for each static IPv6 address). > That would explain that 20 s are too short. > > Why doesn't libvirtd test IPv6 DAD on all interfaces in parallel, instead of > sequentially which seems to be the case? laine, thoughts? It is true that libvirt does the setup for each network sequentially, not in parallel (it would be too chaotic to try to do them in parallel - think about adding the iptables rules), but the DAD timeout is per bridge, not the combined time for all bridges. I've found that setting a longer forwarding delay for a network (in the "bridge" element, usually set to 0) requires a longer timeout for DAD, but couldn't determine an equation to describe it, and since almost nobody uses a non-0 forwarding delay (iow, probably only me in a test setup), I didn't take the time to figure it out. My recollection is that DAD on a single bridge took around 7 seconds if the forwarding delay was 0, and didn't vary. The stupid part of all this is that this DAD is happening when the bridge is first created, before there is anything at all attached to it, so there shouldn't be *any* addresses at all, much less duplicates. It would be better if we could disable DAD during the startup of the network. (or does DAD look for duplicate addresses on *all* interfaces of the machine? If so, maybe that's what's causing the increase in required time here). I'd rather avoid adding a configuration knob for the timeout if at all possible - once something like that is in, we have to keep it in even if we later determine it wasn't necessary. I'll try setting up a large number of networks with IPv6 and see if the time required for DAD changes as the network count increases. Any progress on that front as of 2.1.0? (In reply to jean-christophe manciot from comment #27) > Any progress on that front as of 2.1.0? None that I've seen I found awhile back that the one thing that changed the DAD timing was setting an STP forwarding delay - the amount of time required for DAD to complete is some function of that value (but it isn't linear - the time for DAD increases much more quickly than the forward delay). Do your networks have a non-0 delay set? If so, can you try setting those to 0 (unless you have a very unusual network setup, a 0 forward delay should cause no problems). STP & DAD are 2 different things, although it is true that DAD can depend on STP timers. STP is used to prevent loops in a Layer 2 network. The default forwarding delay of 15 s (STP) or 2 s (RSTP) is used in listening and learning states before the ports can reach the forwarding state, when DAD (and other networking communication) is allowed to start. Setting the forwarding delay to 0 simulates "portfast/edge" ports usually facing a server or VM/container in our case which is useful when there is no risk of loop to speed up the initial setup of the port, avoiding the listening and learning STP port states. If you use the Linux bridge(s) as standalone element(s), not connected to upstream ToR switches or to one another, i.e if there is no risk of loops, you can safely disable STP since it is not useful. Otherwise, setting the forwarding delay to 0 should only be done on ports facing a VM/container, or you risk experiencing "broadcast storms" for instance. Unfortunately, we don't have that level of control on Linux bridges (no portfast command for individual ports), so setting the forwarding delay to 0 means setting all ports as portfast, which is "dangerous" if the bridge is connected to other bridges in loops. All my virtual networks have been created with virt-manager, which does not offer the possibility to enable/disable STP or change the timers. So I left the default STP values which on Ubuntu are: sudo showstp virbr0 virbr0 bridge id 8000.000000000000 designated root 8000.000000000000 root port 0 path cost 0 max age 20.00 bridge max age 20.00 hello time 2.00 bridge hello time 2.00 forward delay 2.00 bridge forward delay 2.00 ageing time 300.00 hello timer 1.31 tcn timer 0.00 topology change timer 0.00 gc timer 129.40 The DAD default value should take into account the STP default values since setting the forwarding delay to 0 is a special case. Anyway, I'll try to manually remove STP on my Linux bridges, recompile libvirt (2.0.0 and not 2.1.0 which has an issue, cf. https://bugzilla.redhat.com/show_bug.cgi?id=1365607) with the default VIR_DAD_WAIT_TIMEOUT and see what happens. Trying to rebuild libvirt made me realize that something has changed: - VIR_DAD_WAIT_TIMEOUT is not defined in src/util/virnetdev.c anymore, but in src/util/virnetdevip.c - my build script was expecting it in the original C file to change its value (with sed) to 60 s, which means my first 2.0.0 build has been made with the default 20 s value without me being aware of this and with STP on all bridges. - since I have not experienced the symptoms described in this thread, something ***has been changed in the code*** that solves this issue, although it does not clearly appear in the release notes. As a conclusion, as far as I'm concerned, this issue is closed. However, there is no way for me to close it with "FIXED" since this choice does not appear in the list below. > All my virtual networks have been created with virt-manager, which
> does not offer the possibility to enable/disable STP or change the timers.
You can do that with "virsh net-edit $networkname" - just modify the settings in the <bridge> element, save the file, then destroy and restart the network (don't do it while any guests are connected to the network, as it destroys the guests' tap device connections to the bridge). The settings in the <bridge> element default to stp='on' delay='0'/>, and the kernel does allow setting the forward delay to 0 *before STP has been turned on*. But when stp is enabled, forward delay is clamped to a minimum value of 2 seconds, so that's why you see a 2.
It's good that you brought this up, because although I had gone through all the investigation to figure this out a few years ago, I had forgotten, and this may explain why I couldn't see a linear relationship between STP setting and the time it takes for DAD to complete (on a simplistic level, you'd think that it would take [some base time] + the STP delay). Taking the minimum of 2 seconds forward delay into account may make it easier to compute a proper DAD wait timeout value. (this hasn't been urgent, since it's only for libvirt-created bridges, and almost nobody plays with the default STP settings - STP is essentially pointless when there is no L2 connection to any other network, and that's almost always true for libvirt virtual networks' bridges).
|