Description of problem: A remote server running Rawhide boots and runs for exactly an hour, but then loses network connectivity. In the logs at the same time I see messages like this: Jan 20 12:58:26 trick avahi-daemon[878]: Withdrawing address record for 192.168.0.138 on br0. Jan 20 12:58:26 trick avahi-daemon[878]: Leaving mDNS multicast group on interface br0.IPv4 with address 192.168.0.138. Jan 20 12:58:26 trick avahi-daemon[878]: Interface br0.IPv4 no longer relevant for mDNS. Version-Release number of selected component (if applicable): avahi-0.6.31-30.fc22.x86_64 How reproducible: 100% Steps to Reproduce: 1. Nothing - just run the machine for about an hour. Additional info: This machine uses the libvirt bridged networking configuration outlined here: http://wiki.libvirt.org/page/Networking#Bridged_networking_.28aka_.22shared_physical_device.22.29 NetworkManager is not running. # cat /etc/sysconfig/network-scripts/ifcfg-br0 DEVICE=br0 TYPE=Bridge BOOTPROTO=dhcp ONBOOT=yes DELAY=0 NM_CONTROLLED=no # cat /etc/sysconfig/network-scripts/ifcfg-em1 DEVICE=eth0 HWADDR=c8:60:00:0a:8d:02 ONBOOT=yes BRIDGE=br0 NM_CONTROLLED=no
Ideally I'd like to tell avahi never to touch br0 ...
In the meantime you can set "deny_interfaces=br0" configuration option in /etc/avahi/avahi-daemon.conf.
(In reply to Michal Sekletar from comment #2) > In the meantime you can set "deny_interfaces=br0" configuration option in > /etc/avahi/avahi-daemon.conf. Thanks - I have now done that. Since yesterday I also added systemd masks for all avahi-related services and sockets, and (of course) rebooted because the network had gone. This morning I found out that the network disappeared again exactly one hour after yesterday's reboot. However this time there was no accompanying avahi log message, nor any other obvious log message. So I suspect it's not avahi, it's something else -- which doesn't log -- but I have no idea what.
Since the IPv6 address of the machine is unaffected, I'm able to log in using IPv6. When the bug happens -- exactly one hour after boot or one hour after the network service is started -- I am able to manually "fix" it by doing: systemd restart network How can I know what modifies the IPv4 configuration of the network?
(In reply to Richard W.M. Jones from comment #4) > Since the IPv6 address of the machine is unaffected, I'm able > to log in using IPv6. When the bug happens -- exactly one hour > after boot or one hour after the network service is started -- > I am able to manually "fix" it by doing: Can you please put avahi-daemon to debug mode. Add --debug to daemon command line option in service file. Once IPv4 networked connectivity is borked please log in via IPv6 and collect avahi logs "journalctl -b -u avahi-daemon.service" > avahi-debug.log". > > systemd restart network Are you really using network initscripts on Fedora? In case yes then above is fine, however default is NetworkManager hence you should do "systemctl restart NetworkManager". > > How can I know what modifies the IPv4 configuration of the network? You could probably load some systemtap probes, but I don't know at which probe points they should be placed. I am not familiar with kernel's core networking code.
Yes, I'm really using network scripts. NetworkManager cannot do the bridged networking configuration that we require for virtualization. (But in any case NM is not running, and disabled, so I can't see how NM could be doing it in this case) I'll reenable avahi and try the suggested avahi debugging.
Created attachment 982962 [details] avahi.log Not much to see, but here is the log as requested.
Avahi /reacts/ to network configuration changes, it doesn't cause them itself. I doubt very much avahi has anything to do with this bug. What do your DHCP client logs show? What does your DHCP lease look like? Is dhclient still running when this happens? grep DHC /var/log/messages ps auxw | grep dhclient To find the lease file, look at the dhclient invocation: /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-wlp2s0.pid -lf /var/lib/NetworkManager/dhclient-e70fe341-9cf9-44a5-a016-91da24c593e9-wlp2s0.lease -cf /var/lib/NetworkManager/dhclient-wlp2s0.conf wlp2s0 -lf specifies the lease file.
I agree it's almost certainly not avahi. It's some other component and avahi is just giving us the bad news. See comment 3. I have just restarted the network, and: Jan 26 09:45:48 trick dhclient[5126]: DHCPDISCOVER on br0 to 255.255.255.255 port 67 interval 4 (xid=0x3c68a171) Jan 26 09:45:48 trick dhclient[5126]: DHCPREQUEST on br0 to 255.255.255.255 port 67 (xid=0x3c68a171) Jan 26 09:45:48 trick dhclient[5126]: DHCPOFFER from 192.168.0.254 Jan 26 09:45:48 trick dhclient[5126]: DHCPACK from 192.168.0.254 (xid=0x3c68a171) Jan 26 09:45:51 trick NET[5178]: /usr/sbin/dhclient-script : updated /etc/resolv.conf Jan 26 09:45:51 trick dhclient[5126]: bound to 192.168.0.138 -- renewal in 1577 seconds. I'm not using NetworkManager. However the network scripts are running dhclient: $ ps ax | grep dhcl 1518 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 5190 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 $ cat /var/lib/dhclient/dhclient--br0.lease default-duid "\000\001\000\001\033\266\261/\310`\000\012\215\002"; lease { interface "br0"; fixed-address 192.168.0.138; filename "pxelinux.0"; server-name "tock"; option subnet-mask 255.255.255.0; option dhcp-lease-time 3600; option routers 192.168.0.1; option dhcp-message-type 5; option dhcp-server-identifier 192.168.0.254; option domain-name-servers 192.168.0.254; option dhcp-renewal-time 1800; option dhcp-rebinding-time 3150; option broadcast-address 192.168.0.255; option host-name "trick"; option domain-name "home.annexia.org"; renew 5 2015/01/23 18:31:06; rebind 5 2015/01/23 18:56:14; expire 5 2015/01/23 19:03:44; } lease { interface "br0"; fixed-address 192.168.0.138; filename "pxelinux.0"; server-name "tock"; option subnet-mask 255.255.255.0; option routers 192.168.0.1; option dhcp-lease-time 3600; option dhcp-message-type 5; option domain-name-servers 192.168.0.254; option dhcp-server-identifier 192.168.0.254; option dhcp-renewal-time 1800; option broadcast-address 192.168.0.255; option dhcp-rebinding-time 3150; option host-name "trick"; option domain-name "home.annexia.org"; renew 1 2015/01/26 10:12:08; rebind 1 2015/01/26 10:38:18; expire 1 2015/01/26 10:45:48; } I now need to wait ~ 1 hour for the network to drop again, to see if dhclient / the lease file goes away.
One hour later ... The IPv4 address has gone away, IPv6 address is still there. dhclient is still running, exactly as before (and same PIDs, so it didn't crash). The lease file still exists, with exactly the same content as above (even the expire etc dates are the same -- which is strange actually). There are no new dhclient-related messages in /var/log/messages. I believe that dhclient *should* be renewing the lease, but it's not doing that.
Okay, the other thing to check...did the clock change between when the lease was acquired and the 1 hour later? I believe dhclient uses wall clock time rather than interval timers to track lease expiration. If for example your clock was updated into the past, or far into the future, dhclient may have missed the renewal.
I have no particular suspicions about the hardware clock. Right at this moment the hardware clock has the correct time. There are no odd time jumps in /var/log/messages. NTP is not running. It's standard PC hardware with a battery-backed clock. As ntp was not installed, I have now installed and started it.
(In reply to Richard W.M. Jones from comment #9) > I'm not using NetworkManager. However the network scripts are running > dhclient: > > $ ps ax | grep dhcl > 1518 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf > /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 > 5190 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf > /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 Any idea how to explain the 2 dhclient instances running with the same pid file ? This shouldn't be possible AFAIK. (In reply to Richard W.M. Jones from comment #10) > I believe that dhclient *should* be renewing the lease, but > it's not doing that. Yes, looks like that, but I'm not sure how to debug that.
Sorry to butt in, I'm wondering why there is a ipv6 address being applied to the underlying em1(eth0?) physical device of bridge br0 based on the above posted avahi log that mentions eth0 and ipv6 on the last line. I was under the impression that any addressing should be done as part of the bridge only, am I mistaken? Perhaps this ipv6 address on the physical device of the bridge is not allowing the bridge to renew its lease. You said that you are able to use ipv6 to access the box while ip4 is unavailable, what address on which interface does this address correspond to? I'm curious as to what 'ip addr' and 'ip route' show after a network restart and after a hang.
I am getting a similar problem on Fedora 21 LXDE spin running in an ESXi VM. There is no bridge, just a single interface, which comes up on boot acquiring an IPv4 address and gateway via dhcp (default route is set accordingly). After about 1 hour the interface loses its IPv4 address (non-dhcp IPv6 address still set) and the default route is missing from the route table. This worked fine on Fedora 19.
(In reply to Jerry Vonau from comment #14) > Sorry to butt in, I'm wondering why there is a ipv6 address being applied to > the underlying em1(eth0?) physical device of bridge br0 based on the above > posted avahi log that mentions eth0 and ipv6 on the last line. I was under > the impression that any addressing should be done as part of the bridge > only, am I mistaken? The why is because radvd is running on the network, and so an IPv6 address is assigned to anything attached to the network. I've no idea if it's correct to assign it to eth0 rather than the bridge. > Perhaps this ipv6 address on the physical device of the > bridge is not allowing the bridge to renew its lease. I disabled radvd, but the problem still happens (except more inconveniently because I'm not able to ssh to the IPv6 address and recover the box remotely).
Bug 1188230 was filed which is superficially similar. However on my machine I do not have open-vm-tools installed.
Re comment #15 my problem seems to be related to the presence of the open-vm-tools daemon vmtoolsd. If it is disabled the problem goes away.
David, I don't think the two are related. Let's investigate open-vm-tools aspect in bug 1188230.
(In reply to Ravindra Kumar from comment #19) > David, I don't think the two are related. Let's investigate open-vm-tools > aspect in bug 1188230. One clarification: I meant two say the two bugs could be related but I don't think open-vm-tools could be causing either. AFAIK, we don't have any 30 min or 1hr timer in open-vm-tools. And, this bug is seen on physical where open-vm-tools service exits early.
I used systemtap (thanks Jamie Bainbridge) to determine what process is removing the network address. The kernel stack trace when the network address is removed is: 0xffffffff817b1960 __inet_del_ifa 0xffffffff817b31db check_lifetime 0xffffffff810ce005 process_one_work 0xffffffff810ce69b worker_thread 0xffffffff810d4b5f __init_kthread_worker 0xffffffff81877bfc ret_from_fork From what I can tell, the check_lifetime function is the key here, involved in the 'lft' expiry. FWIW when the network is initialized, the IPv4 address has a life time of 3596 seconds, while the IPv6 address is valid "forever": 99: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether c8:60:00:0a:8d:02 brd ff:ff:ff:ff:ff:ff inet 192.168.0.138/24 brd 192.168.0.255 scope global dynamic br0 valid_lft 3596sec preferred_lft 3596sec inet6 2001:8b0:9ad:8728:ca60:ff:fe0a:8d02/64 scope global mngtmpaddr dynamic valid_lft 86399sec preferred_lft 14399sec inet6 fe80::ca60:ff:fe0a:8d02/64 scope link valid_lft forever preferred_lft forever For some reason dhclient isn't renewing the lease before the 'lft' runs out.
Another data point. The machine has been up for 18 hours and I've manually restarted the network several times (because otherwise it loses IPv4 connectivity). There are now lots of copies of dhclient running ... $ uptime 10:29:10 up 18:05, 2 users, load average: 0.00, 0.36, 1.17 $ ps ax | grep dhcl 1522 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 10953 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 11429 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 12439 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 14380 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 17065 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 23231 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 23935 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0 31069 ? Ss 0:00 /sbin/dhclient -H trick -1 -q -lf /var/lib/dhclient/dhclient--br0.lease -pf /var/run/dhclient-br0.pid br0
May be related to bug 1093803.
(In reply to Richard W.M. Jones from comment #21) > FWIW when the network is initialized, the IPv4 address has a > life time of 3596 seconds, while the IPv6 address is valid "forever": > > 99: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP > group default > inet 192.168.0.138/24 brd 192.168.0.255 scope global dynamic br0 > valid_lft 3596sec preferred_lft 3596sec obtained via dhclient > inet6 2001:8b0:9ad:8728:ca60:ff:fe0a:8d02/64 scope global mngtmpaddr > dynamic > valid_lft 86399sec preferred_lft 14399sec (most likely) obtained via SLAAC (from Router Advertisements), see [1] > inet6 fe80::ca60:ff:fe0a:8d02/64 scope link > valid_lft forever preferred_lft forever link-local address, see also [1] and [2] [1] https://en.wikipedia.org/wiki/IPv6_address#Stateless_address_autoconfiguration [2] https://en.wikipedia.org/wiki/IPv6_address#Address_lifetime (In reply to Richard W.M. Jones from comment #22) > There are now lots of copies of dhclient running ... Might be initscripts problem ? (In reply to Jiri Jaburek from comment #23) > May be related to bug 1093803. May be.
For those who didn't read the earlier comments, the machine is an ordinary PC with a stable, battery-backed time source. There is no time strangeness observed.
*** Bug 1193944 has been marked as a duplicate of this bug. ***
*** Bug 1193988 has been marked as a duplicate of this bug. ***
(In reply to Richard W.M. Jones from comment #10) > I believe that dhclient *should* be renewing the lease, but > it's not doing that. Yes, it seems to stop working after it goes to running in background [1]. (In reply to Richard W.M. Jones from comment #22) > I've > manually restarted the network several times (because otherwise it > loses IPv4 connectivity). There are now lots of copies of dhclient > running ... That's because dhclient process can't be stopped, one has to 'kill -9' it [1]. [1] bug #1193944.
Moving to bind, Tomas promised to take a look at possibilities of building separate libraries with '--disable-epoll --disable-threads' for dhcp. https://lists.isc.org/pipermail/bind-users/2015-February/094640.html If that fails, the only long term solution I have is to ask FPC for exception to build dhcp against bundled bind (dhcp-4.3.2 bundles bind-9.9.7, while we already ship bind-9.10.2 in F22). Both bind & dhcp are from ISC and they release new dhcp versions when there's any related bind vulnerability in the bundled bind.
Commenting here, to not pollute bug #1181562. (In reply to Tomas Hozza from bug #1181562, comment #3) > 1. ... Downside is that although this makes > DHCP server and client work in the background, the only way to stop them is > by 'kill -9'. I've been testing what problems this might cause: 'systemctl restart NetworkManager' seems to 'kill -9' dhclient if it can't stop it. 'systemctl restart network' works also OK, but it leaves the unstoppable dhclient instances behind - not sure it might cause any big troubles. 'systemctl stop/restart dhcpd' doesn't work, one has to use 'systemctl kill --signal=SIGKILL dhcpd' instead. During shutdown systemd kills the dhcpd service after a minute when it's not able to stop it. So even I don't like it (and have no clue how to fix it so far), it doesn't look like a blocker to me.
I still use ifup/ifdown for my network and have worked around it by having a script see if it can ping the gateway; if it works, sleep, otherwise kill -9 dhclient and ifdown/ifup the interface. I haven't noticed any downsides to the workaround other than the hiccups between when the lease is dropped and the script picks it back up again.
I spent almost two days by creating the special version of bind libraries. However I was not able to prepare version that would work with DHCP. It always exited on startup with an assertion. Also the patch was very hackish. We will proceed with discussing the possibilities with FESCo and FPC.
This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle. Changing version to '22'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22
I hope I'm not hijacking but I think this is relevant information... I have experienced something similar in the F22 cloud TC's leading up to Alpha. If I leave the systems up for a day or so they lose IPs and I can't get in to them. These systems are in an openstack environment that works fine for Fedora 20/21. If I get in (via VNC using the password I had set for the root account) I am able to check out the dhclient process and run strace on it: [root@t8 ~]# ps -ef | \grep dhclient root 495 1 0 Mar05 ? 00:00:00 /sbin/dhclient -H localhost -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0 root 6909 6895 0 01:03 pts/0 00:00:00 grep dhclient [root@t8 ~]# [root@t9 ~]# strace -p 495 Process 495 attached futex(0x7f0fb8a750a4, FUTEX_WAIT_PRIVATE, 5, NULL ^CProcess 495 detached <detached ...> It's waiting on a futex but there are no threads and no children. Here is where it is if you attach to it with gdb: (gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f0fb7a918c0 in run () from /lib64/libisc.so.148 #2 0x00007f0fb6e1c555 in start_thread (arg=0x7f0fb1eec700) at pthread_create.c:333 #3 0x00007f0fb7337f3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 And we can see dhclient never really renewed a lease ever since it first started up: [root@t8 ~]# journalctl /usr/sbin/dhclient --no-pager -- Logs begin at Thu 2015-03-05 03:57:59 UTC, end at Sat 2015-03-07 01:01:01 UTC. -- Mar 05 03:58:09 localhost.localdomain dhclient[435]: Created duid \000\001\000\001\034\212\220\321\372\026>\371\221\222. Mar 05 03:58:09 localhost.localdomain dhclient[435]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 (xid=0xa55de33e) Mar 05 03:58:09 localhost.localdomain dhclient[435]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0xa55de33e) Mar 05 03:58:09 localhost.localdomain dhclient[435]: DHCPOFFER from 10.0.0.3 Mar 05 03:58:09 localhost.localdomain dhclient[435]: DHCPACK from 10.0.0.3 (xid=0xa55de33e) Mar 05 03:58:12 host-10-0-0-73 dhclient[435]: bound to 10.0.0.73 -- renewal in 34408 seconds. Some versions of relevant software: [root@t8 ~]# rpm -q dhcp-client dhcp-client-4.3.2-0.4b1.fc22.x86_64 [root@t8 ~]# rpm -qa | grep bind bind-license-9.10.2-0.1.rc1.fc22.noarch bind-libs-9.10.2-0.1.rc1.fc22.x86_64 bind-libs-lite-9.10.2-0.1.rc1.fc22.x86_64 Please let me know what else I can do to help narrow this down.
Hey Rich/Others, Can you confirm this still happens for you for the latest F22 composes/Alpha? If it does then we should probably promote this as some sort of blocker.
I just tested this over the weekend and see the same thing on the F22 Alpha image. Tomorrow I'll try to get an exact time on when it fails.
Hi all. Yes, this should still happen on F22+ (including alpha). It happens in case you use network initscript and not NM. We have been actively working on this issue. Once the package review for bind99 is finished, I'll rebuild DHCP with it and push an update. This will happen this week. Feel free to propose this issue as blocker/exception.
Proposed as a Blocker for 22-beta by Fedora user dustymabe using the blocker tracking app because: I can't site a specific release criteria that is offended by this bug but there are several that are offended if networked ceases to exist (especially in a cloud image where that is your only means of contact). This happens in the cloud image and other places where NM is not used, but network scripts are.
Discussed at 2015-03-09 blocker review meeting: http://meetbot.fedoraproject.org/fedora-blocker-review/2015-03-09/f22-blocker-review.2015-03-09-16.05.log.txt . Accepted as a blocker as a conditional violation of the Alpha criterion "The installed system must be able to download and install updates with the default console package manager.", the condition being use of the default Cloud network configuration (or any other use of network.service, which is still fairly common).
bind99-9.9.7-3.fc22, bind-9.10.2-1.fc22, dhcp-4.3.2-2.fc22, dnsperf-2.0.0.0-15.fc22, bind-dyndb-ldap-7.0-4.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/FEDORA-2015-3717/bind99-9.9.7-3.fc22,bind-9.10.2-1.fc22,dhcp-4.3.2-2.fc22,dnsperf-2.0.0.0-15.fc22,bind-dyndb-ldap-7.0-4.fc22
bind99-9.9.7-3.fc22, bind-9.10.2-1.fc22, dhcp-4.3.2-2.fc22, dnsperf-2.0.0.0-15.fc22, bind-dyndb-ldap-7.0-4.fc22 has been pushed to the Fedora 22 testing repository.
bind99-9.9.7-3.fc22, bind-9.10.2-1.fc22, dhcp-4.3.2-2.fc22, dnsperf-2.0.0.0-15.fc22, bind-dyndb-ldap-7.0-4.fc22 has been pushed to the Fedora 22 stable repository.
I have tested this package and it does appear to work, thanks all.