Bug 1187244 - [RHEL 7.0 + 7.1] Host configure with DHCP is losing connectivity after some time - dhclient is not running
Summary: [RHEL 7.0 + 7.1] Host configure with DHCP is losing connectivity after some ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Dan Kenigsberg
QA Contact: Michael Burman
URL:
Whiteboard:
: 1189837 (view as bug list)
Depends On:
Blocks: 1035038 1082754 rhev35gablocker oVirt_3.5.2_tracker 1196220 1197441 1204782
TreeView+ depends on / blocked
 
Reported: 2015-01-29 15:39 UTC by Gil Klein
Modified: 2016-03-09 19:29 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1204782 (view as bug list)
Environment:
Last Closed: 2016-03-09 19:29:45 UTC
oVirt Team: Network
ylavi: Triaged+


Attachments (Terms of Use)
strace output from dhclient (1.18 MB, text/plain)
2015-01-29 15:39 UTC, Gil Klein
no flags Details
journalctl output (puma20.scl.lab.tlv.redhat.com) (6.47 MB, text/x-vhdl)
2015-02-01 13:16 UTC, Gil Klein
no flags Details
journalctl output (puma20.scl.lab.tlv.redhat.com) 3/Feb (175.08 KB, text/plain)
2015-02-03 16:07 UTC, Gil Klein
no flags Details
strace output (puma20.scl.lab.tlv.redhat.com) 3/Feb (3.21 MB, text/plain)
2015-02-03 16:08 UTC, Gil Klein
no flags Details
pcap output (puma20.scl.lab.tlv.redhat.com) 3/Feb (1.27 MB, application/octet-stream)
2015-02-03 16:09 UTC, Gil Klein
no flags Details
dmesg output (puma28.scl.lab.tlv.redhat.com) RHEL 7.0 (262.44 KB, text/plain)
2015-02-08 11:24 UTC, Gil Klein
no flags Details
journalctl output (puma28.scl.lab.tlv.redhat.com) RHEL 7.0 (6.75 MB, application/x-gzip)
2015-02-08 11:58 UTC, Gil Klein
no flags Details
strace output of a relevant dhclient (713 bytes, text/plain)
2015-02-11 11:10 UTC, Ido Barkan
no flags Details
alma07 logs (auditd/journalctl/dmesg) (626.64 KB, application/x-gzip)
2015-02-16 11:23 UTC, Gil Klein
no flags Details
server, engine and vdsm logs (2.57 MB, application/x-gzip)
2015-03-05 12:41 UTC, Kevin Alon Goldblatt
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0362 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 29441 master MERGED split network restoration from vdsmd.service Never
oVirt gerrit 39137 master MERGED Prevent systemd to kill dhclient once vdsm-network service is stopped Never

Description Gil Klein 2015-01-29 15:39:07 UTC
Created attachment 985664 [details]
strace output from dhclient

Description of problem:

Host that is configured which DHCP is loosing it's IP and connectivity after ~1 day.


Version-Release number of selected component (if applicable):
[root@puma20 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.1 Beta (Maipo)
[root@puma20 ~]# uname -a
Linux puma20.scl.lab.tlv.redhat.com 3.10.0-223.el7.x86_64 #1 SMP Tue Jan 13 10:49:13 EST 2015 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
100% on specific hosts


Steps to Reproduce:
1. Configure the host to use DHCP
2. Use the host for ~1 day


Actual results:
Host will lose it's IP, dhclient won't be running and host will lose its connectivity to the network. 


Expected results:
Host should not be losing it's IP address 


Additional info:
The interface that loses it's ip is part of a bridge.
Network manager is inactive.

systemctl status NetworkManager
NetworkManager.service - Network Manager
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; disabled)
   Active: inactive (dead)

# rpm -qa |grep dhc
dhcp-4.2.5-35.el7.x86_64
dhcp-debuginfo-4.2.5-35.el7.x86_64
dhclient-4.2.5-35.el7.x86_64
dhcp-devel-4.2.5-35.el7.x86_64
dhcp-libs-4.2.5-35.el7.x86_64
dhcp-common-4.2.5-35.el7.x86_64

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master rhevm state UP qlen 1000
    link/ether 00:9c:02:b0:df:b0 brd ff:ff:ff:ff:ff:ff
    inet 10.35.160.49/24 brd 10.35.160.255 scope global dynamic enp4s0f0
       valid_lft 28273sec preferred_lft 28273sec
    inet6 fe80::29c:2ff:feb0:dfb0/64 scope link 
       valid_lft forever preferred_lft forever
3: enp5s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 44:1e:a1:73:39:92 brd ff:ff:ff:ff:ff:ff
    inet 10.35.162.32/24 brd 10.35.162.255 scope global dynamic enp5s0f0
       valid_lft 26136sec preferred_lft 26136sec
    inet6 fe80::461e:a1ff:fe73:3992/64 scope link 
       valid_lft forever preferred_lft forever
4: enp4s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:9c:02:b0:df:b4 brd ff:ff:ff:ff:ff:ff
    inet 10.35.163.98/24 brd 10.35.163.255 scope global dynamic enp4s0f1
       valid_lft 42912sec preferred_lft 42912sec
    inet6 fe80::29c:2ff:feb0:dfb4/64 scope link 
       valid_lft forever preferred_lft forever
5: enp5s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether 44:1e:a1:73:39:93 brd ff:ff:ff:ff:ff:ff
6: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 42:87:eb:f0:40:44 brd ff:ff:ff:ff:ff:ff
8: ;vdsmdummy;: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 02:bf:b8:c7:e1:7e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::bf:b8ff:fec7:e17e/64 scope link 
       valid_lft forever preferred_lft forever
9: rhevm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 00:9c:02:b0:df:b0 brd ff:ff:ff:ff:ff:ff
    inet 10.35.160.49/24 brd 10.35.160.255 scope global dynamic rhevm
       valid_lft 25776sec preferred_lft 25776sec
    inet6 fe80::29c:2ff:feb0:dfb0/64 scope link 
       valid_lft forever preferred_lft forever

# brctl show
bridge name	bridge id		STP enabled	interfaces
;vdsmdummy;		8000.000000000000	no		
rhevm		8000.009c02b0dfb0	no		enp4s0f0


Please see the attached strace output for dhclient that shows a SIGTERM signal was sent to the process.

--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=9759, si_uid=0}
+++ killed by SIGTERM +++

Comment 1 Jiri Popelka 2015-01-29 17:06:12 UTC
(In reply to Gil Klein from comment #0)
> Network manager is inactive.
> ...
> --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=9759, si_uid=0}
> +++ killed by SIGTERM +++

To know what exactly stopped dhclient we'd need to see what is/was the process with PID 9759. For now my only idea is initscripts (since NM is not running), for example ifdown-eth stops dhclient:
https://git.fedorahosted.org/cgit/initscripts.git/tree/sysconfig/network-scripts/ifdown-eth#n98

Maybe Lukas has some idea -> let's start with reassigning to initscripts

Comment 2 Lukáš Nykrýn 2015-01-30 10:13:38 UTC
I really doubt that this is an initscripts bug. Initscripts are not doing anything on their own so even if it was ifdown who kill dhclient something else  had to call ifdown. Can you post here you journal logs?

Comment 3 Gil Klein 2015-02-01 13:16:42 UTC
Created attachment 986759 [details]
journalctl output (puma20.scl.lab.tlv.redhat.com)

Comment 4 Lukáš Nykrýn 2015-02-01 14:02:57 UTC
Network initscript was not restarted so there must be a different culprit.

Comment 6 Jiri Popelka 2015-02-02 12:31:05 UTC
(In reply to Gil Klein from comment #3)
> journalctl output (puma20.scl.lab.tlv.redhat.com)

I'm getting puzzled from the journalctl output.
It shows something completely different than what Nisim originally added to bug #1116004, comment #54.
According to your journalctl output there's been one dhclient instance (pid 5660) running since Jan 25th. and successfully renewing addresses for
enp4s0f0 (10.35.160.49)
enp4s0f1 (10.35.163.98)
enp5s0f0 (10.35.162.32)
rhevm (10.35.160.49)

What I'm wondering in a first place is why there's only one instance of dhclient running for all network interfaces
and what started that instance, because AFAIK initscripts run one dhclient instance per network interface.
Could you try to look at what's the parent process of this dhclient ?
# ps -o ppid= 5660
# ps -f <ppid>

Comment 7 Gil Klein 2015-02-02 14:01:08 UTC
The server was rebooted do to power outage, but indeed, only one dhclient is now running now:

# ps aux|grep "/sbin/dhclient"
root      6235  0.0  0.0 112640   968 pts/1    S+   15:39   0:00 grep --color=auto /sbin/dhclient
root     29361  0.0  0.0 103736 12720 ?        Ss   14:43   0:00 /sbin/dhclient -H puma20 -1 -q -lf /var/lib/dhclient/dhclient--rhevm.lease -pf /var/run/dhclient-rhevm.pid rhevm

# ps -o ppid=29361
29361
26761
26742
[root@puma20 ~]# ps -f 26761
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
root     26761 26742  0 14:25 pts/1    Ss     0:00 -bash
[root@puma20 ~]# ps -f 26742
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
root     26742  1584  0 14:24 ?        Ss     0:03 sshd: root@pts/1

Comment 8 Jiri Popelka 2015-02-02 14:06:48 UTC
(In reply to Gil Klein from comment #7)
> # ps -o ppid=29361

There's a space between = and the pid, i.e.
# ps -o ppid= 29361

Comment 9 Gil Klein 2015-02-02 14:56:31 UTC
Here it is:

# ps -o ppid= 29361
    1

Comment 10 Jiri Popelka 2015-02-03 15:50:59 UTC
(In reply to Gil Klein from comment #0)
> attached strace output for dhclient that shows a SIGTERM signal was sent to the process.


Also 'dhclient -r/-x ${DEVICE}' sends a SIGTERM to the already running dhclient 'on' ${DEVICE}
but I don't think this info moves us any further.

As I stated in comment #6, the journalctl output from comment #3 looks good to me, i.e. the dhclient seems to have been running for couple days. So what next ?
If there's a problem it's most likely not a problem in dhclient itself, but I have no idea where to move from here.

Comment 11 Gil Klein 2015-02-03 16:06:39 UTC
I've tried to capture the SIGTERM while monitoring the process list frequently, but it seems hard to reproduce this exact problem.

The same host is still losing it's IP each and every day, but I'm not sure it is the same issue.

Please see the attached strace and pcap captures.

Comment 12 Gil Klein 2015-02-03 16:07:45 UTC
Created attachment 987668 [details]
journalctl output (puma20.scl.lab.tlv.redhat.com) 3/Feb

Comment 13 Gil Klein 2015-02-03 16:08:37 UTC
Created attachment 987669 [details]
strace output (puma20.scl.lab.tlv.redhat.com) 3/Feb

Comment 14 Gil Klein 2015-02-03 16:09:19 UTC
Created attachment 987671 [details]
pcap output (puma20.scl.lab.tlv.redhat.com) 3/Feb

Comment 15 Jiri Popelka 2015-02-03 17:31:15 UTC
(In reply to Gil Klein from comment #12)
> journalctl output (puma20.scl.lab.tlv.redhat.com) 3/Feb


> 16:12:08 network[1020]: Bringing up interface rhevm:
> 16:12:08 dhclient[1305]: DHCPREQUEST on rhevm to 255.255.255.255 port 67 (xid=0x1bb44b21)
> 16:12:08 dhclient[1305]: DHCPACK from 10.35.160.254 (xid=0x1bb44b21)
> 16:12:10 avahi-daemon[959]: Registering new address record for 10.35.160.49 on rhevm.IPv4.
> 16:12:11 dhclient[1305]: bound to 10.35.160.49 -- renewal in 16302 seconds.
> 16:12:11 network[1020]: Determining IP information for rhevm... done.

here the network service configured rhevm interface

> 16:12:11 systemd[1]: Reached target Network.
> 16:12:13 vdsmd_init_common.sh[990]: vdsm: Running unified_network_persistence_upgrade
> 16:12:13 vdsmd_init_common.sh[990]: vdsm: Running restore_nets
> 16:12:13 avahi-daemon[959]: Withdrawing address record for 10.35.160.49 on rhevm.

for some reason the configured address 10.35.160.49 on rhevm disappeared - maybe something (vdsmd_init_common.sh ?) called 'dhclient -r' ?

> 16:12:13 kernel: rhevm: port 1(enp4s0f0) entered disabled state
> 16:12:15 dhclient[3032]: DHCPREQUEST on rhevm to 255.255.255.255 port 67 (xid=0x128c4693)
> 16:12:15 dhclient[3032]: DHCPACK from 10.35.160.254 (xid=0x128c4693)
> 16:12:17 avahi-daemon[959]: Registering new address record for 10.35.160.49 on rhevm.IPv4.
> 16:12:17 dhclient[3032]: bound to 10.35.160.49 -- renewal in 18213 seconds.

and here new dhclient (pid 3032 instead of previous 1305) configured rhevm interface again

Might the vdsmd_init_common.sh be the one what later stops dhclient ?
What component does this belong to ?

Comment 16 Gil Klein 2015-02-04 07:45:40 UTC
Dan,

Is it possible that vdsmd_init_common.sh stops dhclient?

Comment 17 Dan Kenigsberg 2015-02-04 18:41:34 UTC
(In reply to Gil Klein from comment #16)
> Dan,
> 
> Is it possible that vdsmd_init_common.sh stops dhclient?

yes, when it calls `vdsm-tool restore-nets`. However, it should have started a new dhclient process again. Is that the issue that you see?

This is different from the reports I heard about dhclient dying during a normal server lifetime, with no restarts involved.

Comment 18 Gil Klein 2015-02-05 11:31:00 UTC
(In reply to Dan Kenigsberg from comment #17)
> (In reply to Gil Klein from comment #16)
> > Dan,
> > 
> > Is it possible that vdsmd_init_common.sh stops dhclient?
> 
> yes, when it calls `vdsm-tool restore-nets`. However, it should have started
> a new dhclient process again. Is that the issue that you see?
I'm still trying to reproduce it, but I'm pretty sure something has killed dhcleint, based on the strace output.
> 
> This is different from the reports I heard about dhclient dying during a
> normal server lifetime, with no restarts involved.
I'm not sure this is the same issue we see on RHEVH 7.0, I'm trying to reproduce it too

Comment 20 Gil Klein 2015-02-08 07:53:53 UTC
We were able to reproduce this issue on RHEL 7.0:

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.0 (Maipo)

# uname -a
Linux puma28.scl.lab.tlv.redhat.com 3.10.0-123.20.1.el7.x86_64 #1 SMP Wed Jan 21 09:45:55 EST 2015 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa | grep dhc
dhcp-4.2.5-35.el7.x86_64
dhcp-common-4.2.5-35.el7.x86_64
dhclient-4.2.5-35.el7.x86_64
dhcp-libs-4.2.5-35.el7.x86_64

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp5s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 44:1e:a1:71:a2:fc brd ff:ff:ff:ff:ff:ff
3: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master rhevm state UP qlen 1000
    link/ether 00:9c:02:b0:ff:28 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::29c:2ff:feb0:ff28/64 scope link 
       valid_lft forever preferred_lft forever
4: enp5s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 44:1e:a1:71:a2:fd brd ff:ff:ff:ff:ff:ff
5: enp4s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:9c:02:b0:ff:2c brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN 
    link/ether fe:3b:ab:d3:72:06 brd ff:ff:ff:ff:ff:ff
8: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
    link/ether 2e:fe:c7:30:af:3e brd ff:ff:ff:ff:ff:ff
9: rhevm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 00:9c:02:b0:ff:28 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::29c:2ff:feb0:ff28/64 scope link 
       valid_lft forever preferred_lft forever

/etc/sysconfig/network-scripts/ifcfg-enp4s0f0
# Generated by VDSM version 4.16.8.1-6.el7ev
DEVICE=enp4s0f0
HWADDR=00:9c:02:b0:ff:28
BRIDGE=rhevm
ONBOOT=yes
MTU=1500
NM_CONTROLLED=no
/etc/sysconfig/network-scripts/ifcfg-enp5s0f0
HWADDR=44:1E:A1:71:A2:FC
TYPE=Ethernet
BOOTPROTO=dhcp
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
NAME=enp5s0f0
UUID=f1eb15d0-303e-49d8-8de9-a8d1b92b8f0e
ONBOOT=no
/etc/sysconfig/network-scripts/ifcfg-enp5s0f1
HWADDR=44:1E:A1:71:A2:FD
TYPE=Ethernet
BOOTPROTO=dhcp
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
NAME=enp5s0f1
UUID=6faa28a1-559f-4c80-b4ff-cdad27b7865d
ONBOOT=no
/etc/sysconfig/network-scripts/ifcfg-lo
DEVICE=lo
IPADDR=127.0.0.1
NETMASK=255.0.0.0
NETWORK=127.0.0.0
# If you're having problems with gated making 127.0.0.0/8 a martian,
# you can change this to something else (255.255.255.255, for example)
BROADCAST=127.255.255.255
ONBOOT=yes
NAME=loopback
/etc/sysconfig/network-scripts/ifcfg-rhevm
# Generated by VDSM version 4.16.8.1-6.el7ev
DEVICE=rhevm
TYPE=Bridge
DELAY=0
STP=off
ONBOOT=yes
BOOTPROTO=dhcp
MTU=1500
DEFROUTE=yes
NM_CONTROLLED=no
HOTPLUG=no

# brctl show
bridge name	bridge id		STP enabled	interfaces
;vdsmdummy;		8000.000000000000	no		
rhevm		8000.009c02b0ff28	no		enp4s0f0

Comment 21 Gil Klein 2015-02-08 11:24:06 UTC
Created attachment 989365 [details]
dmesg output (puma28.scl.lab.tlv.redhat.com)  RHEL 7.0

Comment 22 Gil Klein 2015-02-08 11:58:04 UTC
Created attachment 989369 [details]
journalctl output (puma28.scl.lab.tlv.redhat.com)  RHEL 7.0

Comment 23 Jiri Popelka 2015-02-08 18:05:48 UTC
I can't see any useful info in the last two dmesg & journalctl logs.

Anyway to summarize what we know so far:

Something (vdsm ?) stops/kills (we know that from strace) dhclient.
Dhclient (or rather dhclient-script) sets address lifetime when adding it to interface.
So when the address is not renewed (for example because something has stopped/killed dhclient),
kernel automatically removes the address when its lifetime ends.
Gil verified (as he wrote in e-mail) that setting the lifetime to 'forever' makes the problem not appear.

This setting of address lifetime was added with
http://pkgs.devel.redhat.com/cgit/rpms/dhcp/commit/?h=rhel-7.0&id=d62cff5d6c06b40aefe2c5839101305c54f6aba7
http://pkgs.devel.redhat.com/cgit/rpms/dhcp/commit/?h=rhel-7.0&id=dbe9812c437730d64c2ec39ba54d53b79201d16f
due to bug #1032809.
If there's no other way, it'd be possible to revert those commits, but that's just hiding of a bug somewhere else - but where I have no idea.

Comment 24 Dan Kenigsberg 2015-02-09 19:56:22 UTC
Jiri, I have another reproducer of the issue. Here, `vdsm-tool restore-nets` helper script calls `ifdown enp6s0` which ends up killing dhclient.

Then, the script calls `ifup enp6s0` which creates a new dhclient process. For some reason, this new process dies after 5 seconds or so, shortly after handling a DHCPREQUEST. This does not happen on el6 hosts.

Based on timing, it seems as if the dhclient process is killed after
ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start
finishes.

Do you happen to know what can be doing this? systemd?

Comment 25 Jiri Popelka 2015-02-10 10:02:32 UTC
No idea. We'd need to see strace output first to know whether it's still the same case of dhclient being killed/stopped (as in the previous reproducers) or it crashed or exited.

Comment 26 Lior Vernia 2015-02-11 09:32:42 UTC
We're not yet sure that this is really a vdsm issue, but it's definitely vdsm that's triggering it. Changing product, though it may be changed back soon...

Comment 28 Lior Vernia 2015-02-11 10:16:26 UTC
*** Bug 1189837 has been marked as a duplicate of this bug. ***

Comment 29 Ido Barkan 2015-02-11 11:10:30 UTC
Created attachment 990419 [details]
strace output of a relevant dhclient

Comment 34 Dan Kenigsberg 2015-02-16 09:48:17 UTC
Gil, in the reproduced scenario, initscripts starts dhclient on boot. Then, vdsm calls ifdown to kill it (I suspect that this is what you see in auditctl). Then, vdsm starts a new dhclient up, and it is killed by something after spawning into background. Can you double-check that you find the killer of the second dhclient?

Comment 36 Gil Klein 2015-02-16 11:18:29 UTC
(In reply to Dan Kenigsberg from comment #34)
> Gil, in the reproduced scenario, initscripts starts dhclient on boot. Then,
> vdsm calls ifdown to kill it (I suspect that this is what you see in
> auditctl). Then, vdsm starts a new dhclient up, and it is killed by
> something after spawning into background. Can you double-check that you find
> the killer of the second dhclient?
It seems to be ifdown-eth in all cases:

# ausearch -k teste_kill | grep -A 6 -B 6 dhclient
type=SYSCALL msg=audit(1423773423.638:91): arch=c000003e syscall=62 success=no exit=-3 a0=1247 a1=f a2=7f5a562bf640 a3=8 items=0 ppid=1330 pid=4674 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="teste_kill"
----
time->Thu Feb 12 22:37:07 2015
type=SYSCALL msg=audit(1423773427.327:103): arch=c000003e syscall=62 success=no exit=-3 a0=1208 a1=f a2=0 a3=1208 items=0 ppid=4740 pid=4743 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ifdown-eth" exe="/usr/bin/bash" subj=system_u:system_r:initrc_t:s0 key="teste_kill"
----
time->Thu Feb 12 22:37:07 2015
type=OBJ_PID msg=audit(1423773427.300:102): opid=4616 oauid=-1 ouid=0 oses=-1 obj=system_u:system_r:dhcpc_t:s0 ocomm="dhclient"
type=SYSCALL msg=audit(1423773427.300:102): arch=c000003e syscall=62 success=yes exit=0 a0=1208 a1=f a2=0 a3=7fffb7b49490 items=0 ppid=4740 pid=4743 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ifdown-eth" exe="/usr/bin/bash" subj=system_u:system_r:initrc_t:s0 key="teste_kill"
----
time->Thu Feb 12 22:37:14 2015
type=OBJ_PID msg=audit(1423773434.284:108): opid=4718 oauid=0 ouid=0 oses=24 obj=unconfined_u:system_r:systemd_passwd_agent_t:s0 ocomm="systemd-tty-ask"
type=SYSCALL msg=audit(1423773434.284:108): arch=c000003e syscall=62 success=yes exit=0 a0=126e a1=f a2=7fce6a7bb768 a3=7fffbb645310 items=0 ppid=4711 pid=4717 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=24 comm="systemctl" exe="/usr/bin/systemctl" subj=unconfined_u:system_r:initrc_t:s0 key="teste_kill"
----
--
----
time->Sun Feb 15 18:54:40 2015
type=OBJ_PID msg=audit(1424019280.153:33136): opid=1860 oauid=-1 ouid=107 oses=-1 obj=system_u:system_r:svirt_t:s0:c310,c1006 ocomm="qemu-kvm"
type=SYSCALL msg=audit(1424019280.153:33136): arch=c000003e syscall=62 success=yes exit=0 a0=744 a1=0 a2=7fc5f0000048 a3=0 items=0 ppid=1 pid=32424 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="libvirtd" exe="/usr/sbin/libvirtd" subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 key="teste_kill"
----
time->Sun Feb 15 18:54:54 2015
type=OBJ_PID msg=audit(1424019294.006:33164): opid=5241 oauid=-1 ouid=0 oses=-1 obj=system_u:system_r:dhcpc_t:s0 ocomm="dhclient"
type=SYSCALL msg=audit(1424019294.006:33164): arch=c000003e syscall=62 success=yes exit=0 a0=1479 a1=f a2=0 a3=7fffa4f7fce0 items=0 ppid=1946 pid=1949 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ifdown-eth" exe="/usr/bin/bash" subj=system_u:system_r:initrc_t:s0 key="teste_kill"
----
time->Sun Feb 15 18:54:54 2015
type=SYSCALL msg=audit(1424019294.034:33165): arch=c000003e syscall=62 success=no exit=-3 a0=1479 a1=f a2=0 a3=1479 items=0 ppid=1946 pid=1949 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ifdown-eth" exe="/usr/bin/bash" subj=system_u:system_r:initrc_t:s0 key="teste_kill"
----
time->Sun Feb 15 18:57:05 2015
--
----
time->Sun Feb 15 19:02:05 2015
type=OBJ_PID msg=audit(1424019725.007:66): opid=2464 oauid=-1 ouid=107 oses=-1 obj=system_u:system_r:virtd_t:s0-s0:c0.c1023 ocomm="qemu-kvm"
type=SYSCALL msg=audit(1424019725.007:66): arch=c000003e syscall=62 success=yes exit=0 a0=9a0 a1=9 a2=7f337c000040 a3=2 items=0 ppid=1 pid=2354 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="libvirtd" exe="/usr/sbin/libvirtd" subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 key="teste_kill"
----
time->Sun Feb 15 19:02:06 2015
type=OBJ_PID msg=audit(1424019726.445:69): opid=1287 oauid=-1 ouid=0 oses=-1 obj=system_u:system_r:dhcpc_t:s0 ocomm="dhclient"
type=SYSCALL msg=audit(1424019726.445:69): arch=c000003e syscall=62 success=yes exit=0 a0=507 a1=f a2=0 a3=7fff2e7f92a0 items=0 ppid=2496 pid=2503 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ifdown-eth" exe="/usr/bin/bash" subj=system_u:system_r:init_t:s0 key="teste_kill"
----
time->Sun Feb 15 19:02:06 2015
type=SYSCALL msg=audit(1424019726.470:70): arch=c000003e syscall=62 success=no exit=-3 a0=507 a1=f a2=0 a3=507 items=0 ppid=2496 pid=2503 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ifdown-eth" exe="/usr/bin/bash" subj=system_u:system_r:init_t:s0 key="teste_kill"

Comment 37 Gil Klein 2015-02-16 11:23:11 UTC
Created attachment 992150 [details]
alma07 logs (auditd/journalctl/dmesg)

Comment 38 Fabian Deutsch 2015-02-19 13:35:33 UTC
One idea that just came to my mind: We can put systemd in debug mode to see if systemd is killing dhclient: Add systemd.log_level=debug to the kernel cmdline (maybe it can also be put in debug mode in userspace)

Comment 39 Eyal Edri 2015-02-25 08:59:23 UTC
this bugs weren't included in the 3.5.1 tracker, hence not marked as critical to the version. 3.5.1 was done and already has a high number of bugs, so postponing this to 3.5.2

Comment 42 Kevin Alon Goldblatt 2015-03-05 12:33:33 UTC
I have also just experienced this issue on both my hosts:

I am running:
rhel7.1 on my hosts with rhevm3.5.1 vt14
vdsm-4.16.12-2.el7ev.x86_64

ifconfig -a displays no IP

/var/log/messages reports:
Mar  4 18:25:27 nott-vds2 puppet: Error: NetworkManager is not running.
Mar  4 18:25:27 nott-vds2 puppet: Could not retrieve fact='ipaddress', resolution='<anonymous>': Network is unreachable - sendto(2)

Resolved the issue by running dhclient

Comment 43 Kevin Alon Goldblatt 2015-03-05 12:41:06 UTC
Created attachment 998341 [details]
server, engine and vdsm logs

Adding logs

Comment 44 Dan Kenigsberg 2015-03-19 14:29:37 UTC
Now that we have a fix almost ready for this deeply-troubling el7 bug, I would like to include it in the upcoming rhev-3.5.1

Comment 46 Ido Barkan 2015-03-25 12:13:47 UTC
the same behavior (systemd is killing dhclient) also occurres during service restart. this is a cavity in the posted solution that can be fixed simply by telling systemd not to do so.

Comment 47 Michael Burman 2015-05-05 06:45:01 UTC
Verified and tested on RHEL 7.1

3.6.0-0.0.master.20150412172306.git55ba764.el6.noarch
vdsm-4.17.0-632.git19a83a2.el7.x86_64
vdsm-4.17.0-633.git7ad88bc.el7.x86_64
systemd-208-20.el7_1.2.x86_64
systemd-208-20.el7_1.3.x86_64
dhclient-4.2.5-36.el7.x86_64

Comment 50 Lucy Bopf 2016-02-19 10:22:59 UTC
Moving 'requires_doc_text' to '-', as this bug is verified and no longer needs to be documented as a known issue.

Comment 52 errata-xmlrpc 2016-03-09 19:29:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.