Bug 1267169 - systemd on overcloud nodes fail to start dhcp-interface@BRIDGE_NAME.services after reboot
systemd on overcloud nodes fail to start dhcp-interface@BRIDGE_NAME.services ...
Status: MODIFIED
Product: Red Hat OpenStack
Classification: Red Hat
Component: diskimage-builder (Show other bugs)
7.0 (Kilo)
All Linux
low Severity medium
: ---
: 10.0 (Newton)
Assigned To: Bob Fournier
Shai Revivo
: Reopened, Triaged, ZStream
Depends On:
Blocks: 1579831 1553099
  Show dependency treegraph
 
Reported: 2015-09-29 04:31 EDT by Jaison Raju
Modified: 2018-05-18 08:45 EDT (History)
16 users (show)

See Also:
Fixed In Version: diskimage-builder-1.26.1-3.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1553099 1579831 (view as bug list)
Environment:
Last Closed: 2016-08-19 15:37:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Screenshot node start (29.26 KB, image/png)
2015-09-29 04:37 EDT, Jaison Raju
no flags Details
systemd journal ifcfg (90.82 KB, application/x-bzip)
2015-09-29 04:49 EDT, Jaison Raju
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2741461 None None None 2016-11-01 08:25 EDT

  None (edit)
Description Jaison Raju 2015-09-29 04:31:09 EDT
Description of problem:
systemd on overcloud nodes fail to dhcp-interface@BRIDGE_NAME.services after reboot .

# systemctl status dhcp-interface@br-storage.service -l
dhcp-interface@br-storage.service - DHCP interface br/storage
   Loaded: loaded (/usr/lib/systemd/system/dhcp-interface@.service; disabled)
   Active: failed (Result: exit-code) since Mon 2015-09-28 14:35:54 EDT; 16h ago
  Process: 3037 ExecStart=/sbin/ifup %I (code=exited, status=1/FAILURE)
  Process: 3034 ExecStartPre=/usr/local/sbin/dhcp-all-interfaces.sh %I (code=exited, status=0/SUCCESS)
 Main PID: 3037 (code=exited, status=1/FAILURE)

Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: Starting DHCP interface br/storage...
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain dhcp-all-interfaces.sh[3034]: cat: /sys/class/net/br/storage/addr_assign_type: No such file or directory
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain dhcp-all-interfaces.sh[3034]: Inspecting interface: br/storage...Device has generated MAC, skipping.
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain ifup[3037]: /sbin/ifup: configuration for br/storage not found.
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain ifup[3037]: Usage: ifup <configuration>
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: dhcp-interface@br-storage.service: main process exited, code=exited, status=1/FAILURE
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: Failed to start DHCP interface br/storage.
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: Unit dhcp-interface@br-storage.service entered failed state.

# ls /etc/sysconfig/network-scripts/ifcfg-*
/etc/sysconfig/network-scripts/ifcfg-br-storage  /etc/sysconfig/network-scripts/ifcfg-eth1  /etc/sysconfig/network-scripts/ifcfg-vlan20
/etc/sysconfig/network-scripts/ifcfg-eth0        /etc/sysconfig/network-scripts/ifcfg-lo    /etc/sysconfig/network-scripts/ifcfg-vlan40


Version-Release number of selected component (if applicable):
RHOS 7
Director

How reproducible:
Always

Steps to Reproduce:
1. Setup overcloud nodes using templates . 
2. Reboot node
3.

Actual results:
systemd fails to start dhcp-interface@* service

Expected results:
systemd is able to start all services .

Additional info:
Comment 1 Jaison Raju 2015-09-29 04:37 EDT
Created attachment 1078232 [details]
Screenshot node start
Comment 3 Jaison Raju 2015-09-29 04:49 EDT
Created attachment 1078234 [details]
systemd journal ifcfg
Comment 4 Jaison Raju 2015-09-29 04:55:51 EDT
In the initial start of the nodes , i do not find this systemd interface services.
The following reboots cause systemd to start this bridge interfaces .

# egrep "dhcp|interface"  before_systemd_status.txt after_systemd_status.txt
after_systemd_status.txt:dhcp-interface@br-ex.service -> '/org/freedesktop/systemd1/unit/dhcp_2dinterface_40br_2dex_2eservice'
after_systemd_status.txt:dhcp-interface@br-ex.service - DHCP interface br/ex
after_systemd_status.txt:   Loaded: loaded (/usr/lib/systemd/system/dhcp-interface@.service; disabled)
after_systemd_status.txt:  Process: 1401 ExecStartPre=/usr/local/sbin/dhcp-all-interfaces.sh %I (code=exited, status=0/SUCCESS)
after_systemd_status.txt:Sep 29 04:31:54 overcloud-controller-0.localdomain dhcp-all-interfaces.sh[1401]: cat: /sys/class/net/br/ex/addr_assign_type: No such file or directory
after_systemd_status.txt:Sep 29 04:31:54 overcloud-controller-0.localdomain dhcp-all-interfaces.sh[1401]: Inspecting interface: br/ex...Device has generated MAC, skipping.
after_systemd_status.txt:Sep 29 04:32:06 overcloud-controller-0.localdomain systemd[1]: dhcp-interface@br-ex.service: main process exited, code=exited, status=1/FAILURE
after_systemd_status.txt:Sep 29 04:32:06 overcloud-controller-0.localdomain systemd[1]: Failed to start DHCP interface br/ex.
after_systemd_status.txt:Sep 29 04:32:06 overcloud-controller-0.localdomain systemd[1]: Unit dhcp-interface@br-ex.service entered failed state.

Regards,
Jaison R
Comment 6 chris alfonso 2015-09-30 12:14:06 EDT
Other than observing the services, what is the net effect or issue this causes?
Comment 7 Jaison Raju 2015-10-01 23:58:42 EDT
(In reply to chris alfonso from comment #6)
> Other than observing the services, what is the net effect or issue this
> causes?

So far no issues are noticed .
network / floating ip / glance / cinder works well .
Comment 10 Hugh Brock 2016-02-28 03:07:12 EST
Jaison, can you still observe this with 7.3?
Comment 11 Jaison Raju 2016-03-04 06:06:45 EST
(In reply to Hugh Brock from comment #10)
> Jaison, can you still observe this with 7.3?

Still noticed:
[root@overcloud-controller-0 ~]# systemctl | grep dhcp
● dhcp-interface@br-ex.service                                                             loaded failed     failed          DHCP interface br/ex
● dhcp-interface@br-int.service                                                            loaded failed     failed          DHCP interface br/int
● dhcp-interface@br-tun.service                                                            loaded failed     failed          DHCP interface br/tun
● dhcp-interface@ovs-system.service                                                        loaded failed     failed          DHCP interface ovs/system
  system-dhcp\x2dinterface.slice                                                           loaded active     active          system-dhcp\x2dinterface.slice

[root@overcloud-compute-0 ~]# systemctl | grep dhcp
● dhcp-interface@br-ex.service                                                             loaded failed failed    DHCP interface br/ex
● dhcp-interface@br-int.service                                                            loaded failed failed    DHCP interface br/int
● dhcp-interface@br-tun.service                                                            loaded failed failed    DHCP interface br/tun
  dhcp-interface@eth0.service                                                              loaded active exited    DHCP interface eth0
  dhcp-interface@eth1.service                                                              loaded active exited    DHCP interface eth1
● dhcp-interface@ovs-system.service                                                        loaded failed failed    DHCP interface ovs/system
  system-dhcp\x2dinterface.slice                                                           loaded active active    system-dhcp\x2dinterface.slice
Comment 12 Mike Burns 2016-04-07 16:50:54 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 31 Bob Fournier 2018-03-01 10:53:14 EST
Thanks Pablo.  From looking at the sosreport for the affected compute node - sosreport-esjc-ost1-cn01p.localdomain-20171115162956

1) The messages that appear to have linked the issue to this bug, namely:
Nov 14 12:39:10 esjc-ost1-cn01p ifup: /sbin/ifup: configuration for ovs/system not found.

are only due to logging and not a functional problem, and not the source of the issue. A separate bug should be opened to handle the problem associated with the case.  These log messages are really just a red herring.

2) The actual issue can't be determined from the available info.  As requested in comment 21 by Dan we need:
a) the templates associated with the deployment, specifically the nic config yaml files
b) the contents of /etc/os-net-config/config.json on the affected node

3) It looks like network teaming is being used, we'd like to understand how it is configured, hence the request in #2. 

bfournie-OSX:sosreport-esjc-ost1-cn01p.localdomain-20171115162956 bfournie$ cat etc/sysconfig/network-scripts/ifcfg-team1 
# This file is autogenerated by os-net-config
DEVICE=team1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-bond1
DEVICETYPE=ovs
TYPE=OVSBond
BOND_IFACES="enp7s0 enp8s0"
OVS_OPTIONS="bond_mode=active-backup"

4)  There are some unexpected logs in /var/log/messages, namely:

Nov 14 12:39:10 esjc-ost1-cn01p cloud-init: 2017-11-14 12:39:10,739 - stages.py[WARNING]: Failed to rename devices: [unknown] Error performing rename('enp7s0', 'br-bond1') for 00:25:b5:25:0a:6a, br-bond1: Unexpected error while running command.
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Command: ['ip', 'link', 'set', 'enp7s0', 'name', 'br-bond1']
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Exit code: 2
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Reason: -
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Stdout: -
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Stderr: RTNETLINK answers: File exists
Nov 14 17:31:25 esjc-ost1-cn01p /usr/bin/virt-who: [INFO] @main.py:183 - Using configuration ""esjc-ost1-cn01p.OST-JC1"" ("libvirt" mode)

From these messages it appears that Siggy's comment in the case (Sigwald, Siggy on Nov 20 2017 at 09:50 AM -08:00) is relevant and does not appear to have been answered.  His comments are:
"From this i can only assume that:
a) you have a problem with cloud-init
b) you have a configuration issue on your network-scripts"

Our recommendation is to:
a) create a separate BZ to separate this issue from the cosmetic issue in this BZ
b) get the nic config files and config.json
c) get answers to the questions that Siggy brought up in the case regarding cloud-init
Comment 32 Bob Fournier 2018-03-01 11:10:35 EST
Also, the initial problem that this bug was created for, namely these log messages:
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: Failed to start DHCP interface br/storage.

has been resolved with this fix - https://bugzilla.redhat.com/show_bug.cgi?id=1403795 which fixes the problem with escaping '-' in names like br-ex, br-storage etc.

See the linked upstream bug - https://bugs.launchpad.net/diskimage-builder/+bug/1649409 and BZ https://bugzilla.redhat.com/show_bug.cgi?id=1403795 (where its noted that these log messages for bridges are cosmetic problems only).

Again these log messages are not related to the cases that have been associated with this BZ.

We may want to backport fix https://bugzilla.redhat.com/show_bug.cgi?id=1403795 to OSP-10 if we want to fix these cosmetic issues.
Comment 33 Bob Fournier 2018-03-01 11:11:41 EST
Adding Needinfo for requests in Comment 31.
Comment 34 Pablo Iranzo Gómez 2018-03-02 05:49:05 EST
Moving needinfo to case owner
Comment 35 Bob Fournier 2018-03-07 17:34:39 EST
I'm marking this as Triaged, as the title and initial problem have been fixed, and a backport of https://bugzilla.redhat.com/show_bug.cgi?id=1403795 is needed to get this fix into OSP-10 (or we need to determine if this fix is actually in rhos-10-patches as there is no upstream branches for diskimage-builder).

For any issues related to case 01974327, please open a new bug.
Comment 37 Bob Fournier 2018-05-18 08:45:23 EDT
Eduard - yes, we can backport this fix to OSP-7.  However, we'd like to make sure that this fix (https://code.engineering.redhat.com/gerrit/#/c/136890/) will resolve the issue Telefonica is hitting.  The reason I ask is that this has been reported as just a cosmetic logging issues (see 7 and 15 above for example).

So, assuming that this is the issue that Telefonica is hitting I will start the backport. I've created https://bugzilla.redhat.com/show_bug.cgi?id=1579831 to track the backport.

Note You need to log in before you can comment on or make changes to this bug.