Bug 1267169 - systemd on overcloud nodes fail to start dhcp-interface after reboot
Summary: systemd on overcloud nodes fail to start dhcp-interface@BRIDGE_NAME.services ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: diskimage-builder
Version: 7.0 (Kilo)
Hardware: All
OS: Linux
low
medium
Target Milestone: z9
: 10.0 (Newton)
Assignee: Bob Fournier
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks: 1553099 1579831 1585763 1585764
TreeView+ depends on / blocked
 
Reported: 2015-09-29 08:31 UTC by Jaison Raju
Modified: 2021-12-10 14:44 UTC (History)
19 users (show)

Fixed In Version: diskimage-builder-1.26.1-3.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, systemd did not handle DHCP service interface names that contained '-' correctly. As a result of this, these interfaces failed to start and logged the error 'Failed to start DHCP interface". With this fix, systemd now escapes interface names that contain '-'.
Clone Of:
: 1553099 1579831 1585763 1585764 (view as bug list)
Environment:
Last Closed: 2018-09-17 16:59:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Screenshot node start (29.26 KB, image/png)
2015-09-29 08:37 UTC, Jaison Raju
no flags Details
systemd journal ifcfg (90.82 KB, application/x-bzip)
2015-09-29 08:49 UTC, Jaison Raju
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-7933 0 None None None 2021-12-10 14:44:09 UTC
Red Hat Knowledge Base (Solution) 2741461 0 None None None 2016-11-01 12:25:01 UTC
Red Hat Product Errata RHBA-2018:2671 0 None None None 2018-09-17 17:00:47 UTC

Description Jaison Raju 2015-09-29 08:31:09 UTC
Description of problem:
systemd on overcloud nodes fail to dhcp-interface after reboot .

# systemctl status dhcp-interface -l
dhcp-interface - DHCP interface br/storage
   Loaded: loaded (/usr/lib/systemd/system/dhcp-interface@.service; disabled)
   Active: failed (Result: exit-code) since Mon 2015-09-28 14:35:54 EDT; 16h ago
  Process: 3037 ExecStart=/sbin/ifup %I (code=exited, status=1/FAILURE)
  Process: 3034 ExecStartPre=/usr/local/sbin/dhcp-all-interfaces.sh %I (code=exited, status=0/SUCCESS)
 Main PID: 3037 (code=exited, status=1/FAILURE)

Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: Starting DHCP interface br/storage...
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain dhcp-all-interfaces.sh[3034]: cat: /sys/class/net/br/storage/addr_assign_type: No such file or directory
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain dhcp-all-interfaces.sh[3034]: Inspecting interface: br/storage...Device has generated MAC, skipping.
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain ifup[3037]: /sbin/ifup: configuration for br/storage not found.
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain ifup[3037]: Usage: ifup <configuration>
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: dhcp-interface: main process exited, code=exited, status=1/FAILURE
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: Failed to start DHCP interface br/storage.
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: Unit dhcp-interface entered failed state.

# ls /etc/sysconfig/network-scripts/ifcfg-*
/etc/sysconfig/network-scripts/ifcfg-br-storage  /etc/sysconfig/network-scripts/ifcfg-eth1  /etc/sysconfig/network-scripts/ifcfg-vlan20
/etc/sysconfig/network-scripts/ifcfg-eth0        /etc/sysconfig/network-scripts/ifcfg-lo    /etc/sysconfig/network-scripts/ifcfg-vlan40


Version-Release number of selected component (if applicable):
RHOS 7
Director

How reproducible:
Always

Steps to Reproduce:
1. Setup overcloud nodes using templates . 
2. Reboot node
3.

Actual results:
systemd fails to start dhcp-interface@* service

Expected results:
systemd is able to start all services .

Additional info:

Comment 1 Jaison Raju 2015-09-29 08:37:49 UTC
Created attachment 1078232 [details]
Screenshot node start

Comment 3 Jaison Raju 2015-09-29 08:49:15 UTC
Created attachment 1078234 [details]
systemd journal ifcfg

Comment 4 Jaison Raju 2015-09-29 08:55:51 UTC
In the initial start of the nodes , i do not find this systemd interface services.
The following reboots cause systemd to start this bridge interfaces .

# egrep "dhcp|interface"  before_systemd_status.txt after_systemd_status.txt
after_systemd_status.txt:dhcp-interface -> '/org/freedesktop/systemd1/unit/dhcp_2dinterface_40br_2dex_2eservice'
after_systemd_status.txt:dhcp-interface - DHCP interface br/ex
after_systemd_status.txt:   Loaded: loaded (/usr/lib/systemd/system/dhcp-interface@.service; disabled)
after_systemd_status.txt:  Process: 1401 ExecStartPre=/usr/local/sbin/dhcp-all-interfaces.sh %I (code=exited, status=0/SUCCESS)
after_systemd_status.txt:Sep 29 04:31:54 overcloud-controller-0.localdomain dhcp-all-interfaces.sh[1401]: cat: /sys/class/net/br/ex/addr_assign_type: No such file or directory
after_systemd_status.txt:Sep 29 04:31:54 overcloud-controller-0.localdomain dhcp-all-interfaces.sh[1401]: Inspecting interface: br/ex...Device has generated MAC, skipping.
after_systemd_status.txt:Sep 29 04:32:06 overcloud-controller-0.localdomain systemd[1]: dhcp-interface: main process exited, code=exited, status=1/FAILURE
after_systemd_status.txt:Sep 29 04:32:06 overcloud-controller-0.localdomain systemd[1]: Failed to start DHCP interface br/ex.
after_systemd_status.txt:Sep 29 04:32:06 overcloud-controller-0.localdomain systemd[1]: Unit dhcp-interface entered failed state.

Regards,
Jaison R

Comment 6 chris alfonso 2015-09-30 16:14:06 UTC
Other than observing the services, what is the net effect or issue this causes?

Comment 7 Jaison Raju 2015-10-02 03:58:42 UTC
(In reply to chris alfonso from comment #6)
> Other than observing the services, what is the net effect or issue this
> causes?

So far no issues are noticed .
network / floating ip / glance / cinder works well .

Comment 10 Hugh Brock 2016-02-28 08:07:12 UTC
Jaison, can you still observe this with 7.3?

Comment 11 Jaison Raju 2016-03-04 11:06:45 UTC
(In reply to Hugh Brock from comment #10)
> Jaison, can you still observe this with 7.3?

Still noticed:
[root@overcloud-controller-0 ~]# systemctl | grep dhcp
● dhcp-interface                                                             loaded failed     failed          DHCP interface br/ex
● dhcp-interface                                                            loaded failed     failed          DHCP interface br/int
● dhcp-interface                                                            loaded failed     failed          DHCP interface br/tun
● dhcp-interface                                                        loaded failed     failed          DHCP interface ovs/system
  system-dhcp\x2dinterface.slice                                                           loaded active     active          system-dhcp\x2dinterface.slice

[root@overcloud-compute-0 ~]# systemctl | grep dhcp
● dhcp-interface                                                             loaded failed failed    DHCP interface br/ex
● dhcp-interface                                                            loaded failed failed    DHCP interface br/int
● dhcp-interface                                                            loaded failed failed    DHCP interface br/tun
  dhcp-interface                                                              loaded active exited    DHCP interface eth0
  dhcp-interface                                                              loaded active exited    DHCP interface eth1
● dhcp-interface                                                        loaded failed failed    DHCP interface ovs/system
  system-dhcp\x2dinterface.slice                                                           loaded active active    system-dhcp\x2dinterface.slice

Comment 12 Mike Burns 2016-04-07 20:50:54 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 31 Bob Fournier 2018-03-01 15:53:14 UTC
Thanks Pablo.  From looking at the sosreport for the affected compute node - sosreport-esjc-ost1-cn01p.localdomain-20171115162956

1) The messages that appear to have linked the issue to this bug, namely:
Nov 14 12:39:10 esjc-ost1-cn01p ifup: /sbin/ifup: configuration for ovs/system not found.

are only due to logging and not a functional problem, and not the source of the issue. A separate bug should be opened to handle the problem associated with the case.  These log messages are really just a red herring.

2) The actual issue can't be determined from the available info.  As requested in comment 21 by Dan we need:
a) the templates associated with the deployment, specifically the nic config yaml files
b) the contents of /etc/os-net-config/config.json on the affected node

3) It looks like network teaming is being used, we'd like to understand how it is configured, hence the request in #2. 

bfournie-OSX:sosreport-esjc-ost1-cn01p.localdomain-20171115162956 bfournie$ cat etc/sysconfig/network-scripts/ifcfg-team1 
# This file is autogenerated by os-net-config
DEVICE=team1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-bond1
DEVICETYPE=ovs
TYPE=OVSBond
BOND_IFACES="enp7s0 enp8s0"
OVS_OPTIONS="bond_mode=active-backup"

4)  There are some unexpected logs in /var/log/messages, namely:

Nov 14 12:39:10 esjc-ost1-cn01p cloud-init: 2017-11-14 12:39:10,739 - stages.py[WARNING]: Failed to rename devices: [unknown] Error performing rename('enp7s0', 'br-bond1') for 00:25:b5:25:0a:6a, br-bond1: Unexpected error while running command.
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Command: ['ip', 'link', 'set', 'enp7s0', 'name', 'br-bond1']
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Exit code: 2
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Reason: -
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Stdout: -
Nov 14 17:31:25 esjc-ost1-cn01p cloud-init: Stderr: RTNETLINK answers: File exists
Nov 14 17:31:25 esjc-ost1-cn01p /usr/bin/virt-who: [INFO] @main.py:183 - Using configuration ""esjc-ost1-cn01p.OST-JC1"" ("libvirt" mode)

From these messages it appears that Siggy's comment in the case (Sigwald, Siggy on Nov 20 2017 at 09:50 AM -08:00) is relevant and does not appear to have been answered.  His comments are:
"From this i can only assume that:
a) you have a problem with cloud-init
b) you have a configuration issue on your network-scripts"

Our recommendation is to:
a) create a separate BZ to separate this issue from the cosmetic issue in this BZ
b) get the nic config files and config.json
c) get answers to the questions that Siggy brought up in the case regarding cloud-init

Comment 32 Bob Fournier 2018-03-01 16:10:35 UTC
Also, the initial problem that this bug was created for, namely these log messages:
Sep 28 14:35:54 overcloud-blockstorage-2.localdomain systemd[1]: Failed to start DHCP interface br/storage.

has been resolved with this fix - https://bugzilla.redhat.com/show_bug.cgi?id=1403795 which fixes the problem with escaping '-' in names like br-ex, br-storage etc.

See the linked upstream bug - https://bugs.launchpad.net/diskimage-builder/+bug/1649409 and BZ https://bugzilla.redhat.com/show_bug.cgi?id=1403795 (where its noted that these log messages for bridges are cosmetic problems only).

Again these log messages are not related to the cases that have been associated with this BZ.

We may want to backport fix https://bugzilla.redhat.com/show_bug.cgi?id=1403795 to OSP-10 if we want to fix these cosmetic issues.

Comment 33 Bob Fournier 2018-03-01 16:11:41 UTC
Adding Needinfo for requests in Comment 31.

Comment 34 Pablo Iranzo Gómez 2018-03-02 10:49:05 UTC
Moving needinfo to case owner

Comment 35 Bob Fournier 2018-03-07 22:34:39 UTC
I'm marking this as Triaged, as the title and initial problem have been fixed, and a backport of https://bugzilla.redhat.com/show_bug.cgi?id=1403795 is needed to get this fix into OSP-10 (or we need to determine if this fix is actually in rhos-10-patches as there is no upstream branches for diskimage-builder).

For any issues related to case 01974327, please open a new bug.

Comment 37 Bob Fournier 2018-05-18 12:45:23 UTC
Eduard - yes, we can backport this fix to OSP-7.  However, we'd like to make sure that this fix (https://code.engineering.redhat.com/gerrit/#/c/136890/) will resolve the issue Telefonica is hitting.  The reason I ask is that this has been reported as just a cosmetic logging issues (see 7 and 15 above for example).

So, assuming that this is the issue that Telefonica is hitting I will start the backport. I've created https://bugzilla.redhat.com/show_bug.cgi?id=1579831 to track the backport.

Comment 45 Alex McLeod 2018-09-03 08:01:13 UTC
Hi there,

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field.

The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Thanks,
Alex

Comment 47 errata-xmlrpc 2018-09-17 16:59:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2671


Note You need to log in before you can comment on or make changes to this bug.