Bug 1494132 - dhcp-all-interfaces.sh fails due to delayed link detection during introspection
Summary: dhcp-all-interfaces.sh fails due to delayed link detection during introspection
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: diskimage-builder
Version: 10.0 (Newton)
Hardware: All
OS: Linux
high
urgent
Target Milestone: z7
: 10.0 (Newton)
Assignee: Bob Fournier
QA Contact: mlammon
URL:
Whiteboard:
: 1320034 (view as bug list)
Depends On:
Blocks: 1292691
TreeView+ depends on / blocked
 
Reported: 2017-09-21 14:07 UTC by Jaison Raju
Modified: 2020-12-14 10:10 UTC (History)
7 users (show)

Fixed In Version: diskimage-builder-1.26.1-2.el7ost
Doc Type: Bug Fix
Doc Text:
Cause: Algorithm checking interface state on baremetal nodes does not have proper retry mechanism. Consequence: Under certain conditions when the link is going up and down, the interfaces on baremetal nodes do not come up correctly and fail to get an IP address from DHCP. The following error can be seen in the logs - 'Invalid Argument'. Fix: Change to the retry mechanism to ensure interfaces are brought up correctly. Result: Interfaces are up and get an IP address assigned via DHCP.
Clone Of:
Environment:
Last Closed: 2018-02-27 16:43:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ipxe initialising devices (382.77 KB, image/png)
2017-09-21 19:15 UTC, Andreas Karis
no flags Details
screenshots (3.07 MB, application/x-xz)
2017-09-21 22:42 UTC, Andreas Karis
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1654046 0 None None None 2017-09-21 14:16:02 UTC
Red Hat Product Errata RHBA-2018:0365 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 Bug Fix and Enhancement Advisory 2018-02-27 21:42:55 UTC

Description Jaison Raju 2017-09-21 14:07:02 UTC
Description of problem:
Introspection fails as the dhcp-all-interfaces.sh fails to bring up interface.
After this failure , link comes up but the interface is never tried for dhcp again after 1st dhcp-all-interfaces.sh script failure.
We observed that NetworkManager also does not help in getting a dhcp IP.

Version-Release number of selected component (if applicable):
RHOS10
ipa images 10.0-20170228.1.el7ost

How reproducible:
Always on customer env.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jaison Raju 2017-09-21 14:17:18 UTC
The following patch fixed the issue, but i had to increase the retries to 35, as u noticed the link usually took 25-30sec on ens255f0 

https://review.openstack.org/#/c/419527/1/elements/dhcp-all-interfaces/install.d/dhcp-all-interfaces.sh

Comment 4 Bob Fournier 2017-09-21 17:03:11 UTC
@Dmitry - yes, a backport is in progress - https://code.engineering.redhat.com/gerrit/#/c/118646/

I'm working with akaris to get more clarification on the long time (25-30 seconds) for link to be detected.  That would require another patch to the same code to increase the  loop counter.

Comment 5 Andreas Karis 2017-09-21 19:15:00 UTC
Created attachment 1329142 [details]
ipxe initialising devices

Comment 6 Andreas Karis 2017-09-21 19:15:50 UTC
got there with:

nova boot --flavor baremetal --nic net-id=<uuid> --image overcloud-full test

Comment 9 Andreas Karis 2017-09-21 22:42:39 UTC
Created attachment 1329238 [details]
screenshots

Comment 10 Andreas Karis 2017-09-22 15:22:14 UTC
Hi,

The customer requested that https://review.openstack.org/#/c/419527/1/elements/dhcp-all-interfaces/install.d/dhcp-all-interfaces.sh   PLUS an increased number of retries  be included in their images and shipped as a fix. "but 20 retries didnt help . i noticed link up takes 25-35 sec."

I don't know how realistic that is?

Comment 12 Bob Fournier 2017-09-22 17:00:11 UTC
Thanks Andreas.

The backport for the carrier check is in progress -https://code.engineering.redhat.com/gerrit/#/c/118646/

We'd prefer not to make the second change to increase the timeout as this can have an affect on all deployments, especially if there are servers with unconnected NICs.

Comment 13 Bob Fournier 2017-10-20 11:54:05 UTC
*** Bug 1320034 has been marked as a duplicate of this bug. ***

Comment 14 Bob Fournier 2017-11-16 21:37:57 UTC
Moving this to POST as fix has merged.  Per discussion at GSS weekly meeting, the change to increase the time-out for nics will not be made.

Comment 26 Bob Fournier 2018-02-27 16:38:32 UTC
Pablo/Jaison - I don't think you need a hotfix to test out this change as its available here - https://errata.devel.redhat.com/advisory/32371/builds as part of the OSP-10z7 build.

The pkg is diskimage-builder-1.26.1-2.el7ost.

Comment 27 errata-xmlrpc 2018-02-27 16:43:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0365


Note You need to log in before you can comment on or make changes to this bug.