Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1494132

Summary: dhcp-all-interfaces.sh fails due to delayed link detection during introspection
Product: Red Hat OpenStack Reporter: Jaison Raju <jraju>
Component: diskimage-builderAssignee: Bob Fournier <bfournie>
Status: CLOSED ERRATA QA Contact: mlammon
Severity: urgent Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: akaris, bfournie, dbecker, jraju, mburns, pablo.iranzo, slinaber
Target Milestone: z7Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: diskimage-builder-1.26.1-2.el7ost Doc Type: Bug Fix
Doc Text:
Cause: Algorithm checking interface state on baremetal nodes does not have proper retry mechanism. Consequence: Under certain conditions when the link is going up and down, the interfaces on baremetal nodes do not come up correctly and fail to get an IP address from DHCP. The following error can be seen in the logs - 'Invalid Argument'. Fix: Change to the retry mechanism to ensure interfaces are brought up correctly. Result: Interfaces are up and get an IP address assigned via DHCP.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-27 16:43:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1292691    
Attachments:
Description Flags
ipxe initialising devices
none
screenshots none

Description Jaison Raju 2017-09-21 14:07:02 UTC
Description of problem:
Introspection fails as the dhcp-all-interfaces.sh fails to bring up interface.
After this failure , link comes up but the interface is never tried for dhcp again after 1st dhcp-all-interfaces.sh script failure.
We observed that NetworkManager also does not help in getting a dhcp IP.

Version-Release number of selected component (if applicable):
RHOS10
ipa images 10.0-20170228.1.el7ost

How reproducible:
Always on customer env.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jaison Raju 2017-09-21 14:17:18 UTC
The following patch fixed the issue, but i had to increase the retries to 35, as u noticed the link usually took 25-30sec on ens255f0 

https://review.openstack.org/#/c/419527/1/elements/dhcp-all-interfaces/install.d/dhcp-all-interfaces.sh

Comment 4 Bob Fournier 2017-09-21 17:03:11 UTC
@Dmitry - yes, a backport is in progress - https://code.engineering.redhat.com/gerrit/#/c/118646/

I'm working with akaris to get more clarification on the long time (25-30 seconds) for link to be detected.  That would require another patch to the same code to increase the  loop counter.

Comment 5 Andreas Karis 2017-09-21 19:15:00 UTC
Created attachment 1329142 [details]
ipxe initialising devices

Comment 6 Andreas Karis 2017-09-21 19:15:50 UTC
got there with:

nova boot --flavor baremetal --nic net-id=<uuid> --image overcloud-full test

Comment 9 Andreas Karis 2017-09-21 22:42:39 UTC
Created attachment 1329238 [details]
screenshots

Comment 10 Andreas Karis 2017-09-22 15:22:14 UTC
Hi,

The customer requested that https://review.openstack.org/#/c/419527/1/elements/dhcp-all-interfaces/install.d/dhcp-all-interfaces.sh   PLUS an increased number of retries  be included in their images and shipped as a fix. "but 20 retries didnt help . i noticed link up takes 25-35 sec."

I don't know how realistic that is?

Comment 12 Bob Fournier 2017-09-22 17:00:11 UTC
Thanks Andreas.

The backport for the carrier check is in progress -https://code.engineering.redhat.com/gerrit/#/c/118646/

We'd prefer not to make the second change to increase the timeout as this can have an affect on all deployments, especially if there are servers with unconnected NICs.

Comment 13 Bob Fournier 2017-10-20 11:54:05 UTC
*** Bug 1320034 has been marked as a duplicate of this bug. ***

Comment 14 Bob Fournier 2017-11-16 21:37:57 UTC
Moving this to POST as fix has merged.  Per discussion at GSS weekly meeting, the change to increase the time-out for nics will not be made.

Comment 26 Bob Fournier 2018-02-27 16:38:32 UTC
Pablo/Jaison - I don't think you need a hotfix to test out this change as its available here - https://errata.devel.redhat.com/advisory/32371/builds as part of the OSP-10z7 build.

The pkg is diskimage-builder-1.26.1-2.el7ost.

Comment 27 errata-xmlrpc 2018-02-27 16:43:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0365