Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1943258

Summary: [Assisted-4.7][Staging][Advanced Networking] Cluster install fails while waiting for control plane
Product: OpenShift Container Platform Reporter: Trey West <trwest>
Component: assisted-installerAssignee: vemporop
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: high CC: alazar, aos-bugs, itsoiref, yobshans
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Core
Fixed In Version: OCP-Metal-v1.0.20.1 Doc Type: Bug Fix
Doc Text:
Cause: In rare cases an arbitrary service may take the .10 IP address in service network, preventing DNS service from starting up properly because it requires this address. Consequence: Many subsystems malfunction because DNS service fails to bind to the .10 address in service network. Error message "failed to create service for dns default: failed to create dns service: Service \"dns-default\" is invalid: spec.clusterIPs: Invalid value: []string{\"172.30.0.10\"}: failed to allocated ip:172.30.0.10 with error:provided IP is already allocated" can be seen in OpenShift logs. Fix: This issue should be fixed at OpenShift level, but as a workaround Assisted Installer will currently delete and recreate any service that takes the .10 address, in order to free the address up for DNS service. Result: The cluster starts up properly, DNS service can take the required IP address even if it was previously taken by another service. Such service will be recreated and will receive a free IP address in service network.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:55:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Trey West 2021-03-25 16:52:17 UTC
Description of problem:

When creating a cluster with advanced networking options, installation is timing out while waiting for controller pod.


Version-Release number of selected component (if applicable):

v1.0.17.3

Steps to Reproduce:
1. Install 5 node cluster with network settings:
Networt CIDR: 10.128.0.0/20
Service CIDR: 172.30.0.0/24
Prefix: 24
2. Wait for cluster installation.

Actual results:

Cluster installation will timeout waiting for control plane

Expected results:

Cluster installs successfully

Additional info:

Last cluster event 
"message": "Host test-infra-cluster-6d4de623-master-0: reached installation stage Waiting for control plane: waiting for controller pod",

Comment 2 Ronnie Lazar 2021-04-25 13:10:18 UTC
@itsoiref Is this similar to some of the the other bugs you handled?

Comment 5 Igal Tsoiref 2021-04-27 10:04:21 UTC
this is the issue with dns pod not starting due to already taking ip

from dns operator :
2021-03-25T05:54:54.973779004Z time="2021-03-25T05:54:54Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to create service for dns default: failed to create dns service: Service \"dns-default\" is invalid: spec.clusterIPs: Invalid value: []string{\"172.30.0.10\"}: failed to allocated ip:172.30.0.10 with error:provided IP is already allocated"

Comment 7 Igal Tsoiref 2021-05-02 09:22:25 UTC
*** Bug 1918531 has been marked as a duplicate of this bug. ***

Comment 8 Trey West 2021-05-04 12:57:46 UTC
@vemporop is there a fixed in version for this?

Comment 9 vemporop 2021-05-04 13:13:18 UTC
The fix is in the controller, the image quay.io/ocpmetal/assisted-installer-controller:latest already includes it, if you can test with a custom controller image. No release yet.

Comment 10 Trey West 2021-06-01 19:37:18 UTC
Verified on v1.0.21.1

Comment 13 errata-xmlrpc 2021-07-27 22:55:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438