Bug 2098424

Summary: Metal Day 1 4.11 - deployments with bond fail - workers stuck in provisioning
Product: OpenShift Container Platform Reporter: Yoav Porag <yporagpa>
Component: InstallerAssignee: aos-install
Installer sub component: openshift-installer QA Contact: Gaoyun Pei <gpei>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: unspecified CC: augol, awolff, yporagpa
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-22 05:29:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yoav Porag 2022-06-19 06:37:38 UTC
Version:
4.11.0-0.nightly-2022-06-15-222801

Platform:

IPI on virtual BM

What happened?

When testing 4.11 bug deployment fails to complete. Multiple cluster operators do not become available and workers are not provisioned. The same configuration works for 4.10. this is consistent with multiple bond modes including 802.3ad and active-backup.

What did you expect to happen?

deployment should have succeded

Anything else we need to know?

Must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/bond_failure_must_gather_160622.tar.gz 

networkConfig segment added to install config under every node.
no-dhcp work around applied at pre-deployment according to https://docs.google.com/document/d/1AiviH6t24tOs9vQELLvojpc6_5eff8OKUNY3ApsH8po/edit#
        networkConfig:
          routes:
            config:
            - destination: 0.0.0.0/0
              next-hop-address: 192.168.123.1
              next-hop-interface: bond0
          dns-resolver:
            config:
              server:
              - 192.168.123.1
          interfaces:
          - name: bond0
            type: bond
            state: up
            ipv4:
              address:
              - ip: 192.168.123.150
                prefix-length: 24
              enabled: true
              dhcp: false
            link-aggregation:
              mode: 802.3ad
              options:
                miimon: '100'
              port:
              - enp0s4
              - enp0s5



[kni@provisionhost-0-0 ~]$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-0.nightly-2022-06-15-222801   False       False         True       22h     OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found...
baremetal                                  4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
cloud-controller-manager                   4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
cloud-credential                           4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
cluster-autoscaler                         4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
config-operator                            4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
console                                    4.11.0-0.nightly-2022-06-15-222801   False       False         True       22h     RouteHealthAvailable: console route is not admitted
csi-snapshot-controller                    4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
dns                                        4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
etcd                                       4.11.0-0.nightly-2022-06-15-222801   True        False         True       22h     UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-16 13:37:28 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-15-222801 registry.ci.openshift.org/ocp/release@sha256:bceac2ed723ce186c56b1db5e7b17cf0ef0a62e6bbfba5d545d419c3018498b2 false }]
image-registry                             4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
ingress                                                                         False       True          True       22h     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights                                   4.11.0-0.nightly-2022-06-15-222801   True        False         False      5s      
kube-apiserver                             4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
kube-controller-manager                    4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
kube-scheduler                             4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
kube-storage-version-migrator              4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
machine-api                                4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
machine-approver                           4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
machine-config                             4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
marketplace                                4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
monitoring                                                                      False       True          True       21h     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.11.0-0.nightly-2022-06-15-222801   True        True          False      22h     Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
node-tuning                                4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
openshift-apiserver                        4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
openshift-controller-manager               4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
openshift-samples                          4.11.0-0.nightly-2022-06-15-222801   True        False         False      21h     
operator-lifecycle-manager                 4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
operator-lifecycle-manager-catalog         4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
operator-lifecycle-manager-packageserver   4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
service-ca                                 4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     
storage                                    4.11.0-0.nightly-2022-06-15-222801   True        False         False      22h     

[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATE                    CONSUMER                                  ONLINE   ERROR   AGE
openshift-machine-api   openshift-master-0-0   externally provisioned   ocp-edge-cluster-0-b8c7d-master-0         true             22h
openshift-machine-api   openshift-master-0-1   externally provisioned   ocp-edge-cluster-0-b8c7d-master-1         true             22h
openshift-machine-api   openshift-master-0-2   externally provisioned   ocp-edge-cluster-0-b8c7d-master-2         true             22h
openshift-machine-api   openshift-worker-0-0   provisioning             ocp-edge-cluster-0-b8c7d-worker-0-tq56h   true             22h
openshift-machine-api   openshift-worker-0-1   provisioning             ocp-edge-cluster-0-b8c7d-worker-0-47v92   true             22h

Comment 1 Yoav Porag 2022-06-22 05:29:45 UTC
works now, probably a result of the fix made to https://bugzilla.redhat.com/show_bug.cgi?id=2098430.
closing bug

Comment 2 Yoav Porag 2022-06-22 06:35:54 UTC

*** This bug has been marked as a duplicate of bug 2092650 ***