Bug 1969212

Summary: [FJ OCP4.8 Bug - PUBLIC VERSION]: Masters repeat reboot every few minutes during workers provisioning
Product: OpenShift Container Platform Reporter: Jacob Anders <janders>
Component: Bare Metal Hardware ProvisioningAssignee: Jacob Anders <janders>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: dtantsur, hyasuhar, jniu, rbartal, rpittau, song.shukun, tsedovic
Version: 4.8Keywords: OtherQA, Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:11:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1920358    

Description Jacob Anders 2021-06-08 01:51:43 UTC
NOTE: this is a public version of linked BZ1965168. This BZ is created to meet valid-bug automation requirements for a downstream PR which will include an already merged upstream PR.

Description as per the original private BZ:

Version: 

$ openshift-install version

openshift-baremetal-install 4.8.0-0.nightly-2021-04-15-152737
built from commit d0462d8b5074448e1917da7f0a5d7a904bd60359
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:70fe4f1a828dcbe493dce6d199eb5d9e76300d053c477f0f4b4577ef7b7d2934

Platform:

baremetal

Please specify:

IPI 

What happened?

We are using Fujitsu iRMC server to test OCP baremetal ipi deployment. The deployment failed because masters repeated reboot every few minutes. This happened during worker nodes deployment. The master nodes were successfully deployed, the bootstrap vm was deleted and related services were merged into the masters. Then when the installer started to deploy worker nodes, all the master nodes repeated reboot. This resulted in the inability to access the ironic service and the deployment finally failed. To be more specific, according to our observation, the master nodes reboot after ironic related pods are started. The time span between them is less than 1 minute.


```
E0526 14:41:26.719722 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:42:21.183589 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:43:15.455763 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:44:08.447631 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:44:49.983687 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:45:32.095693 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 192.168.30.201:6443: connect: no route to host 
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation 
FATAL failed to initialize the cluster: Working towards 4.8.0-0.nightly-2021-04-15-152737: 655 of 677 done (96% complete) 
```  


What did you expect to happen?

Master nodes will not repeat reboot and baremetal ipi deployment will successfully complete.

How to reproduce it (as minimally and precisely as possible)?

$ openshift-install --dir ~/clusterconfigs create manifests
$ cp ~/ipi/99_router-replicas.yaml ~/clusterconfigs/openshift/
$ openshift-install --dir ~/clusterconfigs --log-level debug create cluster


Anything else we need to know?

* We manually merged related PRs during testing, circumventing known [issue](https://github.com/openshift/installer/issues/4857).
* Because of [IPMI credentials](https://github.com/metal3-io/baremetal-operator/issues/879) related patch is not merged into OCP, we cannot use the latest night version for testing. We hope that [PR880](https://github.com/metal3-io/baremetal-operator/pull/880) can be merged into openshift as soon as possible so that the latest version can be used for testing.

Comment 2 Jacob Anders 2021-06-09 13:02:47 UTC
*** Bug 1965168 has been marked as a duplicate of this bug. ***

Comment 3 Lubov 2021-06-14 13:19:20 UTC
We don't have Fujitsu iRMC setup, so closing as OtherQA
The problem is not reproduced on HP or Dell setups
If the problems is seen again on iRMC, please, reopen

Comment 6 errata-xmlrpc 2021-07-27 23:11:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438