1969212 – [FJ OCP4.8 Bug - PUBLIC VERSION]: Masters repeat reboot every few minutes during workers provisioning

Bug 1969212 - [FJ OCP4.8 Bug - PUBLIC VERSION]: Masters repeat reboot every few minutes during workers provisioning

Summary: [FJ OCP4.8 Bug - PUBLIC VERSION]: Masters repeat reboot every few minutes dur...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Jacob Anders
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1965168 (view as bug list)
Depends On:
Blocks:	1920358
TreeView+	depends on / blocked

Reported:	2021-06-08 01:51 UTC by Jacob Anders
Modified:	2021-07-27 23:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:11:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ironic-image pull 178	0	None	open	Bug 1969212: remove irmc from enabled_bios_interfaces	2021-06-08 02:22:58 UTC
Red Hat Bugzilla	1965168	1	None	None	None	2021-06-09 13:03:59 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:12:11 UTC

Description Jacob Anders 2021-06-08 01:51:43 UTC

NOTE: this is a public version of linked BZ1965168. This BZ is created to meet valid-bug automation requirements for a downstream PR which will include an already merged upstream PR.

Description as per the original private BZ:

Version: 

$ openshift-install version

openshift-baremetal-install 4.8.0-0.nightly-2021-04-15-152737
built from commit d0462d8b5074448e1917da7f0a5d7a904bd60359
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:70fe4f1a828dcbe493dce6d199eb5d9e76300d053c477f0f4b4577ef7b7d2934

Platform:

baremetal

Please specify:

IPI 

What happened?

We are using Fujitsu iRMC server to test OCP baremetal ipi deployment. The deployment failed because masters repeated reboot every few minutes. This happened during worker nodes deployment. The master nodes were successfully deployed, the bootstrap vm was deleted and related services were merged into the masters. Then when the installer started to deploy worker nodes, all the master nodes repeated reboot. This resulted in the inability to access the ironic service and the deployment finally failed. To be more specific, according to our observation, the master nodes reboot after ironic related pods are started. The time span between them is less than 1 minute.


```
E0526 14:41:26.719722 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:42:21.183589 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:43:15.455763 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:44:08.447631 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:44:49.983687 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
E0526 14:45:32.095693 1584040 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&resourceVersion=39250": dial tcp 192.168.30.201:6443: connect: no route to host
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.openshift.zz.local:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 192.168.30.201:6443: connect: no route to host 
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation 
FATAL failed to initialize the cluster: Working towards 4.8.0-0.nightly-2021-04-15-152737: 655 of 677 done (96% complete) 
```  


What did you expect to happen?

Master nodes will not repeat reboot and baremetal ipi deployment will successfully complete.

How to reproduce it (as minimally and precisely as possible)?

$ openshift-install --dir ~/clusterconfigs create manifests
$ cp ~/ipi/99_router-replicas.yaml ~/clusterconfigs/openshift/
$ openshift-install --dir ~/clusterconfigs --log-level debug create cluster


Anything else we need to know?

* We manually merged related PRs during testing, circumventing known [issue](https://github.com/openshift/installer/issues/4857).
* Because of [IPMI credentials](https://github.com/metal3-io/baremetal-operator/issues/879) related patch is not merged into OCP, we cannot use the latest night version for testing. We hope that [PR880](https://github.com/metal3-io/baremetal-operator/pull/880) can be merged into openshift as soon as possible so that the latest version can be used for testing.

Comment 2 Jacob Anders 2021-06-09 13:02:47 UTC

*** Bug 1965168 has been marked as a duplicate of this bug. ***

Comment 3 Lubov 2021-06-14 13:19:20 UTC

We don't have Fujitsu iRMC setup, so closing as OtherQA
The problem is not reproduced on HP or Dell setups
If the problems is seen again on iRMC, please, reopen

Comment 6 errata-xmlrpc 2021-07-27 23:11:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.