Bug 1829028 - 4.2.30 install on GCP fails consistently with machine-api-operator in CrashLoopBackoff
Summary: 4.2.30 install on GCP fails consistently with machine-api-operator in CrashLo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Release
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.2.z
Assignee: Luke Meyer
QA Contact: Milind Yadav
URL:
Whiteboard:
: 1827706 (view as bug list)
Depends On:
Blocks: 1827779
TreeView+ depends on / blocked
 
Reported: 2020-04-28 18:03 UTC by Mike Fiedler
Modified: 2020-05-13 11:07 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-13 11:07:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2023 0 None None None 2020-05-13 11:07:40 UTC

Description Mike Fiedler 2020-04-28 18:03:18 UTC
Description of problem:

4.2.30 installs are failing consistently (IPI and UPI) on GCP with the machine-api operator crash looping:


root@ip-172-31-64-58: ~/must-gather # oc get pods -n openshift-machine-api
NAME                                           READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-7bf49dc494-kc4gx   1/1     Running            0          49m
machine-api-controllers-9fcf7ff59-4fzl5        2/3     CrashLoopBackOff   15         56m
machine-api-operator-759c87494c-cqx5j          1/1     Running            0          57m
root@ip-172-31-64-58: ~/must-gather # oc get clusteroperators
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                       Unknown     Unknown       True       49m
cloud-credential                           4.2.30    True        False         False      56m
cluster-autoscaler                         4.2.30    True        False         False      49m
console                                    4.2.30    Unknown     True          False      51m
dns                                        4.2.30    True        False         False      55m
image-registry                                       False       True          False      50m
ingress                                    unknown   False       True          True       50m
insights                                   4.2.30    True        False         False      56m
kube-apiserver                             4.2.30    True        False         False      54m
kube-controller-manager                    4.2.30    True        False         False      53m
kube-scheduler                             4.2.30    True        False         False      53m
machine-api                                          True        True          False      56m
machine-config                             4.2.30    True        False         False      55m
marketplace                                4.2.30    True        False         False      50m
monitoring                                           False       True          True       45m
network                                    4.2.30    True        False         False      55m
node-tuning                                4.2.30    True        False         False      52m
openshift-apiserver                        4.2.30    True        False         False      52m
openshift-controller-manager               4.2.30    True        False         False      54m
openshift-samples                          4.2.30    True        False         False      50m
operator-lifecycle-manager                 4.2.30    True        False         False      49m
operator-lifecycle-manager-catalog         4.2.30    True        False         False      52m
operator-lifecycle-manager-packageserver   4.2.30    True        False         False      49m
service-ca                                 4.2.30    True        False         False      56m
service-catalog-apiserver                  4.2.30    True        False         False      52m
service-catalog-controller-manager         4.2.30    True        False         False      52m
storage                                    4.2.30    True        False         False      50m


4.2.29 installs OK.


Version-Release number of selected component (if applicable):  quay.io/openshift-release-dev/ocp-release:4.2.30-x86_64


How reproducible: Always on GCP


Steps to Reproduce:
1. Run an IPI install with quay.io/openshift-release-dev/ocp-release:4.2.30-x86_64 on GCP


Actual results:

See above.

Additional info:

Will provide location for must-gather.

Comment 4 Michael Gugino 2020-04-28 21:47:06 UTC
Must gather is missing logs for machine-controller pod, even though only 1/3 pods is in crashloop backoff.  We need to follow up and figure out why the must-gather tool is missing the logs for all the containers.  Ideally, we should get the most recent logs from the failed container as well.

Comment 5 Michael Gugino 2020-04-28 22:36:55 UTC
I used the cluster bot to try to launch a cluster with that release.

Logs from machine-controller container

panic: semver: Parse(doozer-failure-5ed92c18-055634): No Major.Minor.Patch elements found

goroutine 1 [running]:
github.com/blang/semver.MustParse(0x19a9890, 0x1e, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/cluster-api-provider-gcp/vendor/github.com/blang/semver/semver.go:356 +0x1c1
github.com/openshift/cluster-api-provider-gcp/pkg/version.init.ializers()
	/go/src/github.com/openshift/cluster-api-provider-gcp/pkg/version/version.go:16 +0

Comment 6 Michael Gugino 2020-04-28 23:32:54 UTC
Here's some more detailed logs if for some reason they're needed: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/615/artifacts/launch/pods/

Comment 7 W. Trevor King 2020-04-29 00:06:56 UTC
> panic: semver: Parse(doozer-failure-5ed92c18-055634): No Major.Minor.Patch elements found

ART's tooling drops these tags if a build fails for whatever reason.  Recommended fix it to have build tooling expect the OS_GIT_VERSION tag [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1827706#c8

Comment 8 W. Trevor King 2020-04-29 00:07:56 UTC
Ah, and this bug is a dup.

*** This bug has been marked as a duplicate of bug 1827706 ***

Comment 9 Justin Pierce 2020-04-29 00:23:10 UTC
The image must be rebuilt with a change to ART code: https://bugzilla.redhat.com/show_bug.cgi?id=1827706#c11

Comment 10 Mike Fiedler 2020-04-29 00:31:04 UTC
Still broken in 4.2.30.   Need to re-spin 4.2.30 with a fix for this

Comment 11 Scott Dodson 2020-04-29 12:32:03 UTC
Since this bug actually has supporting details I'm inverting the duplicate bug relationship.

Comment 12 Scott Dodson 2020-04-29 12:32:52 UTC
*** Bug 1827706 has been marked as a duplicate of this bug. ***

Comment 13 Scott Dodson 2020-04-29 12:35:33 UTC
Moving over to release to execute on what Justin prescribes in comment 9.

Comment 16 Milind Yadav 2020-05-06 04:09:27 UTC

Steps :
1.Ran flexi IPI installation on GCP [registry.svc.ci.openshift.org/ocp/release:4.2.30]

2.Installation failed 

Errors:
oc logs -f machine-api-controllers-9fcf7ff59-q6w5d -c machine-controller 
panic: semver: Parse(doozer-failure-5ed92c18-055634): No Major.Minor.Patch elements found

goroutine 1 [running]:
github.com/blang/semver.MustParse(0x19a9890, 0x1e, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/cluster-api-provider-gcp/vendor/github.com/blang/semver/semver.go:356 +0x1c1
github.com/openshift/cluster-api-provider-gcp/pkg/version.init.ializers()
	/go/src/github.com/openshift/cluster-api-provider-gcp/pkg/version/version.go:16 +0x78
[miyadav@miyadav bugzhsun1828704]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          79m     Unable to apply 4.2.30: some cluster operators have not yet rolled out

which is same as what was reported , moving it to ASSIGNED state

Comment 17 W. Trevor King 2020-05-06 04:36:03 UTC
> 1.Ran flexi IPI installation on GCP [registry.svc.ci.openshift.org/ocp/release:4.2.30]

4.2.30 is the impacted release [1].  Use a more recent nightly to verify.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/203#issuecomment-621218274

Comment 18 Milind Yadav 2020-05-06 07:02:10 UTC
Thanks Trevor , for providing comment , I used the latest 4.2 nightly - 'registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2020-05-04-072150'  and the installation was successful

Comment 20 errata-xmlrpc 2020-05-13 11:07:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2023


Note You need to log in before you can comment on or make changes to this bug.