1767152 – OCP 4.1 configuration status for pool master is empty, retrying: timed out waiting for the condition

Bug 1767152 - OCP 4.1 configuration status for pool master is empty, retrying: timed out waiting for the condition

Summary: OCP 4.1 configuration status for pool master is empty, retrying: timed out wa...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Micah Abbott
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-30 19:27 UTC by Chance Zibolski
Modified:	2019-11-07 18:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-07 18:58:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1726370	0	medium	CLOSED	Cluster operator machine-config is reporting a failure: Failed to resync 4.2.0-0.nightly-2019-07-01-102521	2021-02-22 00:41:40 UTC

Description Chance Zibolski 2019-10-30 19:27:23 UTC

Description of problem: 

Seeing the following error in release-openshift-ocp-installer-e2e-aws-4.1 tests:

level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-0.nightly-2019-10-30-153934 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty, retrying: timed out waiting for the condition"


Failing CI run is: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1/477#1:build-log.txt%3A48


This is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1726370

Version-Release number of selected component (if applicable): 4.1.z


How reproducible: Not very, it's occurred once in the last 24 hours in e2e as indicated in https://ci-search-ci-search-next.svc.ci.openshift.org/?search=configuration+status+for+pool+master+is+empty&maxAge=336h&context=2&type=all. There are similar failures for 4.2 AWS proxy e2e though.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Kirsten Garrison 2019-11-06 18:01:20 UTC

seeing pivot errors, was this a bad image?:
```
I1030 17:00:28.755509   66671 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592
error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592: unable to pull image: Error determining manifest MIME type for docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592: Error reading manifest sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown
W1030 17:00:29.132239   66671 run.go:40] podman failed: exit status 125; retrying...
I1030 17:01:49.132470   66671 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592
E1030 17:01:49.479058    2900 writer.go:132] Marking Degraded due to: during bootstrap: failed to run pivot: failed to start pivot.service: exit status 1
I1030 17:01:49.525280    2900 update.go:737] logger doesn't support --jounald, grepping the journal
I1030 17:01:49.565653    2900 update.go:848] error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory
I1030 17:01:49.568287    2900 daemon.go:667] In bootstrap mode
I1030 17:01:49.568309    2900 daemon.go:695] Current+desired config: rendered-worker-d27b6c7c52762f83fea9ce3683379f5d
I1030 17:01:49.573460    2900 daemon.go:865] Bootstrap pivot required to: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592
I1030 17:01:49.573529    2900 update.go:715] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592
error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592: unable to pull image: Error determining manifest MIME type for docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592: Error reading manifest sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown
W1030 17:00:29.132239   66671 run.go:40] podman failed: exit status 125; retrying...
I1030 17:01:49.132470   66671 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592
error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592: unable to pull image: Error determining manifest MIME type for docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592: Error reading manifest sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown
W1030 17:01:49.475241   66671 run.go:40] podman failed: exit status 125; retrying...
F1030 17:01:49.475272   66671 run.go:48] podman: timed out waiting for the condition
pivot.service: Main process exited, code=exited, status=255/n/a
pivot.service: Failed with result 'exit-code'.
Failed to start Pivot Tool.
pivot.service: Consumed 940ms CPU time
```
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1/477/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-daemon-244df_machine-config-daemon.log

Comment 2 Micah Abbott 2019-11-07 15:15:37 UTC

```
I1030 17:00:28.755509   66671 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592
error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592: unable to pull image: Error determining manifest MIME type for docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592: Error reading manifest sha256:0ed6c90b9738103a56fe9c06794e6cfab0ef1258b415286aca57aaedd454d592 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown
```

This looks like the release payload was GC'ed on Quay.

Does this reproduce reliably with other 4.1 nightly payloads?

Recent runs of the same job look green - https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1

Comment 3 Chance Zibolski 2019-11-07 18:48:43 UTC

I'm unable to provide more info on this since this was something I found in CI while buildcop.

Comment 4 Micah Abbott 2019-11-07 18:58:45 UTC

I think this was a flake due to a GC'ed release payload, since jobs after the reported failed job were passing just fine.

Note You need to log in before you can comment on or make changes to this bug.