Bug 1764556 - OCP4.3 machine-config operator failing in UPI mode
Summary: OCP4.3 machine-config operator failing in UPI mode
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-23 10:14 UTC by Lukas Bednar
Modified: 2020-01-23 11:09 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:09:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:09:34 UTC

Description Lukas Bednar 2019-10-23 10:14:06 UTC
Description of problem:

I am trying to deploy ocp-4.3 in UPI mode.

INSTALLER_VERSION="4.3.0-0.nightly-2019-10-21-082726"
IMAGE_NAME="rhcos-43.80.20191020.3"

Unfortunately machine-config cluster operator doesn't come up.

Version-Release number of the following components:
4.3.0-0.nightly-2019-10-21-082726
rhcos-43.80.20191020.3

How reproducible: 100

Steps to Reproduce:
1. Install cluster in UPI mode
2.
3.

Actual results:
15:33:06 level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.3.0-0.nightly-2019-10-15-040346 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node host-172-16-0-15 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-0b38dac160d804b64495b1b7ebd6510b\\\\\\\" not found\\\", Node host-172-16-0-14 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-0b38dac160d804b64495b1b7ebd6510b\\\\\\\" not found\\\", Node host-172-16-0-19 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-0b38dac160d804b64495b1b7ebd6510b\\\\\\\" not found\\\"\", retrying"

Expected results:
machine config operator comes up, and installation succeed.

Additional info:

Comment 2 Lukas Bednar 2019-10-24 11:20:02 UTC
Following combination works: 4.3.0-0.nightly-2019-10-24-040910 & rhcos-43.81.20191023.1

Comment 3 Scott Dodson 2019-10-24 16:53:25 UTC
Lukas,

If bootstrap-complete has been achieved please file bugs like this against the operator that's failing, in this case MCO.

Comment 4 Lukas Bednar 2019-10-24 17:55:12 UTC
(In reply to Scott Dodson from comment #3)
> Lukas,
> 
> If bootstrap-complete has been achieved please file bugs like this against
> the operator that's failing, in this case MCO.

I see, it makes sense ... next time I will do so !

Comment 5 Kirsten Garrison 2019-11-05 21:05:18 UTC
I see:
`parsing booted osImageURL: parsing reference: "": invalid reference format` <-- this seems to be an rpmostree error which is bubbled up to MCO.

from daemon logs:
```
019-10-22T23:23:17.388531875Z I1022 23:23:17.388472 2959986 daemon.go:208] Booted osImageURL:  (43.80.20191020.3)
2019-10-22T23:23:17.389989456Z I1022 23:23:17.389933 2959986 metrics.go:20] Starting metrics listener on 127.0.0.1:8797
2019-10-22T23:23:17.391115527Z I1022 23:23:17.391041 2959986 update.go:1027] Starting to manage node: host-172-16-0-18
2019-10-22T23:23:17.396883559Z I1022 23:23:17.396852 2959986 rpm-ostree.go:364] Running captured: rpm-ostree status
2019-10-22T23:23:17.443270516Z I1022 23:23:17.443202 2959986 daemon.go:759] State: idle
2019-10-22T23:23:17.443270516Z AutomaticUpdates: disabled
2019-10-22T23:23:17.443270516Z Deployments:
2019-10-22T23:23:17.443270516Z * ostree://7c459d948e2e3b8e50f816fca6b9932e113b11df92f81e038e3d61285d3810b4
2019-10-22T23:23:17.443270516Z                    Version: 43.80.20191020.3 (2019-10-20T23:57:48Z)

...
019-10-22T23:24:23.31471718Z I1022 23:24:23.314667 2959986 daemon.go:687] In bootstrap mode
2019-10-22T23:24:23.31471718Z I1022 23:24:23.314689 2959986 daemon.go:715] Current+desired config: rendered-master-0f22a71e0f359a3d6218bd01cf170f4a
2019-10-22T23:24:23.319407573Z I1022 23:24:23.319375 2959986 daemon.go:891] Bootstrap pivot required to: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ef2d8bb78a570126d2ed5012b1887c3128f67b6a81c36fd7a4f08f71beeaac47
2019-10-22T23:24:23.319438616Z I1022 23:24:23.319404 2959986 update.go:906] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ef2d8bb78a570126d2ed5012b1887c3128f67b6a81c36fd7a4f08f71beeaac47
2019-10-22T23:24:23.335830911Z I1022 23:24:19.263036 2961562 rpm-ostree.go:356] Running captured: rpm-ostree status --json
2019-10-22T23:24:23.335830911Z client(id:cli dbus:1.2997 unit:machine-config-daemon-host.service uid:0) added; new total=1
2019-10-22T23:24:23.335830911Z client(id:cli dbus:1.2997 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0
2019-10-22T23:24:23.335830911Z In idle state; will auto-exit in 60 seconds
2019-10-22T23:24:23.335830911Z error: parsing booted osImageURL: parsing reference: "": invalid reference format
2019-10-22T23:24:23.335830911Z machine-config-daemon-host.service: Main process exited, code=exited, status=1/FAILURE
2019-10-22T23:24:23.335830911Z machine-config-daemon-host.service: Failed with result 'exit-code'.
2019-10-22T23:24:23.335830911Z Failed to start Machine Config Daemon Initial.
2019-10-22T23:24:23.335830911Z machine-config-daemon-host.service: Consumed 43ms CPU time


```

Not exactly sure what the issue is - is the RHCOS version somehow incompatible?  Could you PTAL and let me know what you think?

Comment 6 Micah Abbott 2019-11-07 18:51:54 UTC
This looks like an error from the `imgref` library - https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L1205

Not sure how an empty string got passed as the osImageURL, though

Comment 8 Micah Abbott 2019-12-04 20:14:24 UTC
@Lukas we changed some of the mechanics of how RHCOS was built and how the MCO parses the previously booted OS around the same time this BZ was reported:

https://github.com/openshift/machine-config-operator/pull/1155
https://github.com/openshift/os/issues/386#issuecomment-544992959

We've not heard of any other reports of this problem since it was reported.

Could you please try to retest with more recent 4.3 artifacts and see if the problem persists?

Comment 9 Lukas Bednar 2019-12-05 13:29:42 UTC
At the moment we are using 4.3.0-0.nightly-2019-12-04-054458 with rhcos-43.81.201912040340.0 and it works for us.

Should I go and try something even newer ?

Comment 10 Micah Abbott 2019-12-05 14:02:47 UTC
@Lukas That's pretty darn new.  :)

Thanks for confirming that things are working.  I'm going mark this as fixed.

Comment 12 Michael Nguyen 2019-12-18 21:50:52 UTC
I am closing this as verified as the reporter confirmed UPI installations to be working on 4.3.0-0.nightly-2019-12-04-054458 with the fix.

Comment 14 errata-xmlrpc 2020-01-23 11:09:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.