Bug 1901577

Summary:	operator conditions machine-config
Product:	OpenShift Container Platform	Reporter:	Lokesh Mandvekar <lsm5>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED DUPLICATE	QA Contact:	Michael Nguyen <mnguyen>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	jlebon, kgarriso, skumari, walters
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	operator conditions machine-config
Last Closed:	2020-12-04 21:13:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lokesh Mandvekar 2020-11-25 15:10:55 UTC

test:
operator conditions machine-config 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=operator+conditions+machine-config


https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331565103664861184 

level=info msg=Waiting up to 40m0s for the cluster at https://api.ci-op-68ywflb6-2659c.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize...
level=error msg=Cluster operator authentication Degraded is True with ProxyConfigController_SyncError: ProxyConfigControllerDegraded: endpoint("https://oauth-openshift.apps.ci-op-68ywflb6-2659c.origin-ci-int-aws.dev.rhcloud.com/healthz") is unreachable with proxy(Get "https://oauth-openshift.apps.ci-op-68ywflb6-2659c.origin-ci-int-aws.dev.rhcloud.com/healthz": x509: certificate signed by unknown authority) and without proxy(Get "https://oauth-openshift.apps.ci-op-68ywflb6-2659c.origin-ci-int-aws.dev.rhcloud.com/healthz": dial tcp 35.164.147.170:443: connect: connection timed out)
level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
level=info msg=Cluster operator insights Disabled is False with AsExpected: 
level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-11-25-114114
level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-11-25-114114: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\"", retrying
level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-11-25-114114
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating
2020/11/25 12:59:31 Failed to upload $KUBECONFIG: timed out waiting for the condition
error: failed to execute wrapped command: exit status 1
2020/11/25 12:59:34 Container test in pod e2e-aws-proxy-ipi-install-install failed, exit code 1, reason Error
2020/11/25 12:59:34 Copied 4.99MB of artifacts from e2e-aws-proxy-ipi-install-install to /logs/artifacts/e2e-aws-proxy/ipi-install-install
2020/11/25 13:00:14 Container artifacts in pod e2e-aws-proxy-ipi-install-install completed successfully
2020/11/25 13:00:14 Executing pod "e2e-aws-proxy-gather-aws-console"
2020/11/25 13:00:34 Container cp-secret-wrapper in pod e2e-aws-proxy-gather-aws-console completed successfully
2020/11/25 13:22:04 Pod e2e-aws-proxy-ipi-deprovision-proxy succeeded after 1m46s
2020/11/25 13:22:04 Releasing leases for "e2e-aws-proxy"
2020/11/25 13:22:04 Releasing lease "b315453f-673a-4c36-a4d3-64dd02315ee8" for "aws-quota-slice"
2020/11/25 13:22:04 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/11/25 13:22:04 Ran for 1h34m28s
error: some steps failed:
  * could not run steps: step e2e-aws-proxy failed: "e2e-aws-proxy" pre steps failed: "e2e-aws-proxy" pod "e2e-aws-proxy-ipi-install-install" failed: the pod ci-op-68ywflb6/e2e-aws-proxy-ipi-install-install failed after 59m50s (failed containers: test): ContainerFailed one or more containers exited
Container test exited with code 1, reason Error
---
th UnsupportedPlatform: Nothing to do on this Platform
level=info msg=Cluster operator insights Disabled is False with AsExpected: 
level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-11-25-114114
level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-11-25-114114: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\"", retrying
level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-11-25-114114
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating
2020/11/25 12:59:31 Failed to upload $KUBECONFIG: timed out waiting for the condition
error: failed to execute wrapped command: exit status 1
---
Link to step on registry info site: https://steps.ci.openshift.org/reference/ipi-install-install
Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift&repo=release&branch=master&test=e2e-aws-proxy&variant=ocp-4.7
time="2020-11-25T13:22:04Z" level=info msg="Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'"

Comment 1 Antonio Murdaca 2020-11-26 09:21:32 UTC

There seems to be a couple issues tho:

from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/891/pull-ci-openshift-cluster-network-operator-master-e2e-openstack-ovn/1331865761811009536

```
 message: "Node yxcvrbbk-3d5e8-bzjds-master-2 is reporting: "Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\n: exit status 1"",
```

Sinny's looking into this

From the proxy job instead https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331565103664861184

```
message: "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found"",
```

we need to look into these

Comment 2 Sinny Kumari 2020-11-26 09:59:57 UTC

(In reply to Antonio Murdaca from comment #1)
> There seems to be a couple issues tho:
> 
> from
> https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/
> openshift_cluster-network-operator/891/pull-ci-openshift-cluster-network-
> operator-master-e2e-openstack-ovn/1331865761811009536
> 
> ```
>  message: "Node yxcvrbbk-3d5e8-bzjds-master-2 is reporting: "Failed to
> remove rollback: error running rpm-ostree cleanup -r: error: Timeout was
> reached\n: exit status 1"",
> ```

This seems to be an rpm-ostree related issue.

Looking at MCD log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/887/pull-ci-openshift-cluster-network-operator-master-e2e-openstack-ovn/1331260885397671936/artifacts/e2e-openstack-ovn/pods/openshift-machine-config-operator_machine-config-daemon-48v2l_machine-config-daemon.log and corresponding rpm-ostree journal log from that node, rpm-ostree doesn't report success fo previous deployment cleanup.

Nov 24 16:15:41.598920 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: rpm-ostreed.service: Succeeded.
Nov 24 16:15:41.599958 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: rpm-ostreed.service: Consumed 5.523s CPU time
Nov 24 16:15:41.602135 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:15:41.602039    2478 manager.go:1044] Destroyed container: "/system.slice/rpm-ostreed.service" (aliases: [], namespace: "")
Nov 24 16:15:41.602203 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:15:41.602072    2478 handler.go:325] Added event &{/system.slice/rpm-ostreed.service 2020-11-24 16:15:41.602063983 +0000 UTC m=+114.238727138 containerDeletion {<nil>}}
Nov 24 16:15:59.997292 i223f7yy-3d5e8-z6qkt-worker-0-xghwt dbus-daemon[1229]: [system] Activating via systemd: service name='org.projectatomic.rpmostree1' unit='rpm-ostreed.service' requested by ':1.244' (uid=0 pid=7086 comm="rpm-ostree cleanup -r " label="system_u:system_r:spc_t:s0")
Nov 24 16:16:00.000807 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000705    2478 factory.go:216] Factory "crio" was unable to handle container "/system.slice/rpm-ostreed.service"
Nov 24 16:16:00.001353 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000810    2478 factory.go:216] Factory "systemd" was unable to handle container "/system.slice/rpm-ostreed.service"
Nov 24 16:16:00.001353 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000837    2478 factory.go:212] Using factory "raw" for container "/system.slice/rpm-ostreed.service"
Nov 24 16:16:00.001917 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.001831    2478 manager.go:987] Added container: "/system.slice/rpm-ostreed.service" (aliases: [], namespace: "")
Nov 24 16:16:00.002732 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.002642    2478 handler.go:325] Added event &{/system.slice/rpm-ostreed.service 2020-11-24 16:15:59.999092998 +0000 UTC containerCreation {<nil>}}
Nov 24 16:16:00.002845 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.002780    2478 container.go:490] Start housekeeping for container "/system.slice/rpm-ostreed.service"
Nov 24 16:16:00.003678 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: Starting rpm-ostree System Management Daemon...
Nov 24 16:16:00.046162 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: Reading config file '/etc/rpm-ostreed.conf'
Nov 24 16:16:00.057568 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: In idle state; will auto-exit in 63 seconds
Nov 24 16:16:00.058531 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: Started rpm-ostree System Management Daemon.
Nov 24 16:16:00.066603 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) added; new total=1
Nov 24 16:16:30.168560 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Nov 24 16:16:30.168979 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) vanished; remaining=0
Nov 24 16:16:30.183694 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.249 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) added; new total=1
Nov 24 16:28:14.754412 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd-tmpfiles[27734]: [/usr/lib/tmpfiles.d/rpm-ostree-1-autovar.conf:1] Duplicate line for path "/var/lib/alternatives", ignoring.

Comment 3 Antonio Murdaca 2020-11-26 10:28:25 UTC

well, the proxy error comes from

level=error msg=Cluster operator authentication Degraded is True with ProxyConfigController_SyncError: ProxyConfigControllerDegraded: endpoint("https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz") is unreachable with proxy(Get "https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz": x509: certificate signed by unknown authority) and without proxy(Get "https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz": dial tcp 35.163.63.55:443: connect: connection timed out)

and surely between bootstrap and cluster run what's returned from the proxy controller differs and causes 

message: "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found"",

Comment 5 Sinny Kumari 2020-11-26 11:07:36 UTC

cc'ing Jonathan, thoughts on why rpm-ostree would timeout while deleting rollback deployment.

Comment 6 Jonathan Lebon 2020-11-26 15:52:47 UTC

Hmm, not sure what's going on here. Interesting that the "Initiated txn Cleanup" message is timestamped 5s after the client timed out.

We've had a few changes related with D-Bus transaction handling (originally motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1865839) in the recent rpm-ostree releases which are not yet in 4.7 though we're planning to tag it in. We'll see if those help.

Comment 9 Kirsten Garrison 2020-12-04 21:04:28 UTC

Hey all the proxy issue is being tracked via: https://bugzilla.redhat.com/show_bug.cgi?id=1901034

So when the rpm ostree stuff is done, feel free to dupe into the tracking issue for the proxy problems : https://bugzilla.redhat.com/show_bug.cgi?id=1901034

Comment 10 Kirsten Garrison 2020-12-04 21:13:23 UTC

Spoke to Jonathan and as there's already a bz tracking rpm ostree im going to dupe this to the overall proxy install failure bz to keep that issue discussion centralized.

*** This bug has been marked as a duplicate of bug 1901034 ***

Comment 11 Colin Walters 2020-12-07 14:06:48 UTC

This looks more like https://bugzilla.redhat.com/show_bug.cgi?id=1888565#c3
TL;DR your storage is too slow.