test: operator conditions machine-config is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=operator+conditions+machine-config https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331565103664861184 level=info msg=Waiting up to 40m0s for the cluster at https://api.ci-op-68ywflb6-2659c.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize... level=error msg=Cluster operator authentication Degraded is True with ProxyConfigController_SyncError: ProxyConfigControllerDegraded: endpoint("https://oauth-openshift.apps.ci-op-68ywflb6-2659c.origin-ci-int-aws.dev.rhcloud.com/healthz") is unreachable with proxy(Get "https://oauth-openshift.apps.ci-op-68ywflb6-2659c.origin-ci-int-aws.dev.rhcloud.com/healthz": x509: certificate signed by unknown authority) and without proxy(Get "https://oauth-openshift.apps.ci-op-68ywflb6-2659c.origin-ci-int-aws.dev.rhcloud.com/healthz": dial tcp 35.164.147.170:443: connect: connection timed out) level=info msg=Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform level=info msg=Cluster operator insights Disabled is False with AsExpected: level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-11-25-114114 level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-11-25-114114: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\"", retrying level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-11-25-114114 level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating 2020/11/25 12:59:31 Failed to upload $KUBECONFIG: timed out waiting for the condition error: failed to execute wrapped command: exit status 1 2020/11/25 12:59:34 Container test in pod e2e-aws-proxy-ipi-install-install failed, exit code 1, reason Error 2020/11/25 12:59:34 Copied 4.99MB of artifacts from e2e-aws-proxy-ipi-install-install to /logs/artifacts/e2e-aws-proxy/ipi-install-install 2020/11/25 13:00:14 Container artifacts in pod e2e-aws-proxy-ipi-install-install completed successfully 2020/11/25 13:00:14 Executing pod "e2e-aws-proxy-gather-aws-console" 2020/11/25 13:00:34 Container cp-secret-wrapper in pod e2e-aws-proxy-gather-aws-console completed successfully 2020/11/25 13:22:04 Pod e2e-aws-proxy-ipi-deprovision-proxy succeeded after 1m46s 2020/11/25 13:22:04 Releasing leases for "e2e-aws-proxy" 2020/11/25 13:22:04 Releasing lease "b315453f-673a-4c36-a4d3-64dd02315ee8" for "aws-quota-slice" 2020/11/25 13:22:04 No custom metadata found and prow metadata already exists. Not updating the metadata. 2020/11/25 13:22:04 Ran for 1h34m28s error: some steps failed: * could not run steps: step e2e-aws-proxy failed: "e2e-aws-proxy" pre steps failed: "e2e-aws-proxy" pod "e2e-aws-proxy-ipi-install-install" failed: the pod ci-op-68ywflb6/e2e-aws-proxy-ipi-install-install failed after 59m50s (failed containers: test): ContainerFailed one or more containers exited Container test exited with code 1, reason Error --- th UnsupportedPlatform: Nothing to do on this Platform level=info msg=Cluster operator insights Disabled is False with AsExpected: level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-11-25-114114 level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-11-25-114114: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-eae7820c783ea9f895860667adc7111a\\\" not found\"", retrying level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-11-25-114114 level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating 2020/11/25 12:59:31 Failed to upload $KUBECONFIG: timed out waiting for the condition error: failed to execute wrapped command: exit status 1 --- Link to step on registry info site: https://steps.ci.openshift.org/reference/ipi-install-install Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift&repo=release&branch=master&test=e2e-aws-proxy&variant=ocp-4.7 time="2020-11-25T13:22:04Z" level=info msg="Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'"
There seems to be a couple issues tho: from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/891/pull-ci-openshift-cluster-network-operator-master-e2e-openstack-ovn/1331865761811009536 ``` message: "Node yxcvrbbk-3d5e8-bzjds-master-2 is reporting: "Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\n: exit status 1"", ``` Sinny's looking into this From the proxy job instead https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331565103664861184 ``` message: "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found"", ``` we need to look into these
(In reply to Antonio Murdaca from comment #1) > There seems to be a couple issues tho: > > from > https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/ > openshift_cluster-network-operator/891/pull-ci-openshift-cluster-network- > operator-master-e2e-openstack-ovn/1331865761811009536 > > ``` > message: "Node yxcvrbbk-3d5e8-bzjds-master-2 is reporting: "Failed to > remove rollback: error running rpm-ostree cleanup -r: error: Timeout was > reached\n: exit status 1"", > ``` This seems to be an rpm-ostree related issue. Looking at MCD log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/887/pull-ci-openshift-cluster-network-operator-master-e2e-openstack-ovn/1331260885397671936/artifacts/e2e-openstack-ovn/pods/openshift-machine-config-operator_machine-config-daemon-48v2l_machine-config-daemon.log and corresponding rpm-ostree journal log from that node, rpm-ostree doesn't report success fo previous deployment cleanup. Nov 24 16:15:41.598920 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: rpm-ostreed.service: Succeeded. Nov 24 16:15:41.599958 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: rpm-ostreed.service: Consumed 5.523s CPU time Nov 24 16:15:41.602135 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:15:41.602039 2478 manager.go:1044] Destroyed container: "/system.slice/rpm-ostreed.service" (aliases: [], namespace: "") Nov 24 16:15:41.602203 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:15:41.602072 2478 handler.go:325] Added event &{/system.slice/rpm-ostreed.service 2020-11-24 16:15:41.602063983 +0000 UTC m=+114.238727138 containerDeletion {<nil>}} Nov 24 16:15:59.997292 i223f7yy-3d5e8-z6qkt-worker-0-xghwt dbus-daemon[1229]: [system] Activating via systemd: service name='org.projectatomic.rpmostree1' unit='rpm-ostreed.service' requested by ':1.244' (uid=0 pid=7086 comm="rpm-ostree cleanup -r " label="system_u:system_r:spc_t:s0") Nov 24 16:16:00.000807 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000705 2478 factory.go:216] Factory "crio" was unable to handle container "/system.slice/rpm-ostreed.service" Nov 24 16:16:00.001353 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000810 2478 factory.go:216] Factory "systemd" was unable to handle container "/system.slice/rpm-ostreed.service" Nov 24 16:16:00.001353 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000837 2478 factory.go:212] Using factory "raw" for container "/system.slice/rpm-ostreed.service" Nov 24 16:16:00.001917 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.001831 2478 manager.go:987] Added container: "/system.slice/rpm-ostreed.service" (aliases: [], namespace: "") Nov 24 16:16:00.002732 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.002642 2478 handler.go:325] Added event &{/system.slice/rpm-ostreed.service 2020-11-24 16:15:59.999092998 +0000 UTC containerCreation {<nil>}} Nov 24 16:16:00.002845 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.002780 2478 container.go:490] Start housekeeping for container "/system.slice/rpm-ostreed.service" Nov 24 16:16:00.003678 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: Starting rpm-ostree System Management Daemon... Nov 24 16:16:00.046162 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: Reading config file '/etc/rpm-ostreed.conf' Nov 24 16:16:00.057568 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: In idle state; will auto-exit in 63 seconds Nov 24 16:16:00.058531 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: Started rpm-ostree System Management Daemon. Nov 24 16:16:00.066603 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) added; new total=1 Nov 24 16:16:30.168560 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0): /org/projectatomic/rpmostree1/rhcos Nov 24 16:16:30.168979 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) vanished; remaining=0 Nov 24 16:16:30.183694 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.249 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) added; new total=1 Nov 24 16:28:14.754412 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd-tmpfiles[27734]: [/usr/lib/tmpfiles.d/rpm-ostree-1-autovar.conf:1] Duplicate line for path "/var/lib/alternatives", ignoring.
well, the proxy error comes from level=error msg=Cluster operator authentication Degraded is True with ProxyConfigController_SyncError: ProxyConfigControllerDegraded: endpoint("https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz") is unreachable with proxy(Get "https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz": x509: certificate signed by unknown authority) and without proxy(Get "https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz": dial tcp 35.163.63.55:443: connect: connection timed out) and surely between bootstrap and cluster run what's returned from the proxy controller differs and causes message: "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found"",
cc'ing Jonathan, thoughts on why rpm-ostree would timeout while deleting rollback deployment.
Hmm, not sure what's going on here. Interesting that the "Initiated txn Cleanup" message is timestamped 5s after the client timed out. We've had a few changes related with D-Bus transaction handling (originally motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1865839) in the recent rpm-ostree releases which are not yet in 4.7 though we're planning to tag it in. We'll see if those help.
Hey all the proxy issue is being tracked via: https://bugzilla.redhat.com/show_bug.cgi?id=1901034 So when the rpm ostree stuff is done, feel free to dupe into the tracking issue for the proxy problems : https://bugzilla.redhat.com/show_bug.cgi?id=1901034
Spoke to Jonathan and as there's already a bz tracking rpm ostree im going to dupe this to the overall proxy install failure bz to keep that issue discussion centralized. *** This bug has been marked as a duplicate of bug 1901034 ***
This looks more like https://bugzilla.redhat.com/show_bug.cgi?id=1888565#c3 TL;DR your storage is too slow.