Bug 1901577
Summary: | operator conditions machine-config | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Lokesh Mandvekar <lsm5> |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.7 | CC: | jlebon, kgarriso, skumari, walters |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
operator conditions machine-config
|
|
Last Closed: | 2020-12-04 21:13:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Lokesh Mandvekar
2020-11-25 15:10:55 UTC
There seems to be a couple issues tho: from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/891/pull-ci-openshift-cluster-network-operator-master-e2e-openstack-ovn/1331865761811009536 ``` message: "Node yxcvrbbk-3d5e8-bzjds-master-2 is reporting: "Failed to remove rollback: error running rpm-ostree cleanup -r: error: Timeout was reached\n: exit status 1"", ``` Sinny's looking into this From the proxy job instead https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-aws-proxy/1331565103664861184 ``` message: "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found"", ``` we need to look into these (In reply to Antonio Murdaca from comment #1) > There seems to be a couple issues tho: > > from > https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/ > openshift_cluster-network-operator/891/pull-ci-openshift-cluster-network- > operator-master-e2e-openstack-ovn/1331865761811009536 > > ``` > message: "Node yxcvrbbk-3d5e8-bzjds-master-2 is reporting: "Failed to > remove rollback: error running rpm-ostree cleanup -r: error: Timeout was > reached\n: exit status 1"", > ``` This seems to be an rpm-ostree related issue. Looking at MCD log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/887/pull-ci-openshift-cluster-network-operator-master-e2e-openstack-ovn/1331260885397671936/artifacts/e2e-openstack-ovn/pods/openshift-machine-config-operator_machine-config-daemon-48v2l_machine-config-daemon.log and corresponding rpm-ostree journal log from that node, rpm-ostree doesn't report success fo previous deployment cleanup. Nov 24 16:15:41.598920 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: rpm-ostreed.service: Succeeded. Nov 24 16:15:41.599958 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: rpm-ostreed.service: Consumed 5.523s CPU time Nov 24 16:15:41.602135 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:15:41.602039 2478 manager.go:1044] Destroyed container: "/system.slice/rpm-ostreed.service" (aliases: [], namespace: "") Nov 24 16:15:41.602203 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:15:41.602072 2478 handler.go:325] Added event &{/system.slice/rpm-ostreed.service 2020-11-24 16:15:41.602063983 +0000 UTC m=+114.238727138 containerDeletion {<nil>}} Nov 24 16:15:59.997292 i223f7yy-3d5e8-z6qkt-worker-0-xghwt dbus-daemon[1229]: [system] Activating via systemd: service name='org.projectatomic.rpmostree1' unit='rpm-ostreed.service' requested by ':1.244' (uid=0 pid=7086 comm="rpm-ostree cleanup -r " label="system_u:system_r:spc_t:s0") Nov 24 16:16:00.000807 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000705 2478 factory.go:216] Factory "crio" was unable to handle container "/system.slice/rpm-ostreed.service" Nov 24 16:16:00.001353 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000810 2478 factory.go:216] Factory "systemd" was unable to handle container "/system.slice/rpm-ostreed.service" Nov 24 16:16:00.001353 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.000837 2478 factory.go:212] Using factory "raw" for container "/system.slice/rpm-ostreed.service" Nov 24 16:16:00.001917 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.001831 2478 manager.go:987] Added container: "/system.slice/rpm-ostreed.service" (aliases: [], namespace: "") Nov 24 16:16:00.002732 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.002642 2478 handler.go:325] Added event &{/system.slice/rpm-ostreed.service 2020-11-24 16:15:59.999092998 +0000 UTC containerCreation {<nil>}} Nov 24 16:16:00.002845 i223f7yy-3d5e8-z6qkt-worker-0-xghwt hyperkube[2478]: I1124 16:16:00.002780 2478 container.go:490] Start housekeeping for container "/system.slice/rpm-ostreed.service" Nov 24 16:16:00.003678 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: Starting rpm-ostree System Management Daemon... Nov 24 16:16:00.046162 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: Reading config file '/etc/rpm-ostreed.conf' Nov 24 16:16:00.057568 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: In idle state; will auto-exit in 63 seconds Nov 24 16:16:00.058531 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd[1]: Started rpm-ostree System Management Daemon. Nov 24 16:16:00.066603 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) added; new total=1 Nov 24 16:16:30.168560 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: Initiated txn Cleanup for client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0): /org/projectatomic/rpmostree1/rhcos Nov 24 16:16:30.168979 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.244 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) vanished; remaining=0 Nov 24 16:16:30.183694 i223f7yy-3d5e8-z6qkt-worker-0-xghwt rpm-ostree[7089]: client(id:machine-config-operator dbus:1.249 unit:crio-677ff1cbfb02ba7d8e38ec0706667eea4c36b31360ce47262c884925cffbc681.scope uid:0) added; new total=1 Nov 24 16:28:14.754412 i223f7yy-3d5e8-z6qkt-worker-0-xghwt systemd-tmpfiles[27734]: [/usr/lib/tmpfiles.d/rpm-ostree-1-autovar.conf:1] Duplicate line for path "/var/lib/alternatives", ignoring. well, the proxy error comes from level=error msg=Cluster operator authentication Degraded is True with ProxyConfigController_SyncError: ProxyConfigControllerDegraded: endpoint("https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz") is unreachable with proxy(Get "https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz": x509: certificate signed by unknown authority) and without proxy(Get "https://oauth-openshift.apps.ci-op-wcv60d19-7403f.origin-ci-int-aws.dev.rhcloud.com/healthz": dial tcp 35.163.63.55:443: connect: connection timed out) and surely between bootstrap and cluster run what's returned from the proxy controller differs and causes message: "Node ip-10-0-55-247.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-69-44.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found", Node ip-10-0-55-67.us-west-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-eae7820c783ea9f895860667adc7111a\" not found"", cc'ing Jonathan, thoughts on why rpm-ostree would timeout while deleting rollback deployment. Hmm, not sure what's going on here. Interesting that the "Initiated txn Cleanup" message is timestamped 5s after the client timed out. We've had a few changes related with D-Bus transaction handling (originally motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1865839) in the recent rpm-ostree releases which are not yet in 4.7 though we're planning to tag it in. We'll see if those help. Hey all the proxy issue is being tracked via: https://bugzilla.redhat.com/show_bug.cgi?id=1901034 So when the rpm ostree stuff is done, feel free to dupe into the tracking issue for the proxy problems : https://bugzilla.redhat.com/show_bug.cgi?id=1901034 Spoke to Jonathan and as there's already a bz tracking rpm ostree im going to dupe this to the overall proxy install failure bz to keep that issue discussion centralized. *** This bug has been marked as a duplicate of bug 1901034 *** This looks more like https://bugzilla.redhat.com/show_bug.cgi?id=1888565#c3 TL;DR your storage is too slow. |