Description of problem: MCO e2e-gcp-op test has been failing for a while which we are tracking in bug https://bugzilla.redhat.com/show_bug.cgi?id=1834895 . MCD log from failing test run - https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1723/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2174/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-mbwc4_machine-config-daemon.log I0512 08:56:00.121039 1921 daemon.go:564] Starting MachineConfigDaemon I0512 08:56:00.121283 1921 daemon.go:571] Enabling Kubelet Healthz Monitor E0512 08:56:03.463330 1921 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host E0512 08:56:03.463519 1921 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host I0512 09:00:58.611428 1921 daemon.go:767] Current config: rendered-worker-0365931436ffa46ac1323f21d763a58e I0512 09:00:58.611456 1921 daemon.go:768] Desired config: rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6 I0512 09:00:58.622168 1921 update.go:1346] Disk currentConfig rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6 overrides node annotation rendered-worker-0365931436ffa46ac1323f21d763a58e I0512 09:00:58.626043 1921 daemon.go:991] Validating against pending config rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6 I0512 09:00:58.811251 1921 daemon.go:1007] Validated on-disk state I0512 09:00:58.830398 1921 daemon.go:1041] Completing pending config rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6 I0512 09:00:58.848929 1921 update.go:1346] completed update for config rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6 I0512 09:00:58.852519 1921 daemon.go:1057] In desired config rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6 Noticed that MCD takes around 5 minutes to sync to desired config whenever we apply a MachineConfig. This adds up lot of time as we run multiple tests on 3 workers nodes. While debugging, we noticed that between passing test and failing test, there were few packages updated on RHCOS node with the latest machine-os-content (https://releases-rhcos-art.cloud.privileged.psi.redhat.com/diff.html?first_release=45.81.202004241728-0&first_stream=releases%2Frhcos-4.5&second_release=45.81.202005131029-0&second_stream=releases%2Frhcos-4.5) . To narrow down the issue, checked if any particular update could have caused this issue. It was tracked down to cri-o update from cri-o-1.18.0-4.dev.rhaos4.5.git7d79f42.el8.x86_64.rpm to cri-o-1.18.0-17.dev.rhaos4.5.gitdea34b9.el8.x86_64.rpm Verified this by installing a 4.5 cluster and did `rpm-ostree override replace cri-o-1.18.0-4.dev.rhaos4.5.git7d79f42.el8.x86_64.rpm` on one of the node -> reboot machine -> apply a karg MC. Rest two worker node will take 5 minutes extra to sync to desired config while the downgraded cri-o node finishes within 1 minute How reproducible: ----------------- Install a 4.5 cluster and apply a MachineConfig. Monitor the MCD log, after the start of MachineConfigDaemon it takes almost 5 minutes to reach to desired config Note that we went another way to fix this problem which is in PR https://github.com/openshift/machine-config-operator/pull/1731, so to reproduce the bug one will have to get a MCO image which doesn't have this change. Not sure what could have changed in cri-o which is causing the delay. It may have impact other places too.
Created attachment 1688482 [details] MCD and crio journal log from local cluster run For verbose log, sample jobs: failing test run - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1724/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2262 passing test run - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1719/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2123 Maybe it will easier to find something useful from crio service log(couldn't find myself something alarming). For that I have added in attachment MCD pod log and crio service journal log from a local cluster I tried
I have tested this with cri-o version 1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 which I believe also has https://github.com/openshift/machine-config-operator/commit/45e35584cfec097a4d23381129e13851ee3a60be (hard to tell, the mcd version `4.5.0-202005150317.git.1.d3fe948.el8` doesn't seem to match upstream commits) I am no longer encountering these timeouts and thus think this is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409