Bug 1835805

Summary: crio update to 1.18.0-17.dev.rhaos4.5.gitdea34b9.el8.x86_64.rpm lead to failing MCO e2e-gcp-op test
Product: OpenShift Container Platform Reporter: Sinny Kumari <skumari>
Component: NodeAssignee: Peter Hunt <pehunt>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: aos-bugs, jokerman, miabbott, mpatel, pehunt, smilner
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:38:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
MCD and crio journal log from local cluster run none

Description Sinny Kumari 2020-05-14 14:09:19 UTC
Description of problem:
MCO e2e-gcp-op test has been failing for a while which we are tracking in bug https://bugzilla.redhat.com/show_bug.cgi?id=1834895 .

MCD log from  failing test run - https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1723/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2174/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-mbwc4_machine-config-daemon.log


I0512 08:56:00.121039    1921 daemon.go:564] Starting MachineConfigDaemon
I0512 08:56:00.121283    1921 daemon.go:571] Enabling Kubelet Healthz Monitor
E0512 08:56:03.463330    1921 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0512 08:56:03.463519    1921 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
I0512 09:00:58.611428    1921 daemon.go:767] Current config: rendered-worker-0365931436ffa46ac1323f21d763a58e
I0512 09:00:58.611456    1921 daemon.go:768] Desired config: rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6
I0512 09:00:58.622168    1921 update.go:1346] Disk currentConfig rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6 overrides node annotation rendered-worker-0365931436ffa46ac1323f21d763a58e
I0512 09:00:58.626043    1921 daemon.go:991] Validating against pending config rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6
I0512 09:00:58.811251    1921 daemon.go:1007] Validated on-disk state
I0512 09:00:58.830398    1921 daemon.go:1041] Completing pending config rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6
I0512 09:00:58.848929    1921 update.go:1346] completed update for config rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6
I0512 09:00:58.852519    1921 daemon.go:1057] In desired config rendered-worker-e4e53a482a88fc59ad4004d8920ba9e6

Noticed that MCD takes around 5 minutes to sync to desired config whenever we apply a MachineConfig. This adds up lot of time as we run multiple tests on 3 workers nodes.

While debugging, we noticed that between passing test and failing test, there were few packages updated on RHCOS node with the latest machine-os-content (https://releases-rhcos-art.cloud.privileged.psi.redhat.com/diff.html?first_release=45.81.202004241728-0&first_stream=releases%2Frhcos-4.5&second_release=45.81.202005131029-0&second_stream=releases%2Frhcos-4.5) . To narrow down the issue, checked if any particular update could have caused this issue. It was tracked down to cri-o update from cri-o-1.18.0-4.dev.rhaos4.5.git7d79f42.el8.x86_64.rpm to cri-o-1.18.0-17.dev.rhaos4.5.gitdea34b9.el8.x86_64.rpm


Verified this by installing a 4.5 cluster and did `rpm-ostree override replace cri-o-1.18.0-4.dev.rhaos4.5.git7d79f42.el8.x86_64.rpm` on one of the node -> reboot machine -> apply a karg MC. Rest two worker node will take 5 minutes extra to sync to desired config while the downgraded cri-o node finishes within 1 minute


How reproducible:
-----------------

Install a 4.5 cluster and apply a MachineConfig. Monitor the MCD log, after the start of MachineConfigDaemon it takes almost 5 minutes to reach to desired config 

Note that we went another way to fix this problem which is in PR https://github.com/openshift/machine-config-operator/pull/1731, so to reproduce the bug one will have to get a MCO image which doesn't have this change.

Not sure what could have changed in cri-o which is causing the delay. It may have impact other places too.

Comment 1 Sinny Kumari 2020-05-14 14:25:45 UTC
Created attachment 1688482 [details]
MCD and crio journal log from local cluster run

For verbose log, sample jobs:
failing test run - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1724/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2262
passing test run - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1719/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2123

Maybe it will easier to find something useful from crio service log(couldn't find myself something alarming). For that I have added in attachment MCD pod log and crio service journal log from a local cluster I tried

Comment 2 Peter Hunt 2020-05-15 20:21:37 UTC
I have tested this with cri-o version 1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 which I believe also has https://github.com/openshift/machine-config-operator/commit/45e35584cfec097a4d23381129e13851ee3a60be (hard to tell, the mcd version `4.5.0-202005150317.git.1.d3fe948.el8` doesn't seem to match upstream commits)
I am no longer encountering these timeouts and thus think this is fixed.

Comment 6 errata-xmlrpc 2020-07-13 17:38:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409