Bug 1834895

Summary: MCO e2e-gcp-op tests fail consistently on timeouts
Product: OpenShift Container Platform Reporter: Yu Qi Zhang <jerzhang>
Component: Machine Config OperatorAssignee: Yu Qi Zhang <jerzhang>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: kgarriso, nmoraiti, pasik, skumari, wking
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:37:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yu Qi Zhang 2020-05-12 15:40:44 UTC
Description of problem:

Currently the e2e-gcp-op CI test fails 100% of the time, see: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op

This has been determined to hit a timeout due to a change in behaviour from somewhere else the past few days. Post reboot the MCD takes 5 minutes (instead of 30 seconds) before it gets ready.


Version-Release number of selected component (if applicable):
4.5


How reproducible:
100%


Steps to Reproduce:
CI

Actual results:
Tests Fail

Expected results:
Tests Pass

Additional info:

Comment 1 Kirsten Garrison 2020-05-13 01:12:19 UTC
Also opened: https://bugzilla.redhat.com/show_bug.cgi?id=1835042

Because something seems to have changed along the way and these weird logs have started appearing.

Comment 2 Kirsten Garrison 2020-05-13 17:14:30 UTC
Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1835368

Comment 3 Kirsten Garrison 2020-05-13 17:58:45 UTC
Wondering if this is somehow related to https://bugzilla.redhat.com/show_bug.cgi?id=1802534

Comment 4 W. Trevor King 2020-05-14 18:32:06 UTC
I don't think this needs a doc update.  When we tripped over the bug fixed by mco#1731, the downside would be a kubelet-heartbeat (5 minute?) potential delay noticing the new desiredConfig.  But eventually that heartbeat (or other node change) would come through and the MCD would notice and apply the desiredConfig.  It only bit us because we have tight timeout limits in the e2e suite that customers are unlikely to have in production clusters.  Or at least, any customer limits on desiredConfig application that require <5m latencies are already brittle, so doesn't seem worth a doc callout to say "maybe under some conditions we will exceed your overly-strict desuredConfig latency assumptions" or whatever a doc update would look like ;).

Comment 7 Ryan Phillips 2020-05-18 14:53:33 UTC
*** Bug 1817465 has been marked as a duplicate of this bug. ***

Comment 8 Michael Nguyen 2020-05-18 15:02:42 UTC
CI is not failing consistently anymore due to timeouts.  Considering this verified as the original issue was reported in CI.

Comment 9 errata-xmlrpc 2020-07-13 17:37:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409