Bug 1834895 - MCO e2e-gcp-op tests fail consistently on timeouts
Summary: MCO e2e-gcp-op tests fail consistently on timeouts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Yu Qi Zhang
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1817465 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-12 15:40 UTC by Yu Qi Zhang
Modified: 2020-07-13 17:38 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:37:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1731 0 None closed Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well 2021-02-05 23:46:51 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:38:13 UTC

Description Yu Qi Zhang 2020-05-12 15:40:44 UTC
Description of problem:

Currently the e2e-gcp-op CI test fails 100% of the time, see: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op

This has been determined to hit a timeout due to a change in behaviour from somewhere else the past few days. Post reboot the MCD takes 5 minutes (instead of 30 seconds) before it gets ready.


Version-Release number of selected component (if applicable):
4.5


How reproducible:
100%


Steps to Reproduce:
CI

Actual results:
Tests Fail

Expected results:
Tests Pass

Additional info:

Comment 1 Kirsten Garrison 2020-05-13 01:12:19 UTC
Also opened: https://bugzilla.redhat.com/show_bug.cgi?id=1835042

Because something seems to have changed along the way and these weird logs have started appearing.

Comment 2 Kirsten Garrison 2020-05-13 17:14:30 UTC
Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1835368

Comment 3 Kirsten Garrison 2020-05-13 17:58:45 UTC
Wondering if this is somehow related to https://bugzilla.redhat.com/show_bug.cgi?id=1802534

Comment 4 W. Trevor King 2020-05-14 18:32:06 UTC
I don't think this needs a doc update.  When we tripped over the bug fixed by mco#1731, the downside would be a kubelet-heartbeat (5 minute?) potential delay noticing the new desiredConfig.  But eventually that heartbeat (or other node change) would come through and the MCD would notice and apply the desiredConfig.  It only bit us because we have tight timeout limits in the e2e suite that customers are unlikely to have in production clusters.  Or at least, any customer limits on desiredConfig application that require <5m latencies are already brittle, so doesn't seem worth a doc callout to say "maybe under some conditions we will exceed your overly-strict desuredConfig latency assumptions" or whatever a doc update would look like ;).

Comment 7 Ryan Phillips 2020-05-18 14:53:33 UTC
*** Bug 1817465 has been marked as a duplicate of this bug. ***

Comment 8 Michael Nguyen 2020-05-18 15:02:42 UTC
CI is not failing consistently anymore due to timeouts.  Considering this verified as the original issue was reported in CI.

Comment 9 errata-xmlrpc 2020-07-13 17:37:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.