1834895 – MCO e2e-gcp-op tests fail consistently on timeouts

Bug 1834895 - MCO e2e-gcp-op tests fail consistently on timeouts

Summary: MCO e2e-gcp-op tests fail consistently on timeouts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Yu Qi Zhang
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1817465 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-12 15:40 UTC by Yu Qi Zhang
Modified:	2020-07-13 17:38 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:37:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1731	0	None	closed	Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well	2021-02-05 23:46:51 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:38:13 UTC

Description Yu Qi Zhang 2020-05-12 15:40:44 UTC

Description of problem:

Currently the e2e-gcp-op CI test fails 100% of the time, see: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op

This has been determined to hit a timeout due to a change in behaviour from somewhere else the past few days. Post reboot the MCD takes 5 minutes (instead of 30 seconds) before it gets ready.


Version-Release number of selected component (if applicable):
4.5


How reproducible:
100%


Steps to Reproduce:
CI

Actual results:
Tests Fail

Expected results:
Tests Pass

Additional info:

Comment 1 Kirsten Garrison 2020-05-13 01:12:19 UTC

Also opened: https://bugzilla.redhat.com/show_bug.cgi?id=1835042

Because something seems to have changed along the way and these weird logs have started appearing.

Comment 2 Kirsten Garrison 2020-05-13 17:14:30 UTC

Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1835368

Comment 3 Kirsten Garrison 2020-05-13 17:58:45 UTC

Wondering if this is somehow related to https://bugzilla.redhat.com/show_bug.cgi?id=1802534

Comment 4 W. Trevor King 2020-05-14 18:32:06 UTC

I don't think this needs a doc update.  When we tripped over the bug fixed by mco#1731, the downside would be a kubelet-heartbeat (5 minute?) potential delay noticing the new desiredConfig.  But eventually that heartbeat (or other node change) would come through and the MCD would notice and apply the desiredConfig.  It only bit us because we have tight timeout limits in the e2e suite that customers are unlikely to have in production clusters.  Or at least, any customer limits on desiredConfig application that require <5m latencies are already brittle, so doesn't seem worth a doc callout to say "maybe under some conditions we will exceed your overly-strict desuredConfig latency assumptions" or whatever a doc update would look like ;).

Comment 7 Ryan Phillips 2020-05-18 14:53:33 UTC

*** Bug 1817465 has been marked as a duplicate of this bug. ***

Comment 8 Michael Nguyen 2020-05-18 15:02:42 UTC

CI is not failing consistently anymore due to timeouts.  Considering this verified as the original issue was reported in CI.

Comment 9 errata-xmlrpc 2020-07-13 17:37:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.