1666084 – e2e tests are failing on OCP 4.0 cluster due to API server drop out, etcd timeouts, and intermittent pod liveness probe timeouts

Bug 1666084 - e2e tests are failing on OCP 4.0 cluster due to API server drop out, etcd timeouts, and intermittent pod liveness probe timeouts

Summary: e2e tests are failing on OCP 4.0 cluster due to API server drop out, etcd tim...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Dan Winship
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-14 20:35 UTC by Vikas Laad
Modified:	2023-09-14 04:45 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-29 13:37:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 21796	0	'None'	closed	Fix pod teardown bug that eventually causes e2e lossage	2020-10-08 16:57:50 UTC

Description Vikas Laad 2019-01-14 20:35:59 UTC

Description of problem:
The current set of e2e failures where the master drops out.  the issue appears to be that the kubelet can’t perform a liveness check of some of the pods and thus causes a restart which causes a blip in failures.  that’s the major known source of flakes right now (symptom is e2e tests fail because they can’t hit openshift api resources, or you get an OpenShift API is down).

Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-01-12-000105	

Actual results:
e2e tests failures

Expected results:
e2e tests should pass

Additional info:

Comment 4 Clayton Coleman 2019-01-15 14:59:12 UTC

1/3 CI runs fails due to etcdserver timeout (possible etcd loss)
Other components fail readiness checks periodically
1/5 installs fail.

https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-4.0

We only pass ~25% of qualifying runs.

Comment 5 Michal Fojtik 2019-01-15 15:20:27 UTC

The kubelet inability to communicate with API Server looks like a networking issue (unless the API server is down).

Comment 6 Ben Bennett 2019-01-15 15:42:31 UTC

The kubelet uses hostnetworking to get to the apiserver... it doesn't use anything from the sdn.  So while the networking could be down, it means the node networking is down and that feels unlikely to happen this often.

Comment 7 Casey Callendrello 2019-01-15 15:45:25 UTC

The OpenShift apiserver isn't host network.

I wonder if this is openvswitch taking a nap.

Comment 8 Dan Winship 2019-01-15 19:47:13 UTC

(In reply to Casey Callendrello from comment #7)
> I wonder if this is openvswitch taking a nap.

Close, but exactly the opposite. It's openvswitch TOTALLY FREAKING OUT! Pod teardown is failing (probably fallout from the code reorg in the restart fix) and leaving cruft behind and so OVS is doing the "I'll log about these 300 missing veths attached to the bridge once a second forever" thing.

Comment 9 Dan Winship 2019-01-15 20:55:28 UTC

https://github.com/openshift/origin/pull/21796

Comment 10 Clayton Coleman 2019-01-16 02:30:49 UTC

The fix did not resolve the e2e issues.  It appears to be cleaning up devices, but we're still seeing the same symptoms as before.

Comment 11 Clayton Coleman 2019-01-16 02:58:11 UTC

e2e etcdserver timeouts were greatly reduced by https://github.com/openshift/installer/pull/1069

Comment 13 Ben Bennett 2019-01-17 19:27:50 UTC

Clayton, do you think there's still a networking problem here even after bumping the AWS machines to more powerful ones?

Comment 14 Dan Winship 2019-01-29 13:37:42 UTC

Some problems have been fixed. Others have been found. This bug isn't tracking anything useful at this point.

Comment 15 Red Hat Bugzilla 2023-09-14 04:45:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.