Bug 1666084 - e2e tests are failing on OCP 4.0 cluster due to API server drop out, etcd timeouts, and intermittent pod liveness probe timeouts
Summary: e2e tests are failing on OCP 4.0 cluster due to API server drop out, etcd tim...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Dan Winship
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-14 20:35 UTC by Vikas Laad
Modified: 2023-09-14 04:45 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-29 13:37:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 21796 0 'None' closed Fix pod teardown bug that eventually causes e2e lossage 2020-10-08 16:57:50 UTC

Description Vikas Laad 2019-01-14 20:35:59 UTC
Description of problem:
The current set of e2e failures where the master drops out.  the issue appears to be that the kubelet can’t perform a liveness check of some of the pods and thus causes a restart which causes a blip in failures.  that’s the major known source of flakes right now (symptom is e2e tests fail because they can’t hit openshift api resources, or you get an OpenShift API is down).

Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-01-12-000105	

Actual results:
e2e tests failures

Expected results:
e2e tests should pass

Additional info:

Comment 4 Clayton Coleman 2019-01-15 14:59:12 UTC
1/3 CI runs fails due to etcdserver timeout (possible etcd loss)
Other components fail readiness checks periodically
1/5 installs fail.

https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-4.0

We only pass ~25% of qualifying runs.

Comment 5 Michal Fojtik 2019-01-15 15:20:27 UTC
The kubelet inability to communicate with API Server looks like a networking issue (unless the API server is down).

Comment 6 Ben Bennett 2019-01-15 15:42:31 UTC
The kubelet uses hostnetworking to get to the apiserver... it doesn't use anything from the sdn.  So while the networking could be down, it means the node networking is down and that feels unlikely to happen this often.

Comment 7 Casey Callendrello 2019-01-15 15:45:25 UTC
The OpenShift apiserver isn't host network.

I wonder if this is openvswitch taking a nap.

Comment 8 Dan Winship 2019-01-15 19:47:13 UTC
(In reply to Casey Callendrello from comment #7)
> I wonder if this is openvswitch taking a nap.

Close, but exactly the opposite. It's openvswitch TOTALLY FREAKING OUT! Pod teardown is failing (probably fallout from the code reorg in the restart fix) and leaving cruft behind and so OVS is doing the "I'll log about these 300 missing veths attached to the bridge once a second forever" thing.

Comment 9 Dan Winship 2019-01-15 20:55:28 UTC
https://github.com/openshift/origin/pull/21796

Comment 10 Clayton Coleman 2019-01-16 02:30:49 UTC
The fix did not resolve the e2e issues.  It appears to be cleaning up devices, but we're still seeing the same symptoms as before.

Comment 11 Clayton Coleman 2019-01-16 02:58:11 UTC
e2e etcdserver timeouts were greatly reduced by https://github.com/openshift/installer/pull/1069

Comment 13 Ben Bennett 2019-01-17 19:27:50 UTC
Clayton, do you think there's still a networking problem here even after bumping the AWS machines to more powerful ones?

Comment 14 Dan Winship 2019-01-29 13:37:42 UTC
Some problems have been fixed. Others have been found. This bug isn't tracking anything useful at this point.

Comment 15 Red Hat Bugzilla 2023-09-14 04:45:01 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.