Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1666084

Summary: e2e tests are failing on OCP 4.0 cluster due to API server drop out, etcd timeouts, and intermittent pod liveness probe timeouts
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: NetworkingAssignee: Dan Winship <danw>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.1.0CC: aos-bugs, bbennett, ccoleman, danw, jokerman, mifiedle, mmccomas, sponnaga, wking, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-29 13:37:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vikas Laad 2019-01-14 20:35:59 UTC
Description of problem:
The current set of e2e failures where the master drops out.  the issue appears to be that the kubelet can’t perform a liveness check of some of the pods and thus causes a restart which causes a blip in failures.  that’s the major known source of flakes right now (symptom is e2e tests fail because they can’t hit openshift api resources, or you get an OpenShift API is down).

Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-01-12-000105	

Actual results:
e2e tests failures

Expected results:
e2e tests should pass

Additional info:

Comment 4 Clayton Coleman 2019-01-15 14:59:12 UTC
1/3 CI runs fails due to etcdserver timeout (possible etcd loss)
Other components fail readiness checks periodically
1/5 installs fail.

https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-4.0

We only pass ~25% of qualifying runs.

Comment 5 Michal Fojtik 2019-01-15 15:20:27 UTC
The kubelet inability to communicate with API Server looks like a networking issue (unless the API server is down).

Comment 6 Ben Bennett 2019-01-15 15:42:31 UTC
The kubelet uses hostnetworking to get to the apiserver... it doesn't use anything from the sdn.  So while the networking could be down, it means the node networking is down and that feels unlikely to happen this often.

Comment 7 Casey Callendrello 2019-01-15 15:45:25 UTC
The OpenShift apiserver isn't host network.

I wonder if this is openvswitch taking a nap.

Comment 8 Dan Winship 2019-01-15 19:47:13 UTC
(In reply to Casey Callendrello from comment #7)
> I wonder if this is openvswitch taking a nap.

Close, but exactly the opposite. It's openvswitch TOTALLY FREAKING OUT! Pod teardown is failing (probably fallout from the code reorg in the restart fix) and leaving cruft behind and so OVS is doing the "I'll log about these 300 missing veths attached to the bridge once a second forever" thing.

Comment 9 Dan Winship 2019-01-15 20:55:28 UTC
https://github.com/openshift/origin/pull/21796

Comment 10 Clayton Coleman 2019-01-16 02:30:49 UTC
The fix did not resolve the e2e issues.  It appears to be cleaning up devices, but we're still seeing the same symptoms as before.

Comment 11 Clayton Coleman 2019-01-16 02:58:11 UTC
e2e etcdserver timeouts were greatly reduced by https://github.com/openshift/installer/pull/1069

Comment 13 Ben Bennett 2019-01-17 19:27:50 UTC
Clayton, do you think there's still a networking problem here even after bumping the AWS machines to more powerful ones?

Comment 14 Dan Winship 2019-01-29 13:37:42 UTC
Some problems have been fixed. Others have been found. This bug isn't tracking anything useful at this point.

Comment 15 Red Hat Bugzilla 2023-09-14 04:45:01 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days