Bug 1666084
| Summary: | e2e tests are failing on OCP 4.0 cluster due to API server drop out, etcd timeouts, and intermittent pod liveness probe timeouts | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> |
| Component: | Networking | Assignee: | Dan Winship <danw> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Meng Bo <bmeng> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.1.0 | CC: | aos-bugs, bbennett, ccoleman, danw, jokerman, mifiedle, mmccomas, sponnaga, wking, xtian |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-01-29 13:37:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vikas Laad
2019-01-14 20:35:59 UTC
1/3 CI runs fails due to etcdserver timeout (possible etcd loss) Other components fail readiness checks periodically 1/5 installs fail. https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-4.0 We only pass ~25% of qualifying runs. The kubelet inability to communicate with API Server looks like a networking issue (unless the API server is down). The kubelet uses hostnetworking to get to the apiserver... it doesn't use anything from the sdn. So while the networking could be down, it means the node networking is down and that feels unlikely to happen this often. The OpenShift apiserver isn't host network. I wonder if this is openvswitch taking a nap. (In reply to Casey Callendrello from comment #7) > I wonder if this is openvswitch taking a nap. Close, but exactly the opposite. It's openvswitch TOTALLY FREAKING OUT! Pod teardown is failing (probably fallout from the code reorg in the restart fix) and leaving cruft behind and so OVS is doing the "I'll log about these 300 missing veths attached to the bridge once a second forever" thing. The fix did not resolve the e2e issues. It appears to be cleaning up devices, but we're still seeing the same symptoms as before. e2e etcdserver timeouts were greatly reduced by https://github.com/openshift/installer/pull/1069 Clayton, do you think there's still a networking problem here even after bumping the AWS machines to more powerful ones? Some problems have been fixed. Others have been found. This bug isn't tracking anything useful at this point. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |