Description of problem: The current set of e2e failures where the master drops out. the issue appears to be that the kubelet can’t perform a liveness check of some of the pods and thus causes a restart which causes a blip in failures. that’s the major known source of flakes right now (symptom is e2e tests fail because they can’t hit openshift api resources, or you get an OpenShift API is down). Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-01-12-000105 Actual results: e2e tests failures Expected results: e2e tests should pass Additional info:
1/3 CI runs fails due to etcdserver timeout (possible etcd loss) Other components fail readiness checks periodically 1/5 installs fail. https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-4.0 We only pass ~25% of qualifying runs.
The kubelet inability to communicate with API Server looks like a networking issue (unless the API server is down).
The kubelet uses hostnetworking to get to the apiserver... it doesn't use anything from the sdn. So while the networking could be down, it means the node networking is down and that feels unlikely to happen this often.
The OpenShift apiserver isn't host network. I wonder if this is openvswitch taking a nap.
(In reply to Casey Callendrello from comment #7) > I wonder if this is openvswitch taking a nap. Close, but exactly the opposite. It's openvswitch TOTALLY FREAKING OUT! Pod teardown is failing (probably fallout from the code reorg in the restart fix) and leaving cruft behind and so OVS is doing the "I'll log about these 300 missing veths attached to the bridge once a second forever" thing.
https://github.com/openshift/origin/pull/21796
The fix did not resolve the e2e issues. It appears to be cleaning up devices, but we're still seeing the same symptoms as before.
e2e etcdserver timeouts were greatly reduced by https://github.com/openshift/installer/pull/1069
Clayton, do you think there's still a networking problem here even after bumping the AWS machines to more powerful ones?
Some problems have been fixed. Others have been found. This bug isn't tracking anything useful at this point.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days