Bug 1539987
Summary: | Under load, openshift-sdn reports "link not found" and fails health check and restarts itself on GCP | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Networking | Assignee: | Dan Winship <danw> |
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.9.0 | CC: | aos-bugs, bbennett, hongli |
Target Milestone: | --- | ||
Target Release: | 3.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Bugs in golang's handling of network namespaces could cause problems after setting up a new pod.
Consequence: Multiple forms of sporadic networking-related failures in the OpenShift node service, including (1) atomic-node-service spontaneously restarting due to the OVS health check failing, (2) the node reporting an incorrect IP address to the master and thus becoming inaccessible for a period of time, (3) temporary pod creation failures that would eventually resolve themselves without user intervention.
Fix: All pod-setup and pod-teardown operations in a pod's network namespace are now performed in a separate process from the main atomic-openshift-node service.
Result: No sporadic network failures
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-03-28 14:23:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Clayton Coleman
2018-01-30 01:28:31 UTC
Looks like a non-trivial amount of our flakes (1/3? 1/4?) are from this or something like it. Also https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18323/test_pull_request_origin_extended_conformance_gce/15217/ Jan 29 16:18:21 ci-prtest-5a37c28-15217-ig-n-sqrh origin-node[2102]: F0129 16:18:21.842955 2102 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: Jan 29 16:18:21 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a Jan 29 16:18:22 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: Unit origin-node.service entered failed state. Jan 29 16:18:22 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service failed. Jan 29 16:18:27 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: origin-node.service holdoff time over, scheduling restart. Jan 29 16:18:27 ci-prtest-5a37c28-15217-ig-n-sqrh systemd[1]: Starting OpenShift Node... This looks like another symptom of https://github.com/openshift/origin/issues/15991 (which has suddenly started popping up everywhere, though we don't know why now). The "link not found" suggests that the healthcheck is getting run from a thread that has leaked into the wrong network namespace. (And in the 2nd and 3rd examples linked above, the SDN health check failure is preceeded by errors like "CNI request failed with status 400: 'error on port veth70fa69b6: "could not open network device veth70fa69b6 (No such device)"'" when the logs suggest that that device definitely should exist, again suggesting that code was running in the wrong namespace.) Will be fixed by https://github.com/openshift/origin/pull/18355. verified in openshift v3.9.0-0.48.0 and cannot reproduce the issue Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489 |