Bug 1820118
Summary: | Kuryr-cni restarts during conformance tests due to namespace not found | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jon Uriarte <juriarte> | |
Component: | Node | Assignee: | Peter Hunt <pehunt> | |
Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.4 | CC: | aos-bugs, bbennett, jokerman, mdulko, mpatel, nagrawal, pehunt, rphillips | |
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1825339 1838116 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:25:01 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1825339 |
Description
Jon Uriarte
2020-04-02 10:10:05 UTC
One correction, it's reproduced when running conformance tests (not NP tests) I believe this is a problem with cri-o. As you can see on CNI specification [1], a namespace must exists when CNI is called. Kuryr is merely an implementation of a CNI plugin, it requires those considerations the CNI spec provides to be held. Analyzing the full log [2] we can see that the namespace was not existing when CNI requests were handled. [1] https://github.com/containernetworking/cni/blob/master/SPEC.md#general-considerations [2] http://pastebin.test.redhat.com/851168 This is part of what https://github.com/openshift/machine-config-operator/pull/1568 was trying to solve. Currently, CRI-O references the network namespace of a pod by the pid of its infra container. This is inherently racy, and prone to issues where the process has been cleaned up, but the container remains in CRI-O's state. A more reliable way is for CRI-O to manage the namespace lifecycle. This capability is in 4.4, but as found, it currently doesn't play well with third party networking plugins, so further work is needed. Does this happen every test run, or is it intermittent (indicative of a race)? (In reply to Peter Hunt from comment #3) > This is part of what > https://github.com/openshift/machine-config-operator/pull/1568 was trying to > solve. Currently, CRI-O references the network namespace of a pod by the pid > of its infra container. This is inherently racy, and prone to issues where > the process has been cleaned up, but the container remains in CRI-O's state. > A more reliable way is for CRI-O to manage the namespace lifecycle. > > This capability is in 4.4, but as found, it currently doesn't play well with > third party networking plugins, so further work is needed. > > Does this happen every test run, or is it intermittent (indicative of a > race)? Hei Peter, I could reproduce it each time I ran kubernetes/conformance tests -280 tests- from origin repo. I noticed it as the kuryr-cni pod restarted like 6-8 times during the tests. I saw it in two different environments so I believe it is reproducible. Jon (In reply to Peter Hunt from comment #3) > This is part of what > https://github.com/openshift/machine-config-operator/pull/1568 was trying to > solve. Currently, CRI-O references the network namespace of a pod by the pid > of its infra container. This is inherently racy, and prone to issues where > the process has been cleaned up, but the container remains in CRI-O's state. > A more reliable way is for CRI-O to manage the namespace lifecycle. > > This capability is in 4.4, but as found, it currently doesn't play well with > third party networking plugins, so further work is needed. > > Does this happen every test run, or is it intermittent (indicative of a > race)? Alright, so I guess we'll need to figure out how to make Kuryr SDN work fine with that change - we had issues with getting SDN pods to access network namespaces when they're in another directory. It's all interconnected it seems! Note: I am working on a PR (https://github.com/cri-o/cri-o/pull/3509) that will use /var/run/netns for network namespaces, instead of /var/run/crio/ns/. That *should* mean you won't need changes to Kuryr to accommodate CRI-O managing its namespace lifecycle (In reply to Peter Hunt from comment #6) > Note: I am working on a PR (https://github.com/cri-o/cri-o/pull/3509) that > will use /var/run/netns for network namespaces, instead of > /var/run/crio/ns/. That *should* mean you won't need changes to Kuryr to > accommodate CRI-O managing its namespace lifecycle I don't think this will help with the clue of the issues we had with patch putting namespaces into /var/run/crio/ns/. In general we can mount whatever we like from the host into the kuryr-cni containers and I had a patch mounting /var/run/crio. The problem we had was with the permissions - somehow our code couldn't access the network namespaces due to file permissions issues even though kuryr-daemon runs as root. Also selinux logs shown no issues. Any idea why that could happen? Oop, I know what's happening there: https://github.com/cri-o/cri-o/blob/33f0cafcd2e81eae9c1be723d6b1ccc44d70838b/pkg/config/config.go#L755 which will also be fixed by the PR changing the location for 1.17: https://github.com/cri-o/cri-o/pull/3530 So the above PR merged. The status of this bug is as follows: We are working on getting CRI-O to manage namespace lifecycle into 4.5. There's one known blocking bug there, which some version of https://github.com/openshift/cluster-network-operator/pull/573 will fix. Once that gets in, and we switch CRI-O to do so, we're going to let it sit for a moment to make sure it does not break anything else. once we know we didn't break anyone, we will also make the switch in 4.4 I would estimate that all can be done in the next two weeks. dup'ed this to 4.4.z, this version can be for 4.5 CRI-O is now managing namespace lifecycle in 4.5 after https://github.com/openshift/machine-config-operator/pull/1689 merged. Moving this to modified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |