Bug 1712851
| Summary: | linux-bridge and kubemacpool-system projects are deleted for unknown reason | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Yan Du <yadu> | ||||
| Component: | Networking | Assignee: | Petr Horáček <phoracek> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Meni Yakove <myakove> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 2.0 | CC: | atragler, cnv-qe-bugs, danken, dollierp, fsimonce, myakove, ncredi, sgordon, stirabos | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 2.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | hco-bundle-registry:v2.0.0-29 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-07-24 20:16:06 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Yan Du
2019-05-22 11:19:54 UTC
cluster-network-addons-operator:v2.0.0-13 hyperconverged-cluster-operator:v2.0.0-22 Created attachment 1571970 [details]
log
Thanks for reporting this Yan, I took a look at the environment and wasn't able to find anything interesting, except maybe "cannot change kubecmacpool config". I will try to run latest U/S to see if that magically fixes the issue. We have seen this a couple of times, I would like to consider it as a 2.0.0 blocker. Updated description of this bug: CNV network components Linux bridge (CNI+marker) and kubemacpool are deployed in their respective namespaces by cluster-network-addons-operator. It appears, that these namespaces are spontaneously removed on QE environment overnight. On top of it, due to a bug in the operator, we are not able to recreate these namespaces. There are two suspicious errors in log. First we have RBAC not allowing us to list some resources, that is IMO unrelated and not a problem at all, just a noise. E0522 05:21:21.021235 1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.Network: Get https://172.30.0.1:443/apis/operator.openshift.io/v1/netwo rks?resourceVersion=14555&timeoutSeconds=301&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0522 05:21:21.021301 1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.DaemonSet: Get https://172.30.0.1:443/apis/apps/v1/daemonsets?resourceV ersion=71437&timeoutSeconds=506&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0522 05:21:21.021306 1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.Deployment: Get https://172.30.0.1:443/apis/apps/v1/deployments?resourc eVersion=494303&timeoutSeconds=575&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0522 05:21:21.021235 1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?resourceVersion =72433&timeoutSeconds=506&watch=true: dial tcp 172.30.0.1:443: connect: connection refused Then we have error where we are not able to reapply configuration of kubemacpool. Note that this happens only after those namespaces were removed and this is probably not a cause of this issue. 2019/05/28 15:20:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/28 15:25:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager 2019/05/28 15:25:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/28 15:30:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager 2019/05/28 15:30:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/28 15:35:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager 2019/05/28 15:35:52 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/28 15:37:21 reconciling NetworkAddonsConfig 2019/05/28 15:37:21 not applying unsafe change: invalid configuration: cannot modify KubeMacPool configuration once it is deployed The kubemacpool error has been addressed in https://github.com/kubevirt/cluster-network-addons-operator/pull/99 and is yet to be tested by QE. Hopefully it fixes the issue. To me it is unclear what could cause the reinstallation of all network components. Hopefully it will be easier to spot once we have the kubemacpool fix in. Should be available to test now.
Got this issue again with OCP-4.1 / CNV-2.0 deployed using HCO_BUNDLE_REGISTRY_TAG=v2.0.0-22
> oc get namespaces
>
> NAME STATUS AGE
> default Active 19h
> kube-public Active 19h
> kube-system Active 19h
> kubemacpool-system Terminating 16h
> kubevirt-hyperconverged Active 17h
> kubevirt-web-ui Active 17h
> linux-bridge Terminating 16h
> local-storage Active 17h
> openshift Active 19h
BTW, it's using cluster-network-addons-operator:v2.0.0-16. Logs from cluster-network-addons-operator:
> 2019/06/06 13:09:43 reconciling (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker
> 2019/06/06 13:09:43 does not exist, creating (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker
> 2019/06/06 13:09:43 could not apply (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: could not create (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: daemonsets.extensions "bridge-marker" is forbidden: unable to create new content in namespace linux-bridge because it is being terminated
Thanks Denis. I think I found another clue. Should provide another possible fix soon. Meni, could you please deploy network addons operator directly without HCO? Once where it would request only linuxBridge component and once with kubeMacPool only. Then let it run overnight. Thanks a lot. Little update: The NetworkAddonsConfig cycles in following failing loop: 1. HCO creates the CR 2. CR goes through Progressing until all components are deployed 3. The second it turns Ready, someone removes the CR 1. CR is recreated by HCO, failing for a second (because of old namespaces are being removed due to garbage collection) ... Investigating who is removing the CR. Little update #2: This error happens only when NetworkAddonsConfig has owner reference set to HyperConverged. When created without that reference, it is fine. Maybe the fact that we set owner reference of cluster-wide resource to namespaced-resource? This doesn't seem to be an issue with older OpenShift versions. Looks like we cannot set the reference. Changed HCO to use finalizers for NetworkAddonsConfig cleanup https://github.com/kubevirt/hyperconverged-cluster-operator/pull/122, since the problem was reproduced only with owner reference, this patch should fix it. Issue have been fixed in hco-bundle-registry:v2.0.0-29, and pods in linux-bridge and kubemacpool projects work well after the envs running for a few days. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1850 |