Description of problem: linux-bridge and kubemacpool-system projects are deleted for unknown reason Version-Release number of selected component (if applicable): CNV2.0 How reproducible: Sometimes Steps to Reproduce: 1. Running CNV2.0 on OCP4.1 for 1-3 days (no manual delete project operation on it) 2. Check the linux-bridge and kubemacpool-system projects 3. Actual results: linux-bridge and kubemacpool-system projects are deleted for unknown reason Expected results: linux-bridge and kubemacpool-system projects should not be deleted Additional info: 1. Please check the cluster-network-addons-operator pod log below 2019/05/22 08:50:41 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/22 08:50:41 reconciling (/v1, Kind=Namespace) /linux-bridge 2019/05/22 08:50:41 reconciling (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker 2019/05/22 08:50:41 does not exist, creating (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker 2019/05/22 08:50:41 could not apply (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: could not create (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: daemonsets.extensions "bridge-marker" is forbidden: unable to create new content in namespace linux-bridge because it is being terminated 2. I didn't see any error for kubemacpool-system project, but the project is deleted as well For more logs, please refer to attachment. 3. I tried to get some useful log from openshift-api or kube-api, but unfortunately, seems some audit logs are missing during the api pod restart. 4. I don't have the exact steps to reproduce it, but I have met this issue for at least 4 times.
cluster-network-addons-operator:v2.0.0-13 hyperconverged-cluster-operator:v2.0.0-22
Created attachment 1571970 [details] log
Thanks for reporting this Yan, I took a look at the environment and wasn't able to find anything interesting, except maybe "cannot change kubecmacpool config". I will try to run latest U/S to see if that magically fixes the issue.
We have seen this a couple of times, I would like to consider it as a 2.0.0 blocker.
Updated description of this bug: CNV network components Linux bridge (CNI+marker) and kubemacpool are deployed in their respective namespaces by cluster-network-addons-operator. It appears, that these namespaces are spontaneously removed on QE environment overnight. On top of it, due to a bug in the operator, we are not able to recreate these namespaces. There are two suspicious errors in log. First we have RBAC not allowing us to list some resources, that is IMO unrelated and not a problem at all, just a noise. E0522 05:21:21.021235 1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.Network: Get https://172.30.0.1:443/apis/operator.openshift.io/v1/netwo rks?resourceVersion=14555&timeoutSeconds=301&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0522 05:21:21.021301 1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.DaemonSet: Get https://172.30.0.1:443/apis/apps/v1/daemonsets?resourceV ersion=71437&timeoutSeconds=506&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0522 05:21:21.021306 1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.Deployment: Get https://172.30.0.1:443/apis/apps/v1/deployments?resourc eVersion=494303&timeoutSeconds=575&watch=true: dial tcp 172.30.0.1:443: connect: connection refused E0522 05:21:21.021235 1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?resourceVersion =72433&timeoutSeconds=506&watch=true: dial tcp 172.30.0.1:443: connect: connection refused Then we have error where we are not able to reapply configuration of kubemacpool. Note that this happens only after those namespaces were removed and this is probably not a cause of this issue. 2019/05/28 15:20:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/28 15:25:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager 2019/05/28 15:25:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/28 15:30:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager 2019/05/28 15:30:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/28 15:35:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager 2019/05/28 15:35:52 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin 2019/05/28 15:37:21 reconciling NetworkAddonsConfig 2019/05/28 15:37:21 not applying unsafe change: invalid configuration: cannot modify KubeMacPool configuration once it is deployed The kubemacpool error has been addressed in https://github.com/kubevirt/cluster-network-addons-operator/pull/99 and is yet to be tested by QE. Hopefully it fixes the issue. To me it is unclear what could cause the reinstallation of all network components. Hopefully it will be easier to spot once we have the kubemacpool fix in.
Should be available to test now.
Got this issue again with OCP-4.1 / CNV-2.0 deployed using HCO_BUNDLE_REGISTRY_TAG=v2.0.0-22 > oc get namespaces > > NAME STATUS AGE > default Active 19h > kube-public Active 19h > kube-system Active 19h > kubemacpool-system Terminating 16h > kubevirt-hyperconverged Active 17h > kubevirt-web-ui Active 17h > linux-bridge Terminating 16h > local-storage Active 17h > openshift Active 19h
BTW, it's using cluster-network-addons-operator:v2.0.0-16.
Logs from cluster-network-addons-operator: > 2019/06/06 13:09:43 reconciling (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker > 2019/06/06 13:09:43 does not exist, creating (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker > 2019/06/06 13:09:43 could not apply (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: could not create (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: daemonsets.extensions "bridge-marker" is forbidden: unable to create new content in namespace linux-bridge because it is being terminated
Thanks Denis. I think I found another clue. Should provide another possible fix soon.
Meni, could you please deploy network addons operator directly without HCO? Once where it would request only linuxBridge component and once with kubeMacPool only. Then let it run overnight. Thanks a lot.
Little update: The NetworkAddonsConfig cycles in following failing loop: 1. HCO creates the CR 2. CR goes through Progressing until all components are deployed 3. The second it turns Ready, someone removes the CR 1. CR is recreated by HCO, failing for a second (because of old namespaces are being removed due to garbage collection) ... Investigating who is removing the CR.
Little update #2: This error happens only when NetworkAddonsConfig has owner reference set to HyperConverged. When created without that reference, it is fine. Maybe the fact that we set owner reference of cluster-wide resource to namespaced-resource? This doesn't seem to be an issue with older OpenShift versions.
Looks like we cannot set the reference. Changed HCO to use finalizers for NetworkAddonsConfig cleanup https://github.com/kubevirt/hyperconverged-cluster-operator/pull/122, since the problem was reproduced only with owner reference, this patch should fix it.
Issue have been fixed in hco-bundle-registry:v2.0.0-29, and pods in linux-bridge and kubemacpool projects work well after the envs running for a few days.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1850