Bug 1712851

Summary:

linux-bridge and kubemacpool-system projects are deleted for unknown reason

Product:

Container Native Virtualization (CNV)

Reporter:

Yan Du <yadu>

Component:

Networking

Assignee:

Petr Horáček <phoracek>

Status:

CLOSED ERRATA

QA Contact:

Meni Yakove <myakove>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

2.0

CC:

atragler, cnv-qe-bugs, danken, dollierp, fsimonce, myakove, ncredi, sgordon, stirabos

Target Milestone:

---

Target Release:

2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

hco-bundle-registry:v2.0.0-29

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-07-24 20:16:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
log	none

Description Yan Du 2019-05-22 11:19:54 UTC

Description of problem:
linux-bridge and kubemacpool-system projects are deleted for unknown reason

Version-Release number of selected component (if applicable):
CNV2.0

How reproducible:
Sometimes

Steps to Reproduce:
1. Running CNV2.0 on OCP4.1 for 1-3 days (no manual delete project operation on it)
2. Check the linux-bridge and kubemacpool-system projects
3.

Actual results:
linux-bridge and kubemacpool-system projects are deleted for unknown reason

Expected results:
linux-bridge and kubemacpool-system projects should not be deleted

Additional info:

1. Please check the cluster-network-addons-operator pod log below
2019/05/22 08:50:41 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin
2019/05/22 08:50:41 reconciling (/v1, Kind=Namespace) /linux-bridge
2019/05/22 08:50:41 reconciling (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker
2019/05/22 08:50:41 does not exist, creating (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker
2019/05/22 08:50:41 could not apply (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: could not create (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: daemonsets.extensions "bridge-marker" is forbidden: unable to create new content in namespace linux-bridge because it is being terminated

2. I didn't see any error for kubemacpool-system project, but the project is deleted as well
For more logs, please refer to attachment.

3. I tried to get some useful log from openshift-api or kube-api, but unfortunately, seems some audit logs are missing during the api pod restart.

4. I don't have the exact steps to reproduce it, but I have met this issue for at least 4 times.

Comment 1 Yan Du 2019-05-22 11:21:17 UTC

cluster-network-addons-operator:v2.0.0-13
hyperconverged-cluster-operator:v2.0.0-22

Comment 2 Yan Du 2019-05-22 11:24:44 UTC

Created attachment 1571970 [details]
log

Comment 4 Petr Horáček 2019-05-28 16:55:29 UTC

Thanks for reporting this Yan, I took a look at the environment and wasn't able to find anything interesting, except maybe "cannot change kubecmacpool config". I will try to run latest U/S to see if that magically fixes the issue.

Comment 6 Dan Kenigsberg 2019-05-29 07:39:43 UTC

We have seen this a couple of times, I would like to consider it as a 2.0.0 blocker.

Comment 11 Petr Horáček 2019-05-30 15:36:29 UTC

Updated description of this bug:

CNV network components Linux bridge (CNI+marker) and kubemacpool are deployed in their respective namespaces by cluster-network-addons-operator. It appears, that these namespaces are spontaneously removed on QE environment overnight. On top of it, due to a bug in the operator, we are not able to recreate these namespaces.

There are two suspicious errors in log.

First we have RBAC not allowing us to list some resources, that is IMO unrelated and not a problem at all, just a noise.
E0522 05:21:21.021235       1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.Network: Get https://172.30.0.1:443/apis/operator.openshift.io/v1/netwo
rks?resourceVersion=14555&timeoutSeconds=301&watch=true: dial tcp 172.30.0.1:443: connect: connection refused                                                                                                    
E0522 05:21:21.021301       1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.DaemonSet: Get https://172.30.0.1:443/apis/apps/v1/daemonsets?resourceV
ersion=71437&timeoutSeconds=506&watch=true: dial tcp 172.30.0.1:443: connect: connection refused                                                                                                                  
E0522 05:21:21.021306       1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to watch *v1.Deployment: Get https://172.30.0.1:443/apis/apps/v1/deployments?resourc
eVersion=494303&timeoutSeconds=575&watch=true: dial tcp 172.30.0.1:443: connect: connection refused                                                                                                              
E0522 05:21:21.021235       1 reflector.go:251] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to watch *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?resourceVersion
=72433&timeoutSeconds=506&watch=true: dial tcp 172.30.0.1:443: connect: connection refused    

Then we have error where we are not able to reapply configuration of kubemacpool. Note that this happens only after those namespaces were removed and this is probably not a cause of this issue.
2019/05/28 15:20:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin
2019/05/28 15:25:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager
2019/05/28 15:25:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin
2019/05/28 15:30:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager
2019/05/28 15:30:51 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin
2019/05/28 15:35:32 Reconciling update to kubemacpool-system/kubemacpool-mac-controller-manager
2019/05/28 15:35:52 Reconciling update to linux-bridge/kube-cni-linux-bridge-plugin
2019/05/28 15:37:21 reconciling NetworkAddonsConfig
2019/05/28 15:37:21 not applying unsafe change: invalid configuration:
cannot modify KubeMacPool configuration once it is deployed

The kubemacpool error has been addressed in https://github.com/kubevirt/cluster-network-addons-operator/pull/99 and is yet to be tested by QE. Hopefully it fixes the issue.

To me it is unclear what could cause the reinstallation of all network components. Hopefully it will be easier to spot once we have the kubemacpool fix in.

Comment 12 Petr Horáček 2019-06-03 14:51:55 UTC

Should be available to test now.

Comment 13 Denis Ollier 2019-06-06 13:13:22 UTC

Got this issue again with OCP-4.1 / CNV-2.0 deployed using HCO_BUNDLE_REGISTRY_TAG=v2.0.0-22

>    oc get namespaces
>    
>    NAME                                                    STATUS        AGE
>    default                                                 Active        19h
>    kube-public                                             Active        19h
>    kube-system                                             Active        19h
>    kubemacpool-system                                      Terminating   16h
>    kubevirt-hyperconverged                                 Active        17h
>    kubevirt-web-ui                                         Active        17h
>    linux-bridge                                            Terminating   16h
>    local-storage                                           Active        17h
>    openshift                                               Active        19h

Comment 14 Denis Ollier 2019-06-06 13:15:40 UTC

BTW, it's using cluster-network-addons-operator:v2.0.0-16.

Comment 15 Denis Ollier 2019-06-06 13:18:20 UTC

Logs from cluster-network-addons-operator:

>    2019/06/06 13:09:43 reconciling (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker
>    2019/06/06 13:09:43 does not exist, creating (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker
>    2019/06/06 13:09:43 could not apply (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: could not create (extensions/v1beta1, Kind=DaemonSet) linux-bridge/bridge-marker: daemonsets.extensions "bridge-marker" is forbidden: unable to create new content in namespace linux-bridge because it is being terminated

Comment 16 Petr Horáček 2019-06-08 21:36:03 UTC

Thanks Denis. I think I found another clue. Should provide another possible fix soon.

Comment 17 Petr Horáček 2019-06-10 12:09:47 UTC

Meni, could you please deploy network addons operator directly without HCO? Once where it would request only linuxBridge component and once with kubeMacPool only. Then let it run overnight. Thanks a lot.

Comment 20 Petr Horáček 2019-06-11 15:28:58 UTC

Little update:

The NetworkAddonsConfig cycles in following failing loop:

1. HCO creates the CR
2. CR goes through Progressing until all components are deployed
3. The second it turns Ready, someone removes the CR

1. CR is recreated by HCO, failing for a second (because of old namespaces are being removed due to garbage collection)
...

Investigating who is removing the CR.

Comment 21 Petr Horáček 2019-06-11 15:47:01 UTC

Little update #2:

This error happens only when NetworkAddonsConfig has owner reference set to HyperConverged. When created without that reference, it is fine. Maybe the fact that we set owner reference of cluster-wide resource to namespaced-resource?

This doesn't seem to be an issue with older OpenShift versions.

Comment 22 Petr Horáček 2019-06-12 15:50:37 UTC

Looks like we cannot set the reference. Changed HCO to use finalizers for NetworkAddonsConfig cleanup https://github.com/kubevirt/hyperconverged-cluster-operator/pull/122, since the problem was reproduced only with owner reference, this patch should fix it.

Comment 23 Yan Du 2019-06-17 05:43:53 UTC

Issue have been fixed in hco-bundle-registry:v2.0.0-29, and pods in linux-bridge and kubemacpool projects work well after the envs running for a few days.

Comment 25 errata-xmlrpc 2019-07-24 20:16:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1850