1880425 – controllerconfig is not completed: status for ControllerConfig machine-config-controller is being reported for 0, expecting it for 1

Bug 1880425 - controllerconfig is not completed: status for ControllerConfig machine-config-controller is being reported for 0, expecting it for 1

Summary: controllerconfig is not completed: status for ControllerConfig machine-config...

Keywords:
Status:	CLOSED DUPLICATE of bug 1874696
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-18 13:53 UTC by David Eads
Modified:	2020-09-18 15:30 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-18 15:30:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description David Eads 2020-09-18 13:53:26 UTC

I've seen a couple runs with a new failure on our promotion jobs.

level=error msg="Cluster operator machine-config Degraded is True with MachineConfigControllerFailed: Unable to apply 4.6.0-0.nightly-2020-09-18-042646: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: status for ControllerConfig machine-config-controller is being reported for 0, expecting it for 1

This is impacting promotion across azure and gcp, so marking high until we can find a cause.

1. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264
2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/1306889915383943168

Comment 1 Yu Qi Zhang 2020-09-18 14:24:57 UTC

I don't think its the MCO. Taking a look at both clusterversion.json files: 

ovirt:
Multiple errors are preventing progress:\n* Could not update oauthclient \"console\" (366 of 605): the server does not recognize this resource, check extension API servers\n* Could not update role \"openshift-console-operator/prometheus-k8s\" (542 of 605): resource may have been deleted\n* Could not update rolebinding \"openshift/cluster-samples-operator-openshift-edit\" (336 of 605): resource may have been deleted

gcp:
"Multiple errors are preventing progress:\n* Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending\nStaticPodsDegraded: pods \"kube-apiserver-ci-op-84lfg31r-0dde3-rhpr8-master-2\" not found\nStaticPodsDegraded: pods \"kube-apiserver-ci-op-84lfg31r-0dde3-rhpr8-master-0\" not found\n* Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 4 is pending\nStaticPodsDegraded: pods \"kube-controller-manager-ci-op-84lfg31r-0dde3-rhpr8-master-2\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-ci-op-84lfg31r-0dde3-rhpr8-master-0\" not found\n* Cluster operator kube-scheduler is reporting a failure: StaticPodsDegraded: pods \"openshift-kube-scheduler-ci-op-84lfg31r-0dde3-rhpr8-master-2\" not found\nStaticPodsDegraded: pods \"openshift-kube-scheduler-ci-op-84lfg31r-0dde3-rhpr8-master-0\" not found\nNodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 4 is pending\n* Could not update oauthclient \"console\" (366 of 605): the server does not recognize this resource, check extension API servers\n* Could not update role \"openshift-console-operator/prometheus-k8s\" (542 of 605): resource may have been deleted\n* Could not update rolebinding \"openshift/cluster-samples-operator-openshift-edit\" (336 of 605): resource may have been deleted"

Not sure what the root cause is from that. Back to looking at the MCC logs:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/1306889915383943168/artifacts/e2e-gcp-upgrade/pods/openshift-machine-config-operator_machine-config-controller-584778f8d-q6db7_machine-config-controller.log

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264/artifacts/e2e-ovirt/pods/openshift-machine-config-operator_machine-config-controller-6cddc9c6ff-grzz2_machine-config-controller.log

no route to host on both of them, so they cannot render controllerconfig from the cluster level configs. Maybe an SDN problem? Sending over to triage.

Comment 2 Dan Winship 2020-09-18 15:21:01 UTC

In the ovirt log, something is doing something terrible on master-0. eg:

Sep 18 05:13:48.951632 ovirt14-qndh7-master-0 kernel: device veth7b5922a3 entered promiscuous mode
Sep 18 05:13:48.956631 ovirt14-qndh7-master-0 kernel: device veth7b5922a3 left promiscuous mode
Sep 18 05:13:48.961478 ovirt14-qndh7-master-0 kernel: device vethe7ac967f entered promiscuous mode
Sep 18 05:13:48.961545 ovirt14-qndh7-master-0 kernel: device vethad6d4f49 entered promiscuous mode
Sep 18 05:13:48.968635 ovirt14-qndh7-master-0 kernel: device vethe7ac967f left promiscuous mode
Sep 18 05:13:48.968725 ovirt14-qndh7-master-0 kernel: device vethec7f5fb8 entered promiscuous mode
Sep 18 05:13:48.970749 ovirt14-qndh7-master-0 kernel: device veth16042bf4 entered promiscuous mode
Sep 18 05:13:48.973465 ovirt14-qndh7-master-0 kernel: device vethad6d4f49 left promiscuous mode
Sep 18 05:13:48.973524 ovirt14-qndh7-master-0 kernel: device vethec7f5fb8 left promiscuous mode
Sep 18 05:13:48.975071 ovirt14-qndh7-master-0 kernel: device vethbbfc3824 entered promiscuous mode
Sep 18 05:13:48.977713 ovirt14-qndh7-master-0 kernel: device veth16042bf4 left promiscuous mode
Sep 18 05:13:48.977763 ovirt14-qndh7-master-0 kernel: device vethbbfc3824 left promiscuous mode
Sep 18 05:13:48.978644 ovirt14-qndh7-master-0 kernel: device veth5fa8018c entered promiscuous mode
Sep 18 05:13:48.980674 ovirt14-qndh7-master-0 kernel: device vethf874aade entered promiscuous mode
Sep 18 05:13:48.983431 ovirt14-qndh7-master-0 kernel: device veth5fa8018c left promiscuous mode
Sep 18 05:13:48.983497 ovirt14-qndh7-master-0 kernel: device vethf874aade left promiscuous mode
Sep 18 05:13:48.984633 ovirt14-qndh7-master-0 kernel: device vethc486caea entered promiscuous mode

This happens constantly with every single veth, from the first time one is created; the very first veth is bounced a total of 43,147 times over half an hour. This is obviously not going to be good for network connectivity. (It's not clear exactly what is happening but probably either the veths are being taken down and brought back up or they're being removed from the bridge and added back?)

The ovs logs from that node (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264/artifacts/e2e-ovirt/pods/) are truncated. The sdn-previous logs (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264/artifacts/e2e-ovirt/pods/openshift-sdn_sdn-xxmlk_sdn_previous.log) show that it failed fairly quickly after startup and then restarted. The second run (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264/artifacts/e2e-ovirt/pods/openshift-sdn_sdn-xxmlk_sdn.log) doesn't show much beside that something keeps running openshift-etcd/installer-2-ovirt14-qndh7-master-0 over and over again. (Not 43,147 times, just 54.) Although the last set of logs for that show that it failed because of "Failed to get secret openshift-etcd/etcd-all-peer-2: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/secrets/etcd-all-peer-2": dial tcp 172.30.0.1:443: connect: no route to host" so that's just more effect, not more cause.


And the e2e-gcp-upgrade shows the same problem on master-2... WTF?

Oh, I think this is warring system-and-containerized OVS daemons; the current OVS log is truncated but the previous log shows:

  Failed to connect to bus: No data available
  openvswitch is running in container

Comment 3 Dan Winship 2020-09-18 15:30:51 UTC


*** This bug has been marked as a duplicate of bug 1874696 ***

Note You need to log in before you can comment on or make changes to this bug.