I've seen a couple runs with a new failure on our promotion jobs. level=error msg="Cluster operator machine-config Degraded is True with MachineConfigControllerFailed: Unable to apply 4.6.0-0.nightly-2020-09-18-042646: timed out waiting for the condition during waitForControllerConfigToBeCompleted: controllerconfig is not completed: status for ControllerConfig machine-config-controller is being reported for 0, expecting it for 1 This is impacting promotion across azure and gcp, so marking high until we can find a cause. 1. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264 2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/1306889915383943168
I don't think its the MCO. Taking a look at both clusterversion.json files: ovirt: Multiple errors are preventing progress:\n* Could not update oauthclient \"console\" (366 of 605): the server does not recognize this resource, check extension API servers\n* Could not update role \"openshift-console-operator/prometheus-k8s\" (542 of 605): resource may have been deleted\n* Could not update rolebinding \"openshift/cluster-samples-operator-openshift-edit\" (336 of 605): resource may have been deleted gcp: "Multiple errors are preventing progress:\n* Cluster operator kube-apiserver is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending\nStaticPodsDegraded: pods \"kube-apiserver-ci-op-84lfg31r-0dde3-rhpr8-master-2\" not found\nStaticPodsDegraded: pods \"kube-apiserver-ci-op-84lfg31r-0dde3-rhpr8-master-0\" not found\n* Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 4 is pending\nStaticPodsDegraded: pods \"kube-controller-manager-ci-op-84lfg31r-0dde3-rhpr8-master-2\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-ci-op-84lfg31r-0dde3-rhpr8-master-0\" not found\n* Cluster operator kube-scheduler is reporting a failure: StaticPodsDegraded: pods \"openshift-kube-scheduler-ci-op-84lfg31r-0dde3-rhpr8-master-2\" not found\nStaticPodsDegraded: pods \"openshift-kube-scheduler-ci-op-84lfg31r-0dde3-rhpr8-master-0\" not found\nNodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 4 is pending\n* Could not update oauthclient \"console\" (366 of 605): the server does not recognize this resource, check extension API servers\n* Could not update role \"openshift-console-operator/prometheus-k8s\" (542 of 605): resource may have been deleted\n* Could not update rolebinding \"openshift/cluster-samples-operator-openshift-edit\" (336 of 605): resource may have been deleted" Not sure what the root cause is from that. Back to looking at the MCC logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/1306889915383943168/artifacts/e2e-gcp-upgrade/pods/openshift-machine-config-operator_machine-config-controller-584778f8d-q6db7_machine-config-controller.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264/artifacts/e2e-ovirt/pods/openshift-machine-config-operator_machine-config-controller-6cddc9c6ff-grzz2_machine-config-controller.log no route to host on both of them, so they cannot render controllerconfig from the cluster level configs. Maybe an SDN problem? Sending over to triage.
In the ovirt log, something is doing something terrible on master-0. eg: Sep 18 05:13:48.951632 ovirt14-qndh7-master-0 kernel: device veth7b5922a3 entered promiscuous mode Sep 18 05:13:48.956631 ovirt14-qndh7-master-0 kernel: device veth7b5922a3 left promiscuous mode Sep 18 05:13:48.961478 ovirt14-qndh7-master-0 kernel: device vethe7ac967f entered promiscuous mode Sep 18 05:13:48.961545 ovirt14-qndh7-master-0 kernel: device vethad6d4f49 entered promiscuous mode Sep 18 05:13:48.968635 ovirt14-qndh7-master-0 kernel: device vethe7ac967f left promiscuous mode Sep 18 05:13:48.968725 ovirt14-qndh7-master-0 kernel: device vethec7f5fb8 entered promiscuous mode Sep 18 05:13:48.970749 ovirt14-qndh7-master-0 kernel: device veth16042bf4 entered promiscuous mode Sep 18 05:13:48.973465 ovirt14-qndh7-master-0 kernel: device vethad6d4f49 left promiscuous mode Sep 18 05:13:48.973524 ovirt14-qndh7-master-0 kernel: device vethec7f5fb8 left promiscuous mode Sep 18 05:13:48.975071 ovirt14-qndh7-master-0 kernel: device vethbbfc3824 entered promiscuous mode Sep 18 05:13:48.977713 ovirt14-qndh7-master-0 kernel: device veth16042bf4 left promiscuous mode Sep 18 05:13:48.977763 ovirt14-qndh7-master-0 kernel: device vethbbfc3824 left promiscuous mode Sep 18 05:13:48.978644 ovirt14-qndh7-master-0 kernel: device veth5fa8018c entered promiscuous mode Sep 18 05:13:48.980674 ovirt14-qndh7-master-0 kernel: device vethf874aade entered promiscuous mode Sep 18 05:13:48.983431 ovirt14-qndh7-master-0 kernel: device veth5fa8018c left promiscuous mode Sep 18 05:13:48.983497 ovirt14-qndh7-master-0 kernel: device vethf874aade left promiscuous mode Sep 18 05:13:48.984633 ovirt14-qndh7-master-0 kernel: device vethc486caea entered promiscuous mode This happens constantly with every single veth, from the first time one is created; the very first veth is bounced a total of 43,147 times over half an hour. This is obviously not going to be good for network connectivity. (It's not clear exactly what is happening but probably either the veths are being taken down and brought back up or they're being removed from the bridge and added back?) The ovs logs from that node (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264/artifacts/e2e-ovirt/pods/) are truncated. The sdn-previous logs (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264/artifacts/e2e-ovirt/pods/openshift-sdn_sdn-xxmlk_sdn_previous.log) show that it failed fairly quickly after startup and then restarted. The second run (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306813402781323264/artifacts/e2e-ovirt/pods/openshift-sdn_sdn-xxmlk_sdn.log) doesn't show much beside that something keeps running openshift-etcd/installer-2-ovirt14-qndh7-master-0 over and over again. (Not 43,147 times, just 54.) Although the last set of logs for that show that it failed because of "Failed to get secret openshift-etcd/etcd-all-peer-2: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/secrets/etcd-all-peer-2": dial tcp 172.30.0.1:443: connect: no route to host" so that's just more effect, not more cause. And the e2e-gcp-upgrade shows the same problem on master-2... WTF? Oh, I think this is warring system-and-containerized OVS daemons; the current OVS log is truncated but the previous log shows: Failed to connect to bus: No data available openvswitch is running in container
*** This bug has been marked as a duplicate of bug 1874696 ***