Description of problem: setup cluster on azure, found the machine-config-daemon pod on worker cannot access the https://172.30.0.1:443 for i in $(oc get pod --no-headers -n openshift-machine-config-operator -l k8s-app=machine-config-daemon | awk '{print $1 }') ; do oc exec -n openshift-machine-config-operator $i -- curl -I --connect-timeout 10 https://172.30.0.1:443 -k ; done Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-8hsdt -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 234 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 HTTP/2 403 audit-id: 8d912ed1-d2f0-483f-bbab-5c33908443af cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 214bbdaa-0ee6-402b-9a81-9dda1b0af0a3 x-kubernetes-pf-prioritylevel-uid: 54ac92b7-9e38-4408-9f7a-f12f914731e5 content-length: 234 date: Mon, 21 Sep 2020 10:04:45 GMT Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-qjttp -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 234 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 HTTP/2 403 audit-id: 879d725e-ab73-419d-a5c2-24420b43c1be cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 214bbdaa-0ee6-402b-9a81-9dda1b0af0a3 x-kubernetes-pf-prioritylevel-uid: 54ac92b7-9e38-4408-9f7a-f12f914731e5 content-length: 234 date: Mon, 21 Sep 2020 10:04:47 GMT Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-qlmp2 -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0 curl: (28) Connection timed out after 10001 milliseconds command terminated with exit code 28 Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-tmvm7 -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0 curl: (28) Connection timed out after 10000 milliseconds command terminated with exit code 28 Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-znhl4 -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 234 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- HTTP/2 403 audit-id: 94b326d6-a912-443e-9803-861fe8b69d0b cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 214bbdaa-0ee6-402b-9a81-9dda1b0af0a3 x-kubernetes-pf-prioritylevel-uid: 54ac92b7-9e38-4408-9f7a-f12f914731e5 content-length: 234 date: Mon, 21 Sep 2020 10:05:16 GMT Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. setup the cluster on azure with OVN 2. hostnetwork pod on worker cannot access the 172.30.0.1:443 3. Actual results: Expected results: Additional info:
Verified this bug on 4.6.0-0.nightly-2020-09-24-015627
*** Bug 1881660 has been marked as a duplicate of this bug. ***
This is still happening with 4.6.0-0.ci-2020-09-28-113704
@Christian Do you have a kubeconfig to that cluster? I think it might be linked to bug: https://bugzilla.redhat.com/show_bug.cgi?id=1883513 but I need to confirm /Alex
I think this bug had the problem described in #comment 3, but it might also have been hitting the issues in https://bugzilla.redhat.com/show_bug.cgi?id=1883513 (for which the node deletion, I suspect, is not really the root cause)
@Alexander that cluster is down now unfortunately, but https://bugzilla.redhat.com/show_bug.cgi?id=1883513 matches exactly. The issue first occurred on a re-deployment of the windows-mco pod, after deleting the Windows MachineSets (and machines and nodes with them) from the first deployment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196