Versions: ----------- [kni@provisionhost-0-0 ~]$ oc version Client Version: 4.7.0-0.nightly-2021-02-04-031352 Server Version: 4.7.0-rc.2 Kubernetes Version: v1.20.0+bd9e442 Platform: ----------- libvirt IPI (automated install with `openshift-baremetal-install`) Description of problem: -------------------------- steps as below: 1. Created a machine health check with Ready condition as the following: apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: worker-event namespace: openshift-machine-api annotations: machine.openshift.io/remediation-strategy: external-baremetal spec: selector: matchLabels: machine.openshift.io/cluster-api-machine-role: worker unhealthyConditions: - type: "Ready" status: "Unknown" timeout: "60s" $ oc create -f example_mhc.yaml 2. create ready condition: suspend a worker node with virsh suspend command 3. wait for the worker to come up again after it finish the remediation flow (NotReady > reboot > removed from nodes list > Ready) 4. after worker node returned to Ready, return to step 2 and do the following for a different worker node 5. in this stage, we can no longer access the openshift-web-console and some cluster operators are degraded (ingress, network, console...): [kni@provisionhost-0-0 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-rc.2 False True True 85m baremetal 4.7.0-rc.2 True False False 4h2m cloud-credential 4.7.0-rc.2 True False False 4h22m cluster-autoscaler 4.7.0-rc.2 True False False 4h1m config-operator 4.7.0-rc.2 True False False 4h2m console 4.7.0-rc.2 True False True 3h32m csi-snapshot-controller 4.7.0-rc.2 True False False 4h1m dns 4.7.0-rc.2 True False False 4h1m etcd 4.7.0-rc.2 True False False 4h image-registry 4.7.0-rc.2 True False False 83m ingress 4.7.0-rc.2 False True True 85m insights 4.7.0-rc.2 True False False 3h55m kube-apiserver 4.7.0-rc.2 True False False 3h59m kube-controller-manager 4.7.0-rc.2 True False False 3h59m kube-scheduler 4.7.0-rc.2 True False False 4h kube-storage-version-migrator 4.7.0-rc.2 True False False 83m machine-api 4.7.0-rc.2 True False False 3h58m machine-approver 4.7.0-rc.2 True False False 4h1m machine-config 4.7.0-rc.2 True False False 4h1m marketplace 4.7.0-rc.2 True False False 4h1m monitoring 4.7.0-rc.2 False True True 52m network 4.7.0-rc.2 True True False 4h2m node-tuning 4.7.0-rc.2 True False False 4h1m openshift-apiserver 4.7.0-rc.2 True False False 3h54m openshift-controller-manager 4.7.0-rc.2 True False False 3h54m openshift-samples 4.7.0-rc.2 True False False 3h22m operator-lifecycle-manager 4.7.0-rc.2 True False False 4h1m operator-lifecycle-manager-catalog 4.7.0-rc.2 True False False 4h1m operator-lifecycle-manager-packageserver 4.7.0-rc.2 True False False 3h55m service-ca 4.7.0-rc.2 True False False 4h2m storage 4.7.0-rc.2 True False False 4h2m [kni@provisionhost-0-0 ~]$ oc get pods -A | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE openshift-ingress router-default-86c576bb5b-k4znz 0/1 CrashLoopBackOff 61 3h17m openshift-ingress router-default-86c576bb5b-wfk2p 0/1 CrashLoopBackOff 31 90m openshift-marketplace certified-operators-4zp9j 0/1 ErrImagePull 0 90m openshift-marketplace certified-operators-bnrvj 0/1 ImagePullBackOff 0 90m openshift-marketplace community-operators-5qz5r 0/1 ImagePullBackOff 0 90m openshift-marketplace community-operators-qvxrt 0/1 ImagePullBackOff 0 90m openshift-marketplace redhat-marketplace-c9xkj 0/1 ImagePullBackOff 0 90m openshift-marketplace redhat-marketplace-qxhpm 0/1 ImagePullBackOff 0 90m openshift-marketplace redhat-operators-grcbr 0/1 ImagePullBackOff 0 90m openshift-marketplace redhat-operators-k4rzs 0/1 ImagePullBackOff 0 90m openshift-monitoring kube-state-metrics-7fc68b8968-85dpc 2/3 CrashLoopBackOff 20 90m openshift-monitoring prometheus-adapter-f86bd4948-rkkcg 0/1 CrashLoopBackOff 20 90m openshift-monitoring prometheus-adapter-f86bd4948-w2xbt 0/1 CrashLoopBackOff 20 90m openshift-network-diagnostics network-check-source-6f6554f7fb-694qp 0/1 CrashLoopBackOff 18 90m after investigation, we found that router-pods are shouting with this error: message: "plate \"msg\"=\"router will coalesce reloads within an interval of each other\" \"interval\"=\"5s\"\nI0218 12:21:11.519906 1 router.go:332] template \"msg\"=\"watching for changes\" \"path\"=\"/etc/pki/tls/private\"\nI0218 12:21:11.519987 1 router.go:262] router \"msg\"=\"router is including routes in all namespaces\" \nI0218 12:21:41.515672 1 trace.go:205] Trace[871148119]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/template/service_lookup.go:33 (18-Feb-2021 12:21:11.514) (total time: 30000ms):\nTrace[871148119]: [30.000720147s] [30.000720147s] END\nE0218 12:21:41.515707 1 reflector.go:138] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0\": dial tcp 172.30.0.1:443: i/o timeout\nI0218 12:21:41.520965 1 trace.go:205] Trace[51718540]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:125 (18-Feb-2021 12:21:11.520) (total time: 30000ms):\nTrace[51718540]: [30.000513217s] [30.000513217s] END\nI0218 12:21:41.520964 1 trace.go:205] Trace[1061279151]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:125 (18-Feb-2021 12:21:11.520) (total time: 30000ms):\nTrace[1061279151]: [30.000580224s] [30.000580224s] END\nE0218 12:21:41.521011 1 reflector.go:138] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1.Route: failed to list *v1.Route: Get \"https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0\": dial tcp 172.30.0.1:443: i/o timeout\nE0218 12:21:41.521028 1 reflector.go:138] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: Get \"https://172.30.0.1:443/apis/discovery.k8s.io/v1beta1/endpointslices?limit=500&resourceVersion=0\": dial tcp 172.30.0.1:443: i/o timeout\n" so it seems that router-pod failed to communicate with default/Kubernetes service IP (172.30.0.1) instal_config.yaml: --------------------- apiVersion: v1 baseDomain: qe.lab.redhat.com networking: networkType: OVNKubernetes machineCIDR: 192.168.123.0/24 metadata: name: ocp-edge-cluster-0 compute: - name: worker replicas: 2 controlPlane: name: master replicas: 3 platform: baremetal: {} platform: baremetal: provisioningNetwork: Managed externalBridge: baremetal-0 provisioningBridge: provisioning-0 libvirtURI: qemu+ssh://root.qe.lab.redhat.com/system provisioningNetworkInterface: enp4s0 provisioningNetworkCIDR: fd00:1101::/64 bootstrapOSImage: http://registry.ocp-edge-cluster-0.qe.lab.redhat.com:8080/images/rhcos-47.83.202102090044-0-qemu.x86_64.qcow2.gz?sha256=5d31652c7856a87450dce1bbbb561b578ee75443c190096cb977a814e5f35935 clusterOSImage: http://registry.ocp-edge-cluster-0.qe.lab.redhat.com:8080/images/rhcos-47.83.202102090044-0-openstack.x86_64.qcow2.gz?sha256=c1b93a426d0f74f0059193439e306f3356b788302a825cfadd870460e543028e apiVIP: 192.168.123.5 ingressVIP: 192.168.123.10 hosts: - name: openshift-master-0-0 role: master bmc: address: redfish://192.168.123.1:8000/redfish/v1/Systems/a5b3cf0b-4c70-47ad-9427-86cdf8e1b121 disableCertificateVerification: True username: admin password: password bootMACAddress: 52:54:00:a8:69:7c rootDeviceHints: deviceName: /dev/sda - name: openshift-master-0-1 role: master bmc: address: redfish://192.168.123.1:8000/redfish/v1/Systems/3db9582b-b0fd-4249-99c5-3df495fdef3f disableCertificateVerification: True username: admin password: password bootMACAddress: 52:54:00:bc:ca:a6 rootDeviceHints: deviceName: /dev/sda - name: openshift-master-0-2 role: master bmc: address: redfish://192.168.123.1:8000/redfish/v1/Systems/8b5fce40-c779-4d17-9776-0a87620ad053 disableCertificateVerification: True username: admin password: password bootMACAddress: 52:54:00:0b:6c:ba rootDeviceHints: deviceName: /dev/sda - name: openshift-worker-0-0 role: worker bmc: address: redfish://192.168.123.1:8000/redfish/v1/Systems/a0a810ab-c615-485d-9345-6d1e9f5135d1 disableCertificateVerification: True username: admin password: password bootMACAddress: 52:54:00:fb:83:0e rootDeviceHints: deviceName: /dev/sda - name: openshift-worker-0-1 role: worker bmc: address: redfish://192.168.123.1:8000/redfish/v1/Systems/b5245f42-53b6-4169-81f5-49eaab7af5e7 disableCertificateVerification: True username: admin password: password bootMACAddress: 52:54:00:de:f2:e2 rootDeviceHints: deviceName: /dev/sda additionalTrustBundle: | -----BEGIN CERTIFICATE----- MIIECjCCAvKgAwIBAgIUWwcKhSMb/lOkCLHkR0XaQBGEGFcwDQYJKoZIhvcNAQEL BQAwgYgxCzAJBgNVBAYTAlVTMQswCQYDVQQIDAJOQzEQMA4GA1UEBwwHUmFsZWln aDEVMBMGA1UECgwMVGVzdCBDb21wYW55MRAwDgYDVQQLDAdUZXN0aW5nMTEwLwYD VQQDDChzZWFsdXNhMjkubW9iaXVzLmxhYi5lbmcucmR1Mi5yZWRoYXQuY29tMB4X DTIxMDIxODA4NDE0NloXDTIyMDIxODA4NDE0NlowgYgxCzAJBgNVBAYTAlVTMQsw CQYDVQQIDAJOQzEQMA4GA1UEBwwHUmFsZWlnaDEVMBMGA1UECgwMVGVzdCBDb21w YW55MRAwDgYDVQQLDAdUZXN0aW5nMTEwLwYDVQQDDChzZWFsdXNhMjkubW9iaXVz LmxhYi5lbmcucmR1Mi5yZWRoYXQuY29tMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A MIIBCgKCAQEAzZ2liQQTiyM3xq7tulv21aZuuHUXxsw/vGfga/vxHomOjh0Y6moV EM3oOcbMpk7GTW+spfn7g1RJIU0o1xfPtMwU5MfSQxh1D98q43bpsLsxzCBWXSNI CNl7owA8ULsHB4l3/nfuXBpS6i1N2Vbw1DaBYnW9uHU7HYE7xZ1Zn85qWI57jMLG jKspn4OWQQk4bJIvxWxRPkJhPc3TMHKqpz5oNdJ7u2ex3Ru9LTjFKFHTXs6aiefc GqN9KIdZHFu3ctB+Pf0OmejAzDnCh/+hdFWCRYQLuD6UaGTq6a/SPh7KpgPQcHFk oNdYdx2Rvf33KBlr6ypK6tArgPU1nNRtKwIDAQABo2owaDAPBgNVHRMECDAGAQH/ AgEAMFUGA1UdEQROMEyCLXJlZ2lzdHJ5Lm9jcC1lZGdlLWNsdXN0ZXItMC5xZS5s YWIucmVkaGF0LmNvbYIJc2VhbHVzYTI5hwTAqHsBhwQKCUyXhwTAqHoBMA0GCSqG SIb3DQEBCwUAA4IBAQBdep80bMzRlobDguMAMJ6RXBoIN8u2qic+P7lJU9V4dZcq FU+ThkXeOqa1YvWlsSPFL4JXLlluY6iEM4gpkRxfB14eDTaylkkWcmQMyhD4ZkRb MxDsjjE8RKRouHQm5UFPZiHMeiKo9pkdqyXQMidHNAQLutpvme/jT15yjH5Za6+l hHn0/N5TYNGKLfuwJrNFRcP3+gHU+IjVNtg2+Bt4XJK6IU51TYLVz2gaiv2x4LYz 4AEWLNV96LA7sRsObO+k0ylesZbKNMVh3dizjMddfx8nByUGs1ukjO7XxdDpVYdy 7z6vr8yBqB1WvHOhRT1YZ7+Jel3E4ChQV4mTUuxb -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- MIIENDCCAxygAwIBAgIJANunI0D662cnMA0GCSqGSIb3DQEBCwUAMIGlMQswCQYD VQQGEwJVUzEXMBUGA1UECAwOTm9ydGggQ2Fyb2xpbmExEDAOBgNVBAcMB1JhbGVp Z2gxFjAUBgNVBAoMDVJlZCBIYXQsIEluYy4xEzARBgNVBAsMClJlZCBIYXQgSVQx GzAZBgNVBAMMElJlZCBIYXQgSVQgUm9vdCBDQTEhMB8GCSqGSIb3DQEJARYSaW5m b3NlY0ByZWRoYXQuY29tMCAXDTE1MDcwNjE3MzgxMVoYDzIwNTUwNjI2MTczODEx WjCBpTELMAkGA1UEBhMCVVMxFzAVBgNVBAgMDk5vcnRoIENhcm9saW5hMRAwDgYD VQQHDAdSYWxlaWdoMRYwFAYDVQQKDA1SZWQgSGF0LCBJbmMuMRMwEQYDVQQLDApS ZWQgSGF0IElUMRswGQYDVQQDDBJSZWQgSGF0IElUIFJvb3QgQ0ExITAfBgkqhkiG 9w0BCQEWEmluZm9zZWNAcmVkaGF0LmNvbTCCASIwDQYJKoZIhvcNAQEBBQADggEP ADCCAQoCggEBALQt9OJQh6GC5LT1g80qNh0u50BQ4sZ/yZ8aETxt+5lnPVX6MHKz bfwI6nO1aMG6j9bSw+6UUyPBHP796+FT/pTS+K0wsDV7c9XvHoxJBJJU38cdLkI2 c/i7lDqTfTcfLL2nyUBd2fQDk1B0fxrskhGIIZ3ifP1Ps4ltTkv8hRSob3VtNqSo GxkKfvD2PKjTPxDPWYyruy9irLZioMffi3i/gCut0ZWtAyO3MVH5qWF/enKwgPES X9po+TdCvRB/RUObBaM761EcrLSM1GqHNueSfqnho3AjLQ6dBnPWlo638Zm1VebK BELyhkLWMSFkKwDmne0jQ02Y4g075vCKvCsCAwEAAaNjMGEwHQYDVR0OBBYEFH7R 4yC+UehIIPeuL8Zqw3PzbgcZMB8GA1UdIwQYMBaAFH7R4yC+UehIIPeuL8Zqw3Pz bgcZMA8GA1UdEwEB/wQFMAMBAf8wDgYDVR0PAQH/BAQDAgGGMA0GCSqGSIb3DQEB CwUAA4IBAQBDNvD2Vm9sA5A9AlOJR8+en5Xz9hXcxJB5phxcZQ8jFoG04Vshvd0e LEnUrMcfFgIZ4njMKTQCM4ZFUPAieyLx4f52HuDopp3e5JyIMfW+KFcNIpKwCsak oSoKtIUOsUJK7qBVZxcrIyeQV2qcYOeZhtS5wBqIwOAhFwlCET7Ze58QHmS48slj S9K0JAcps2xdnGu0fkzhSQxY8GPQNFTlr6rYld5+ID/hHeS76gq0YG3q6RLWRkHf 4eTkRjivAlExrFzKcljC4axKQlnOvVAzz+Gm32U0xPBF4ByePVxCJUHw1TsyTmel RxNEp7yHoXcwn+fXna+t5JWh1gxUZty3 -----END CERTIFICATE----- pullSecret: | { "auths": { "registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000": { "auth": "b2NwLWVkZ2U6b2NwLWVkZ2UtcGFzcw==" } }} fips: false sshKey: | ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDh3V1FQqLaxn8MInLCRAd8tD6vnug8uiC6WSyfx6jbXGg4jYXkQGdpShB7uLzZFhfzIlsmQmY8JteEIQ7kCLo5v6K9hxhgS2bza+phgP+YmRrRk8+vCORWH9A4mmQxtJxc+y4AOK4wIqMsEW0Tea4ckG//vVet/n4YzCndnVqNdK4OB4kVXVrNHZexIzx3TvUqkBgClH7uo9OfnBdtaNU5rJBKm0rELsI8P6gBMbeDp3zm+0NSj/mR2+yrFJ55gh2IWZnCcZx1kDcrHakri0V8YCBvxzI3Rs9TUYxldoNskOGWqgiJCo5aLzAbkxgTIH4+DzZ9EZKkv0EURSCJTGy9v5W1UGKbZ3zEbJ4totRL2kwl2HpTTblyebYbycDv1XB3p0cg9tj0tCQiB8pvUo6qXUs7DwPLsWRDoDnKRziel3a+eQ2hdyyNgBkwXJywnI8z8Tf2NjSd946Hy6WiS7HIUHqmsg8u8s0RZGiJaJcO3Q62E87PHi9Pf0D0q/rAfbc= kni.qe.lab.redhat.com imageContentSources: - mirrors: - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image source: quay.io/openshift-release-dev/ocp-release - mirrors: - registry.ocp-edge-cluster-0.qe.lab.redhat.com:5000/localimages/local-release-image source: quay.io/openshift-release-dev/ocp-v4.0-art-dev How reproducible: ----------------- 100% Actual results: ------------------ we can't access the web-console after 2 node's remediation and the cluster is not stable Expected results: --------------------- we can access the web-console and the cluster is stable Additional info: --------------------- must-gather: https://drive.google.com/drive/folders/19euIuB4sqM7NjwcuxoOR-yWE2tePe5R7?usp=sharing
i'm not sure this is specifically a machine health check problem, would it be more appropriate to involve the networking or console team for this bug?
you are probably right. not sure though to which subcomponent i should re-assign the issue. re-assigning to routing for now.
Looks like this is ovn-specific issue, as I couldn't reproduce this on openshift-sdn cluster, and could reproduce this on ovn cluster. I see that ovn is putting some netowrk-related annotations on the node and while we remediate baremetal nodes, we delete the node from the cluster. Before we delete the node we keep all the labels and annotations and restore them once the node re-register itself (we don't override existing annotation though). I suspect this could be related.
We've managed to figure out what was causing this. Fix has been posted upstream for ovn-org/ovn-kubernetes, I will use this bug for the integration into OCP /Alex
Thanks for submitting a fix, Alex. Can you please explain the root cause for the bug?
This made it in with the latest downstream merge: https://github.com/openshift/ovn-kubernetes/pull/440 so setting to MODIFIED
Alex, what is the root cause please? Or what upstream commit in the merge addressed this? Thanks, Ross
Hi Ross This was the commit that fixed it: https://github.com/openshift/ovn-kubernetes/pull/440/commits/97575ad96a8a2ba7a506a6e468afaa7c8af12578 Essentially the problem was that when the node was deleted and re-created: we didn't sync the node subnet leading to a misconfigured ovn-k8s-mp0. I guess you can ask the reporter to help you reproduce in case it's needed. /Alex
OK, take two: upstream PR; https://github.com/ovn-org/ovn-kubernetes/pull/2115 I will set to modify once the downstream PR gets in.
I am setting the bug back to a POST state. The reason for this juggling is that PR "openshift ovn-kubernetes pull 496" did provide a correct fix, which should have fixed the bug. It was however noticed last week that there could be cases which are not covered by that patch. Hence we've implemented an additional fix which should improve the general solution, this is PR "openshift/ovn-kubernetes/pull/497". I would like to have both fixes tracked by this bug and have both back-ported once they've been verified on 4.8. Hence why I am doing this. Conclusion: this bug can already be tested with what has been delivered in "openshift ovn-kubernetes pull 496", and it should fix it. However the new PR "openshift/ovn-kubernetes/pull/497" should improve the general solution.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438