Description of problem: upgrade from 4.8.6 to 4.9.0-0.nightly-2021-08-26-164418 with two windows nodes, blocked by dns upgrade, describe the not running dns-default pod error is Warning FailedCreatePodSandBox <invalid> (x131 over 159m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-j7wdt_openshift-dns_2201ca51-82d5-44e0-a349-343d736e9952_0(15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3): error adding pod openshift-dns_dns-default-j7wdt to CNI network "multus-cni-network": [openshift-dns/dns-default-j7wdt:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns/dns-default-j7wdt 15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3] [openshift-dns/dns-default-j7wdt 15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded profile: 53_IPI on AWS & OVN & WindowsContainer upgrade job: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/17007/ FYI: This issue is not related to windows nodes, since nodeSelector for dns-default pods are kubernetes.io/os=linux, windows nodes do not have such label # oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-131-61.us-east-2.compute.internal Ready worker 4h54m v1.21.1-1397+a678cfd2c37e87 10.0.131.61 <none> Windows Server 2019 Datacenter 10.0.17763.2114 docker://20.10.6 ip-10-0-149-100.us-east-2.compute.internal Ready worker 4h48m v1.21.1-1397+a678cfd2c37e87 10.0.149.100 <none> Windows Server 2019 Datacenter 10.0.17763.2114 docker://20.10.6 ip-10-0-153-190.us-east-2.compute.internal Ready master 5h40m v1.21.1+9807387 10.0.153.190 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-153-192.us-east-2.compute.internal Ready worker 5h30m v1.21.1+9807387 10.0.153.192 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-164-193.us-east-2.compute.internal Ready worker 5h32m v1.21.1+9807387 10.0.164.193 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-170-80.us-east-2.compute.internal Ready master 5h40m v1.21.1+9807387 10.0.170.80 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-195-55.us-east-2.compute.internal Ready master 5h40m v1.21.1+9807387 10.0.195.55 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 ip-10-0-213-141.us-east-2.compute.internal Ready worker 5h30m v1.21.1+9807387 10.0.213.141 <none> Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa) 4.18.0-305.12.1.el8_4.x86_64 cri-o://1.21.2-11.rhaos4.8.git5d31399.el8 # oc get node ip-10-0-131-61.us-east-2.compute.internal --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-10-0-131-61.us-east-2.compute.internal Ready worker 5h1m v1.21.1-1397+a678cfd2c37e87 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-e5o52r8,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-08-26-164418 True False False 5h5m baremetal 4.9.0-0.nightly-2021-08-26-164418 True False False 5h31m cloud-controller-manager 4.9.0-0.nightly-2021-08-26-164418 True False False 4h1m cloud-credential 4.9.0-0.nightly-2021-08-26-164418 True False False 5h42m cluster-autoscaler 4.9.0-0.nightly-2021-08-26-164418 True False False 5h30m config-operator 4.9.0-0.nightly-2021-08-26-164418 True False False 5h33m console 4.9.0-0.nightly-2021-08-26-164418 True False False 3h46m csi-snapshot-controller 4.9.0-0.nightly-2021-08-26-164418 True False False 5h32m dns 4.8.6 True True False 5h32m DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."... etcd 4.9.0-0.nightly-2021-08-26-164418 True False False 5h31m image-registry 4.9.0-0.nightly-2021-08-26-164418 True False False 5h26m ingress 4.9.0-0.nightly-2021-08-26-164418 True False False 5h25m insights 4.9.0-0.nightly-2021-08-26-164418 True False False 5h26m kube-apiserver 4.9.0-0.nightly-2021-08-26-164418 True False False 5h26m kube-controller-manager 4.9.0-0.nightly-2021-08-26-164418 True False False 5h31m kube-scheduler 4.9.0-0.nightly-2021-08-26-164418 True False False 5h30m kube-storage-version-migrator 4.9.0-0.nightly-2021-08-26-164418 True False False 5h32m machine-api 4.9.0-0.nightly-2021-08-26-164418 True False False 5h28m machine-approver 4.9.0-0.nightly-2021-08-26-164418 True False False 5h31m machine-config 4.8.6 True False False 5h31m marketplace 4.9.0-0.nightly-2021-08-26-164418 True False False 5h32m monitoring 4.9.0-0.nightly-2021-08-26-164418 True False False 5h24m network 4.9.0-0.nightly-2021-08-26-164418 True False False 5h33m node-tuning 4.9.0-0.nightly-2021-08-26-164418 True False False 3h47m openshift-apiserver 4.9.0-0.nightly-2021-08-26-164418 True False False 5h28m openshift-controller-manager 4.9.0-0.nightly-2021-08-26-164418 True False False 3h47m openshift-samples 4.9.0-0.nightly-2021-08-26-164418 True False False 3h47m operator-lifecycle-manager 4.9.0-0.nightly-2021-08-26-164418 True False False 5h32m operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-08-26-164418 True False False 5h32m operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-08-26-164418 True False False 3h42m service-ca 4.9.0-0.nightly-2021-08-26-164418 True False False 5h33m storage 4.9.0-0.nightly-2021-08-26-164418 True False False 5h32m # oc get co dns -oyaml ... - lastTransitionTime: "2021-08-27T03:16:14Z" message: |- DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6." Upgrading operator to "4.9.0-0.nightly-2021-08-26-164418". Upgrading coredns to "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fd96e0593ca8040f5d7f361ea51182f02899bda1b7df3345f7f5d5998887e555". Upgrading kube-rbac-proxy to "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940". reason: DNSReportsProgressingIsTrueAndUpgrading status: "True" type: Progressing ... # oc -n openshift-dns get pod -o wide | grep dns-default dns-default-4vg2r 2/2 Running 0 5h31m 10.128.2.6 ip-10-0-213-141.us-east-2.compute.internal <none> <none> dns-default-g2dxv 2/2 Running 0 5h37m 10.129.0.18 ip-10-0-195-55.us-east-2.compute.internal <none> <none> dns-default-hrhpn 2/2 Running 0 5h37m 10.130.0.22 ip-10-0-170-80.us-east-2.compute.internal <none> <none> dns-default-j7wdt 0/2 ContainerCreating 0 171m <none> ip-10-0-164-193.us-east-2.compute.internal <none> <none> dns-default-qswnf 2/2 Running 0 5h30m 10.129.2.6 ip-10-0-153-192.us-east-2.compute.internal <none> <none> dns-default-rchd6 2/2 Running 0 5h37m 10.128.0.21 ip-10-0-153-190.us-east-2.compute.internal <none> <none> # oc -n openshift-dns get ds dns-default NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE dns-default 6 6 5 1 5 kubernetes.io/os=linux 5h38m # oc -n openshift-dns describe pod dns-default-j7wdt ... Warning FailedCreatePodSandBox 161m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-j7wdt_openshift-dns_2201ca51-82d5-44e0-a349-343d736e9952_0(b5ab6d6c3462ba546ee8e0d43514ac11136867bd54de4bd13f93a25839039e14): error adding pod openshift-dns_dns-default-j7wdt to CNI network "multus-cni-network": [openshift-dns/dns-default-j7wdt:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns/dns-default-j7wdt b5ab6d6c3462ba546ee8e0d43514ac11136867bd54de4bd13f93a25839039e14] [openshift-dns/dns-default-j7wdt b5ab6d6c3462ba546ee8e0d43514ac11136867bd54de4bd13f93a25839039e14] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded ' Warning FailedCreatePodSandBox <invalid> (x131 over 159m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-j7wdt_openshift-dns_2201ca51-82d5-44e0-a349-343d736e9952_0(15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3): error adding pod openshift-dns_dns-default-j7wdt to CNI network "multus-cni-network": [openshift-dns/dns-default-j7wdt:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns/dns-default-j7wdt 15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3] [openshift-dns/dns-default-j7wdt 15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded checked, only the ContainerCreating pod dns-default-j7wdt does not have annotations # for i in $(oc -n openshift-dns get pod | grep dns-default | awk '{print $1}'); do echo $i; oc -n openshift-dns get pod $i -oyaml | grep annotations -A2; echo -e "\n"; done dns-default-4vg2r annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.2.6/23"],"mac_address":"0a:58:0a:80:02:06","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.128.2.3"}],"ip_address":"10.128.2.6/23","gateway_ip":"10.128.2.1"}}' k8s.v1.cni.cncf.io/network-status: |- dns-default-g2dxv annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.0.18/23"],"mac_address":"0a:58:0a:81:00:12","gateway_ips":["10.129.0.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.129.0.3"}],"ip_address":"10.129.0.18/23","gateway_ip":"10.129.0.1"}}' k8s.v1.cni.cncf.io/network-status: |- dns-default-hrhpn annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.130.0.22/23"],"mac_address":"0a:58:0a:82:00:16","gateway_ips":["10.130.0.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.130.0.3"}],"ip_address":"10.130.0.22/23","gateway_ip":"10.130.0.1"}}' k8s.v1.cni.cncf.io/network-status: |- dns-default-j7wdt dns-default-qswnf annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.2.6/23"],"mac_address":"0a:58:0a:81:02:06","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.129.2.3"}],"ip_address":"10.129.2.6/23","gateway_ip":"10.129.2.1"}}' k8s.v1.cni.cncf.io/network-status: |- dns-default-rchd6 annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.0.21/23"],"mac_address":"0a:58:0a:80:00:15","gateway_ips":["10.128.0.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.128.0.3"}],"ip_address":"10.128.0.21/23","gateway_ip":"10.128.0.1"}}' k8s.v1.cni.cncf.io/network-status: |- Version-Release number of selected component (if applicable): upgrade from 4.8.6 to 4.9.0-0.nightly-2021-08-26-164418 How reproducible: not sure Steps to Reproduce: 1. see the description 2. 3. Actual results: blocked by dns upgrade Expected results: no block for upgrade Additional info:
omg logs -n openshift-ovn-kubernetes ovnkube-master-h8x2k -c ovnkube-master: > namespace.go:585] Failed to get join switch port IP address for node ip-10-0-164-193.us-east-2.compute.internal: provided IP is already allocated In ovnkube-master, while a nodes cache is syncing. It fails when an attempt is made to reserve logical router port IPs that already exists. This node is the node where pod k8s_dns-default-j7wdt fails to get its IP from its annotation during CNI invocation because OVN is unhealthy for this particular node. This is being worked on upstream to not fail if the IPs that exists is what is expected: See here: https://github.com/ovn-org/ovn-kubernetes/pull/2456 I have yet to determine blocker status - consulting Dan W.
Possible fix merged in upstream ovn and downstream cherry pick opened. Waiting on CI to confirm fix.
*** Bug 1999894 has been marked as a duplicate of this bug. ***
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? This happens independently of upgrades or regular clusters, but only happens in a very specific scenario that most customers will very unlikely hit. I can't estimate the percentage though, but must be less 10 %. What is the impact? Is it serious enough to warrant blocking edges? One or several nodes can be impacted. The result is, the node is completely hosed (no pod can start, and all existing will lose networking) How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Restarting the existing ovnkube-master leader instance __should__ fix it. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes. Only customers upgrading or using this z-stream version with this fix are impacted. The next will include the fix.
Expanding on my previous comment: > Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Only 4.8.10 and 4.7.29 are impacted by the problem. So any customer upgrading to those version, can be affected.
As this issue only occurs when there is change in node count, this is not a typical scenario during upgrades. In general users do not attempt to increase the node count during upgrade. Hence we are removing the upgrade blocker keyword from the bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759