Bug 1998423

Summary:	upgrade from 4.8.6 to 4.9.0-0.nightly-2021-08-26-164418, blocked by dns upgrade due to FailedCreatePodSandBox for pods
Product:	OpenShift Container Platform	Reporter:	Junqi Zhao <juzhao>
Component:	Networking	Assignee:	Alexander Constantinescu <aconstan>
Networking sub component:	ovn-kubernetes	QA Contact:	huirwang
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, anbhat, astoycos, danw, dhellmann, lmohanty, mkennell, scuppett, wking, zzhao
Version:	4.9	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1999894 (view as bug list)		Environment:
Last Closed:	2021-10-18 17:49:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1999895

Description Junqi Zhao 2021-08-27 06:50:01 UTC

Description of problem:
upgrade from 4.8.6 to 4.9.0-0.nightly-2021-08-26-164418 with two windows nodes, blocked by dns upgrade, describe the not running dns-default pod
error is
  Warning  FailedCreatePodSandBox  <invalid> (x131 over 159m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-j7wdt_openshift-dns_2201ca51-82d5-44e0-a349-343d736e9952_0(15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3): error adding pod openshift-dns_dns-default-j7wdt to CNI network "multus-cni-network": [openshift-dns/dns-default-j7wdt:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns/dns-default-j7wdt 15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3] [openshift-dns/dns-default-j7wdt 15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded

profile: 53_IPI on AWS & OVN & WindowsContainer
upgrade job:  https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/17007/
FYI: This issue is not related to windows nodes, since nodeSelector for dns-default pods are kubernetes.io/os=linux, windows nodes do not have such label
# oc get node -o wide
NAME                                         STATUS   ROLES    AGE     VERSION                       INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-131-61.us-east-2.compute.internal    Ready    worker   4h54m   v1.21.1-1397+a678cfd2c37e87   10.0.131.61    <none>        Windows Server 2019 Datacenter                                 10.0.17763.2114                docker://20.10.6
ip-10-0-149-100.us-east-2.compute.internal   Ready    worker   4h48m   v1.21.1-1397+a678cfd2c37e87   10.0.149.100   <none>        Windows Server 2019 Datacenter                                 10.0.17763.2114                docker://20.10.6
ip-10-0-153-190.us-east-2.compute.internal   Ready    master   5h40m   v1.21.1+9807387               10.0.153.190   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-153-192.us-east-2.compute.internal   Ready    worker   5h30m   v1.21.1+9807387               10.0.153.192   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-164-193.us-east-2.compute.internal   Ready    worker   5h32m   v1.21.1+9807387               10.0.164.193   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-170-80.us-east-2.compute.internal    Ready    master   5h40m   v1.21.1+9807387               10.0.170.80    <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-195-55.us-east-2.compute.internal    Ready    master   5h40m   v1.21.1+9807387               10.0.195.55    <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8
ip-10-0-213-141.us-east-2.compute.internal   Ready    worker   5h30m   v1.21.1+9807387               10.0.213.141   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108161759-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-11.rhaos4.8.git5d31399.el8

# oc get node ip-10-0-131-61.us-east-2.compute.internal --show-labels
NAME                                        STATUS   ROLES    AGE    VERSION                       LABELS
ip-10-0-131-61.us-east-2.compute.internal   Ready    worker   5h1m   v1.21.1-1397+a678cfd2c37e87   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-e5o52r8,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a

# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h5m    
baremetal                                  4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h31m   
cloud-controller-manager                   4.9.0-0.nightly-2021-08-26-164418   True        False         False      4h1m    
cloud-credential                           4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h42m   
cluster-autoscaler                         4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h30m   
config-operator                            4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h33m   
console                                    4.9.0-0.nightly-2021-08-26-164418   True        False         False      3h46m   
csi-snapshot-controller                    4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h32m   
dns                                        4.8.6                               True        True          False      5h32m   DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."...
etcd                                       4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h31m   
image-registry                             4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h26m   
ingress                                    4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h25m   
insights                                   4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h26m   
kube-apiserver                             4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h26m   
kube-controller-manager                    4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h31m   
kube-scheduler                             4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h30m   
kube-storage-version-migrator              4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h32m   
machine-api                                4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h28m   
machine-approver                           4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h31m   
machine-config                             4.8.6                               True        False         False      5h31m   
marketplace                                4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h32m   
monitoring                                 4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h24m   
network                                    4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h33m   
node-tuning                                4.9.0-0.nightly-2021-08-26-164418   True        False         False      3h47m   
openshift-apiserver                        4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h28m   
openshift-controller-manager               4.9.0-0.nightly-2021-08-26-164418   True        False         False      3h47m   
openshift-samples                          4.9.0-0.nightly-2021-08-26-164418   True        False         False      3h47m   
operator-lifecycle-manager                 4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h32m   
operator-lifecycle-manager-catalog         4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h32m   
operator-lifecycle-manager-packageserver   4.9.0-0.nightly-2021-08-26-164418   True        False         False      3h42m   
service-ca                                 4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h33m   
storage                                    4.9.0-0.nightly-2021-08-26-164418   True        False         False      5h32m   

# oc get co dns -oyaml
...
  - lastTransitionTime: "2021-08-27T03:16:14Z"
    message: |-
      DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."
      Upgrading operator to "4.9.0-0.nightly-2021-08-26-164418".
      Upgrading coredns to "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fd96e0593ca8040f5d7f361ea51182f02899bda1b7df3345f7f5d5998887e555".
      Upgrading kube-rbac-proxy to "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940".
    reason: DNSReportsProgressingIsTrueAndUpgrading
    status: "True"
    type: Progressing
...


# oc -n openshift-dns get pod -o wide | grep dns-default
dns-default-4vg2r     2/2     Running             0          5h31m   10.128.2.6     ip-10-0-213-141.us-east-2.compute.internal   <none>           <none>
dns-default-g2dxv     2/2     Running             0          5h37m   10.129.0.18    ip-10-0-195-55.us-east-2.compute.internal    <none>           <none>
dns-default-hrhpn     2/2     Running             0          5h37m   10.130.0.22    ip-10-0-170-80.us-east-2.compute.internal    <none>           <none>
dns-default-j7wdt     0/2     ContainerCreating   0          171m    <none>         ip-10-0-164-193.us-east-2.compute.internal   <none>           <none>
dns-default-qswnf     2/2     Running             0          5h30m   10.129.2.6     ip-10-0-153-192.us-east-2.compute.internal   <none>           <none>
dns-default-rchd6     2/2     Running             0          5h37m   10.128.0.21    ip-10-0-153-190.us-east-2.compute.internal   <none>           <none>

# oc -n openshift-dns get ds dns-default
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   6         6         5       1            5           kubernetes.io/os=linux   5h38m

# oc -n openshift-dns describe pod dns-default-j7wdt
...
  Warning  FailedCreatePodSandBox  161m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-j7wdt_openshift-dns_2201ca51-82d5-44e0-a349-343d736e9952_0(b5ab6d6c3462ba546ee8e0d43514ac11136867bd54de4bd13f93a25839039e14): error adding pod openshift-dns_dns-default-j7wdt to CNI network "multus-cni-network": [openshift-dns/dns-default-j7wdt:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns/dns-default-j7wdt b5ab6d6c3462ba546ee8e0d43514ac11136867bd54de4bd13f93a25839039e14] [openshift-dns/dns-default-j7wdt b5ab6d6c3462ba546ee8e0d43514ac11136867bd54de4bd13f93a25839039e14] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
'
  Warning  FailedCreatePodSandBox  <invalid> (x131 over 159m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-j7wdt_openshift-dns_2201ca51-82d5-44e0-a349-343d736e9952_0(15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3): error adding pod openshift-dns_dns-default-j7wdt to CNI network "multus-cni-network": [openshift-dns/dns-default-j7wdt:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-dns/dns-default-j7wdt 15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3] [openshift-dns/dns-default-j7wdt 15bc5c77116d4974058682359262dfea2d6d6a63fa39d33ba612cfd5c2e758a3] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded

checked, only the ContainerCreating pod dns-default-j7wdt does not have annotations
# for i in $(oc -n openshift-dns get pod | grep dns-default | awk '{print $1}'); do echo $i; oc -n openshift-dns get pod $i -oyaml | grep annotations -A2; echo -e "\n"; done
dns-default-4vg2r
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.2.6/23"],"mac_address":"0a:58:0a:80:02:06","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.128.2.3"}],"ip_address":"10.128.2.6/23","gateway_ip":"10.128.2.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-


dns-default-g2dxv
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.0.18/23"],"mac_address":"0a:58:0a:81:00:12","gateway_ips":["10.129.0.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.129.0.3"}],"ip_address":"10.129.0.18/23","gateway_ip":"10.129.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-


dns-default-hrhpn
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.130.0.22/23"],"mac_address":"0a:58:0a:82:00:16","gateway_ips":["10.130.0.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.130.0.3"}],"ip_address":"10.130.0.22/23","gateway_ip":"10.130.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-


dns-default-j7wdt


dns-default-qswnf
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.2.6/23"],"mac_address":"0a:58:0a:81:02:06","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.129.2.3"}],"ip_address":"10.129.2.6/23","gateway_ip":"10.129.2.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-


dns-default-rchd6
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.0.21/23"],"mac_address":"0a:58:0a:80:00:15","gateway_ips":["10.128.0.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.128.0.3"}],"ip_address":"10.128.0.21/23","gateway_ip":"10.128.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-

Version-Release number of selected component (if applicable):
upgrade from 4.8.6 to 4.9.0-0.nightly-2021-08-26-164418

How reproducible:
not sure

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
blocked by dns upgrade

Expected results:
no block for upgrade

Additional info:

Comment 2 Martin Kennelly 2021-08-29 14:49:04 UTC

omg logs -n openshift-ovn-kubernetes ovnkube-master-h8x2k  -c ovnkube-master:
> namespace.go:585] Failed to get join switch port IP address for node ip-10-0-164-193.us-east-2.compute.internal: provided IP is already allocated

In ovnkube-master, while a nodes cache is syncing. It fails when an attempt is made to reserve logical router port IPs that already exists. This node is the node where pod k8s_dns-default-j7wdt fails to get its IP from its annotation during CNI invocation because OVN is unhealthy for this particular node.

This is being worked on upstream to not fail if the IPs that exists is what is expected: See here: https://github.com/ovn-org/ovn-kubernetes/pull/2456

I have yet to determine blocker status - consulting Dan W.

Comment 8 Martin Kennelly 2021-08-31 12:30:05 UTC

Possible fix merged in upstream ovn and downstream cherry pick opened. Waiting on CI to confirm fix.

Comment 11 Aniket Bhat 2021-09-01 00:02:55 UTC

*** Bug 1999894 has been marked as a duplicate of this bug. ***

Comment 17 Lalatendu Mohanty 2021-09-03 16:03:38 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 18 Alexander Constantinescu 2021-09-03 16:15:44 UTC

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

  This happens independently of upgrades or regular clusters, but only happens in a very specific scenario that most customers will very unlikely hit. I can't estimate the percentage though, but must be less 10 %.

What is the impact?  Is it serious enough to warrant blocking edges?

  One or several nodes can be impacted. The result is, the node is completely hosed (no pod can start, and all existing will lose networking) 

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  
  Restarting the existing ovnkube-master leader instance __should__ fix it. 

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
   
  Yes. Only customers upgrading or using this z-stream version with this fix are impacted. The next will include the fix.

Comment 19 Alexander Constantinescu 2021-09-03 16:25:21 UTC

Expanding on my previous comment: 

> Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

Only 4.8.10 and 4.7.29 are impacted by the problem. So any customer upgrading to those version, can be affected.

Comment 20 Lalatendu Mohanty 2021-09-03 17:11:33 UTC

As this issue only occurs when there is change in node count, this is not a typical scenario during upgrades. In general users do not attempt to increase the node count during upgrade. Hence we are removing the upgrade blocker keyword from the bug.

Comment 21 Ben Parees 2021-09-03 17:37:47 UTC

*** Bug 1999894 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2021-10-18 17:49:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759