1868259 – [OVN]Upgrading from 4.5.5 to 4.6 latest nightly build failed

Bug 1868259 - [OVN]Upgrading from 4.5.5 to 4.6 latest nightly build failed

Summary: [OVN]Upgrading from 4.5.5 to 4.6 latest nightly build failed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1868083 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-12 07:11 UTC by huirwang
Modified:	2021-04-05 17:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:28:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 756	0	None	closed	Bug 1868259: Fixes OVN upgrade with shared gateway mode	2021-02-19 12:27:53 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:28:21 UTC

Description huirwang 2020-08-12 07:11:05 UTC

Description of problems:
OVN cluster,upgrading from 4.5.5 to 4.6 latest nightly build failed, one ovnkube-node-XXXXX pod is in CrashLoopBackOff and most of ovnkube-node-metrics-5hq5j pods are in pending status due to 
node(s) didn't have free ports for the requested ports.But the required port 9103 is occupied by ovnkube.


*How reproducible:*
Found this failure in upgrade ci, then reproduced it in manually upgrade


*Version-Release number of selected components (if applicable):*
Base version:4.5.5
Target version:4.6.0-0.nightly-2020-08-11-134736

Steps to Reproduce:
1. Install 4.5.5 baremetal OVN cluster.
2. Upgrade to latest 4.6 nightly build with below commands:
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-11-134736 --force=true --allow-explicit-upgrade=true

Actual Result:

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.5     True        True          29m     Unable to apply 4.6.0-0.nightly-2020-08-11-134736: the cluster operator network has not yet successfully rolled out

oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-11-134736   True        False         False      23m
cloud-credential                           4.6.0-0.nightly-2020-08-11-134736   True        False         False      145m
cluster-autoscaler                         4.6.0-0.nightly-2020-08-11-134736   True        False         False      123m
config-operator                            4.6.0-0.nightly-2020-08-11-134736   True        False         False      124m
console                                    4.6.0-0.nightly-2020-08-11-134736   True        False         False      23m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-11-134736   True        False         False      119m
dns                                        4.5.5                               True        False         False      128m
etcd                                       4.6.0-0.nightly-2020-08-11-134736   True        False         False      128m
image-registry                             4.6.0-0.nightly-2020-08-11-134736   True        False         False      119m
ingress                                    4.6.0-0.nightly-2020-08-11-134736   True        False         False      25m
insights                                   4.6.0-0.nightly-2020-08-11-134736   True        False         False      124m
kube-apiserver                             4.6.0-0.nightly-2020-08-11-134736   True        False         False      127m
kube-controller-manager                    4.6.0-0.nightly-2020-08-11-134736   True        False         False      127m
kube-scheduler                             4.6.0-0.nightly-2020-08-11-134736   True        False         False      127m
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-11-134736   True        False         False      119m
machine-api                                4.6.0-0.nightly-2020-08-11-134736   True        False         False      124m
machine-approver                           4.6.0-0.nightly-2020-08-11-134736   True        False         False      126m
machine-config                             4.5.5                               True        False         False      29m
marketplace                                4.6.0-0.nightly-2020-08-11-134736   True        False         False      23m
monitoring                                 4.6.0-0.nightly-2020-08-11-134736   True        False         False      113m
network                                    4.5.5                               True        True          True       129m
node-tuning                                4.6.0-0.nightly-2020-08-11-134736   True        False         False      24m
openshift-apiserver                        4.6.0-0.nightly-2020-08-11-134736   True        False         False      124m
openshift-controller-manager               4.6.0-0.nightly-2020-08-11-134736   True        False         False      23m
openshift-samples                          4.6.0-0.nightly-2020-08-11-134736   True        False         False      24m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-11-134736   True        False         False      128m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-11-134736   True        False         False      128m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-11-134736   True        False         False      24m
service-ca                                 4.6.0-0.nightly-2020-08-11-134736   True        False         False      129m
storage                                    4.6.0-0.nightly-2020-08-11-134736   True        False         False      25m




 oc get co network   -o yaml
 - lastTransitionTime: "2020-08-12T05:43:28Z"
    message: |-
      DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
      DaemonSet "openshift-multus/multus-admission-controller" update is rolling out (1 out of 3 updated)
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node" update is rolling out (1 out of 6 updated)
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node-metrics" is not available (awaiting 5 nodes)
    reason: Deploying
    status: "True"
    type: Progressing
    
    
Check the OVN pods in openshift-ovn-kubernetes
oc get pods -n openshift-ovn-kubernetes
NAME                           READY   STATUS             RESTARTS   AGE
ovnkube-master-khpnv           4/4     Running            0          21m
ovnkube-master-kzhd8           4/4     Running            0          23m
ovnkube-master-metrics-2jqqm   1/1     Running            0          23m
ovnkube-master-metrics-gjwqj   1/1     Running            0          23m
ovnkube-master-metrics-xckrq   1/1     Running            0          23m
ovnkube-master-qqb8d           4/4     Running            0          22m
ovnkube-node-5t95k             2/2     Running            0          132m
ovnkube-node-67drc             2/2     Running            0          123m
ovnkube-node-9smvf             2/2     Running            0          123m
ovnkube-node-fwblq             1/2     CrashLoopBackOff   9          23m
ovnkube-node-metrics-5hq5j     0/1     Pending            0          23m
ovnkube-node-metrics-f9spb     0/1     Pending            0          23m
ovnkube-node-metrics-fwmhb     0/1     Pending            0          23m
ovnkube-node-metrics-ksfjc     0/1     Pending            0          23m
ovnkube-node-metrics-ldvmx     0/1     Pending            0          23m
ovnkube-node-metrics-rxdp6     1/1     Running            0          23m
ovnkube-node-mtj9b             2/2     Running            0          123m
ovnkube-node-r8vn4             2/2     Running            0          132m
ovs-node-525v5                 1/1     Running            0          23m
ovs-node-jnn2h                 1/1     Running            0          23m
ovs-node-ncmt6                 1/1     Running            0          21m
ovs-node-pxvh9                 1/1     Running            0          22m
ovs-node-tr6rl                 1/1     Running            0          21m
ovs-node-xk4xl                 1/1     Running            0          21m


 oc logs ovnkube-node-fwblq  -c ovn-controller -n openshift-ovn-kubernetes
 2020-08-12T06:05:45Z|00176|poll_loop|INFO|wakeup due to [POLLIN] on fd 13 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (70% CPU usage)
2020-08-12T06:05:45Z|00177|poll_loop|INFO|wakeup due to [POLLIN] on fd 13 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (70% CPU usage)
2020-08-12T06:05:45Z|00178|poll_loop|INFO|wakeup due to [POLLIN] on fd 25 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (70% CPU usage)
2020-08-12T06:05:45Z|00179|pinctrl|WARN|Dropped 173 log messages in last 45 seconds (most recently, 0 seconds ago) due to excessive rate
2020-08-12T06:05:45Z|00180|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address
2020-08-12T06:05:45Z|00181|poll_loop|INFO|wakeup due to [POLLIN] on fd 25 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (70% CPU usage)
2020-08-12T06:06:12Z|00182|patch|ERR|Dropped 16 log messages in last 30 seconds (most recently, 27 seconds ago) due to excessive rate
2020-08-12T06:06:12Z|00183|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet'
2020-08-12T06:06:42Z|00184|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet'
2020-08-12T06:06:45Z|00185|pinctrl|WARN|Dropped 41 log messages in last 60 seconds (most recently, 3 seconds ago) due to excessive rate
2020-08-12T06:06:45Z|00186|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address
2020-08-12T06:07:12Z|00187|patch|ERR|Dropped 4 log messages in last 30 seconds (most recently, 27 seconds ago) due to excessive rate
2020-08-12T06:07:12Z|00188|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet'
2020-08-12T06:07:42Z|00189|patch|ERR|bridge not found for localnet port 'lnet-node_local_switch' with network name 'locnet'
2020-08-12T06:07:45Z|00190|pinctrl|WARN|Dropped 35 log messages in last 60 seconds (most recently, 3 seconds ago) due to excessive rate
2020-08-12T06:07:45Z|00191|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address

Check the pending pods:
oc describe pod ovnkube-node-metrics-ldvmx -n openshift-ovn-kubernetes
Containers:
  kube-rbac-proxy:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31114cd64b9dc44e9d61cec370f9c96acdb7dd391f8228552e552e6605aba735
    Port:       9103/TCP
    Host Port:  9103/TCP


  Warning  FailedScheduling  28s (x31 over 25m)  default-scheduler  0/6 nodes are available: 6 node(s) didn't have free ports for the requested pod ports.
  

check nodes for tcp port 9103,it was occupied by ovnkube
[root@huir-upg1-jp5v8-compute-2 ~]# netstat -ntlp | grep 9103
tcp6       0      0 :::9103                 :::*                    LISTEN      2608/ovnkube

Comment 6 Tim Rozet 2020-08-13 14:53:43 UTC

I think there are 2 different issues here. For the original bug the issue is that we are not upgrading ovnkube-node before we launch the metrics pods, so metrics cannot deploy because it needs port 9103. For the the other comment, it looks like something is wrong with configuring br-ex bridge with the ovs-configuration service, however I cannot launch an oc debug node pod on that cluster, so I'm unable to investigate further. Either way, Weinan please open a new BZ for the issue you are encountering as it is a separate bug.

Comment 7 Tim Rozet 2020-08-13 16:03:42 UTC

(In reply to Tim Rozet from comment #6)
> I think there are 2 different issues here. For the original bug the issue is
> that we are not upgrading ovnkube-node before we launch the metrics pods, so
> metrics cannot deploy because it needs port 9103. For the the other comment,
> it looks like something is wrong with configuring br-ex bridge with the
> ovs-configuration service, however I cannot launch an oc debug node pod on
> that cluster, so I'm unable to investigate further. Either way, Weinan
> please open a new BZ for the issue you are encountering as it is a separate
> bug.

Actually the metrics port problem might not be the real issue. I see in the manual upgrade cluster ovs-configuration problems.

Comment 9 Anurag saxena 2020-08-13 18:29:43 UTC

It seems we can't escape the x509 error. As per auth team "no.  It's reasonably safeish to ignore a cert error for getting logs (so we built that), but it's considerably less safe cases where users send data to the potentially unsafe endpoint"

The other way i believe is to leverage a bastion host here which is not working for me at the moment :(

Comment 10 Anurag saxena 2020-08-13 19:28:44 UTC

@huiran Could you try to repro this issue on Monday on a new setup and share? Had a hard time with oc debug and bastion host on this one. Thanks

Comment 12 Tim Rozet 2020-08-13 21:37:44 UTC

OK so after further investigation it looks like the problem is that CNO upgrades before MCO. Which means MCO never has a chance to start system OVS and run the ovs-configuration service and OVN fails to start. We have a couple of options here:
1. Add the same detection we use for ovs-node DS to detect whether or not OVS is running in the host or not, and use that to determine if we should run in local GW mode. That would allow CNO to "upgrade" and then when MCO runs it would reboot the node and ovn-kube would then run the right way after that.
2. Move CNO to run after MCO in upgrade path.

Need to figure out if #2 is feasible, otherwise we go with #1.

Comment 16 Alexander Constantinescu 2020-08-25 15:49:16 UTC

*** Bug 1868083 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2020-10-27 16:28:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 19 W. Trevor King 2021-04-05 17:47:33 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.