1880591 – [Upgrade][OVN] Upgrade from 4.5.8 to 4.6.0 fails

Bug 1880591 - [Upgrade][OVN] Upgrade from 4.5.8 to 4.6.0 fails

Summary: [Upgrade][OVN] Upgrade from 4.5.8 to 4.6.0 fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Duplicates (8):	1880449 1880514 1886786 1887055 1888075 1888222 1888959 1889393 (view as bug list)
Depends On:	1885344
Blocks:	1881979 1889393
TreeView+	depends on / blocked

Reported:	2020-09-18 19:14 UTC by Weibin Liang
Modified:	2020-11-09 15:51 UTC (History)
CC List:	26 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-09 15:50:58 UTC
Target Upstream Version:
Embargoed:
Flags:	anusaxen: needinfo-

Attachments	(Terms of Use)
Test Results Log (16.00 KB, text/plain) 2020-09-21 15:23 UTC, Weibin Liang	no flags	Details
master logs from acl removal failed (390.20 KB, text/plain) 2020-10-08 16:53 UTC, Tim Rozet	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 825	None	closed	Bug 1880591: 4.6 Use ovs-configuration file to determine if OVS is running in systemd	2021-02-16 16:35:08 UTC
Github	openshift cluster-network-operator pull 836	None	closed	Bug 1880591: Adds missing check for ovs-config-executed	2021-02-16 16:35:07 UTC
Github	openshift machine-config-operator pull 2140	None	closed	Bug 1880591: Make ovs-configuration write a file	2021-02-16 16:35:08 UTC
Github	openshift ovn-kubernetes pull 281	None	closed	Bug 1880591: Allow local no bridge	2021-02-16 16:35:08 UTC
Github	openshift ovn-kubernetes pull 297	None	closed	Bug 1880591: 9-30-2020 merge	2021-02-16 16:35:08 UTC
Github	openshift ovn-kubernetes pull 305	None	closed	Bug 1880591: Fix service sync reject ACL name matching	2021-02-16 16:35:08 UTC
Github	openshift ovn-kubernetes pull 307	None	closed	Bug 1880591: Fixes race during 4.5->4.6 upgrade with ovn node and master	2021-02-16 16:35:09 UTC
Github	ovn-org ovn-kubernetes pull 1738	None	closed	Remove reject ACLs regardless of cache	2021-02-16 16:35:09 UTC
Red Hat Knowledge Base (Solution)	5439051	None	None	None	2020-09-28 10:42:37 UTC
Red Hat Product Errata	RHBA-2020:4339	None	None	None	2020-11-09 15:51:22 UTC

Description Weibin Liang 2020-09-18 19:14:18 UTC

Description of problem:
[OVN] Upgrade from 4.5.8 to 4.6.0-fc.5 failed on Bare Metal

Version-Release number of the following components:
4.5.8 to 4.6.0-fc.5

How reproducible:
Always

Steps to Reproduce:
[weliang@weliang verification-tests]$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge
clusterversion.config.openshift.io/version patched
[weliang@weliang verification-tests]$ oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.6.0-fc.5-x86_64 --allow-explicit-upgrade --force=true
Updating to release image quay.io/openshift-release-dev/ocp-release:4.6.0-fc.5-x86_64


Actual results:
[weliang@weliang verification-tests]$ oc get nodes
NAME                                STATUS   ROLES    AGE     VERSION
weliang-182-8vzkb-compute-0         Ready    worker   5h11m   v1.18.3+6c42de8
weliang-182-8vzkb-compute-1         Ready    worker   5h11m   v1.18.3+6c42de8
weliang-182-8vzkb-compute-2         Ready    worker   5h11m   v1.18.3+6c42de8
weliang-182-8vzkb-control-plane-0   Ready    master   5h21m   v1.18.3+6c42de8
weliang-182-8vzkb-control-plane-1   Ready    master   5h21m   v1.18.3+6c42de8
weliang-182-8vzkb-control-plane-2   Ready    master   5h21m   v1.18.3+6c42de8
[weliang@weliang verification-tests]$ oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-fc.5   False       True          True       3h57m
cloud-credential                           4.6.0-fc.5   True        False         False      5h22m
cluster-autoscaler                         4.6.0-fc.5   True        False         False      5h13m
config-operator                            4.6.0-fc.5   True        False         False      5h13m
console                                    4.6.0-fc.5   True        False         True       4h11m
csi-snapshot-controller                    4.6.0-fc.5   True        False         False      4h1m
dns                                        4.5.8        True        False         False      5h18m
etcd                                       4.6.0-fc.5   True        False         False      5h17m
image-registry                             4.6.0-fc.5   True        False         False      5h9m
ingress                                    4.6.0-fc.5   True        False         False      4h12m
insights                                   4.6.0-fc.5   True        False         False      5h13m
kube-apiserver                             4.6.0-fc.5   True        False         False      5h17m
kube-controller-manager                    4.6.0-fc.5   True        False         False      5h17m
kube-scheduler                             4.6.0-fc.5   True        False         False      5h17m
kube-storage-version-migrator              4.6.0-fc.5   True        False         False      5h10m
machine-api                                4.6.0-fc.5   True        False         False      5h13m
machine-approver                           4.6.0-fc.5   True        False         False      5h15m
machine-config                             4.5.8        True        False         False      5h17m
marketplace                                4.6.0-fc.5   True        False         False      4h11m
monitoring                                 4.6.0-fc.5   False       False         True       3h55m
network                                    4.6.0-fc.5   True        False         False      5h19m
node-tuning                                4.6.0-fc.5   True        False         False      4h12m
openshift-apiserver                        4.6.0-fc.5   False       False         False      3h57m
openshift-controller-manager               4.6.0-fc.5   True        False         False      5h8m
openshift-samples                          4.6.0-fc.5   True        False         False      4h12m
operator-lifecycle-manager                 4.6.0-fc.5   True        False         False      5h19m
operator-lifecycle-manager-catalog         4.6.0-fc.5   True        False         False      5h19m
operator-lifecycle-manager-packageserver   4.6.0-fc.5   False       False         False      125m
service-ca                                 4.6.0-fc.5   True        False         False      5h19m
storage                                    4.6.0-fc.5   True        False         False      4h12m
[weliang@weliang verification-tests]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.8     True        True          4h27m   Unable to apply 4.6.0-fc.5: the cluster operator openshift-apiserver has not yet successfully rolled out
[weliang@weliang verification-tests]$ 


Expected results:
Upgrade successfully.


Additional info:
Can not must-gather in broken cluster
[weliang@weliang verification-tests]$ oc adm must-gather --dest-dir=/tmp/log_must-gather
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b8ec5afb544381d78514ea0367ca29edac1a7aadfa323d191cf84d60d01045e0
[must-gather      ] OUT namespace/openshift-must-gather-sjm8s created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jjcdv created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b8ec5afb544381d78514ea0367ca29edac1a7aadfa323d191cf84d60d01045e0 created
[must-gather-zp26b] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io)
[must-gather-zp26b] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get deploymentconfigs.apps.openshift.io)
[must-gather-zp26b] OUT gather logs unavailable: unexpected EOF
[must-gather-zp26b] OUT waiting for gather to complete
[must-gather-zp26b] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jjcdv deleted
[must-gather      ] OUT namespace/openshift-must-gather-sjm8s deleted
error: gather never finished for pod must-gather-zp26b: timed out waiting for the condition

Comment 1 Stefan Schimanski 2020-09-21 13:01:49 UTC

What does

  oc get co openshift-apiserver -o yaml

say?

Comment 2 Stefan Schimanski 2020-09-21 13:02:45 UTC

Same for all other operators with Available=false or Degraded=false please.

Comment 3 Weibin Liang 2020-09-21 15:23:28 UTC

Created attachment 1715553 [details]
Test Results Log

Comment 4 Weibin Liang 2020-09-21 15:24:21 UTC

Test env: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/113589/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 5 Stefan Schimanski 2020-09-21 16:14:14 UTC

This is a network problem:

  $ kubectl get -n openshift-apiserver endpoints
  NAME   ENDPOINTS                                            AGE
  api    10.128.0.20:8443,10.129.0.13:8443,10.130.0.29:8443   155m

Then 

  oc debug node/weliang-211-wz4nn-control-plane-0

and

  curl -k <endpoint>:8443

The second endpoint "10.129.0.13:8443" blocks.

Then

  oc debug node/weliang-211-wz4nn-control-plane-1

and the same curls again. The endpoint "10.129.0.13:8443" works.

Comment 6 Stefan Schimanski 2020-09-21 16:20:23 UTC

In ovn controller logs on master-1 I see those messages around every minute:

2020-09-21T16:18:59Z|06924|ovsdb_idl|WARN|transaction error: {"details":"No column other_config in table Chassis.","error":"unknown column","syntax":"{\"encaps\":[\"named-uuid\",\"row451c853e_d3f1_4da8_ac8a_390587e3bc5f\"],\"external_ids\":[\"map\",[[\"datapath-type\",\"\"],[\"iface-types\",\"erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan\"],[\"is-interconn\",\"false\"],[\"ovn-bridge-mappings\",\"physnet:br-local\"],[\"ovn-chassis-mac-mappings\",\"\"],[\"ovn-cms-options\",\"\"]]],\"hostname\":\"weliang-211-wz4nn-control-plane-1\",\"name\":\"31473c2e-568a-4970-b865-fd3932174b18\",\"other_config\":[\"map\",[[\"datapath-type\",\"\"],[\"iface-types\",\"erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan\"],[\"is-interconn\",\"false\"],[\"ovn-bridge-mappings\",\"physnet:br-local\"],[\"ovn-chassis-mac-mappings\",\"\"],[\"ovn-cms-options\",\"\"]]]}"}

Comment 7 Stefan Schimanski 2020-09-21 16:20:52 UTC

And this nice message:

  2020-09-21T16:18:59Z|06923|ovsdb_idl|WARN|Dropped 34665 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate

Comment 8 Stefan Schimanski 2020-09-21 16:23:33 UTC

At the same time:

NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
network                                    4.6.0-fc.5   True        False         False      169m

Comment 9 Ben Bennett 2020-09-22 12:46:53 UTC

*** Bug 1880449 has been marked as a duplicate of this bug. ***

Comment 10 Alexander Constantinescu 2020-09-22 12:56:29 UTC

Hi Weibin, 

could you reproduce again and provide a kubeconfig? The one attached in #comment 4 is not valid anymore it seems. 

/Alex

Comment 11 Alexander Constantinescu 2020-09-22 14:43:27 UTC

Hi

Also, I am wondering if this is a valid upgrade path? 4.6.0-fc.5 contains code from ovn-kubernetes from a month ago. This predates the changes introduced a week ago, which essentially means: no one will ever upgrade to this version in the future. 

@Webin, could you try an upgrade to 4.6.0-fc.7: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.6.0-fc.7?from=4.6.0-fc.4 I see it was built 4 days ago which should contain the networking code on 4.6 code (I can't verify though as that page returns an error for that release)

/Alex

Comment 12 Anurag saxena 2020-09-22 15:00:25 UTC

Wondering if this might be connected to https://github.com/ovn-org/ovn-kubernetes/pull/1720

Comment 13 Alexander Constantinescu 2020-09-22 15:08:36 UTC

> Wondering if this might be connected to https://github.com/ovn-org/ovn-kubernetes/pull/1720

No, for sure not on a cluster upgrade to 4.6.0-fc.5. That PR fixes a new bug found on CI caused by the changes from last week, which 4.6.0-fc.5 does not have

Comment 14 Tim Rozet 2020-09-22 16:39:12 UTC

I don't think upgrade is going to work from 4.5 -> lastest 4.6. CNO will upgrade before MCO and we will be left in the situation where the node has not run ovs-configuration yet, but ovn-k8s has upgraded and will attempt to create the bridge. It would still be good for QE to verify this is the case. I'll come up with a fix in parallel.

Comment 15 Numan Siddique 2020-09-22 17:12:29 UTC

Looking into comment 6, seems like the ovn dbs are not upgraded to the latest schema and hence ovn-controlelrs are not seeing the newly added column - other_config.

Probably this fix is required - https://github.com/openshift/cluster-network-operator/commit/eaff68c539225391a7cf7a3d21edd8283426b7e8#diff-54b09156f80dfb820afa115de13b8f32

This patch makes sure that the ovn dbs are updated. 

There is a workaround for this issue -

Login to nbdb/sbdb container and run the  commands from here - https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/ovnkube-master.yaml#L171

Comment 16 Alexander Constantinescu 2020-09-22 17:56:50 UTC

Clarifying #comment 15

> Probably this fix is required - https://github.com/openshift/cluster-network-operator/commit/eaff68c539225391a7cf7a3d21edd8283426b7e8#diff-54b09156f80dfb820afa115de13b8f32

4.6.0-fc.5 does not have that patch too. Which is probably why those errors are showing up.

Comment 17 Weibin Liang 2020-09-22 18:09:11 UTC

4.5.8 -> 4.6.0-fc.7 upgrading testing failed too

[weliang@weliang ~]$ oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-fc.7   False       True          True       13m
cloud-credential                           4.6.0-fc.7   True        False         False      127m
cluster-autoscaler                         4.6.0-fc.7   True        False         False      103m
config-operator                            4.6.0-fc.7   True        False         False      103m
console                                    4.6.0-fc.7   False       False         True       11m
csi-snapshot-controller                    4.6.0-fc.7   False       True          False      11m
dns                                        4.5.8        True        True          True       109m
etcd                                       4.6.0-fc.7   True        False         False      108m
image-registry                             4.6.0-fc.7   False       True          False      13m
ingress                                    4.6.0-fc.7   True        False         False      24m
insights                                   4.6.0-fc.7   True        False         False      104m
kube-apiserver                             4.6.0-fc.7   True        False         False      108m
kube-controller-manager                    4.6.0-fc.7   True        False         False      109m
kube-scheduler                             4.6.0-fc.7   True        False         False      108m
kube-storage-version-migrator              4.6.0-fc.7   True        False         False      99m
machine-api                                4.6.0-fc.7   True        False         False      104m
machine-approver                           4.6.0-fc.7   True        False         False      107m
machine-config                             4.5.8        True        False         False      108m
marketplace                                4.6.0-fc.7   True        False         False      23m
monitoring                                 4.6.0-fc.7   True        False         False      22m
network                                    4.6.0-fc.7   True        False         False      110m
node-tuning                                4.6.0-fc.7   True        False         False      24m
openshift-apiserver                        4.6.0-fc.7   True        False         False      14m
openshift-controller-manager               4.6.0-fc.7   True        False         False      104m
openshift-samples                          4.6.0-fc.7   True        False         False      23m
operator-lifecycle-manager                 4.6.0-fc.7   True        False         False      109m
operator-lifecycle-manager-catalog         4.6.0-fc.7   True        False         False      110m
operator-lifecycle-manager-packageserver   4.6.0-fc.7   True        False         False      23m
service-ca                                 4.6.0-fc.7   True        False         False      110m
storage                                    4.6.0-fc.7   True        False         False      24m
[weliang@weliang ~]$ 
[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.8     True        True          45m     Unable to apply 4.6.0-fc.7: an unknown error has occurred: MultipleErrors
[weliang@weliang ~]$ 

kubeconfig: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/113869/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 18 Tim Rozet 2020-09-22 18:22:13 UTC

Thanks Weibin. As I suspected it looks like we are creating a bridge, which wont work during upgrade:

    Bridge brens3
        fail_mode: standalone
        Port patch-br-local_weliang-2211-48l5l-compute-0-to-br-int
            Interface patch-br-local_weliang-2211-48l5l-compute-0-to-br-int
                type: patch
                options: {peer=patch-br-int-to-br-local_weliang-2211-48l5l-compute-0}

I'll have a fix to try shortly. What's the easiest way to test this out? Can I do it with cluster bot?

Comment 19 Anurag saxena 2020-09-22 18:28:19 UTC

`test upgrade 4.5.8 openshift/ovn-kubernetes#xxx aws,ovn` should work. Suggest to schedule couple of runs like on aws and gcp.

Comment 20 Weibin Liang 2020-09-22 18:58:02 UTC

(In reply to Anurag saxena from comment #19)
> `test upgrade 4.5.8 openshift/ovn-kubernetes#xxx aws,ovn` should work.
> Suggest to schedule couple of runs like on aws and gcp.


Thanks, Anurag!

For Bare metal cluster, need change to use metal like this:

test upgrade 4.5.8 openshift/ovn-kubernetes#xxx metal, ovn

Comment 21 Anurag saxena 2020-09-22 19:00:11 UTC

(In reply to Weibin Liang from comment #20)
> (In reply to Anurag saxena from comment #19)
> > `test upgrade 4.5.8 openshift/ovn-kubernetes#xxx aws,ovn` should work.
> > Suggest to schedule couple of runs like on aws and gcp.
> 
> 
> Thanks, Anurag!
> 
> For Bare metal cluster, need change to use metal like this:
> 
> test upgrade 4.5.8 openshift/ovn-kubernetes#xxx metal, ovn

Oh yea, its BM so make sense. Not sure how bot handling BM though. Worth a try

Comment 22 Paige Rubendall 2020-09-24 17:58:40 UTC

*** Bug 1880514 has been marked as a duplicate of this bug. ***

Comment 23 zhaozhanqi 2020-09-25 01:45:06 UTC

this should not only baremental I guess. I had a try on aws with ovn. also met same issue.

Comment 27 Tim Rozet 2020-09-28 04:01:01 UTC

It looks like the problem is we have reject ACL present on the kapi service:
_uuid : 5b37d90d-8a86-4dc7-a17b-8afe370f9154
action : reject
direction : from-lport
external_ids : {}
log : false
match : "ip4.dst==172.30.0.1 && tcp && tcp.dst==443"
meter : []
name : "948626bb-6702-427a-89f5-afeaf2600552-172.30.0.1:443"
priority : 1000
severity : []

Even though the endpoints are all there:
[root@huir-upg3-xsm9v-master-1 ~]# ovn-nbctl list load_balancer 948626bb-6702-427a-89f5-afeaf2600552 | grep 172.30.0.1
vips : {"172.30.0.10:53"="10.128.0.20:5353,10.128.2.3:5353,10.129.0.11:5353,10.129.2.3:5353,10.130.0.9:5353,10.131.0.3:5353", "172.30.0.10:9154"="10.128.0.20:9154,10.128.2.3:9154,10.129.0.11:9154,10.129.2.3:9154,10.130.0.9:9154,10.131.0.3:9154", "172.30.0.1:443"="192.168.0.123:6443,192.168.0.74:6443,192.168.1.183:6443"

which means something got out of sync with our services/endpoints handling logic. Because all of those log messages are debug level only, it's hard to know if these ACLs were present before ovnkube-masters were upgraded or if they were added afterwards when new ovnkube-master came up. It would be helpful if you could reproduce it with debug level logs on.

Either way we can make improvements by:
1. syncServices removing stale reject ACLs. Since we only use reject ACLs on services now, we can poll all reject ACLs in OVN and decide if to remove them. This would fix the case where these ACLs are leftover from a previous instance of ovn-k8s.
2. make deleteLoadBalancerRejectACL better. Right now we only look at the cache to see if there is an ACL configured for that service. If we don't find one, we should still generate the name and try to delete it anyway.
3. move a lot of these debug level service/endpoints logging messages to Info

Comment 28 Tim Rozet 2020-09-28 04:44:30 UTC

Huiran or Anurag, can you please try upgrade again and include:
https://github.com/openshift/ovn-kubernetes/pull/295 ?

See if that resolves the problem.

Comment 31 Tim Rozet 2020-09-28 14:45:00 UTC

Thanks Huiran. My patch was flawed, but either it is coincidence or it exposed two OVN crashes. OVS also looks hosed. Filing a new bug on OVN, and will update my patch with the correct logic.

Comment 32 Anurag saxena 2020-09-28 19:26:47 UTC

Update: Bug description says Baremetal but comment 29 behavior is seen across all platforms now

Comment 33 Anurag saxena 2020-09-28 22:54:08 UTC

(In reply to Tim Rozet from comment #28)
> Huiran or Anurag, can you please try upgrade again and include:
> https://github.com/openshift/ovn-kubernetes/pull/295 ?
> 
> See if that resolves the problem.

Tim , my gcp job succeeded with PR 295 
`test upgrade 4.5.0-0.nightly-2020-09-28-124031 openshift/ovn-kubernetes#295 gcp,ovn succeeded`

Bot doesn't seem to be working well with metal so i believe the fix should work for all platforms

Comment 34 Tim Rozet 2020-09-29 03:46:24 UTC

Thanks Anurag. I just updated 295 to test out latest changes. Following upstream PR here:
https://github.com/ovn-org/ovn-kubernetes/pull/1738

Comment 35 Weibin Liang 2020-09-30 17:11:27 UTC

 test upgrade 4.5.0-0.nightly-2020-09-28-124031 openshift/ovn-kubernetes#297 gcp,ovn succeeded

Comment 36 Tim Rozet 2020-10-01 18:06:24 UTC

*** Bug 1883521 has been marked as a duplicate of this bug. ***

Comment 38 Weibin Liang 2020-10-02 21:41:43 UTC

Upgrading still faile from latest 4.5 to 4.6 on BM OVN cluster.
openvswitch.service is inactive in some nodes

[weliang@weliang ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-10-01-011427   True        True          106m    Unable to apply 4.6.0-0.nightly-2020-10-02-160623: the cluster operator openshift-apiserver is degraded
[weliang@weliang ~]$ oc get pods --all-namespaces -o wide | egrep -v "Runn|Comp"
NAMESPACE                                          NAME                                                         READY   STATUS              RESTARTS   AGE    IP            NODE                              NOMINATED NODE   READINESS GATES
openshift-apiserver                                apiserver-5c5558d85-6vgwh                                    0/2     Pending             0          58m    <none>        <none>                            <none>           <none>
openshift-controller-manager                       controller-manager-rqk8g                                     0/1     ContainerCreating   1          83m    <none>        weliang26-nwdc5-control-plane-2   <none>           <none>
openshift-dns                                      dns-default-lk2n8                                            0/3     ContainerCreating   0          66m    <none>        weliang26-nwdc5-control-plane-2   <none>           <none>
openshift-etcd                                     etcd-quorum-guard-7986975d98-cq72h                           0/1     Pending             0          58m    <none>        <none>                            <none>           <none>
openshift-kube-apiserver                           revision-pruner-8-weliang26-nwdc5-control-plane-2            0/1     ContainerCreating   0          56m    <none>        weliang26-nwdc5-control-plane-2   <none>           <none>
openshift-kube-descheduler-operator                cluster-6fb776d674-x2drl                                     0/1     ImagePullBackOff    0          130m   10.131.0.14   weliang26-nwdc5-compute-1         <none>           <none>
openshift-multus                                   multus-admission-controller-j5scn                            0/2     ContainerCreating   0          74m    <none>        weliang26-nwdc5-control-plane-2   <none>           <none>
openshift-multus                                   network-metrics-daemon-dglwf                                 0/2     ContainerCreating   0          76m    <none>        weliang26-nwdc5-control-plane-2   <none>           <none>
openshift-oauth-apiserver                          apiserver-5988648dd4-wsq6h                                   0/1     Pending             0          58m    <none>        <none>                            <none>           <none>
[weliang@weliang ~]$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-10-02-160623   True        False         True       13s
machine-config                             4.5.0-0.nightly-2020-10-01-011427   False       True          True       67m
network                                    4.6.0-0.nightly-2020-10-02-160623   True        True          True       155m
openshift-apiserver                        4.6.0-0.nightly-2020-10-02-160623   True        False         True       2s
[weliang@weliang ~]$ for f in $(oc get nodes  -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" --  chroot /host systemctl status ovs-vswitchd openvswitch ; done
Starting pod/weliang26-nwdc5-compute-0-debug ...
To use host binaries, run `chroot /host`
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead)

● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Removing debug pod ...
error: non-zero exit code from debug container
Starting pod/weliang26-nwdc5-compute-1-debug ...
To use host binaries, run `chroot /host`
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead)

● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Removing debug pod ...
error: non-zero exit code from debug container
Starting pod/weliang26-nwdc5-compute-2-debug ...
To use host binaries, run `chroot /host`
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead)

● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Removing debug pod ...
error: non-zero exit code from debug container
Starting pod/weliang26-nwdc5-control-plane-0-debug ...
To use host binaries, run `chroot /host`
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead)

● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Removing debug pod ...
Starting pod/weliang26-nwdc5-control-plane-1-debug ...
To use host binaries, run `chroot /host`
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead)

● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Removing debug pod ...
error: non-zero exit code from debug container
Starting pod/weliang26-nwdc5-control-plane-2-debug ...
To use host binaries, run `chroot /host`
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
  Drop-In: /etc/systemd/system/ovs-vswitchd.service.d
           └─10-ovs-vswitchd-restart.conf
   Active: active (running) since Fri 2020-10-02 20:39:56 UTC; 58min ago
  Process: 1331 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVS_USER_OPT} start $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 1324 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS)
  Process: 1322 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS)
 Main PID: 1383 (ovs-vswitchd)
    Tasks: 10 (limit: 102082)
   Memory: 47.7M
      CPU: 23.842s
   CGroup: /system.slice/ovs-vswitchd.service
           └─1383 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach

Oct 02 21:36:02 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00388|stream_unix|ERR|/var/run/openvswitch/br-int.snoop: binding failed: No such file or directory
Oct 02 21:36:02 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00389|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br-int.snoop: No such file or directory
Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00391|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00393|stream_unix|ERR|/var/run/openvswitch/br-int.mgmt: binding failed: No such file or directory
Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00395|stream_unix|ERR|/var/run/openvswitch/br-int.snoop: binding failed: No such file or directory
Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00396|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br-int.snoop: No such file or directory
Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00398|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00400|stream_unix|ERR|/var/run/openvswitch/br-int.mgmt: binding failed: No such file or directory
Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00402|stream_unix|ERR|/var/run/openvswitch/br-int.snoop: binding failed: No such file or directory
Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00403|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br-int.snoop: No such file or directory

● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled)
   Active: active (exited) since Fri 2020-10-02 20:39:56 UTC; 58min ago
  Process: 1392 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
 Main PID: 1392 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 102082)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/openvswitch.service

Oct 02 20:39:56 weliang26-nwdc5-control-plane-2 systemd[1]: Starting Open vSwitch...
Oct 02 20:39:56 weliang26-nwdc5-control-plane-2 systemd[1]: Started Open vSwitch.

Removing debug pod ...

Comment 39 Tim Rozet 2020-10-03 02:33:28 UTC

I think it's normal for OVS not to all be running in systemd yet since your MCO still shows 4.5:
machine-config                             4.5.0-0.nightly-2020-10-01-011427 

Some nodes probably have not been rebooted yet to fully upgrade to 4.6. The way to check is you can do cat /etc/*release* and see what OS the nodes are using that are missing systemd OVS. They should still be 4.5. Regardless, we need to know why the upgrade failed. Do you have a setup I can debug or can you attach a must gather please?

Comment 40 Tim Rozet 2020-10-05 16:22:00 UTC

I think there is still a race condition with laying down/enabling new files with upgrades. Instead of using the file to determine if OVS should be in systemd; I'm proposing using the OS version itself. Will move this bug to 4.7 and clone it for a backport.

Comment 41 Weibin Liang 2020-10-05 17:15:29 UTC

Hi Tim,

Here is the new cluster you can use to debug: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/116379/artifact/workdir/install-dir/auth/kubeconfig/*view*/

must-gather still not work on this broken cluster.

Comment 42 Tim Rozet 2020-10-05 17:58:02 UTC

Thanks Weibin. Looking at your cluster I can see why this failed. The broken node is on 4.6. However:

[root@weliang52-9pll7-compute-0 ~]# stat /host/etc/systemd/system/network-online.target.wants/ovs-configuration.service
stat: cannot stat '/host/etc/systemd/system/network-online.target.wants/ovs-configuration.service': No such file or directory

^We dont find the ovs-configuration service so we start OVS containers. Now OVS is running in container and systemd and things break. However, I see ovs-configuration was placed into:

/host/etc/systemd/system/multi-user.target.wants/ovs-configuration.service


Even though the service specifies install into network-online.target:

[Unit]
Description=Configures OVS with proper host networking configuration
# Removal of this file signals firstboot completion
ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json
# This service is used to move a physical NIC into OVS and reconfigure OVS to use the host IP
Requires=openvswitch.service
Wants=NetworkManager-wait-online.service
After=NetworkManager-wait-online.service openvswitch.service network.service
Before=network-online.target kubelet.service crio.service node-valid-hostname.service

[Service]
# Need oneshot to delay kubelet
Type=oneshot
ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes
StandardOutput=journal+console
StandardError=journal+console

[Install]
WantedBy=network-online.target

sh-4.4# systemctl disable ovs-configuration.service
Removed /etc/systemd/system/multi-user.target.wants/ovs-configuration.service.
sh-4.4# systemctl enable ovs-configuration.service
Created symlink /etc/systemd/system/network-online.target.wants/ovs-configuration.service → /etc/systemd/system/ovs-configuration.service.
sh-4.4# stat /etc/systemd/system/network-online.target.wants/ovs-configuration.service
  File: /etc/systemd/system/network-online.target.wants/ovs-configuration.service -> /etc/systemd/system/ovs-configuration.service
  Size: 45              Blocks: 0          IO Block: 4096   symbolic link
Device: fd00h/64768d    Inode: 180451495   Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:container_file_t:s0
Access: 2020-10-05 17:49:31.598094614 +0000
Modify: 2020-10-05 17:49:31.497090653 +0000
Change: 2020-10-05 17:49:31.498090692 +0000
 Birth: -
sh-4.4#

^ By simply disabling/re-enabling we can see the file is now installed into the right place. This means there is some bug here with service installation with either MCO or RHCOS.

Either way my proposed fix should address the issue. I'll also bring this up with the MCO team to understand why it did not install the service correctly.

Comment 43 Tim Rozet 2020-10-05 18:08:50 UTC

MCO confirmed it is a bug. Filed https://bugzilla.redhat.com/show_bug.cgi?id=1885365

Comment 44 Ben Bennett 2020-10-07 17:46:56 UTC

*** Bug 1885517 has been marked as a duplicate of this bug. ***

Comment 46 Anurag saxena 2020-10-08 11:44:16 UTC

Moving back to assigned until https://github.com/openshift/machine-config-operator/pull/2145 merge and we verify this again

Comment 47 Tim Rozet 2020-10-08 13:12:45 UTC

Anurag we don't need MCO 2145 with the changes in CNO 825. That change now checks for a created file rather than a systemd unit location:
https://github.com/openshift/cluster-network-operator/pull/825/files#diff-21a36954dbfb576c0c9b366428438e83R55

Comment 48 Anurag saxena 2020-10-08 15:11:47 UTC

As discussed with Tim a bit, we still see ovs systemd units discussed in comment38 still looks inactive on 4.6.0-0.nightly-2020-10-08-043318 which contains both 2140 and 825
Moving this to assigned and being troubleshooted btw Tim and QE.

cluster: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/116786/artifact/workdir/install-dir/auth/kubeconfig

Comment 49 Tim Rozet 2020-10-08 16:51:32 UTC

It looks like the cause for this upgrade stalling is that there is that connections to API server service are being rejected. I can see the reject ACL is still there after initial bring up of 4.6 ovn pods:

[root@ip-10-0-215-110 ~]# ovn-nbctl --no-leader-only list acl 84ffba22-20a9-42ad-a8e4-5ad7785e0890-172.30.0.1:443
_uuid               : 463ccb51-4f5c-42ae-bc6b-c557cc55f51e
action              : reject
direction           : from-lport
external_ids        : {}
log                 : false
match               : "ip4.dst==172.30.0.1 && tcp && tcp.dst==443"
meter               : []
name                : "84ffba22-20a9-42ad-a8e4-5ad7785e0890-172.30.0.1:443"
priority            : 1000
severity            : []

This should have been removed via:
https://github.com/ovn-org/ovn-kubernetes/pull/1738

which would have removed it during initial service sync or when the endpoint add event happened. I can see in the logs there are no reject service ACLs removed during sync, and also the endpoints for kubernetes were added:

I1008 10:03:49.584268       1 service.go:244] Creating service kubernetes
I1008 10:03:49.584297       1 endpoints.go:46] Adding endpoints: kubernetes for namespace: default

This also should have removed the reject ACL.

Comment 50 Tim Rozet 2020-10-08 16:53:27 UTC

Created attachment 1720035 [details]
master logs from acl removal failed

Comment 51 Tim Rozet 2020-10-08 17:22:58 UTC

Looks like this is a string match problem during service sync where we generateACLName with escaping \\ in it, but what we actually compare does not contain the escaped slashes:
https://github.com/ovn-org/ovn-kubernetes/pull/1749

Comment 54 Dan Winship 2020-10-09 14:57:35 UTC

*** Bug 1886786 has been marked as a duplicate of this bug. ***

Comment 55 Tim Rozet 2020-10-09 16:04:47 UTC

Looks like there is a potential issue when 4.5 ovnkube-node pod upgrades before ovnkube-master pod to 4.6 on the master node. Pushed a fix https://github.com/openshift/ovn-kubernetes/pull/307

Also noticed, we were missing the new CNO check on ovnkube-master:
https://github.com/openshift/cluster-network-operator/pull/836

Comment 56 Tim Rozet 2020-10-12 20:53:44 UTC

*** Bug 1887055 has been marked as a duplicate of this bug. ***

Comment 57 Anurag saxena 2020-10-13 12:34:12 UTC

FYI, cluster bot job `test upgrade 4.5.0-0.nightly-2020-10-10-030038 openshift/ovn-kubernetes#307,openshift/cluster-network-operator#836 aws,ovn` succeeded

Comment 58 Tim Rozet 2020-10-14 13:12:04 UTC

*** Bug 1888222 has been marked as a duplicate of this bug. ***

Comment 59 Tim Rozet 2020-10-14 15:22:03 UTC

*** Bug 1888075 has been marked as a duplicate of this bug. ***

Comment 60 Lalatendu Mohanty 2020-10-19 12:50:03 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 61 Tim Rozet 2020-10-20 14:39:18 UTC

OVN is currently tech preview on 4.5 and is not guaranteed to upgrade to 4.6 (GA release). There will be a subsequent 4.6z release that will support upgrade from 4.5.

Comment 62 Lalatendu Mohanty 2020-10-21 13:12:30 UTC

As OVN is tech preview we will not remove edge because of this bug. Hence removing the UpgradeBlocker keyword.

Comment 63 Tim Rozet 2020-10-21 19:55:42 UTC

*** Bug 1888959 has been marked as a duplicate of this bug. ***

Comment 64 Tim Rozet 2020-10-21 19:56:42 UTC

*** Bug 1889393 has been marked as a duplicate of this bug. ***

Comment 67 Weibin Liang 2020-11-02 19:26:49 UTC

Upgrading from 4.5.0-0.nightly-2020-10-31-200727 to 4.6.0-0.nightly-2020-11-02-081936 in AWS passed

Comment 69 errata-xmlrpc 2020-11-09 15:50:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.3 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4339

Note You need to log in before you can comment on or make changes to this bug.

aconstan
anusaxen
aos-bugs
bbennett
ChetRHosey
djuran
fiezzi
huirwang
jima
juzhao
lmohanty
mcornea
mfojtik
mifiedle
nusiddiq
prubenda
rekhan
sttts
trozet
tsze
vrutkovs
wking
xtian
yanyang
yprokule
zzhao