Description of problem: [OVN] Upgrade from 4.5.8 to 4.6.0-fc.5 failed on Bare Metal Version-Release number of the following components: 4.5.8 to 4.6.0-fc.5 How reproducible: Always Steps to Reproduce: [weliang@weliang verification-tests]$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge clusterversion.config.openshift.io/version patched [weliang@weliang verification-tests]$ oc adm upgrade --to-image=quay.io/openshift-release-dev/ocp-release:4.6.0-fc.5-x86_64 --allow-explicit-upgrade --force=true Updating to release image quay.io/openshift-release-dev/ocp-release:4.6.0-fc.5-x86_64 Actual results: [weliang@weliang verification-tests]$ oc get nodes NAME STATUS ROLES AGE VERSION weliang-182-8vzkb-compute-0 Ready worker 5h11m v1.18.3+6c42de8 weliang-182-8vzkb-compute-1 Ready worker 5h11m v1.18.3+6c42de8 weliang-182-8vzkb-compute-2 Ready worker 5h11m v1.18.3+6c42de8 weliang-182-8vzkb-control-plane-0 Ready master 5h21m v1.18.3+6c42de8 weliang-182-8vzkb-control-plane-1 Ready master 5h21m v1.18.3+6c42de8 weliang-182-8vzkb-control-plane-2 Ready master 5h21m v1.18.3+6c42de8 [weliang@weliang verification-tests]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-fc.5 False True True 3h57m cloud-credential 4.6.0-fc.5 True False False 5h22m cluster-autoscaler 4.6.0-fc.5 True False False 5h13m config-operator 4.6.0-fc.5 True False False 5h13m console 4.6.0-fc.5 True False True 4h11m csi-snapshot-controller 4.6.0-fc.5 True False False 4h1m dns 4.5.8 True False False 5h18m etcd 4.6.0-fc.5 True False False 5h17m image-registry 4.6.0-fc.5 True False False 5h9m ingress 4.6.0-fc.5 True False False 4h12m insights 4.6.0-fc.5 True False False 5h13m kube-apiserver 4.6.0-fc.5 True False False 5h17m kube-controller-manager 4.6.0-fc.5 True False False 5h17m kube-scheduler 4.6.0-fc.5 True False False 5h17m kube-storage-version-migrator 4.6.0-fc.5 True False False 5h10m machine-api 4.6.0-fc.5 True False False 5h13m machine-approver 4.6.0-fc.5 True False False 5h15m machine-config 4.5.8 True False False 5h17m marketplace 4.6.0-fc.5 True False False 4h11m monitoring 4.6.0-fc.5 False False True 3h55m network 4.6.0-fc.5 True False False 5h19m node-tuning 4.6.0-fc.5 True False False 4h12m openshift-apiserver 4.6.0-fc.5 False False False 3h57m openshift-controller-manager 4.6.0-fc.5 True False False 5h8m openshift-samples 4.6.0-fc.5 True False False 4h12m operator-lifecycle-manager 4.6.0-fc.5 True False False 5h19m operator-lifecycle-manager-catalog 4.6.0-fc.5 True False False 5h19m operator-lifecycle-manager-packageserver 4.6.0-fc.5 False False False 125m service-ca 4.6.0-fc.5 True False False 5h19m storage 4.6.0-fc.5 True False False 4h12m [weliang@weliang verification-tests]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.8 True True 4h27m Unable to apply 4.6.0-fc.5: the cluster operator openshift-apiserver has not yet successfully rolled out [weliang@weliang verification-tests]$ Expected results: Upgrade successfully. Additional info: Can not must-gather in broken cluster [weliang@weliang verification-tests]$ oc adm must-gather --dest-dir=/tmp/log_must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b8ec5afb544381d78514ea0367ca29edac1a7aadfa323d191cf84d60d01045e0 [must-gather ] OUT namespace/openshift-must-gather-sjm8s created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jjcdv created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b8ec5afb544381d78514ea0367ca29edac1a7aadfa323d191cf84d60d01045e0 created [must-gather-zp26b] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get routes.route.openshift.io) [must-gather-zp26b] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get deploymentconfigs.apps.openshift.io) [must-gather-zp26b] OUT gather logs unavailable: unexpected EOF [must-gather-zp26b] OUT waiting for gather to complete [must-gather-zp26b] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jjcdv deleted [must-gather ] OUT namespace/openshift-must-gather-sjm8s deleted error: gather never finished for pod must-gather-zp26b: timed out waiting for the condition
What does oc get co openshift-apiserver -o yaml say?
Same for all other operators with Available=false or Degraded=false please.
Created attachment 1715553 [details] Test Results Log
Test env: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/113589/artifact/workdir/install-dir/auth/kubeconfig/*view*/
This is a network problem: $ kubectl get -n openshift-apiserver endpoints NAME ENDPOINTS AGE api 10.128.0.20:8443,10.129.0.13:8443,10.130.0.29:8443 155m Then oc debug node/weliang-211-wz4nn-control-plane-0 and curl -k <endpoint>:8443 The second endpoint "10.129.0.13:8443" blocks. Then oc debug node/weliang-211-wz4nn-control-plane-1 and the same curls again. The endpoint "10.129.0.13:8443" works.
In ovn controller logs on master-1 I see those messages around every minute: 2020-09-21T16:18:59Z|06924|ovsdb_idl|WARN|transaction error: {"details":"No column other_config in table Chassis.","error":"unknown column","syntax":"{\"encaps\":[\"named-uuid\",\"row451c853e_d3f1_4da8_ac8a_390587e3bc5f\"],\"external_ids\":[\"map\",[[\"datapath-type\",\"\"],[\"iface-types\",\"erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan\"],[\"is-interconn\",\"false\"],[\"ovn-bridge-mappings\",\"physnet:br-local\"],[\"ovn-chassis-mac-mappings\",\"\"],[\"ovn-cms-options\",\"\"]]],\"hostname\":\"weliang-211-wz4nn-control-plane-1\",\"name\":\"31473c2e-568a-4970-b865-fd3932174b18\",\"other_config\":[\"map\",[[\"datapath-type\",\"\"],[\"iface-types\",\"erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan\"],[\"is-interconn\",\"false\"],[\"ovn-bridge-mappings\",\"physnet:br-local\"],[\"ovn-chassis-mac-mappings\",\"\"],[\"ovn-cms-options\",\"\"]]]}"}
And this nice message: 2020-09-21T16:18:59Z|06923|ovsdb_idl|WARN|Dropped 34665 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate
At the same time: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network 4.6.0-fc.5 True False False 169m
*** Bug 1880449 has been marked as a duplicate of this bug. ***
Hi Weibin, could you reproduce again and provide a kubeconfig? The one attached in #comment 4 is not valid anymore it seems. /Alex
Hi Also, I am wondering if this is a valid upgrade path? 4.6.0-fc.5 contains code from ovn-kubernetes from a month ago. This predates the changes introduced a week ago, which essentially means: no one will ever upgrade to this version in the future. @Webin, could you try an upgrade to 4.6.0-fc.7: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.6.0-fc.7?from=4.6.0-fc.4 I see it was built 4 days ago which should contain the networking code on 4.6 code (I can't verify though as that page returns an error for that release) /Alex
Wondering if this might be connected to https://github.com/ovn-org/ovn-kubernetes/pull/1720
> Wondering if this might be connected to https://github.com/ovn-org/ovn-kubernetes/pull/1720 No, for sure not on a cluster upgrade to 4.6.0-fc.5. That PR fixes a new bug found on CI caused by the changes from last week, which 4.6.0-fc.5 does not have
I don't think upgrade is going to work from 4.5 -> lastest 4.6. CNO will upgrade before MCO and we will be left in the situation where the node has not run ovs-configuration yet, but ovn-k8s has upgraded and will attempt to create the bridge. It would still be good for QE to verify this is the case. I'll come up with a fix in parallel.
Looking into comment 6, seems like the ovn dbs are not upgraded to the latest schema and hence ovn-controlelrs are not seeing the newly added column - other_config. Probably this fix is required - https://github.com/openshift/cluster-network-operator/commit/eaff68c539225391a7cf7a3d21edd8283426b7e8#diff-54b09156f80dfb820afa115de13b8f32 This patch makes sure that the ovn dbs are updated. There is a workaround for this issue - Login to nbdb/sbdb container and run the commands from here - https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/ovnkube-master.yaml#L171
Clarifying #comment 15 > Probably this fix is required - https://github.com/openshift/cluster-network-operator/commit/eaff68c539225391a7cf7a3d21edd8283426b7e8#diff-54b09156f80dfb820afa115de13b8f32 4.6.0-fc.5 does not have that patch too. Which is probably why those errors are showing up.
4.5.8 -> 4.6.0-fc.7 upgrading testing failed too [weliang@weliang ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-fc.7 False True True 13m cloud-credential 4.6.0-fc.7 True False False 127m cluster-autoscaler 4.6.0-fc.7 True False False 103m config-operator 4.6.0-fc.7 True False False 103m console 4.6.0-fc.7 False False True 11m csi-snapshot-controller 4.6.0-fc.7 False True False 11m dns 4.5.8 True True True 109m etcd 4.6.0-fc.7 True False False 108m image-registry 4.6.0-fc.7 False True False 13m ingress 4.6.0-fc.7 True False False 24m insights 4.6.0-fc.7 True False False 104m kube-apiserver 4.6.0-fc.7 True False False 108m kube-controller-manager 4.6.0-fc.7 True False False 109m kube-scheduler 4.6.0-fc.7 True False False 108m kube-storage-version-migrator 4.6.0-fc.7 True False False 99m machine-api 4.6.0-fc.7 True False False 104m machine-approver 4.6.0-fc.7 True False False 107m machine-config 4.5.8 True False False 108m marketplace 4.6.0-fc.7 True False False 23m monitoring 4.6.0-fc.7 True False False 22m network 4.6.0-fc.7 True False False 110m node-tuning 4.6.0-fc.7 True False False 24m openshift-apiserver 4.6.0-fc.7 True False False 14m openshift-controller-manager 4.6.0-fc.7 True False False 104m openshift-samples 4.6.0-fc.7 True False False 23m operator-lifecycle-manager 4.6.0-fc.7 True False False 109m operator-lifecycle-manager-catalog 4.6.0-fc.7 True False False 110m operator-lifecycle-manager-packageserver 4.6.0-fc.7 True False False 23m service-ca 4.6.0-fc.7 True False False 110m storage 4.6.0-fc.7 True False False 24m [weliang@weliang ~]$ [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.8 True True 45m Unable to apply 4.6.0-fc.7: an unknown error has occurred: MultipleErrors [weliang@weliang ~]$ kubeconfig: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/113869/artifact/workdir/install-dir/auth/kubeconfig/*view*/
Thanks Weibin. As I suspected it looks like we are creating a bridge, which wont work during upgrade: Bridge brens3 fail_mode: standalone Port patch-br-local_weliang-2211-48l5l-compute-0-to-br-int Interface patch-br-local_weliang-2211-48l5l-compute-0-to-br-int type: patch options: {peer=patch-br-int-to-br-local_weliang-2211-48l5l-compute-0} I'll have a fix to try shortly. What's the easiest way to test this out? Can I do it with cluster bot?
`test upgrade 4.5.8 openshift/ovn-kubernetes#xxx aws,ovn` should work. Suggest to schedule couple of runs like on aws and gcp.
(In reply to Anurag saxena from comment #19) > `test upgrade 4.5.8 openshift/ovn-kubernetes#xxx aws,ovn` should work. > Suggest to schedule couple of runs like on aws and gcp. Thanks, Anurag! For Bare metal cluster, need change to use metal like this: test upgrade 4.5.8 openshift/ovn-kubernetes#xxx metal, ovn
(In reply to Weibin Liang from comment #20) > (In reply to Anurag saxena from comment #19) > > `test upgrade 4.5.8 openshift/ovn-kubernetes#xxx aws,ovn` should work. > > Suggest to schedule couple of runs like on aws and gcp. > > > Thanks, Anurag! > > For Bare metal cluster, need change to use metal like this: > > test upgrade 4.5.8 openshift/ovn-kubernetes#xxx metal, ovn Oh yea, its BM so make sense. Not sure how bot handling BM though. Worth a try
*** Bug 1880514 has been marked as a duplicate of this bug. ***
this should not only baremental I guess. I had a try on aws with ovn. also met same issue.
It looks like the problem is we have reject ACL present on the kapi service: _uuid : 5b37d90d-8a86-4dc7-a17b-8afe370f9154 action : reject direction : from-lport external_ids : {} log : false match : "ip4.dst==172.30.0.1 && tcp && tcp.dst==443" meter : [] name : "948626bb-6702-427a-89f5-afeaf2600552-172.30.0.1:443" priority : 1000 severity : [] Even though the endpoints are all there: [root@huir-upg3-xsm9v-master-1 ~]# ovn-nbctl list load_balancer 948626bb-6702-427a-89f5-afeaf2600552 | grep 172.30.0.1 vips : {"172.30.0.10:53"="10.128.0.20:5353,10.128.2.3:5353,10.129.0.11:5353,10.129.2.3:5353,10.130.0.9:5353,10.131.0.3:5353", "172.30.0.10:9154"="10.128.0.20:9154,10.128.2.3:9154,10.129.0.11:9154,10.129.2.3:9154,10.130.0.9:9154,10.131.0.3:9154", "172.30.0.1:443"="192.168.0.123:6443,192.168.0.74:6443,192.168.1.183:6443" which means something got out of sync with our services/endpoints handling logic. Because all of those log messages are debug level only, it's hard to know if these ACLs were present before ovnkube-masters were upgraded or if they were added afterwards when new ovnkube-master came up. It would be helpful if you could reproduce it with debug level logs on. Either way we can make improvements by: 1. syncServices removing stale reject ACLs. Since we only use reject ACLs on services now, we can poll all reject ACLs in OVN and decide if to remove them. This would fix the case where these ACLs are leftover from a previous instance of ovn-k8s. 2. make deleteLoadBalancerRejectACL better. Right now we only look at the cache to see if there is an ACL configured for that service. If we don't find one, we should still generate the name and try to delete it anyway. 3. move a lot of these debug level service/endpoints logging messages to Info
Huiran or Anurag, can you please try upgrade again and include: https://github.com/openshift/ovn-kubernetes/pull/295 ? See if that resolves the problem.
Thanks Huiran. My patch was flawed, but either it is coincidence or it exposed two OVN crashes. OVS also looks hosed. Filing a new bug on OVN, and will update my patch with the correct logic.
Update: Bug description says Baremetal but comment 29 behavior is seen across all platforms now
(In reply to Tim Rozet from comment #28) > Huiran or Anurag, can you please try upgrade again and include: > https://github.com/openshift/ovn-kubernetes/pull/295 ? > > See if that resolves the problem. Tim , my gcp job succeeded with PR 295 `test upgrade 4.5.0-0.nightly-2020-09-28-124031 openshift/ovn-kubernetes#295 gcp,ovn succeeded` Bot doesn't seem to be working well with metal so i believe the fix should work for all platforms
Thanks Anurag. I just updated 295 to test out latest changes. Following upstream PR here: https://github.com/ovn-org/ovn-kubernetes/pull/1738
test upgrade 4.5.0-0.nightly-2020-09-28-124031 openshift/ovn-kubernetes#297 gcp,ovn succeeded
*** Bug 1883521 has been marked as a duplicate of this bug. ***
Upgrading still faile from latest 4.5 to 4.6 on BM OVN cluster. openvswitch.service is inactive in some nodes [weliang@weliang ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-10-01-011427 True True 106m Unable to apply 4.6.0-0.nightly-2020-10-02-160623: the cluster operator openshift-apiserver is degraded [weliang@weliang ~]$ oc get pods --all-namespaces -o wide | egrep -v "Runn|Comp" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-apiserver apiserver-5c5558d85-6vgwh 0/2 Pending 0 58m <none> <none> <none> <none> openshift-controller-manager controller-manager-rqk8g 0/1 ContainerCreating 1 83m <none> weliang26-nwdc5-control-plane-2 <none> <none> openshift-dns dns-default-lk2n8 0/3 ContainerCreating 0 66m <none> weliang26-nwdc5-control-plane-2 <none> <none> openshift-etcd etcd-quorum-guard-7986975d98-cq72h 0/1 Pending 0 58m <none> <none> <none> <none> openshift-kube-apiserver revision-pruner-8-weliang26-nwdc5-control-plane-2 0/1 ContainerCreating 0 56m <none> weliang26-nwdc5-control-plane-2 <none> <none> openshift-kube-descheduler-operator cluster-6fb776d674-x2drl 0/1 ImagePullBackOff 0 130m 10.131.0.14 weliang26-nwdc5-compute-1 <none> <none> openshift-multus multus-admission-controller-j5scn 0/2 ContainerCreating 0 74m <none> weliang26-nwdc5-control-plane-2 <none> <none> openshift-multus network-metrics-daemon-dglwf 0/2 ContainerCreating 0 76m <none> weliang26-nwdc5-control-plane-2 <none> <none> openshift-oauth-apiserver apiserver-5988648dd4-wsq6h 0/1 Pending 0 58m <none> <none> <none> <none> [weliang@weliang ~]$ oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-10-02-160623 True False True 13s machine-config 4.5.0-0.nightly-2020-10-01-011427 False True True 67m network 4.6.0-0.nightly-2020-10-02-160623 True True True 155m openshift-apiserver 4.6.0-0.nightly-2020-10-02-160623 True False True 2s [weliang@weliang ~]$ for f in $(oc get nodes -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" -- chroot /host systemctl status ovs-vswitchd openvswitch ; done Starting pod/weliang26-nwdc5-compute-0-debug ... To use host binaries, run `chroot /host` ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: inactive (dead) ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled) Active: inactive (dead) Removing debug pod ... error: non-zero exit code from debug container Starting pod/weliang26-nwdc5-compute-1-debug ... To use host binaries, run `chroot /host` ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: inactive (dead) ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled) Active: inactive (dead) Removing debug pod ... error: non-zero exit code from debug container Starting pod/weliang26-nwdc5-compute-2-debug ... To use host binaries, run `chroot /host` ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: inactive (dead) ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled) Active: inactive (dead) Removing debug pod ... error: non-zero exit code from debug container Starting pod/weliang26-nwdc5-control-plane-0-debug ... To use host binaries, run `chroot /host` ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: inactive (dead) ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled) Active: inactive (dead) Removing debug pod ... Starting pod/weliang26-nwdc5-control-plane-1-debug ... To use host binaries, run `chroot /host` ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: inactive (dead) ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled) Active: inactive (dead) Removing debug pod ... error: non-zero exit code from debug container Starting pod/weliang26-nwdc5-control-plane-2-debug ... To use host binaries, run `chroot /host` ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Drop-In: /etc/systemd/system/ovs-vswitchd.service.d └─10-ovs-vswitchd-restart.conf Active: active (running) since Fri 2020-10-02 20:39:56 UTC; 58min ago Process: 1331 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVS_USER_OPT} start $OPTIONS (code=exited, status=0/SUCCESS) Process: 1324 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS) Process: 1322 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS) Main PID: 1383 (ovs-vswitchd) Tasks: 10 (limit: 102082) Memory: 47.7M CPU: 23.842s CGroup: /system.slice/ovs-vswitchd.service └─1383 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach Oct 02 21:36:02 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00388|stream_unix|ERR|/var/run/openvswitch/br-int.snoop: binding failed: No such file or directory Oct 02 21:36:02 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00389|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br-int.snoop: No such file or directory Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00391|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00393|stream_unix|ERR|/var/run/openvswitch/br-int.mgmt: binding failed: No such file or directory Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00395|stream_unix|ERR|/var/run/openvswitch/br-int.snoop: binding failed: No such file or directory Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00396|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br-int.snoop: No such file or directory Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00398|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00400|stream_unix|ERR|/var/run/openvswitch/br-int.mgmt: binding failed: No such file or directory Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00402|stream_unix|ERR|/var/run/openvswitch/br-int.snoop: binding failed: No such file or directory Oct 02 21:37:40 weliang26-nwdc5-control-plane-2 ovs-vswitchd[1383]: ovs|00403|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br-int.snoop: No such file or directory ● openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled) Active: active (exited) since Fri 2020-10-02 20:39:56 UTC; 58min ago Process: 1392 ExecStart=/bin/true (code=exited, status=0/SUCCESS) Main PID: 1392 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 102082) Memory: 0B CPU: 0 CGroup: /system.slice/openvswitch.service Oct 02 20:39:56 weliang26-nwdc5-control-plane-2 systemd[1]: Starting Open vSwitch... Oct 02 20:39:56 weliang26-nwdc5-control-plane-2 systemd[1]: Started Open vSwitch. Removing debug pod ...
I think it's normal for OVS not to all be running in systemd yet since your MCO still shows 4.5: machine-config 4.5.0-0.nightly-2020-10-01-011427 Some nodes probably have not been rebooted yet to fully upgrade to 4.6. The way to check is you can do cat /etc/*release* and see what OS the nodes are using that are missing systemd OVS. They should still be 4.5. Regardless, we need to know why the upgrade failed. Do you have a setup I can debug or can you attach a must gather please?
I think there is still a race condition with laying down/enabling new files with upgrades. Instead of using the file to determine if OVS should be in systemd; I'm proposing using the OS version itself. Will move this bug to 4.7 and clone it for a backport.
Hi Tim, Here is the new cluster you can use to debug: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/116379/artifact/workdir/install-dir/auth/kubeconfig/*view*/ must-gather still not work on this broken cluster.
Thanks Weibin. Looking at your cluster I can see why this failed. The broken node is on 4.6. However: [root@weliang52-9pll7-compute-0 ~]# stat /host/etc/systemd/system/network-online.target.wants/ovs-configuration.service stat: cannot stat '/host/etc/systemd/system/network-online.target.wants/ovs-configuration.service': No such file or directory ^We dont find the ovs-configuration service so we start OVS containers. Now OVS is running in container and systemd and things break. However, I see ovs-configuration was placed into: /host/etc/systemd/system/multi-user.target.wants/ovs-configuration.service Even though the service specifies install into network-online.target: [Unit] Description=Configures OVS with proper host networking configuration # Removal of this file signals firstboot completion ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json # This service is used to move a physical NIC into OVS and reconfigure OVS to use the host IP Requires=openvswitch.service Wants=NetworkManager-wait-online.service After=NetworkManager-wait-online.service openvswitch.service network.service Before=network-online.target kubelet.service crio.service node-valid-hostname.service [Service] # Need oneshot to delay kubelet Type=oneshot ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes StandardOutput=journal+console StandardError=journal+console [Install] WantedBy=network-online.target sh-4.4# systemctl disable ovs-configuration.service Removed /etc/systemd/system/multi-user.target.wants/ovs-configuration.service. sh-4.4# systemctl enable ovs-configuration.service Created symlink /etc/systemd/system/network-online.target.wants/ovs-configuration.service → /etc/systemd/system/ovs-configuration.service. sh-4.4# stat /etc/systemd/system/network-online.target.wants/ovs-configuration.service File: /etc/systemd/system/network-online.target.wants/ovs-configuration.service -> /etc/systemd/system/ovs-configuration.service Size: 45 Blocks: 0 IO Block: 4096 symbolic link Device: fd00h/64768d Inode: 180451495 Links: 1 Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:container_file_t:s0 Access: 2020-10-05 17:49:31.598094614 +0000 Modify: 2020-10-05 17:49:31.497090653 +0000 Change: 2020-10-05 17:49:31.498090692 +0000 Birth: - sh-4.4# ^ By simply disabling/re-enabling we can see the file is now installed into the right place. This means there is some bug here with service installation with either MCO or RHCOS. Either way my proposed fix should address the issue. I'll also bring this up with the MCO team to understand why it did not install the service correctly.
MCO confirmed it is a bug. Filed https://bugzilla.redhat.com/show_bug.cgi?id=1885365
*** Bug 1885517 has been marked as a duplicate of this bug. ***
Moving back to assigned until https://github.com/openshift/machine-config-operator/pull/2145 merge and we verify this again
Anurag we don't need MCO 2145 with the changes in CNO 825. That change now checks for a created file rather than a systemd unit location: https://github.com/openshift/cluster-network-operator/pull/825/files#diff-21a36954dbfb576c0c9b366428438e83R55
As discussed with Tim a bit, we still see ovs systemd units discussed in comment38 still looks inactive on 4.6.0-0.nightly-2020-10-08-043318 which contains both 2140 and 825 Moving this to assigned and being troubleshooted btw Tim and QE. cluster: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/116786/artifact/workdir/install-dir/auth/kubeconfig
It looks like the cause for this upgrade stalling is that there is that connections to API server service are being rejected. I can see the reject ACL is still there after initial bring up of 4.6 ovn pods: [root@ip-10-0-215-110 ~]# ovn-nbctl --no-leader-only list acl 84ffba22-20a9-42ad-a8e4-5ad7785e0890-172.30.0.1:443 _uuid : 463ccb51-4f5c-42ae-bc6b-c557cc55f51e action : reject direction : from-lport external_ids : {} log : false match : "ip4.dst==172.30.0.1 && tcp && tcp.dst==443" meter : [] name : "84ffba22-20a9-42ad-a8e4-5ad7785e0890-172.30.0.1:443" priority : 1000 severity : [] This should have been removed via: https://github.com/ovn-org/ovn-kubernetes/pull/1738 which would have removed it during initial service sync or when the endpoint add event happened. I can see in the logs there are no reject service ACLs removed during sync, and also the endpoints for kubernetes were added: I1008 10:03:49.584268 1 service.go:244] Creating service kubernetes I1008 10:03:49.584297 1 endpoints.go:46] Adding endpoints: kubernetes for namespace: default This also should have removed the reject ACL.
Created attachment 1720035 [details] master logs from acl removal failed
Looks like this is a string match problem during service sync where we generateACLName with escaping \\ in it, but what we actually compare does not contain the escaped slashes: https://github.com/ovn-org/ovn-kubernetes/pull/1749
*** Bug 1886786 has been marked as a duplicate of this bug. ***
Looks like there is a potential issue when 4.5 ovnkube-node pod upgrades before ovnkube-master pod to 4.6 on the master node. Pushed a fix https://github.com/openshift/ovn-kubernetes/pull/307 Also noticed, we were missing the new CNO check on ovnkube-master: https://github.com/openshift/cluster-network-operator/pull/836
*** Bug 1887055 has been marked as a duplicate of this bug. ***
FYI, cluster bot job `test upgrade 4.5.0-0.nightly-2020-10-10-030038 openshift/ovn-kubernetes#307,openshift/cluster-network-operator#836 aws,ovn` succeeded
*** Bug 1888222 has been marked as a duplicate of this bug. ***
*** Bug 1888075 has been marked as a duplicate of this bug. ***
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
OVN is currently tech preview on 4.5 and is not guaranteed to upgrade to 4.6 (GA release). There will be a subsequent 4.6z release that will support upgrade from 4.5.
As OVN is tech preview we will not remove edge because of this bug. Hence removing the UpgradeBlocker keyword.
*** Bug 1888959 has been marked as a duplicate of this bug. ***
*** Bug 1889393 has been marked as a duplicate of this bug. ***
Upgrading from 4.5.0-0.nightly-2020-10-31-200727 to 4.6.0-0.nightly-2020-11-02-081936 in AWS passed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.3 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4339