Bug 1874385
Summary: | [ovn][upgrade][4.6]Failed to upgrade for ovn network plugin | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | zhaozhanqi <zzhao> | |
Component: | Networking | Assignee: | Alexander Constantinescu <aconstan> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | aconstan, anusaxen, bhershbe, huirwang, rbrattai, sdodson, vlaad, weliang, wking, zhsun | |
Version: | 4.5 | Keywords: | Regression, TestBlocker | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1874737 1874928 1882085 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:36:25 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1874737, 1874928, 1882085, 1882092 |
Description
zhaozhanqi
2020-09-01 08:48:54 UTC
OK, so an update on this. Quick recap of what is going on: OVN cluster cannot upgrade from 4.5.7 -> 4.5.0-0.nightly-2020-08-29-080432 Why? 1) Because the openshift-apiserver is degraded leading to other operators, which are depending on OpenShift specific resources (like routes.route.openshift.io), to be degraded - as those resources are inaccessible given the state of the openshift-apiserver. 2) Because the communication between the openshift-apiserver -> etcd is having issues, the logs are filled with: I0901 15:42:26.575373 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0901 15:42:27.600071 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0901 15:42:27.600073 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0901 15:42:34.255006 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0901 15:42:34.255451 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0901 15:42:34.255678 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0901 15:42:34.255833 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I've been trying to make sense of if the problem is: a) in the apiserver b) connection to etcd c) in etcd To investigate a) I had a look in the apiserver pods, and as cited above: the logs are filled with transport is closing, which according to: https://chromium.googlesource.com/external/github.com/grpc/grpc-go/#:~:text=The%20RPC%20failed%20with%20error,are%20many%20possible%20reasons%2C%20including%3A&text=If%20this%20is%20the%20case,longer%20RPC%20calls%20to%20finish seems to point more to a server/connection problem rather than client. Thus more likely b) or c) b) I had a look at networking and DNS, DNS seems to be problematic from the apiserver [root@apiserver-578f858cb-glj7s /]# curl -w "@curlformat.txt" -o /dev/null -s "http://google.com/" time_namelookup: 10.521s time_connect: 10.531s time_appconnect: 0.000s time_pretransfer: 10.531s time_redirect: 0.000s time_starttransfer: 10.545s ---------- time_total: 10.545s [root@apiserver-578f858cb-glj7s /]# getent ahosts google.com 172.217.12.238 STREAM google.com 172.217.12.238 DGRAM 172.217.12.238 RAW 2607:f8b0:4004:807::200e STREAM 2607:f8b0:4004:807::200e DGRAM 2607:f8b0:4004:807::200e RAW [root@apiserver-578f858cb-glj7s /]# curl -w "@curlformat.txt" -o /dev/null -s "http://172.217.12.238:80/" time_namelookup: 0.000s time_connect: 1.023s time_appconnect: 0.000s time_pretransfer: 1.023s time_redirect: 0.000s time_starttransfer: 1.036s ---------- time_total: 1.036s [root@apiserver-578f858cb-glj7s /]# cat curlformat.txt time_namelookup: %{time_namelookup}s\n time_connect: %{time_connect}s\n time_appconnect: %{time_appconnect}s\n time_pretransfer: %{time_pretransfer}s\n time_redirect: %{time_redirect}s\n time_starttransfer: %{time_starttransfer}s\n ----------\n time_total: %{time_total}s\n I.e, DNS takes around 10 seconds to evaluate I've been trying to understand why that is and I am seeing that the apiserver cannot communicate with some dns-server pods: oc get svc -n openshift-dns dns-default-health NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE dns-default-health ClusterIP 172.30.48.106 <none> 88/TCP 36m aconstan@linux-3 ~ $ oc get ep -n openshift-dns dns-default-health NAME ENDPOINTS AGE dns-default-health 10.128.0.19:8080,10.128.2.3:8080,10.129.0.10:8080 + 3 more... 36m From the apiserver I've tried to curl every DNS health probe individually, for two DNS pods the connection does not go through [root@apiserver-578f858cb-glj7s /]# curl 10.128.2.3:8080/health OK [root@apiserver-578f858cb-glj7s /]# curl 10.131.0.3:8080/health OK [root@apiserver-578f858cb-glj7s /]# curl 10.130.0.4:8080/health OK [root@apiserver-578f858cb-glj7s /]# curl 10.129.0.10:8080/health OK [root@apiserver-578f858cb-glj7s /]# curl 10.128.2.3:8080/health OK [root@apiserver-578f858cb-glj7s /]# curl 10.128.0.19:8080/health ^C [root@apiserver-578f858cb-glj7s /]# curl 10.129.2.3:8080/health ^C [root@apiserver-578f858cb-glj7s /]# ^C i.e: the apiserver cannot communicate with 10.128.0.19 and 10.129.2.3 directly This *might* be the reason behind the slow DNS response above and *possibly* why it's having problems communicating with etcd (that is done through a cluster service, which obviously needs to be resolved) Also from the api-server pod, we have: [root@apiserver-578f858cb-glj7s /]# curl dns-default-health.openshift-dns.svc.cluster.local/health curl: (6) Could not resolve host: dns-default-health.openshift-dns.svc.cluster.local; Unknown error c) I am seeing logs in etcd (oc logs -c etcd ETCD_POD -n openshift-etcd): range request "key:\"/kubernetes.io/$SOME_RESOURCE" " with result "range_response_count:1 size:828" took too long (166.233937ms) to execute Not sure what to make of that. All in all, there are three clues: 1) I will take care of investigating why the apiserver pod cannot communicate with those two DNS server pods. 2) Once 1) is done, and if the problem persists once that's fixed, I will assign to the Routing to check the slow DNS queries 3) Once 2) is done (or in parallel) someone from etcd might help figure out what those etcd logs in c) mean and if they can have an impact on the api-server errors above. /Alex since this issue maybe affect the customer Verizon to upgrade. So I set the target to 4.6 and clone one to 4.5.z. please correct me if I'm wrong. thanks. We first discovered this on two vsphere with OVN clusters, upgrade from 4.5.7->4.5.0-0.nightly-2020-08-29-080432, then reproduced on osp, aws and gcp. #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-08-29-080432 True False False 3h18m cloud-credential 4.5.0-0.nightly-2020-08-29-080432 True False False 3h37m cluster-autoscaler 4.5.0-0.nightly-2020-08-29-080432 True False False 3h28m config-operator 4.5.0-0.nightly-2020-08-29-080432 True False False 3h28m console 4.5.0-0.nightly-2020-08-29-080432 True False True 120m csi-snapshot-controller 4.5.0-0.nightly-2020-08-29-080432 True False False 155m dns 4.5.7 True False False 3h32m etcd 4.5.0-0.nightly-2020-08-29-080432 True False False 3h33m image-registry 4.5.0-0.nightly-2020-08-29-080432 True False False 121m ingress 4.5.0-0.nightly-2020-08-29-080432 True False False 3h23m insights 4.5.0-0.nightly-2020-08-29-080432 True False False 3h29m kube-apiserver 4.5.0-0.nightly-2020-08-29-080432 True False False 3h32m kube-controller-manager 4.5.0-0.nightly-2020-08-29-080432 True False False 3h32m kube-scheduler 4.5.0-0.nightly-2020-08-29-080432 True False False 3h30m kube-storage-version-migrator 4.5.0-0.nightly-2020-08-29-080432 True False False 154m machine-api 4.5.0-0.nightly-2020-08-29-080432 True False False 3h26m machine-approver 4.5.0-0.nightly-2020-08-29-080432 True False False 3h32m machine-config 4.5.7 True False False 3h27m marketplace 4.5.0-0.nightly-2020-08-29-080432 True False False 122m monitoring 4.5.0-0.nightly-2020-08-29-080432 False False True 27m network 4.5.0-0.nightly-2020-08-29-080432 True False False 3h35m node-tuning 4.5.0-0.nightly-2020-08-29-080432 True False False 122m openshift-apiserver 4.5.0-0.nightly-2020-08-29-080432 False False False 24m openshift-controller-manager 4.5.0-0.nightly-2020-08-29-080432 True False False 3h21m openshift-samples 4.5.0-0.nightly-2020-08-29-080432 True False False 122m operator-lifecycle-manager 4.5.0-0.nightly-2020-08-29-080432 True False False 3h34m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-08-29-080432 True False False 3h34m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-08-29-080432 False True False 4m50s service-ca 4.5.0-0.nightly-2020-08-29-080432 True False False 3h35m storage 4.5.0-0.nightly-2020-08-29-080432 True False False 123m I've continued looking at the cluster from yesterday (i.e: #comment 1). With the help of someone from OVN we've been able to trace the upgrade problem, and it seems what we're missing is: https://github.com/openshift/cluster-network-operator/pull/755 The ovn-controller logs are filled with: 2020-09-01T08:51:35Z|04744|ovsdb_idl|WARN|transaction error: {"details":"No column other_config in table Chassis.","error":"unknown column","syntax":"{\"encaps\":[\"named-uuid\",\"rowf3a08f1f_0d25_4a80_843d_c5c70f0e27b6\"],\"external_ids\":[\"map\",[[\"datapath-type\",\"\"],[\"iface-types\",\"erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan\"],[\"is-interconn\",\"false\"],[\"ovn-bridge-mappings\",\"physnet:br-local\"],[\"ovn-chassis-mac-mappings\",\"\"],[\"ovn-cms-options\",\"\"]]],\"hostname\":\"ugdci01091820-xctdf-master-2\",\"name\":\"37c5871d-b843-45d6-b39c-9a107a04344d\",\"other_config\":[\"map\",[[\"datapath-type\",\"\"],[\"iface-types\",\"erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan\"],[\"is-interconn\",\"false\"],[\"ovn-bridge-mappings\",\"physnet:br-local\"],[\"ovn-chassis-mac-mappings\",\"\"],[\"ovn-cms-options\",\"\"]]]}"} suggesting that "Upgraded ovn-controllers will not see any new columns because of this resulting in datapath disruptions.", as the PR mentions - and which completely fits our picture after these upgrades. I will discuss this internally within our team to see how to handle that. /Alex @Alex i believe the verification is blocked currently due to https://bugzilla.redhat.com/show_bug.cgi?id=1878030. Will keep an eye on this , thanks Anurag, 4.6.0-0.nightly-2020-09-10-121352 should not be included the fixed PR,the PR merged in 4.6.0-0.nightly-2020-09-12-164537 I have a try from 4.5.9 to 4.6.0-0.nightly-2020-09-13-023938, it works well https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/5079/console Move this bug to 'verified' Thats great, Thanks for verifying that Zhanqi. Yea Since the latest builds were suspected to be blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1878030, thats why this confusion. *** Bug 1871068 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |