Bug 1874385

Summary: [ovn][upgrade][4.6]Failed to upgrade for ovn network plugin
Product: OpenShift Container Platform Reporter: zhaozhanqi <zzhao>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, anusaxen, bhershbe, huirwang, rbrattai, sdodson, vlaad, weliang, wking, zhsun
Version: 4.5Keywords: Regression, TestBlocker
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1874737 1874928 1882085 (view as bug list) Environment:
Last Closed: 2020-10-27 16:36:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1874737, 1874928, 1882085, 1882092    

Description zhaozhanqi 2020-09-01 08:48:54 UTC
Description of problem:
upgrade from 4.5.7 to 4.5.0-0.nightly-2020-08-29-080432 with OVN cluster
openshift-apiserver pods cannot be working well due to pod cannot be communicated between nodes

Version-Release number of selected component (if applicable):
4.5.7  ---> 4.5.0-0.nightly-2020-08-29-080432

How reproducible:


Steps to Reproduce:
1. setup cluster with ovn cluster
2. upgrade to 4.5.0-0.nightly-2020-08-29-080432
3. 

Actual results:

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-0.nightly-2020-08-29-080432   True        False         True       6h1m
cloud-credential                           4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h28m
cluster-autoscaler                         4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h17m
config-operator                            4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h18m
console                                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      116m
csi-snapshot-controller                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h12m
dns                                        4.5.7                               True        False         False      6h24m
etcd                                       4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h23m
image-registry                             4.5.0-0.nightly-2020-08-29-080432   True        True          False      6h13m
ingress                                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h12m
insights                                   4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h21m
kube-apiserver                             4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h23m
kube-controller-manager                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h22m
kube-scheduler                             4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h23m
kube-storage-version-migrator              4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h13m
machine-api                                4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h17m
machine-approver                           4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h23m
machine-config                             4.5.7                               True        False         False      6h15m
marketplace                                4.5.0-0.nightly-2020-08-29-080432   True        False         False      5h22m
monitoring                                 4.5.0-0.nightly-2020-08-29-080432   False       True          True       132m
network                                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h25m
node-tuning                                4.5.0-0.nightly-2020-08-29-080432   True        False         False      5h22m
openshift-apiserver                        4.5.0-0.nightly-2020-08-29-080432   False       False         False      3m35s
openshift-controller-manager               4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h20m
openshift-samples                          4.5.0-0.nightly-2020-08-29-080432   True        False         False      5h22m
operator-lifecycle-manager                 4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h24m
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h24m
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-08-29-080432   False       True          False      3m43s
service-ca                                 4.5.0-0.nightly-2020-08-29-080432   True        False         False      6h25m
storage                                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      5h23m


$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.7     True        True          5h59m   Unable to apply 4.5.0-0.nightly-2020-08-29-080432: the cluster operator openshift-apiserver has not yet successfully rolled out

Expected results:
no issue for upgrade

Additional info:

Comment 3 Alexander Constantinescu 2020-09-01 17:06:09 UTC
OK, so an update on this. 

Quick recap of what is going on:

OVN cluster cannot upgrade from 4.5.7 -> 4.5.0-0.nightly-2020-08-29-080432

Why?

1) Because the openshift-apiserver is degraded leading to other operators, which are depending on OpenShift specific resources (like routes.route.openshift.io), to be degraded - as those resources are inaccessible given the state of the openshift-apiserver.
2) Because the communication between the openshift-apiserver -> etcd is having issues, the logs are filled with:

I0901 15:42:26.575373       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0901 15:42:27.600071       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0901 15:42:27.600073       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0901 15:42:34.255006       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0901 15:42:34.255451       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0901 15:42:34.255678       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0901 15:42:34.255833       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I've been trying to make sense of if the problem is:

a) in the apiserver 
b) connection to etcd 
c) in etcd

To investigate 

a) I had a look in the apiserver pods, and as cited above: the logs are filled with transport is closing, which according to: https://chromium.googlesource.com/external/github.com/grpc/grpc-go/#:~:text=The%20RPC%20failed%20with%20error,are%20many%20possible%20reasons%2C%20including%3A&text=If%20this%20is%20the%20case,longer%20RPC%20calls%20to%20finish seems to point more to a server/connection problem rather than client. 

Thus more likely b) or c)

b) I had a look at networking and DNS, DNS seems to be problematic from the apiserver

[root@apiserver-578f858cb-glj7s /]# curl -w "@curlformat.txt" -o /dev/null -s "http://google.com/"
time_namelookup:  10.521s
       time_connect:  10.531s
    time_appconnect:  0.000s
   time_pretransfer:  10.531s
      time_redirect:  0.000s
 time_starttransfer:  10.545s
                    ----------
         time_total:  10.545s
[root@apiserver-578f858cb-glj7s /]# getent ahosts google.com
172.217.12.238  STREAM google.com
172.217.12.238  DGRAM  
172.217.12.238  RAW    
2607:f8b0:4004:807::200e STREAM 
2607:f8b0:4004:807::200e DGRAM  
2607:f8b0:4004:807::200e RAW    
[root@apiserver-578f858cb-glj7s /]# curl -w "@curlformat.txt" -o /dev/null -s "http://172.217.12.238:80/"
time_namelookup:  0.000s
       time_connect:  1.023s
    time_appconnect:  0.000s
   time_pretransfer:  1.023s
      time_redirect:  0.000s
 time_starttransfer:  1.036s
                    ----------
         time_total:  1.036s
[root@apiserver-578f858cb-glj7s /]# cat curlformat.txt 
time_namelookup:  %{time_namelookup}s\n
       time_connect:  %{time_connect}s\n
    time_appconnect:  %{time_appconnect}s\n
   time_pretransfer:  %{time_pretransfer}s\n
      time_redirect:  %{time_redirect}s\n
 time_starttransfer:  %{time_starttransfer}s\n
                    ----------\n
         time_total:  %{time_total}s\n

I.e, DNS takes around 10 seconds to evaluate 

I've been trying to understand why that is and I am seeing that the apiserver cannot communicate with some dns-server pods:

oc get svc -n openshift-dns dns-default-health
NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
dns-default-health   ClusterIP   172.30.48.106   <none>        88/TCP    36m
aconstan@linux-3 ~ $ oc get ep -n openshift-dns dns-default-health
NAME                 ENDPOINTS                                                       AGE
dns-default-health   10.128.0.19:8080,10.128.2.3:8080,10.129.0.10:8080 + 3 more...   36m

From the apiserver I've tried to curl every DNS health probe individually, for two DNS pods the connection does not go through

[root@apiserver-578f858cb-glj7s /]# curl 10.128.2.3:8080/health
OK
[root@apiserver-578f858cb-glj7s /]# curl 10.131.0.3:8080/health
OK
[root@apiserver-578f858cb-glj7s /]# curl 10.130.0.4:8080/health
OK
[root@apiserver-578f858cb-glj7s /]# curl 10.129.0.10:8080/health
OK
[root@apiserver-578f858cb-glj7s /]# curl 10.128.2.3:8080/health
OK
[root@apiserver-578f858cb-glj7s /]# curl 10.128.0.19:8080/health
^C
[root@apiserver-578f858cb-glj7s /]# curl 10.129.2.3:8080/health
^C
[root@apiserver-578f858cb-glj7s /]# 
^C

i.e: the apiserver cannot communicate with  10.128.0.19 and 10.129.2.3 directly

This *might* be the reason behind the slow DNS response above and *possibly* why it's having problems communicating with etcd (that is done through a cluster service, which obviously needs to be resolved)

Also from the api-server pod, we have:

[root@apiserver-578f858cb-glj7s /]# curl dns-default-health.openshift-dns.svc.cluster.local/health
curl: (6) Could not resolve host: dns-default-health.openshift-dns.svc.cluster.local; Unknown error

c) I am seeing logs in etcd (oc logs -c etcd ETCD_POD -n openshift-etcd):

range request "key:\"/kubernetes.io/$SOME_RESOURCE" " with result "range_response_count:1 size:828" took too long (166.233937ms) to execute

Not sure what to make of that. 

All in all, there are three clues:

1) I will take care of investigating why the apiserver pod cannot communicate with those two DNS server pods.
2) Once 1) is done, and if the problem persists once that's fixed, I will assign to the Routing to check the slow DNS queries 
3) Once 2) is done (or in parallel) someone from etcd might help figure out what those etcd logs in c) mean and if they can have an impact on the api-server errors above.

/Alex

Comment 6 zhaozhanqi 2020-09-02 06:35:19 UTC
since this issue maybe affect the customer Verizon to upgrade. So I set the target to 4.6 and clone one to 4.5.z. please correct me if I'm wrong. thanks.

Comment 7 sunzhaohua 2020-09-02 07:28:52 UTC
We first discovered this on two vsphere with OVN clusters, upgrade from 4.5.7->4.5.0-0.nightly-2020-08-29-080432, then reproduced on osp, aws and gcp.
 #oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h18m
cloud-credential                           4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h37m
cluster-autoscaler                         4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h28m
config-operator                            4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h28m
console                                    4.5.0-0.nightly-2020-08-29-080432   True        False         True       120m
csi-snapshot-controller                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      155m
dns                                        4.5.7                               True        False         False      3h32m
etcd                                       4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h33m
image-registry                             4.5.0-0.nightly-2020-08-29-080432   True        False         False      121m
ingress                                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h23m
insights                                   4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h29m
kube-apiserver                             4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h32m
kube-controller-manager                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h32m
kube-scheduler                             4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h30m
kube-storage-version-migrator              4.5.0-0.nightly-2020-08-29-080432   True        False         False      154m
machine-api                                4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h26m
machine-approver                           4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h32m
machine-config                             4.5.7                               True        False         False      3h27m
marketplace                                4.5.0-0.nightly-2020-08-29-080432   True        False         False      122m
monitoring                                 4.5.0-0.nightly-2020-08-29-080432   False       False         True       27m
network                                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h35m
node-tuning                                4.5.0-0.nightly-2020-08-29-080432   True        False         False      122m
openshift-apiserver                        4.5.0-0.nightly-2020-08-29-080432   False       False         False      24m
openshift-controller-manager               4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h21m
openshift-samples                          4.5.0-0.nightly-2020-08-29-080432   True        False         False      122m
operator-lifecycle-manager                 4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h34m
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h34m
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-08-29-080432   False       True          False      4m50s
service-ca                                 4.5.0-0.nightly-2020-08-29-080432   True        False         False      3h35m
storage                                    4.5.0-0.nightly-2020-08-29-080432   True        False         False      123m

Comment 8 Alexander Constantinescu 2020-09-02 10:14:27 UTC
I've continued looking at the cluster from yesterday (i.e: #comment 1). With the help of someone from OVN we've been able to trace the upgrade problem, and it seems what we're missing is: https://github.com/openshift/cluster-network-operator/pull/755

The ovn-controller logs are filled with:

2020-09-01T08:51:35Z|04744|ovsdb_idl|WARN|transaction error: {"details":"No column other_config in table Chassis.","error":"unknown column","syntax":"{\"encaps\":[\"named-uuid\",\"rowf3a08f1f_0d25_4a80_843d_c5c70f0e27b6\"],\"external_ids\":[\"map\",[[\"datapath-type\",\"\"],[\"iface-types\",\"erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan\"],[\"is-interconn\",\"false\"],[\"ovn-bridge-mappings\",\"physnet:br-local\"],[\"ovn-chassis-mac-mappings\",\"\"],[\"ovn-cms-options\",\"\"]]],\"hostname\":\"ugdci01091820-xctdf-master-2\",\"name\":\"37c5871d-b843-45d6-b39c-9a107a04344d\",\"other_config\":[\"map\",[[\"datapath-type\",\"\"],[\"iface-types\",\"erspan,geneve,gre,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan\"],[\"is-interconn\",\"false\"],[\"ovn-bridge-mappings\",\"physnet:br-local\"],[\"ovn-chassis-mac-mappings\",\"\"],[\"ovn-cms-options\",\"\"]]]}"}

suggesting that "Upgraded ovn-controllers will not see any new columns because of this resulting in datapath disruptions.", as the PR mentions - and which completely fits our picture after these upgrades. 

I will discuss this internally within our team to see how to handle that. 

/Alex

Comment 12 Anurag saxena 2020-09-14 14:04:36 UTC
@Alex i believe the verification is blocked currently due to https://bugzilla.redhat.com/show_bug.cgi?id=1878030. Will keep an eye on this , thanks

Comment 14 zhaozhanqi 2020-09-15 02:41:14 UTC
Anurag, 4.6.0-0.nightly-2020-09-10-121352 should not be included the fixed PR,the PR merged in 4.6.0-0.nightly-2020-09-12-164537

I have a try from 4.5.9 to 4.6.0-0.nightly-2020-09-13-023938, it works well https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/5079/console

Move this bug to 'verified'

Comment 15 Anurag saxena 2020-09-15 15:14:14 UTC
Thats great, Thanks for verifying that Zhanqi. Yea Since the latest builds were suspected to be blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1878030, thats why this confusion.

Comment 16 Alexander Constantinescu 2020-09-17 13:15:01 UTC
*** Bug 1871068 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2020-10-27 16:36:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 20 W. Trevor King 2021-04-05 17:46:48 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475