1943413 – [OVN] Pod stuck in containerCreating for ovn cluster when upgrade from 4.5->4.6->4.7

Bug 1943413 - [OVN] Pod stuck in containerCreating for ovn cluster when upgrade from 4.5->4.6->4.7

Summary: [OVN] Pod stuck in containerCreating for ovn cluster when upgrade from 4.5->4...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Mohamed Mahmoud
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1941840 1951502
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-26 02:28 UTC by zhaozhanqi
Modified:	2024-06-14 01:02 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-13 17:44:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
relevant pod and ovn-controller output (80.00 KB, application/x-tar) 2021-04-15 20:52 UTC, Tim Rozet	no flags	Details
locked up node's sbdb logs (4.48 MB, text/plain) 2021-04-16 15:35 UTC, Tim Rozet	no flags	Details
journald and ovn-controller coredump (6.46 MB, application/gzip) 2021-04-16 18:46 UTC, Tim Rozet	no flags	Details
coredump for northd (6.82 MB, application/gzip) 2021-04-23 02:04 UTC, Tim Rozet	no flags	Details
ovndb from the NotReady node, cluster 230024 (6.76 MB, application/octet-stream) 2021-04-24 02:28 UTC, Ross Brattain	no flags	Details
ovn, etcd logs from NotReady node 230024 (523.03 KB, application/octet-stream) 2021-04-24 02:29 UTC, Ross Brattain	no flags	Details
first top command on NotReady node, cluster 230024 (7.08 KB, text/plain) 2021-04-24 02:33 UTC, Ross Brattain	no flags	Details
authentication-operator-logs (1.14 MB, application/gzip) 2021-04-24 03:15 UTC, Ross Brattain	no flags	Details
View All

Description zhaozhanqi 2021-03-26 02:28:35 UTC

Description of problem:
When upgrade cluster from 4.5.36 -> 4.6.23 -> 4.7.4.  

4.5.36 -> 4.6.23 success

4.6.23 -> 4.7.4 failed due to monitoring operator failed

10:52:39  
10:52:39  Post action: #oc get co:NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
10:52:39  authentication                             4.7.4     True        False         False      88m
10:52:39  baremetal                                  4.7.4     True        False         False      152m
10:52:39  cloud-credential                           4.7.4     True        False         False      6h2m
10:52:39  cluster-autoscaler                         4.7.4     True        False         False      5h49m
10:52:39  config-operator                            4.7.4     True        False         False      5h49m
10:52:39  console                                    4.7.4     True        False         False      94m
10:52:39  csi-snapshot-controller                    4.7.4     True        False         False      94m
10:52:39  dns                                        4.7.4     True        False         False      100m
10:52:39  etcd                                       4.7.4     True        False         False      5h53m
10:52:39  image-registry                             4.7.4     True        False         False      96m
10:52:39  ingress                                    4.7.4     True        False         False      3h49m
10:52:39  insights                                   4.7.4     True        False         False      5h50m
10:52:39  kube-apiserver                             4.7.4     True        False         False      5h52m
10:52:39  kube-controller-manager                    4.7.4     True        False         False      5h53m
10:52:39  kube-scheduler                             4.7.4     True        False         False      5h52m
10:52:39  kube-storage-version-migrator              4.7.4     True        False         False      70m
10:52:39  machine-api                                4.7.4     True        False         False      5h47m
10:52:39  machine-approver                           4.7.4     True        False         False      5h51m
10:52:39  machine-config                             4.7.4     True        False         False      86m
10:52:39  marketplace                                4.7.4     True        False         False      93m
10:52:39  monitoring                                 4.7.4     False       True          True       92m
10:52:39  network                                    4.7.4     True        False         False      5h55m
10:52:39  node-tuning                                4.7.4     True        False         False      149m
10:52:39  openshift-apiserver                        4.7.4     True        False         False      92m
10:52:39  openshift-controller-manager               4.7.4     True        False         False      5h50m
10:52:39  openshift-samples                          4.7.4     True        False         False      149m
10:52:39  operator-lifecycle-manager                 4.7.4     True        False         False      5h54m
10:52:39  operator-lifecycle-manager-catalog         4.7.4     True        False         False      5h54m
10:52:39  operator-lifecycle-manager-packageserver   4.7.4     True        False         False      94m
10:52:39  service-ca                                 4.7.4     True        False         False      5h55m
10:52:39  storage                                    4.7.4     True        False         False      95m
10:52:39  


check logs info: namespaces/openshift-monitoring/core/events.yaml file

- apiVersion: v1
  count: 211
  eventTime: null
  firstTimestamp: "2021-03-25T01:31:53Z"
  involvedObject:
    apiVersion: v1
    kind: Pod
    name: alertmanager-main-0
    namespace: openshift-monitoring
    resourceVersion: "202247"
    uid: 779cf751-d74e-47ba-8bac-ed843b1d9c62
  kind: Event
  lastTimestamp: "2021-03-25T02:52:32Z"
  message: |-
    (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-0_openshift-monitoring_779cf751-d74e-47ba-8bac-ed843b1d9c62_0(6ecac37778c989860a95eb9144289f6ff0a54cf39dd0e252cb83ebaf4c70f3bb): [openshift-monitoring/alertmanager-main-0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-0 6ecac37778c989860a95eb9144289f6ff0a54cf39dd0e252cb83ebaf4c70f3bb] [openshift-monitoring/alertmanager-main-0 6ecac37778c989860a95eb9144289f6ff0a54cf39dd0e252cb83ebaf4c70f3bb] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows

Version-Release number of selected component (if applicable):
IPI AWS

How reproducible:


Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:


Additional info:

must-gather logs: 

http://10.73.131.57:9000/openshift-must-gather/2021-03-25-02-57-15/must-gather.local.70866397952984103.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=openshift%2F20210325%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210325T025746Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=bec73077a1ba3c4df4490c576fcc9afe87b7755ad5118b213618138d730bf82b

Comment 1 zhaozhanqi 2021-03-26 08:13:52 UTC

FYI: this issue not already reproduced. try again it works

Comment 2 Mohamed Mahmoud 2021-03-31 12:58:10 UTC

@zzhao do you have cluster in the failing state that I can access ?

Comment 4 zhaozhanqi 2021-04-01 02:04:32 UTC

(In reply to Mohamed Mahmoud from comment #2)
> @zzhao do you have cluster in the failing state that I can access
> ?

sorry, the cluster already be released when I found this issue. we only have the must-gather logs for now. and this issue not already happen. So I have not running cluster in hand.

Comment 26 Tim Rozet 2021-04-15 20:52:44 UTC

Created attachment 1772252 [details]
relevant pod and ovn-controller output

Comment 27 Tim Rozet 2021-04-15 20:57:03 UTC

So there are multiple problems here. One is that one of the master nodes is not functioning correctly. It looks like kubelet is no longer responding and we see coredumps in journald. Besides that issue though, we see that ovn-controller is not behaving correctly. We can see that the pod in question, openshift-monitoring_alertmanager-main-1, is scheduled on pdhamdhe091202-7x7q4-w-b-1.c.openshift-qe.internal. However OVN thinks it is on pdhamdhe091202-7x7q4-w-c-2.c.openshift-qe.internal. We can also see in SBDB that the pod has its IP set to:

[root@pdhamdhe091202-7x7q4-m-1 ~]# ovn-nbctl list logical_switch_port openshift-monitoring_alertmanager-main-1
_uuid               : da946c76-61d4-459a-a070-1949a56a4c63
addresses           : ["0a:58:0a:80:02:22 10.128.2.34"]


However, in ovn-controller:
2021-04-09T07:34:36Z|00101|binding|INFO|Claiming lport openshift-monitoring_alertmanager-main-1 for this chassis.
2021-04-09T07:34:36Z|00102|binding|INFO|openshift-monitoring_alertmanager-main-1: Claiming 0a:58:0a:81:02:1a 10.129.2.26

CNI request around the same time:
I0409 07:34:37.563128    3132 cni.go:195] [openshift-monitoring/alertmanager-main-1 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3] ADD finished CNI request [openshift-monitoring/alertmanager-main-1 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3], result "{\"Result\":{\"interfaces\":[{\"name\":\"6c7e5405c74e1fa\",\"mac\":\"56:af:8a:18:f0:19\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:80:02:22\",\"sandbox\":\"/var/run/netns/998fc171-fb3a-406f-bb3f-bbb2f7bc2ad1\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.128.2.34/23\",\"gateway\":\"10.128.2.1\"}],\"dns\":{}},\"PodIFInfo\":null}", err <nil>

Also, we think that the node was updated before the masters, however we still see:
2021-04-09T06:50:52Z|00018|ovsdb_idl|WARN|transaction error: {"details":"No column up in table Port_Binding.","error":"unknown column","syntax":"{\"up\":true}"}

and a lot of:
2021-04-15T14:49:59Z|00240|ovsdb_idl|INFO|ssl:10.0.0.5:9642: clustered database server is disconnected from cluster; trying another server
2021-04-15T14:49:59Z|00241|main|INFO|OVNSB commit failed, force recompute next time.

I've attached the ovn-controller log and relevant output. We also have a live cluster with the problem. Dumitru can you PTAL?

Comment 28 Dumitru Ceara 2021-04-15 21:39:38 UTC

(In reply to Tim Rozet from comment #27)
> So there are multiple problems here. One is that one of the master nodes is
> not functioning correctly. It looks like kubelet is no longer responding and
> we see coredumps in journald. Besides that issue though, we see that
> ovn-controller is not behaving correctly. We can see that the pod in
> question, openshift-monitoring_alertmanager-main-1, is scheduled on
> pdhamdhe091202-7x7q4-w-b-1.c.openshift-qe.internal. However OVN thinks it is
> on pdhamdhe091202-7x7q4-w-c-2.c.openshift-qe.internal. We can also see in
> SBDB that the pod has its IP set to:
> 
> [root@pdhamdhe091202-7x7q4-m-1 ~]# ovn-nbctl list logical_switch_port
> openshift-monitoring_alertmanager-main-1
> _uuid               : da946c76-61d4-459a-a070-1949a56a4c63
> addresses           : ["0a:58:0a:80:02:22 10.128.2.34"]
> 
> 
> However, in ovn-controller:
> 2021-04-09T07:34:36Z|00101|binding|INFO|Claiming lport
> openshift-monitoring_alertmanager-main-1 for this chassis.
> 2021-04-09T07:34:36Z|00102|binding|INFO|openshift-monitoring_alertmanager-
> main-1: Claiming 0a:58:0a:81:02:1a 10.129.2.26
> 
> CNI request around the same time:
> I0409 07:34:37.563128    3132 cni.go:195]
> [openshift-monitoring/alertmanager-main-1
> 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3] ADD
> finished CNI request [openshift-monitoring/alertmanager-main-1
> 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3], result
> "{\"Result\":{\"interfaces\":[{\"name\":\"6c7e5405c74e1fa\",\"mac\":\"56:af:
> 8a:18:f0:19\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:80:02:22\",\"sandbox\":
> \"/var/run/netns/998fc171-fb3a-406f-bb3f-bbb2f7bc2ad1\"}],\"ips\":
> [{\"version\":\"4\",\"interface\":1,\"address\":\"10.128.2.34/23\",
> \"gateway\":\"10.128.2.1\"}],\"dns\":{}},\"PodIFInfo\":null}", err <nil>
> 
> Also, we think that the node was updated before the masters, however we
> still see:
> 2021-04-09T06:50:52Z|00018|ovsdb_idl|WARN|transaction error: {"details":"No
> column up in table Port_Binding.","error":"unknown
> column","syntax":"{\"up\":true}"}

Are we 100% sure that the node was updated before the masters?  I'm asking
because, unless there's a bug, ovn-controller should not try to write to
Port_Binding.up if Port_Binding.up is not already set to a value in the SB
(therefore the SB was already upgraded).

> 
> and a lot of:
> 2021-04-15T14:49:59Z|00240|ovsdb_idl|INFO|ssl:10.0.0.5:9642: clustered
> database server is disconnected from cluster; trying another server
> 2021-04-15T14:49:59Z|00241|main|INFO|OVNSB commit failed, force recompute
> next time.

What is the state of the SB cluster?  The logs seem to indicate that SB
servers 10.0.0.5 and 10.0.0.6 are disconnected from each other and that
10.0.0.7 is not reachable.

ovn-appctl -t ovnsb_db cluster/status OVN_Southbound

It might also be useful to check the openshift-monitoring_alertmanager-main-1
port binding on both 10.0.0.6 and 10.0.0.6 with:

ovn-sbctl --no-leader-only list port_binding openshift-monitoring_alertmanager-main-1

> 
> I've attached the ovn-controller log and relevant output. We also have a
> live cluster with the problem. Dumitru can you PTAL?

Comment 29 Dumitru Ceara 2021-04-16 08:54:15 UTC

(In reply to Dumitru Ceara from comment #28)
> (In reply to Tim Rozet from comment #27)
> > So there are multiple problems here. One is that one of the master nodes is
> > not functioning correctly. It looks like kubelet is no longer responding and
> > we see coredumps in journald. Besides that issue though, we see that
> > ovn-controller is not behaving correctly. We can see that the pod in
> > question, openshift-monitoring_alertmanager-main-1, is scheduled on
> > pdhamdhe091202-7x7q4-w-b-1.c.openshift-qe.internal. However OVN thinks it is
> > on pdhamdhe091202-7x7q4-w-c-2.c.openshift-qe.internal. We can also see in
> > SBDB that the pod has its IP set to:
> > 
> > [root@pdhamdhe091202-7x7q4-m-1 ~]# ovn-nbctl list logical_switch_port
> > openshift-monitoring_alertmanager-main-1
> > _uuid               : da946c76-61d4-459a-a070-1949a56a4c63
> > addresses           : ["0a:58:0a:80:02:22 10.128.2.34"]
> > 
> > 
> > However, in ovn-controller:
> > 2021-04-09T07:34:36Z|00101|binding|INFO|Claiming lport
> > openshift-monitoring_alertmanager-main-1 for this chassis.
> > 2021-04-09T07:34:36Z|00102|binding|INFO|openshift-monitoring_alertmanager-
> > main-1: Claiming 0a:58:0a:81:02:1a 10.129.2.26
> > 
> > CNI request around the same time:
> > I0409 07:34:37.563128    3132 cni.go:195]
> > [openshift-monitoring/alertmanager-main-1
> > 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3] ADD
> > finished CNI request [openshift-monitoring/alertmanager-main-1
> > 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3], result
> > "{\"Result\":{\"interfaces\":[{\"name\":\"6c7e5405c74e1fa\",\"mac\":\"56:af:
> > 8a:18:f0:19\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:80:02:22\",\"sandbox\":
> > \"/var/run/netns/998fc171-fb3a-406f-bb3f-bbb2f7bc2ad1\"}],\"ips\":
> > [{\"version\":\"4\",\"interface\":1,\"address\":\"10.128.2.34/23\",
> > \"gateway\":\"10.128.2.1\"}],\"dns\":{}},\"PodIFInfo\":null}", err <nil>

From ovn-controller logs:
2021-04-09T07:29:39Z|00009|reconnect|INFO|ssl:10.0.0.7:9642: connected <--- connected to .7 SB
...
2021-04-09T07:34:36Z|00101|binding|INFO|Claiming lport openshift-monitoring_alertmanager-main-1 for this chassis.
2021-04-09T07:34:36Z|00102|binding|INFO|openshift-monitoring_alertmanager-main-1: Claiming 0a:58:0a:81:02:1a 10.129.2.26 <--- It can be that .7 had stale Port_binding information
...
2021-04-09T07:38:12Z|00158|reconnect|ERR|ssl:10.0.0.7:9642: no response to inactivity probe after 30 seconds, disconnecting <--- Connection dropped to .7, was it upgraded here?
2021-04-09T07:38:12Z|00159|reconnect|INFO|ssl:10.0.0.7:9642: connection dropped
2021-04-09T07:38:12Z|00160|reconnect|INFO|ssl:10.0.0.5:9642: connecting...
2021-04-09T07:38:12Z|00161|reconnect|INFO|ssl:10.0.0.5:9642: connected
...
2021-04-15T20:56:53Z|00248|reconnect|INFO|ssl:10.0.0.6:9642: connected <--- last successful connect (quite a few days later)

We don't see any more "Claiming lport openshift-monitoring_alertmanager-main-1"
but that's because it was already claimed before by the same chassis.  When
the port_binding is updated with the new MAC addresses, ovn-controller will
just update the openflows.

Checking the current flows, they look fine:

ovs-ofctl dump-flows br-int table=8 | grep metadata=0x17 
 cookie=0x7c8567e9, duration=609325.521s, table=8, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=100,metadata=0x17,vlan_tci=0x1000/0x1000 actions=drop
 cookie=0xb03bf160, duration=609325.539s, table=8, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=100,metadata=0x17,dl_src=01:00:00:00:00:00/01:00:00:00:00:00 actions=drop
 cookie=0xf0e785eb, duration=609325.513s, table=8, n_packets=1922566, n_bytes=930519010, idle_age=0, hard_age=65534, priority=50,reg14=0x1,metadata=0x17 actions=resubmit(,9)
 cookie=0xef54856a, duration=609029.531s, table=8, n_packets=359589, n_bytes=15107770, idle_age=0, hard_age=65534, priority=50,reg14=0x2d,metadata=0x17,dl_src=0a:58:0a:80:02:22 actions=resubmit(,9)  <--- This matches the most recent SB values.

> > 
> > Also, we think that the node was updated before the masters, however we
> > still see:
> > 2021-04-09T06:50:52Z|00018|ovsdb_idl|WARN|transaction error: {"details":"No
> > column up in table Port_Binding.","error":"unknown
> > column","syntax":"{\"up\":true}"}
> 
> Are we 100% sure that the node was updated before the masters?  I'm asking
> because, unless there's a bug, ovn-controller should not try to write to
> Port_Binding.up if Port_Binding.up is not already set to a value in the SB
> (therefore the SB was already upgraded).
> 
> > 
> > and a lot of:
> > 2021-04-15T14:49:59Z|00240|ovsdb_idl|INFO|ssl:10.0.0.5:9642: clustered
> > database server is disconnected from cluster; trying another server
> > 2021-04-15T14:49:59Z|00241|main|INFO|OVNSB commit failed, force recompute
> > next time.
> 
> What is the state of the SB cluster?  The logs seem to indicate that SB
> servers 10.0.0.5 and 10.0.0.6 are disconnected from each other and that
> 10.0.0.7 is not reachable.
> 
> ovn-appctl -t ovnsb_db cluster/status OVN_Southbound
> 
> It might also be useful to check the openshift-monitoring_alertmanager-main-1
> port binding on both 10.0.0.6 and 10.0.0.6 with:
> 
> ovn-sbctl --no-leader-only list port_binding
> openshift-monitoring_alertmanager-main-1
> 
> > 
> > I've attached the ovn-controller log and relevant output. We also have a
> > live cluster with the problem. Dumitru can you PTAL?

I tried to investigate some more what's happening with the SB on
10.0.0.7, pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal, but I can't
seem to be able to connect to any of the pods running there.  Starting a
debug pod on the node fails too.

Another weird thing is that 10.0.0.6 is connected to both .5 and .7:
[root@pdhamdhe091202-7x7q4-m-1 ~]# ss -tn | grep :9644
ESTAB    0      0                10.0.0.6:47944                  10.0.0.7:9644  
ESTAB    0      0                10.0.0.6:9644                   10.0.0.5:51602 
ESTAB    0      0                10.0.0.6:9644                   10.0.0.7:40618 
ESTAB    0      0                10.0.0.6:53388                  10.0.0.5:9644

But .5 is only connected to .6:
[root@pdhamdhe091202-7x7q4-m-0 ~]# ss -tn | grep :9644
ESTAB    0      0                10.0.0.5:51602                  10.0.0.6:9644  
ESTAB    0      0                10.0.0.5:9644                   10.0.0.6:45172 
ESTAB    0      0                10.0.0.5:9644                   10.0.0.6:53388 
SYN-SENT 0      1                10.0.0.5:46114                  10.0.0.7:9644

@Tim any idea on how to get the SB logs off 10.0.0.7?

Comment 30 Tim Rozet 2021-04-16 15:35:25 UTC

Created attachment 1772475 [details]
locked up node's sbdb logs

Hi Dumitru, see attachment. Ross was able to reboot the node for us and we have the logs now.

Comment 31 Mohamed Mahmoud 2021-04-16 16:08:00 UTC

ovn-controller crash (this is printed in reverse TS)

Apr 13 21:46:45 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd-coredump[3043824]: Process 5658 (ovn-controller) of user 0 dumped core.
                                                                                            
                                                                                            Stack trace of thread 5658:
                                                                                            #0  0x00007fed2fcf284f raise (libc.so.6)
                                                                                            #1  0x00007fed2fcdcc45 abort (libc.so.6)
                                                                                            #2  0x0000556945a939a4 n/a (/usr/bin/ovn-controller)
                                                                                            #3  0x00007ffd05af0bf0 n/a (n/a)
Apr 13 21:46:18 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: 10.0.0.3 is manually marked as down, skipping for internal clients...
Apr 13 21:46:03 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: crio-conmon-e037ecced891ee9734ed17043dfaf69d9e7bb6f5c8ce0bcfaa8e4c251c83e802.scope: Consumed 832ms CPU time
Apr 13 21:46:03 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: crio-conmon-e037ecced891ee9734ed17043dfaf69d9e7bb6f5c8ce0bcfaa8e4c251c83e802.scope: Succeeded.
Apr 13 21:46:01 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: crio-e037ecced891ee9734ed17043dfaf69d9e7bb6f5c8ce0bcfaa8e4c251c83e802.scope: Consumed 2min 9.986s CPU time
Apr 13 21:46:00 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: crio-e037ecced891ee9734ed17043dfaf69d9e7bb6f5c8ce0bcfaa8e4c251c83e802.scope: Succeeded.
Apr 13 21:41:52 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: Started Process Core Dump (PID 3043789/UID 0).
Apr 13 21:40:33 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: done applying vip rules
Apr 13 21:40:03 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal ovs-vswitchd[1327]: ovs|03669|rconn|ERR|br-int<->unix#66810: no response to inactivity probe after 60 seconds, disconnecting
Apr 13 21:39:57 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: ensuring rule for 34.123.66.142 for internal clients
Apr 13 21:39:26 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: ensuring rule for 34.123.66.142 for external clients
Apr 13 21:38:50 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: ensuring rule for 10.0.0.3 for external clients
Apr 13 21:38:35 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal ovs-vswitchd[1327]: ovs|03668|rconn|ERR|br-int<->unix#66844: no response to inactivity probe after 60 seconds, disconnecting
Apr 13 21:37:42 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: Processing route for NIC 0/42:01:0a:00:00:07 for 34.123.66.142
Apr 13 21:35:15 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: 10.0.0.3 is manually marked as down, skipping for internal clients...
Apr 13 21:35:06 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal ovs-vswitchd[1327]: ovs|03666|rconn|ERR|br-int<->unix#66843: no response to inactivity probe after 60 seconds, disconnecting

Comment 32 Tim Rozet 2021-04-16 18:46:24 UTC

Created attachment 1772554 [details]
journald and ovn-controller coredump

Comment 33 Tim Rozet 2021-04-16 18:46:54 UTC

Dumitru, can you PTAL at the coredump?

Comment 35 Gianluca 2021-04-19 16:13:49 UTC

Upgrading cluster from 4.7.0 -> 4.7.3 -> 4.7.6 seems to have the same behavior.

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-1_openshift-monitoring_6fe22f18-6d30-449b-b6fc-949992ed0688_0(8b3c4637e89146b1937ef3bbc5bf6fc946be5aafd39aeadbfbfb9d9e01a65825): [openshift-monitoring/alertmanager-main-1:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-1 8b3c4637e89146b1937ef3bbc5bf6fc946be5aafd39aeadbfbfb9d9e01a65825] [openshift-monitoring/alertmanager-main-1 8b3c4637e89146b1937ef3bbc5bf6fc946be5aafd39aeadbfbfb9d9e01a65825] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows '

must gather log attached to Case 02920947

Comment 36 Dumitru Ceara 2021-04-20 09:41:11 UTC

(In reply to Tim Rozet from comment #33)
> Dumitru, can you PTAL at the coredump?

I had a look at the coredump, adding the steps I did here for reference,
there's probably a better way to do this but this worked for me:

1. Bring up an OCP CI cluster with clusterbot running the version the
   bug was reported on (4.7 nightly).

2. Spin up a debug pod on one of the nodes, e.g.:
   # oc debug node/ci-ln-kn1stvb-f76d1-llzzm-master-0

3. Copy the ovn-controller core dump to the debug pod, along with the
   OVN rpms (including debuginfo) corresponding to the OVN version that
   was running (ovn2.13-20.12.0-24):
   # for r in $(ls *.rpm); do oc cp $r ci-ln-kn1stvb-f76d1-llzzm-master-0-debug:/tmp; done
   # oc cp core.ovn-controller.0.bf9ed7e2bc934d0a9e5db8b29b87997a.5658.1618349886000000.lz4 ci-ln-kn1stvb-f76d1-llzzm-master-0-debug:/tmp/

4. On the debug pod install gdb and the OVN rpms:
   # yum install -y gdb
   # rpm -Uhv --force --nodeps /tmp/ovn2.13*

5. Load the coredump in gdb:
   # gdb /usr/bin/ovn-controller /tmp/core.ovn-controller.0.bf9ed7e2bc934d0a9e5db8b29b87997a.5658.1618349886000000


(gdb) bt
#0  0x00007fed2fcf284f in raise () from /lib64/libc.so.6
#1  0x00007fed2fcdcc45 in abort () from /lib64/libc.so.6
#2  0x0000556945a939a4 in ovs_abort_valist (err_no=err_no@entry=0, format=format@entry=0x556945b73eb0 "%s: assertion %s failed in %s()", args=args@entry=0x7ffd05af0b10) at lib/util.c:419
#3  0x0000556945a9b794 in vlog_abort_valist (module_=<optimized out>, message=0x556945b73eb0 "%s: assertion %s failed in %s()", args=args@entry=0x7ffd05af0b10) at lib/vlog.c:1249
#4  0x0000556945a9b83a in vlog_abort (module=module@entry=0x556945e30e80 <this_module>, message=message@entry=0x556945b73eb0 "%s: assertion %s failed in %s()") at lib/vlog.c:1263
#5  0x0000556945a936bb in ovs_assert_failure (where=where@entry=0x556945b4e9a6 "controller/ofctrl.c:1917", function=function@entry=0x556945b4eff0 <__func__.34362> "merge_tracked_flows", 
    condition=condition@entry=0x556945b4e990 "del_f->installed_flow") at lib/util.c:86
#6  0x00005569459bfbaa in merge_tracked_flows (flow_table=0x5569470c5b40) at /usr/src/debug/ovn2.13-20.12.0-24.el8fdp.x86_64/openvswitch-2.14.90/include/openvswitch/hmap.h:283
#7  update_installed_flows_by_track (msgs=0x7ffd05af0c40, flow_table=0x5569470c5b40) at controller/ofctrl.c:1946
#8  ofctrl_put (flow_table=flow_table@entry=0x5569470c5b40, pending_ct_zones=pending_ct_zones@entry=0x5569470ee560, meter_table=<optimized out>, req_cfg=req_cfg@entry=0, flow_changed=flow_changed@entry=true) at controller/ofctrl.c:2130
#9  0x00005569459abdae in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:2931

(gdb) frame 6
#6  0x00005569459bfbaa in merge_tracked_flows (flow_table=0x5569470c5b40) at /usr/src/debug/ovn2.13-20.12.0-24.el8fdp.x86_64/openvswitch-2.14.90/include/openvswitch/hmap.h:283
283             hmap_expand_at(hmap, where);

(gdb) p del_f->installed_flow
$2 = (struct installed_flow *) 0x0
(gdb) list ofctrl.c:1917
1912                    continue;
1913                }
1914
1915                /* del_f must have been installed, otherwise it should have been
1916                 * removed during track_flow_add_or_modify. */
1917                ovs_assert(del_f->installed_flow);
1918
1919                if (!f->installed_flow) {
1920                    /* f is not installed yet. */
1921                    replace_installed_to_desired(del_f->installed_flow, del_f, f);

There were a few fixes to the OF incremental processing in ovn-controller
recently:

[1] https://github.com/ovn-org/ovn/commit/c6c61b4e3462fb5201a61a226c2acaf6f4caf917
[2] https://github.com/ovn-org/ovn/commit/858d1dd716db1a1e664a7c1737fd34f04fcbda5e
[3] https://github.com/ovn-org/ovn/commit/6975c649f932633042ca54df2d8f8f0eb866c344

IIRC, without [2], the flood_remove_flows_for_sb_uuid() was potentially
expanding a hmap while iterating through it which may cause entries to
be visited twice or skipped.  That might explain what we're seeing above.

All the above fixes are available downstream since version 20.12.0-63.
It might be a good idea anyway to bump ovn version in ovn-kubernetes to
something >= 20.12.0-63 and retest.

@Tim, could you please help out with this?

I'll open an OVN BZ too in order to investigate and make sure that the
crash above is not a new one though.

I didn't analyze the SB logs from 10.0.0.7 yet, but the crash seems
unrelated to that.  I'll have a look at the SB logs soon.

Regards,
Dumitru

Comment 37 Dumitru Ceara 2021-04-20 09:50:19 UTC

I opened https://bugzilla.redhat.com/show_bug.cgi?id=1951502 for further
investigation of the ovn-controller core dump.

Comment 38 Dumitru Ceara 2021-04-20 12:44:12 UTC

Bug 1951502 is a new crash, the changelog has been rewritten for ovn2.13
but the -24 version that was running on the cluster includes all current
upstream ofctrl related fixes.

Comment 42 Tim Rozet 2021-04-22 15:58:53 UTC

In addition to ovn-controller crashing here, we see the node lock up and journald crash. We cannot ssh to the node, only ping it. It looks like the system is running OOM. After speaking with Ryan, the known kernel OOM bug did not make it into 4.7.4. It is only in 4.7.6 and later:
https://bugzilla.redhat.com/show_bug.cgi?id=1857446

It is most likely we are also hitting that issue here. Ross, can you help us try to reproduce this issue on the latest 4.7 and see if the node locks up? OVN team would also like to poke around there if we can reproduce the ovn-controller crash and get the NBDB.

Also, the original bug says 4.5->4.6->4.7, but it seems like some folks have reproduced this just going from something like 4.7.2->4.7.3. Are there other paths to hit this bug? Is it simply any 4.6 or 4.7 upgrade to 4.7.3 or 4.7.4?

Comment 44 Tim Rozet 2021-04-23 01:58:05 UTC

Ross helped us attempt two upgrades. In both of these upgrades the ovn-controller crash was not seen, but there were other OVN crashes in northd and the ovsdb. Also we did not see any kernel OOM and the cluster appears to upgrade successfully (northd and ovsdb containers are restarted). This gives us some confidence that the node lockup we saw before on an earlier 4.7 is most likely the kernel bug that was fixed in later 4.7.

1. 4.6.26 --> 4.7.0-0.nightly-2021-04-22-105612. OVN versions: ovn2.13-20.06.2-11.el8fdp --> 20.12.0-24.el8fdp
OVN northd and ovsdb crashes in this case on two master nodes with coredumps:

http://file.rdu.redhat.com/~rbrattai/logs/must-gather.local.557225639646412986.tar
http://file.rdu.redhat.com/~rbrattai/logs/must-gather.local.2229250353150456641.tar.zst

2. 4.6.23 --> 4.7.0-0.nightly-2021-04-22-105612. OVN versions: ovn2.13-20.09.0-7.el8fdn --> 20.12.0-24.el8fdp
OVN northd crash on two master nodes with coredumps. I have those cores on my laptop and can share or Ross can provide a link.

We can try a few more times to reproduce the ovn-controller crash on latest 4.7 and also ensure the node lockup doesn't happen, but we also need OVN team to look into these new coredumps.

Comment 45 Tim Rozet 2021-04-23 02:04:04 UTC

Created attachment 1774638 [details]
coredump for northd

Actually the 4.6.23->4.7 cores are small enough I can attach here.

Comment 46 zhaozhanqi 2021-04-23 09:31:46 UTC

add another upgrade paths,seems they are good,  the link below are the must-gather coredumps from `oc adm must-gather -- gather_core_dumps`

1. 4.6.23-->4.7.3 

http://file.apac.redhat.com/~zzhao/must-gather.local.4251437677671016869.tar.gz

2. 4.6.23--> 4.7.0-0.nightly-2021-04-22-105612

http://file.apac.redhat.com/~zzhao/must-gather.local.507640507921899081.tar.gz

Comment 47 Ross Brattain 2021-04-23 18:43:21 UTC

On 4.7.0-0.nightly-2021-04-22-105612 it seems that `oc delete daemonsets ovnkube-master --wait=false` triggers ovn-northd coredumps reliably, but the ovnkube-master pods recover.

Fri Apr 23 17:15:25.0250636310 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.5930.1619198124000000.lz4
Fri Apr 23 17:18:49.2774560510 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1667572.1619198328000000.lz4
Fri Apr 23 17:32:57.3799121480 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1672345.1619199176000000.lz4
Fri Apr 23 17:38:50.6163434580 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1690961.1619199529000000.lz4
Fri Apr 23 17:48:24.9831908680 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1699300.1619200104000000.lz4
Fri Apr 23 18:23:35.1131995010 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1712140.1619202213000000.lz4
Fri Apr 23 18:26:55.6223641060 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1760256.1619202414000000.lz4
Fri Apr 23 18:35:45.4445084530 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1765504.1619202944000000.lz4

Comment 48 Ross Brattain 2021-04-24 02:25:17 UTC


After 28 hrs the 4.6.26 --> 4.7.0-0.nightly-2021-04-22-105612 cluster had master-0 enter `NotReady` state.

I was able to ssh into the node.  The system was mostly unresponsive, load avg 177.71.
After stopping `kubelet` I observed various processes with high CPU utilization.  `etcd` consistently used 12% cpu.

Also observed core.systemd-journal coredump

[77045.656418] [Fri Apr 23 17:15:18 2021] traps: ovn-northd[4950] general protection fault ip:7f68e4a83d21 sp:7fff6692fd50 error:0 in libc-2.28.so[7f68e4a62000+1b9000]
[77249.821609] [Fri Apr 23 17:18:42 2021] traps: ovn-northd[1713666] general protection fault ip:7f6de542fd21 sp:7fffbe200e30 error:0 in libc-2.28.so[7f6de540e000+1b9000]
[77805.333438] [Fri Apr 23 17:27:58 2021] traps: ovn-northd[1719579] general protection fault ip:7fa260ea0d21 sp:7fff23dee380 error:0 in libc-2.28.so[7fa260e7f000+1b9000]
[78600.214465] [Fri Apr 23 17:41:13 2021] traps: ovn-northd[1733228] general protection fault ip:7f6760f61d21 sp:7fffce55f120 error:0 in libc-2.28.so[7f6760f40000+1b9000]
[79025.462822] [Fri Apr 23 17:48:18 2021] traps: ovn-northd[1753077] general protection fault ip:7f2cefde5d21 sp:7ffc20f249e0 error:0 in libc-2.28.so[7f2cefdc4000+1b9000]
[81135.198549] [Fri Apr 23 18:23:28 2021] traps: ovn-northd[1764871] general protection fault ip:7f0255198d21 sp:7ffd2a84b730 error:0 in libc-2.28.so[7f0255177000+1b9000]
[81335.999986] [Fri Apr 23 18:26:48 2021] traps: ovn-northd[1813835] general protection fault ip:7fc852689d21 sp:7ffff63b7b60 error:0 in libc-2.28.so[7fc852668000+1b9000]
[81865.552301] [Fri Apr 23 18:35:38 2021] traps: ovn-northd[1820142] general protection fault ip:7efecb808d21 sp:7ffd98acbe50 error:0 in libc-2.28.so[7efecb7e7000+1b9000]
[82372.566797] [Fri Apr 23 18:44:05 2021] traps: ovn-northd[1834086] general protection fault ip:7fa77dd1dd21 sp:7ffced119070 error:0 in libc-2.28.so[7fa77dcfc000+1b9000]
[82574.761351] [Fri Apr 23 18:47:27 2021] traps: ovn-northd[1846875] general protection fault ip:7ffa9d6b9d21 sp:7fff00cf93b0 error:0 in libc-2.28.so[7ffa9d698000+1b9000]
[87315.468193] [Fri Apr 23 20:06:28 2021] systemd[1]: systemd-journald.service: State 'stop-sigabrt' timed out. Terminating.
[87322.226438] [Fri Apr 23 20:06:35 2021] systemd[1]: Starting Cleanup of Temporary Directories...
[87346.312381] [Fri Apr 23 20:06:59 2021] systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
[87346.753929] [Fri Apr 23 20:06:59 2021] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[87350.565027] [Fri Apr 23 20:07:03 2021] systemd[1]: systemd-journald.service: Consumed 55.603s CPU time
[87351.639607] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[87351.654227] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2.
[87351.684096] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journal-flush.service: Succeeded.
[87351.703414] [Fri Apr 23 20:07:04 2021] systemd[1]: Stopped Flush Journal to Persistent Storage.
[87351.710239] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journal-flush.service: Consumed 0 CPU time
[87351.741727] [Fri Apr 23 20:07:04 2021] systemd[1]: Stopping Flush Journal to Persistent Storage...
[87351.758087] [Fri Apr 23 20:07:04 2021] systemd[1]: Stopped Journal Service.
[87351.765212] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journald.service: Consumed 55.603s CPU time
[87353.483285] [Fri Apr 23 20:07:06 2021] systemd[1]: Starting Journal Service...
[87366.172243] [Fri Apr 23 20:07:19 2021] systemd-journald[1952234]: File /var/log/journal/41a9ccb005fa25b06249ece132b36d26/system.journal corrupted or uncleanly shut down, renaming and replacing.
[87369.562936] [Fri Apr 23 20:07:22 2021] systemd[1]: Started Journal Service.
[87400.134351] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: Process 892 (systemd-journal) of user 0 dumped core.
[87400.143186] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.8a8ac32fd64c4e5f90eec19afbfb0dcb.892.1619208301000000.lz4
[87400.159442] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: Stack trace of thread 892:
[87400.165967] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #0  0x00007f52af6b80e4 journal_file_find_data_object_with_hash (libsystemd-shared-239.so)
[87400.178615] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #1  0x00007f52af6b8359 journal_file_append_data (libsystemd-shared-239.so)
[87400.190137] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #2  0x00007f52af6ba701 journal_file_append_entry (libsystemd-shared-239.so)
[87400.201681] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #3  0x0000555ee6f2cd1c dispatch_message_real (systemd-journald)
[87400.212421] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #4  0x0000555ee6f3239d stdout_stream_log (systemd-journald)
[87400.221925] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #5  0x0000555ee6f32654 stdout_stream_line (systemd-journald)
[87400.232066] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #6  0x0000555ee6f32def stdout_stream_scan (systemd-journald)
[87413.034295] [Fri Apr 23 20:08:06 2021] printk: systemd-coredum: 7 output lines suppressed due to ratelimiting

I reniced -n 20 etcd and was able to generate `sosreport` and reduce load avg to <1.0

systemd-journald.service abort occured at (87315.468193 / 3600) = 24.27 hours after boot.

The Fri Apr 23 ovn-northd coredumps listed above were triggered after testing. The command `oc delete daemonsets ovnkube-master --wait=false` reliably triggers the coredumps, after which the cluster seemed to recover, until I noticed the `NotReady` state at around Fri Apr 23 21:36.

ovn*_db.db were recovered, along with ovn logs.


OVN and etcd are still running in their original state.

In the current cluster all the alertmanager-main-? pods respond to metric curl, so that doesn't seem to be the same issue.
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://10.128.2.35:9095/metrics'


Otherwise, these events match the previous cluster  4.6.23-x86_64--> 4.7.0-0.nightly-2021-04-05-201242 `NotReady` lockup.  Those nodes became `NotReady` after a few days.

I am attempting to reproduce on a 4.7.0-0.nightly-2021-04-23-165253 without upgrading, letting it soak without interaction for 24 hours.

This cluster and the previous cluster are UPI on GCP with OVN and etcd encryption enabled.

Comment 49 Ross Brattain 2021-04-24 02:28:12 UTC

Created attachment 1774965 [details]
ovndb from the NotReady node, cluster 230024

Comment 50 Ross Brattain 2021-04-24 02:29:36 UTC

Created attachment 1774966 [details]
ovn, etcd logs from NotReady node 230024

Comment 51 Ross Brattain 2021-04-24 02:33:27 UTC

Created attachment 1774967 [details]
first top command on NotReady node, cluster 230024

Comment 52 Ross Brattain 2021-04-24 02:39:33 UTC

Looks like  authentication-operator  maybe, 7.8g and 53.5% MEM


    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  17742 root      20   0 9734552   7.8g      0 S   0.7  53.5 144:51.38 authentication-
  17742 root      20   0 9734552   7.8g      0 S   0.3  53.5 144:57.52 authentication-

Apr 23 20:01:52 rbrattai230024-8tzfq-master-0.c.openshift-qe.internal hyperkube[1752]: I0423 20:01:51.967365    1752 prober.go:117] Liveness probe for "authentication-operator-7f78b8b8cd-hjhqc_openshift-authentication-operator(ed3af4f4-8852-48b6-a410-8150eb608eaf):authe>
Apr 23 20:07:29 rbrattai230024-8tzfq-master-0.c.openshift-qe.internal hyperkube[1752]: I0423 20:02:18.253039    1752 prober.go:117] Liveness probe for "authentication-operator-7f78b8b8cd-hjhqc_openshift-authentication-operator(ed3af4f4-8852-48b6-a410-8150eb608eaf):authe>
Apr 23 20:01:52 rbrattai230024-8tzfq-master-0.c.openshift-qe.internal hyperkube[1752]: I0423 20:01:51.967365    1752 prober.go:117] Liveness probe for "authentication-operator-7f78b8b8cd-hjhqc_openshift-authentication-operator(ed3af4f4-8852-48b6-a410-8150eb608eaf):authe>
Apr 23 20:07:29 rbrattai230024-8tzfq-master-0.c.openshift-qe.internal hyperkube[1752]: I0423 20:02:18.253039    1752 prober.go:117] Liveness probe for "authentication-operator-7f78b8b8cd-hjhqc_openshift-authentication-operator(ed3af4f4-8852-48b6-a410-8150eb608eaf):authe>

Comment 53 Ross Brattain 2021-04-24 03:15:16 UTC

Created attachment 1774968 [details]
authentication-operator-logs

Comment 54 Dan Williams 2021-04-24 15:25:20 UTC

Seems like around Apr 23rd, 20:01:50 things start to go south on that node.

2021-04-23 20:01:55.028926 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 148.74685ms, to e57937edfb61fd39)
2021-04-23 20:01:55.028953 W | etcdserver: server is likely overloaded

around that time ovn-controller starts having trouble too:

2021-04-23T18:49:09Z|00312|reconnect|INFO|ssl:10.0.0.7:9642: connected
...
2021-04-23T20:02:09Z|00313|timeval|WARN|Unreasonably long 3932ms poll interval (0ms user, 3ms system)
2021-04-23T20:02:10Z|00314|timeval|WARN|faults: 19 minor, 11 major
2021-04-23T20:02:10Z|00315|timeval|WARN|disk: 2528 reads, 0 writes
2021-04-23T20:02:10Z|00316|timeval|WARN|context switches: 29 voluntary, 4 involuntary
...
2021-04-23T20:02:10Z|00339|coverage|INFO|util_xalloc                0.8/sec   208.033/sec      132.1406/sec   total: 106648956

eventually our connections to OVS have problems (though it's 45m later):

2021-04-23T20:46:49Z|00639|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2021-04-23T20:46:49Z|00640|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected

that continues through to the end of the logs.

I think at this point we'd need a full must-gather on etcd, authentication-operator, systemd journal (for OVS and kubelet), and OVN components on a couple nodes to better root-cause. But quite possibly OOM pressure?

Comment 56 Gianluca 2021-04-26 06:55:56 UTC

If someone wants access to a real cluster with the problem I can provide it.
I've also opened case #02920882 but the guy seems to concentrate on other things, I have jounald and ovn-controller coredump on all the three master node, the last is quite big at around 50 to 70Mb.
CPU utilization on the cluster - now void without user workload - raised to 20-25% compared to 5% of the 4.7.3 version.

Comment 58 Lalatendu Mohanty 2021-04-26 12:57:20 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 59 Numan Siddique 2021-04-26 13:07:59 UTC

Regarding the ovn-northd crash seen, below is the investigation details from Dumitru Ceara

-----
I had a look at the ovn-northd core dumps and it might be an issue in the OVS lib/fatal-signal.c implementation, but it's definitely nothing critical, ovn-northd was terminating anyway with signal 15 (SIGTERM): 


Reading symbols from /usr/bin/ovn-northd...Reading symbols from /usr/lib/debug/usr/bin/ovn-northd-20.12.0-24.el8fdp.x86_64.debug...done.
done.
[New LWP 1]
 
warning: Could not load shared library symbols for /lib64/libunbound.so.2.
Do you need "set solib-search-path" or "set sysroot"?
 
warning: Loadable section ".note.gnu.property" outside of ELF segments
 
warning: Loadable section ".note.gnu.property" outside of ELF segments
 
warning: Loadable section ".note.gnu.property" outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
 
warning: Loadable section ".note.gnu.property" outside of ELF segments
 
warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `ovn-northd --no-chdir -vconsole:info -vfile:off --ovnnb-db ssl:10.0.0.3:9641,ss'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f525d04ed21 in abort () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-127.el8_3.2.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libevent-2.1.8-5.el8.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 python3-libs-3.6.8-31.el8.x86_64 zlib-1.2.11-16.2.el8_3.x86_64
(gdb) bt
#0  0x00007f525d04ed21 in abort () from /lib64/libc.so.6
#1  0x00005638777ebe9c in fatal_signal_run () at lib/fatal-signal.c:313
#2  0x0000563877831110 in poll_block () at lib/poll-loop.c:388
#3  0x0000563877775cd0 in main (argc=<optimized out>, argv=<optimized out>) at northd/ovn-northd.c:14267
(gdb) frame 0
#0  0x00007f525d04ed21 in abort () from /lib64/libc.so.6
(gdb) frame 1
#1  0x00005638777ebe9c in fatal_signal_run () at lib/fatal-signal.c:313
313             OVS_NOT_REACHED();
(gdb) list
308              * termination status reflects that we were killed by this signal */
309             signal(sig_nr, SIG_DFL);
310             raise(sig_nr);
311
312             ovs_mutex_unlock(&mutex);
313             OVS_NOT_REACHED();
314         }
315     }
316
317     void


-----

From the above investigation, this definitely not a big concern. I think ovsdb-server crash could be due to the same reason as lib/fatal-signal.c is part of
ovs code base (which OVN also uses).

Comment 60 W. Trevor King 2021-04-26 22:07:30 UTC

Adding ImpactStatementRequested per comment 58 and [1].

[1]: https://github.com/openshift/enhancements/pull/475

Comment 61 Feng Pan 2021-04-28 19:12:34 UTC

Who is impacted?  
Customers upgrading from 4.6.z to 4.7.3+.

What is the impact?  Is it serious enough to warrant blocking edges?
OVN coredumps (ovn-northd) will be observed when the northd process is already exiting during container termination. Functionality of OVN is not affected.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
No remediation is required, upgrade will be successful.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
No. The coredumps come from 4.6 components during upgrade process.

Comment 64 Mohamed Mahmoud 2021-05-13 17:44:46 UTC

fixed by (https://bugzilla.redhat.com/show_bug.cgi?id=1941840)

Comment 65 Red Hat Bugzilla 2023-09-15 01:04:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.