Bug 1943413
Summary: | [OVN] Pod stuck in containerCreating for ovn cluster when upgrade from 4.5->4.6->4.7 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | zhaozhanqi <zzhao> |
Component: | Networking | Assignee: | Mohamed Mahmoud <mmahmoud> |
Networking sub component: | ovn-kubernetes | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED NOTABUG | Docs Contact: | |
Severity: | urgent | ||
Priority: | unspecified | CC: | anbhat, astoycos, dcbw, dceara, fpan, gianluca.lini, gparente, lmohanty, markmc, mfojtik, nusiddiq, rbrattai, trozet, wking |
Version: | 4.7 | Keywords: | Upgrades |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-13 17:44:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1941840, 1951502 | ||
Bug Blocks: | |||
Attachments: |
Description
zhaozhanqi
2021-03-26 02:28:35 UTC
FYI: this issue not already reproduced. try again it works @zzhao do you have cluster in the failing state that I can access ? (In reply to Mohamed Mahmoud from comment #2) > @zzhao do you have cluster in the failing state that I can access > ? sorry, the cluster already be released when I found this issue. we only have the must-gather logs for now. and this issue not already happen. So I have not running cluster in hand. Created attachment 1772252 [details]
relevant pod and ovn-controller output
So there are multiple problems here. One is that one of the master nodes is not functioning correctly. It looks like kubelet is no longer responding and we see coredumps in journald. Besides that issue though, we see that ovn-controller is not behaving correctly. We can see that the pod in question, openshift-monitoring_alertmanager-main-1, is scheduled on pdhamdhe091202-7x7q4-w-b-1.c.openshift-qe.internal. However OVN thinks it is on pdhamdhe091202-7x7q4-w-c-2.c.openshift-qe.internal. We can also see in SBDB that the pod has its IP set to: [root@pdhamdhe091202-7x7q4-m-1 ~]# ovn-nbctl list logical_switch_port openshift-monitoring_alertmanager-main-1 _uuid : da946c76-61d4-459a-a070-1949a56a4c63 addresses : ["0a:58:0a:80:02:22 10.128.2.34"] However, in ovn-controller: 2021-04-09T07:34:36Z|00101|binding|INFO|Claiming lport openshift-monitoring_alertmanager-main-1 for this chassis. 2021-04-09T07:34:36Z|00102|binding|INFO|openshift-monitoring_alertmanager-main-1: Claiming 0a:58:0a:81:02:1a 10.129.2.26 CNI request around the same time: I0409 07:34:37.563128 3132 cni.go:195] [openshift-monitoring/alertmanager-main-1 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3] ADD finished CNI request [openshift-monitoring/alertmanager-main-1 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3], result "{\"Result\":{\"interfaces\":[{\"name\":\"6c7e5405c74e1fa\",\"mac\":\"56:af:8a:18:f0:19\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:80:02:22\",\"sandbox\":\"/var/run/netns/998fc171-fb3a-406f-bb3f-bbb2f7bc2ad1\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.128.2.34/23\",\"gateway\":\"10.128.2.1\"}],\"dns\":{}},\"PodIFInfo\":null}", err <nil> Also, we think that the node was updated before the masters, however we still see: 2021-04-09T06:50:52Z|00018|ovsdb_idl|WARN|transaction error: {"details":"No column up in table Port_Binding.","error":"unknown column","syntax":"{\"up\":true}"} and a lot of: 2021-04-15T14:49:59Z|00240|ovsdb_idl|INFO|ssl:10.0.0.5:9642: clustered database server is disconnected from cluster; trying another server 2021-04-15T14:49:59Z|00241|main|INFO|OVNSB commit failed, force recompute next time. I've attached the ovn-controller log and relevant output. We also have a live cluster with the problem. Dumitru can you PTAL? (In reply to Tim Rozet from comment #27) > So there are multiple problems here. One is that one of the master nodes is > not functioning correctly. It looks like kubelet is no longer responding and > we see coredumps in journald. Besides that issue though, we see that > ovn-controller is not behaving correctly. We can see that the pod in > question, openshift-monitoring_alertmanager-main-1, is scheduled on > pdhamdhe091202-7x7q4-w-b-1.c.openshift-qe.internal. However OVN thinks it is > on pdhamdhe091202-7x7q4-w-c-2.c.openshift-qe.internal. We can also see in > SBDB that the pod has its IP set to: > > [root@pdhamdhe091202-7x7q4-m-1 ~]# ovn-nbctl list logical_switch_port > openshift-monitoring_alertmanager-main-1 > _uuid : da946c76-61d4-459a-a070-1949a56a4c63 > addresses : ["0a:58:0a:80:02:22 10.128.2.34"] > > > However, in ovn-controller: > 2021-04-09T07:34:36Z|00101|binding|INFO|Claiming lport > openshift-monitoring_alertmanager-main-1 for this chassis. > 2021-04-09T07:34:36Z|00102|binding|INFO|openshift-monitoring_alertmanager- > main-1: Claiming 0a:58:0a:81:02:1a 10.129.2.26 > > CNI request around the same time: > I0409 07:34:37.563128 3132 cni.go:195] > [openshift-monitoring/alertmanager-main-1 > 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3] ADD > finished CNI request [openshift-monitoring/alertmanager-main-1 > 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3], result > "{\"Result\":{\"interfaces\":[{\"name\":\"6c7e5405c74e1fa\",\"mac\":\"56:af: > 8a:18:f0:19\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:80:02:22\",\"sandbox\": > \"/var/run/netns/998fc171-fb3a-406f-bb3f-bbb2f7bc2ad1\"}],\"ips\": > [{\"version\":\"4\",\"interface\":1,\"address\":\"10.128.2.34/23\", > \"gateway\":\"10.128.2.1\"}],\"dns\":{}},\"PodIFInfo\":null}", err <nil> > > Also, we think that the node was updated before the masters, however we > still see: > 2021-04-09T06:50:52Z|00018|ovsdb_idl|WARN|transaction error: {"details":"No > column up in table Port_Binding.","error":"unknown > column","syntax":"{\"up\":true}"} Are we 100% sure that the node was updated before the masters? I'm asking because, unless there's a bug, ovn-controller should not try to write to Port_Binding.up if Port_Binding.up is not already set to a value in the SB (therefore the SB was already upgraded). > > and a lot of: > 2021-04-15T14:49:59Z|00240|ovsdb_idl|INFO|ssl:10.0.0.5:9642: clustered > database server is disconnected from cluster; trying another server > 2021-04-15T14:49:59Z|00241|main|INFO|OVNSB commit failed, force recompute > next time. What is the state of the SB cluster? The logs seem to indicate that SB servers 10.0.0.5 and 10.0.0.6 are disconnected from each other and that 10.0.0.7 is not reachable. ovn-appctl -t ovnsb_db cluster/status OVN_Southbound It might also be useful to check the openshift-monitoring_alertmanager-main-1 port binding on both 10.0.0.6 and 10.0.0.6 with: ovn-sbctl --no-leader-only list port_binding openshift-monitoring_alertmanager-main-1 > > I've attached the ovn-controller log and relevant output. We also have a > live cluster with the problem. Dumitru can you PTAL? (In reply to Dumitru Ceara from comment #28) > (In reply to Tim Rozet from comment #27) > > So there are multiple problems here. One is that one of the master nodes is > > not functioning correctly. It looks like kubelet is no longer responding and > > we see coredumps in journald. Besides that issue though, we see that > > ovn-controller is not behaving correctly. We can see that the pod in > > question, openshift-monitoring_alertmanager-main-1, is scheduled on > > pdhamdhe091202-7x7q4-w-b-1.c.openshift-qe.internal. However OVN thinks it is > > on pdhamdhe091202-7x7q4-w-c-2.c.openshift-qe.internal. We can also see in > > SBDB that the pod has its IP set to: > > > > [root@pdhamdhe091202-7x7q4-m-1 ~]# ovn-nbctl list logical_switch_port > > openshift-monitoring_alertmanager-main-1 > > _uuid : da946c76-61d4-459a-a070-1949a56a4c63 > > addresses : ["0a:58:0a:80:02:22 10.128.2.34"] > > > > > > However, in ovn-controller: > > 2021-04-09T07:34:36Z|00101|binding|INFO|Claiming lport > > openshift-monitoring_alertmanager-main-1 for this chassis. > > 2021-04-09T07:34:36Z|00102|binding|INFO|openshift-monitoring_alertmanager- > > main-1: Claiming 0a:58:0a:81:02:1a 10.129.2.26 > > > > CNI request around the same time: > > I0409 07:34:37.563128 3132 cni.go:195] > > [openshift-monitoring/alertmanager-main-1 > > 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3] ADD > > finished CNI request [openshift-monitoring/alertmanager-main-1 > > 6c7e5405c74e1fa4acf2624fdd0d8c7065790119e980006d912c88133ead02e3], result > > "{\"Result\":{\"interfaces\":[{\"name\":\"6c7e5405c74e1fa\",\"mac\":\"56:af: > > 8a:18:f0:19\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:80:02:22\",\"sandbox\": > > \"/var/run/netns/998fc171-fb3a-406f-bb3f-bbb2f7bc2ad1\"}],\"ips\": > > [{\"version\":\"4\",\"interface\":1,\"address\":\"10.128.2.34/23\", > > \"gateway\":\"10.128.2.1\"}],\"dns\":{}},\"PodIFInfo\":null}", err <nil> From ovn-controller logs: 2021-04-09T07:29:39Z|00009|reconnect|INFO|ssl:10.0.0.7:9642: connected <--- connected to .7 SB ... 2021-04-09T07:34:36Z|00101|binding|INFO|Claiming lport openshift-monitoring_alertmanager-main-1 for this chassis. 2021-04-09T07:34:36Z|00102|binding|INFO|openshift-monitoring_alertmanager-main-1: Claiming 0a:58:0a:81:02:1a 10.129.2.26 <--- It can be that .7 had stale Port_binding information ... 2021-04-09T07:38:12Z|00158|reconnect|ERR|ssl:10.0.0.7:9642: no response to inactivity probe after 30 seconds, disconnecting <--- Connection dropped to .7, was it upgraded here? 2021-04-09T07:38:12Z|00159|reconnect|INFO|ssl:10.0.0.7:9642: connection dropped 2021-04-09T07:38:12Z|00160|reconnect|INFO|ssl:10.0.0.5:9642: connecting... 2021-04-09T07:38:12Z|00161|reconnect|INFO|ssl:10.0.0.5:9642: connected ... 2021-04-15T20:56:53Z|00248|reconnect|INFO|ssl:10.0.0.6:9642: connected <--- last successful connect (quite a few days later) We don't see any more "Claiming lport openshift-monitoring_alertmanager-main-1" but that's because it was already claimed before by the same chassis. When the port_binding is updated with the new MAC addresses, ovn-controller will just update the openflows. Checking the current flows, they look fine: ovs-ofctl dump-flows br-int table=8 | grep metadata=0x17 cookie=0x7c8567e9, duration=609325.521s, table=8, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=100,metadata=0x17,vlan_tci=0x1000/0x1000 actions=drop cookie=0xb03bf160, duration=609325.539s, table=8, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=100,metadata=0x17,dl_src=01:00:00:00:00:00/01:00:00:00:00:00 actions=drop cookie=0xf0e785eb, duration=609325.513s, table=8, n_packets=1922566, n_bytes=930519010, idle_age=0, hard_age=65534, priority=50,reg14=0x1,metadata=0x17 actions=resubmit(,9) cookie=0xef54856a, duration=609029.531s, table=8, n_packets=359589, n_bytes=15107770, idle_age=0, hard_age=65534, priority=50,reg14=0x2d,metadata=0x17,dl_src=0a:58:0a:80:02:22 actions=resubmit(,9) <--- This matches the most recent SB values. > > > > Also, we think that the node was updated before the masters, however we > > still see: > > 2021-04-09T06:50:52Z|00018|ovsdb_idl|WARN|transaction error: {"details":"No > > column up in table Port_Binding.","error":"unknown > > column","syntax":"{\"up\":true}"} > > Are we 100% sure that the node was updated before the masters? I'm asking > because, unless there's a bug, ovn-controller should not try to write to > Port_Binding.up if Port_Binding.up is not already set to a value in the SB > (therefore the SB was already upgraded). > > > > > and a lot of: > > 2021-04-15T14:49:59Z|00240|ovsdb_idl|INFO|ssl:10.0.0.5:9642: clustered > > database server is disconnected from cluster; trying another server > > 2021-04-15T14:49:59Z|00241|main|INFO|OVNSB commit failed, force recompute > > next time. > > What is the state of the SB cluster? The logs seem to indicate that SB > servers 10.0.0.5 and 10.0.0.6 are disconnected from each other and that > 10.0.0.7 is not reachable. > > ovn-appctl -t ovnsb_db cluster/status OVN_Southbound > > It might also be useful to check the openshift-monitoring_alertmanager-main-1 > port binding on both 10.0.0.6 and 10.0.0.6 with: > > ovn-sbctl --no-leader-only list port_binding > openshift-monitoring_alertmanager-main-1 > > > > > I've attached the ovn-controller log and relevant output. We also have a > > live cluster with the problem. Dumitru can you PTAL? I tried to investigate some more what's happening with the SB on 10.0.0.7, pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal, but I can't seem to be able to connect to any of the pods running there. Starting a debug pod on the node fails too. Another weird thing is that 10.0.0.6 is connected to both .5 and .7: [root@pdhamdhe091202-7x7q4-m-1 ~]# ss -tn | grep :9644 ESTAB 0 0 10.0.0.6:47944 10.0.0.7:9644 ESTAB 0 0 10.0.0.6:9644 10.0.0.5:51602 ESTAB 0 0 10.0.0.6:9644 10.0.0.7:40618 ESTAB 0 0 10.0.0.6:53388 10.0.0.5:9644 But .5 is only connected to .6: [root@pdhamdhe091202-7x7q4-m-0 ~]# ss -tn | grep :9644 ESTAB 0 0 10.0.0.5:51602 10.0.0.6:9644 ESTAB 0 0 10.0.0.5:9644 10.0.0.6:45172 ESTAB 0 0 10.0.0.5:9644 10.0.0.6:53388 SYN-SENT 0 1 10.0.0.5:46114 10.0.0.7:9644 @Tim any idea on how to get the SB logs off 10.0.0.7? Created attachment 1772475 [details]
locked up node's sbdb logs
Hi Dumitru, see attachment. Ross was able to reboot the node for us and we have the logs now.
ovn-controller crash (this is printed in reverse TS) Apr 13 21:46:45 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd-coredump[3043824]: Process 5658 (ovn-controller) of user 0 dumped core. Stack trace of thread 5658: #0 0x00007fed2fcf284f raise (libc.so.6) #1 0x00007fed2fcdcc45 abort (libc.so.6) #2 0x0000556945a939a4 n/a (/usr/bin/ovn-controller) #3 0x00007ffd05af0bf0 n/a (n/a) Apr 13 21:46:18 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: 10.0.0.3 is manually marked as down, skipping for internal clients... Apr 13 21:46:03 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: crio-conmon-e037ecced891ee9734ed17043dfaf69d9e7bb6f5c8ce0bcfaa8e4c251c83e802.scope: Consumed 832ms CPU time Apr 13 21:46:03 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: crio-conmon-e037ecced891ee9734ed17043dfaf69d9e7bb6f5c8ce0bcfaa8e4c251c83e802.scope: Succeeded. Apr 13 21:46:01 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: crio-e037ecced891ee9734ed17043dfaf69d9e7bb6f5c8ce0bcfaa8e4c251c83e802.scope: Consumed 2min 9.986s CPU time Apr 13 21:46:00 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: crio-e037ecced891ee9734ed17043dfaf69d9e7bb6f5c8ce0bcfaa8e4c251c83e802.scope: Succeeded. Apr 13 21:41:52 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal systemd[1]: Started Process Core Dump (PID 3043789/UID 0). Apr 13 21:40:33 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: done applying vip rules Apr 13 21:40:03 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal ovs-vswitchd[1327]: ovs|03669|rconn|ERR|br-int<->unix#66810: no response to inactivity probe after 60 seconds, disconnecting Apr 13 21:39:57 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: ensuring rule for 34.123.66.142 for internal clients Apr 13 21:39:26 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: ensuring rule for 34.123.66.142 for external clients Apr 13 21:38:50 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: ensuring rule for 10.0.0.3 for external clients Apr 13 21:38:35 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal ovs-vswitchd[1327]: ovs|03668|rconn|ERR|br-int<->unix#66844: no response to inactivity probe after 60 seconds, disconnecting Apr 13 21:37:42 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: Processing route for NIC 0/42:01:0a:00:00:07 for 34.123.66.142 Apr 13 21:35:15 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal bash[3038394]: 10.0.0.3 is manually marked as down, skipping for internal clients... Apr 13 21:35:06 pdhamdhe091202-7x7q4-m-2.c.openshift-qe.internal ovs-vswitchd[1327]: ovs|03666|rconn|ERR|br-int<->unix#66843: no response to inactivity probe after 60 seconds, disconnecting Created attachment 1772554 [details]
journald and ovn-controller coredump
Dumitru, can you PTAL at the coredump? Upgrading cluster from 4.7.0 -> 4.7.3 -> 4.7.6 seems to have the same behavior. (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-1_openshift-monitoring_6fe22f18-6d30-449b-b6fc-949992ed0688_0(8b3c4637e89146b1937ef3bbc5bf6fc946be5aafd39aeadbfbfb9d9e01a65825): [openshift-monitoring/alertmanager-main-1:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-1 8b3c4637e89146b1937ef3bbc5bf6fc946be5aafd39aeadbfbfb9d9e01a65825] [openshift-monitoring/alertmanager-main-1 8b3c4637e89146b1937ef3bbc5bf6fc946be5aafd39aeadbfbfb9d9e01a65825] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows ' must gather log attached to Case 02920947 (In reply to Tim Rozet from comment #33) > Dumitru, can you PTAL at the coredump? I had a look at the coredump, adding the steps I did here for reference, there's probably a better way to do this but this worked for me: 1. Bring up an OCP CI cluster with clusterbot running the version the bug was reported on (4.7 nightly). 2. Spin up a debug pod on one of the nodes, e.g.: # oc debug node/ci-ln-kn1stvb-f76d1-llzzm-master-0 3. Copy the ovn-controller core dump to the debug pod, along with the OVN rpms (including debuginfo) corresponding to the OVN version that was running (ovn2.13-20.12.0-24): # for r in $(ls *.rpm); do oc cp $r ci-ln-kn1stvb-f76d1-llzzm-master-0-debug:/tmp; done # oc cp core.ovn-controller.0.bf9ed7e2bc934d0a9e5db8b29b87997a.5658.1618349886000000.lz4 ci-ln-kn1stvb-f76d1-llzzm-master-0-debug:/tmp/ 4. On the debug pod install gdb and the OVN rpms: # yum install -y gdb # rpm -Uhv --force --nodeps /tmp/ovn2.13* 5. Load the coredump in gdb: # gdb /usr/bin/ovn-controller /tmp/core.ovn-controller.0.bf9ed7e2bc934d0a9e5db8b29b87997a.5658.1618349886000000 (gdb) bt #0 0x00007fed2fcf284f in raise () from /lib64/libc.so.6 #1 0x00007fed2fcdcc45 in abort () from /lib64/libc.so.6 #2 0x0000556945a939a4 in ovs_abort_valist (err_no=err_no@entry=0, format=format@entry=0x556945b73eb0 "%s: assertion %s failed in %s()", args=args@entry=0x7ffd05af0b10) at lib/util.c:419 #3 0x0000556945a9b794 in vlog_abort_valist (module_=<optimized out>, message=0x556945b73eb0 "%s: assertion %s failed in %s()", args=args@entry=0x7ffd05af0b10) at lib/vlog.c:1249 #4 0x0000556945a9b83a in vlog_abort (module=module@entry=0x556945e30e80 <this_module>, message=message@entry=0x556945b73eb0 "%s: assertion %s failed in %s()") at lib/vlog.c:1263 #5 0x0000556945a936bb in ovs_assert_failure (where=where@entry=0x556945b4e9a6 "controller/ofctrl.c:1917", function=function@entry=0x556945b4eff0 <__func__.34362> "merge_tracked_flows", condition=condition@entry=0x556945b4e990 "del_f->installed_flow") at lib/util.c:86 #6 0x00005569459bfbaa in merge_tracked_flows (flow_table=0x5569470c5b40) at /usr/src/debug/ovn2.13-20.12.0-24.el8fdp.x86_64/openvswitch-2.14.90/include/openvswitch/hmap.h:283 #7 update_installed_flows_by_track (msgs=0x7ffd05af0c40, flow_table=0x5569470c5b40) at controller/ofctrl.c:1946 #8 ofctrl_put (flow_table=flow_table@entry=0x5569470c5b40, pending_ct_zones=pending_ct_zones@entry=0x5569470ee560, meter_table=<optimized out>, req_cfg=req_cfg@entry=0, flow_changed=flow_changed@entry=true) at controller/ofctrl.c:2130 #9 0x00005569459abdae in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:2931 (gdb) frame 6 #6 0x00005569459bfbaa in merge_tracked_flows (flow_table=0x5569470c5b40) at /usr/src/debug/ovn2.13-20.12.0-24.el8fdp.x86_64/openvswitch-2.14.90/include/openvswitch/hmap.h:283 283 hmap_expand_at(hmap, where); (gdb) p del_f->installed_flow $2 = (struct installed_flow *) 0x0 (gdb) list ofctrl.c:1917 1912 continue; 1913 } 1914 1915 /* del_f must have been installed, otherwise it should have been 1916 * removed during track_flow_add_or_modify. */ 1917 ovs_assert(del_f->installed_flow); 1918 1919 if (!f->installed_flow) { 1920 /* f is not installed yet. */ 1921 replace_installed_to_desired(del_f->installed_flow, del_f, f); There were a few fixes to the OF incremental processing in ovn-controller recently: [1] https://github.com/ovn-org/ovn/commit/c6c61b4e3462fb5201a61a226c2acaf6f4caf917 [2] https://github.com/ovn-org/ovn/commit/858d1dd716db1a1e664a7c1737fd34f04fcbda5e [3] https://github.com/ovn-org/ovn/commit/6975c649f932633042ca54df2d8f8f0eb866c344 IIRC, without [2], the flood_remove_flows_for_sb_uuid() was potentially expanding a hmap while iterating through it which may cause entries to be visited twice or skipped. That might explain what we're seeing above. All the above fixes are available downstream since version 20.12.0-63. It might be a good idea anyway to bump ovn version in ovn-kubernetes to something >= 20.12.0-63 and retest. @Tim, could you please help out with this? I'll open an OVN BZ too in order to investigate and make sure that the crash above is not a new one though. I didn't analyze the SB logs from 10.0.0.7 yet, but the crash seems unrelated to that. I'll have a look at the SB logs soon. Regards, Dumitru I opened https://bugzilla.redhat.com/show_bug.cgi?id=1951502 for further investigation of the ovn-controller core dump. Bug 1951502 is a new crash, the changelog has been rewritten for ovn2.13 but the -24 version that was running on the cluster includes all current upstream ofctrl related fixes. In addition to ovn-controller crashing here, we see the node lock up and journald crash. We cannot ssh to the node, only ping it. It looks like the system is running OOM. After speaking with Ryan, the known kernel OOM bug did not make it into 4.7.4. It is only in 4.7.6 and later: https://bugzilla.redhat.com/show_bug.cgi?id=1857446 It is most likely we are also hitting that issue here. Ross, can you help us try to reproduce this issue on the latest 4.7 and see if the node locks up? OVN team would also like to poke around there if we can reproduce the ovn-controller crash and get the NBDB. Also, the original bug says 4.5->4.6->4.7, but it seems like some folks have reproduced this just going from something like 4.7.2->4.7.3. Are there other paths to hit this bug? Is it simply any 4.6 or 4.7 upgrade to 4.7.3 or 4.7.4? Ross helped us attempt two upgrades. In both of these upgrades the ovn-controller crash was not seen, but there were other OVN crashes in northd and the ovsdb. Also we did not see any kernel OOM and the cluster appears to upgrade successfully (northd and ovsdb containers are restarted). This gives us some confidence that the node lockup we saw before on an earlier 4.7 is most likely the kernel bug that was fixed in later 4.7. 1. 4.6.26 --> 4.7.0-0.nightly-2021-04-22-105612. OVN versions: ovn2.13-20.06.2-11.el8fdp --> 20.12.0-24.el8fdp OVN northd and ovsdb crashes in this case on two master nodes with coredumps: http://file.rdu.redhat.com/~rbrattai/logs/must-gather.local.557225639646412986.tar http://file.rdu.redhat.com/~rbrattai/logs/must-gather.local.2229250353150456641.tar.zst 2. 4.6.23 --> 4.7.0-0.nightly-2021-04-22-105612. OVN versions: ovn2.13-20.09.0-7.el8fdn --> 20.12.0-24.el8fdp OVN northd crash on two master nodes with coredumps. I have those cores on my laptop and can share or Ross can provide a link. We can try a few more times to reproduce the ovn-controller crash on latest 4.7 and also ensure the node lockup doesn't happen, but we also need OVN team to look into these new coredumps. Created attachment 1774638 [details]
coredump for northd
Actually the 4.6.23->4.7 cores are small enough I can attach here.
add another upgrade paths,seems they are good, the link below are the must-gather coredumps from `oc adm must-gather -- gather_core_dumps` 1. 4.6.23-->4.7.3 http://file.apac.redhat.com/~zzhao/must-gather.local.4251437677671016869.tar.gz 2. 4.6.23--> 4.7.0-0.nightly-2021-04-22-105612 http://file.apac.redhat.com/~zzhao/must-gather.local.507640507921899081.tar.gz On 4.7.0-0.nightly-2021-04-22-105612 it seems that `oc delete daemonsets ovnkube-master --wait=false` triggers ovn-northd coredumps reliably, but the ovnkube-master pods recover. Fri Apr 23 17:15:25.0250636310 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.5930.1619198124000000.lz4 Fri Apr 23 17:18:49.2774560510 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1667572.1619198328000000.lz4 Fri Apr 23 17:32:57.3799121480 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1672345.1619199176000000.lz4 Fri Apr 23 17:38:50.6163434580 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1690961.1619199529000000.lz4 Fri Apr 23 17:48:24.9831908680 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1699300.1619200104000000.lz4 Fri Apr 23 18:23:35.1131995010 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1712140.1619202213000000.lz4 Fri Apr 23 18:26:55.6223641060 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1760256.1619202414000000.lz4 Fri Apr 23 18:35:45.4445084530 2021 core.ovn-northd.0.aab7de61aed44c39b5cb8ecd82433323.1765504.1619202944000000.lz4 After 28 hrs the 4.6.26 --> 4.7.0-0.nightly-2021-04-22-105612 cluster had master-0 enter `NotReady` state. I was able to ssh into the node. The system was mostly unresponsive, load avg 177.71. After stopping `kubelet` I observed various processes with high CPU utilization. `etcd` consistently used 12% cpu. Also observed core.systemd-journal coredump [77045.656418] [Fri Apr 23 17:15:18 2021] traps: ovn-northd[4950] general protection fault ip:7f68e4a83d21 sp:7fff6692fd50 error:0 in libc-2.28.so[7f68e4a62000+1b9000] [77249.821609] [Fri Apr 23 17:18:42 2021] traps: ovn-northd[1713666] general protection fault ip:7f6de542fd21 sp:7fffbe200e30 error:0 in libc-2.28.so[7f6de540e000+1b9000] [77805.333438] [Fri Apr 23 17:27:58 2021] traps: ovn-northd[1719579] general protection fault ip:7fa260ea0d21 sp:7fff23dee380 error:0 in libc-2.28.so[7fa260e7f000+1b9000] [78600.214465] [Fri Apr 23 17:41:13 2021] traps: ovn-northd[1733228] general protection fault ip:7f6760f61d21 sp:7fffce55f120 error:0 in libc-2.28.so[7f6760f40000+1b9000] [79025.462822] [Fri Apr 23 17:48:18 2021] traps: ovn-northd[1753077] general protection fault ip:7f2cefde5d21 sp:7ffc20f249e0 error:0 in libc-2.28.so[7f2cefdc4000+1b9000] [81135.198549] [Fri Apr 23 18:23:28 2021] traps: ovn-northd[1764871] general protection fault ip:7f0255198d21 sp:7ffd2a84b730 error:0 in libc-2.28.so[7f0255177000+1b9000] [81335.999986] [Fri Apr 23 18:26:48 2021] traps: ovn-northd[1813835] general protection fault ip:7fc852689d21 sp:7ffff63b7b60 error:0 in libc-2.28.so[7fc852668000+1b9000] [81865.552301] [Fri Apr 23 18:35:38 2021] traps: ovn-northd[1820142] general protection fault ip:7efecb808d21 sp:7ffd98acbe50 error:0 in libc-2.28.so[7efecb7e7000+1b9000] [82372.566797] [Fri Apr 23 18:44:05 2021] traps: ovn-northd[1834086] general protection fault ip:7fa77dd1dd21 sp:7ffced119070 error:0 in libc-2.28.so[7fa77dcfc000+1b9000] [82574.761351] [Fri Apr 23 18:47:27 2021] traps: ovn-northd[1846875] general protection fault ip:7ffa9d6b9d21 sp:7fff00cf93b0 error:0 in libc-2.28.so[7ffa9d698000+1b9000] [87315.468193] [Fri Apr 23 20:06:28 2021] systemd[1]: systemd-journald.service: State 'stop-sigabrt' timed out. Terminating. [87322.226438] [Fri Apr 23 20:06:35 2021] systemd[1]: Starting Cleanup of Temporary Directories... [87346.312381] [Fri Apr 23 20:06:59 2021] systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT [87346.753929] [Fri Apr 23 20:06:59 2021] systemd[1]: systemd-journald.service: Failed with result 'watchdog'. [87350.565027] [Fri Apr 23 20:07:03 2021] systemd[1]: systemd-journald.service: Consumed 55.603s CPU time [87351.639607] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart. [87351.654227] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 2. [87351.684096] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journal-flush.service: Succeeded. [87351.703414] [Fri Apr 23 20:07:04 2021] systemd[1]: Stopped Flush Journal to Persistent Storage. [87351.710239] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journal-flush.service: Consumed 0 CPU time [87351.741727] [Fri Apr 23 20:07:04 2021] systemd[1]: Stopping Flush Journal to Persistent Storage... [87351.758087] [Fri Apr 23 20:07:04 2021] systemd[1]: Stopped Journal Service. [87351.765212] [Fri Apr 23 20:07:04 2021] systemd[1]: systemd-journald.service: Consumed 55.603s CPU time [87353.483285] [Fri Apr 23 20:07:06 2021] systemd[1]: Starting Journal Service... [87366.172243] [Fri Apr 23 20:07:19 2021] systemd-journald[1952234]: File /var/log/journal/41a9ccb005fa25b06249ece132b36d26/system.journal corrupted or uncleanly shut down, renaming and replacing. [87369.562936] [Fri Apr 23 20:07:22 2021] systemd[1]: Started Journal Service. [87400.134351] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: Process 892 (systemd-journal) of user 0 dumped core. [87400.143186] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.8a8ac32fd64c4e5f90eec19afbfb0dcb.892.1619208301000000.lz4 [87400.159442] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: Stack trace of thread 892: [87400.165967] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #0 0x00007f52af6b80e4 journal_file_find_data_object_with_hash (libsystemd-shared-239.so) [87400.178615] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #1 0x00007f52af6b8359 journal_file_append_data (libsystemd-shared-239.so) [87400.190137] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #2 0x00007f52af6ba701 journal_file_append_entry (libsystemd-shared-239.so) [87400.201681] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #3 0x0000555ee6f2cd1c dispatch_message_real (systemd-journald) [87400.212421] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #4 0x0000555ee6f3239d stdout_stream_log (systemd-journald) [87400.221925] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #5 0x0000555ee6f32654 stdout_stream_line (systemd-journald) [87400.232066] [Fri Apr 23 20:07:53 2021] systemd-coredump[1952188]: #6 0x0000555ee6f32def stdout_stream_scan (systemd-journald) [87413.034295] [Fri Apr 23 20:08:06 2021] printk: systemd-coredum: 7 output lines suppressed due to ratelimiting I reniced -n 20 etcd and was able to generate `sosreport` and reduce load avg to <1.0 systemd-journald.service abort occured at (87315.468193 / 3600) = 24.27 hours after boot. The Fri Apr 23 ovn-northd coredumps listed above were triggered after testing. The command `oc delete daemonsets ovnkube-master --wait=false` reliably triggers the coredumps, after which the cluster seemed to recover, until I noticed the `NotReady` state at around Fri Apr 23 21:36. ovn*_db.db were recovered, along with ovn logs. OVN and etcd are still running in their original state. In the current cluster all the alertmanager-main-? pods respond to metric curl, so that doesn't seem to be the same issue. # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://10.128.2.35:9095/metrics' Otherwise, these events match the previous cluster 4.6.23-x86_64--> 4.7.0-0.nightly-2021-04-05-201242 `NotReady` lockup. Those nodes became `NotReady` after a few days. I am attempting to reproduce on a 4.7.0-0.nightly-2021-04-23-165253 without upgrading, letting it soak without interaction for 24 hours. This cluster and the previous cluster are UPI on GCP with OVN and etcd encryption enabled. Created attachment 1774965 [details]
ovndb from the NotReady node, cluster 230024
Created attachment 1774966 [details]
ovn, etcd logs from NotReady node 230024
Created attachment 1774967 [details]
first top command on NotReady node, cluster 230024
Looks like authentication-operator maybe, 7.8g and 53.5% MEM PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17742 root 20 0 9734552 7.8g 0 S 0.7 53.5 144:51.38 authentication- 17742 root 20 0 9734552 7.8g 0 S 0.3 53.5 144:57.52 authentication- Apr 23 20:01:52 rbrattai230024-8tzfq-master-0.c.openshift-qe.internal hyperkube[1752]: I0423 20:01:51.967365 1752 prober.go:117] Liveness probe for "authentication-operator-7f78b8b8cd-hjhqc_openshift-authentication-operator(ed3af4f4-8852-48b6-a410-8150eb608eaf):authe> Apr 23 20:07:29 rbrattai230024-8tzfq-master-0.c.openshift-qe.internal hyperkube[1752]: I0423 20:02:18.253039 1752 prober.go:117] Liveness probe for "authentication-operator-7f78b8b8cd-hjhqc_openshift-authentication-operator(ed3af4f4-8852-48b6-a410-8150eb608eaf):authe> Apr 23 20:01:52 rbrattai230024-8tzfq-master-0.c.openshift-qe.internal hyperkube[1752]: I0423 20:01:51.967365 1752 prober.go:117] Liveness probe for "authentication-operator-7f78b8b8cd-hjhqc_openshift-authentication-operator(ed3af4f4-8852-48b6-a410-8150eb608eaf):authe> Apr 23 20:07:29 rbrattai230024-8tzfq-master-0.c.openshift-qe.internal hyperkube[1752]: I0423 20:02:18.253039 1752 prober.go:117] Liveness probe for "authentication-operator-7f78b8b8cd-hjhqc_openshift-authentication-operator(ed3af4f4-8852-48b6-a410-8150eb608eaf):authe> Created attachment 1774968 [details]
authentication-operator-logs
Seems like around Apr 23rd, 20:01:50 things start to go south on that node. 2021-04-23 20:01:55.028926 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 148.74685ms, to e57937edfb61fd39) 2021-04-23 20:01:55.028953 W | etcdserver: server is likely overloaded around that time ovn-controller starts having trouble too: 2021-04-23T18:49:09Z|00312|reconnect|INFO|ssl:10.0.0.7:9642: connected ... 2021-04-23T20:02:09Z|00313|timeval|WARN|Unreasonably long 3932ms poll interval (0ms user, 3ms system) 2021-04-23T20:02:10Z|00314|timeval|WARN|faults: 19 minor, 11 major 2021-04-23T20:02:10Z|00315|timeval|WARN|disk: 2528 reads, 0 writes 2021-04-23T20:02:10Z|00316|timeval|WARN|context switches: 29 voluntary, 4 involuntary ... 2021-04-23T20:02:10Z|00339|coverage|INFO|util_xalloc 0.8/sec 208.033/sec 132.1406/sec total: 106648956 eventually our connections to OVS have problems (though it's 45m later): 2021-04-23T20:46:49Z|00639|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting... 2021-04-23T20:46:49Z|00640|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected that continues through to the end of the logs. I think at this point we'd need a full must-gather on etcd, authentication-operator, systemd journal (for OVS and kubelet), and OVN components on a couple nodes to better root-cause. But quite possibly OOM pressure? If someone wants access to a real cluster with the problem I can provide it. I've also opened case #02920882 but the guy seems to concentrate on other things, I have jounald and ovn-controller coredump on all the three master node, the last is quite big at around 50 to 70Mb. CPU utilization on the cluster - now void without user workload - raised to 20-25% compared to 5% of the 4.7.3 version. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 Regarding the ovn-northd crash seen, below is the investigation details from Dumitru Ceara ----- I had a look at the ovn-northd core dumps and it might be an issue in the OVS lib/fatal-signal.c implementation, but it's definitely nothing critical, ovn-northd was terminating anyway with signal 15 (SIGTERM): Reading symbols from /usr/bin/ovn-northd...Reading symbols from /usr/lib/debug/usr/bin/ovn-northd-20.12.0-24.el8fdp.x86_64.debug...done. done. [New LWP 1] warning: Could not load shared library symbols for /lib64/libunbound.so.2. Do you need "set solib-search-path" or "set sysroot"? warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". warning: Loadable section ".note.gnu.property" outside of ELF segments warning: Loadable section ".note.gnu.property" outside of ELF segments Core was generated by `ovn-northd --no-chdir -vconsole:info -vfile:off --ovnnb-db ssl:10.0.0.3:9641,ss'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f525d04ed21 in abort () from /lib64/libc.so.6 Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-127.el8_3.2.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libevent-2.1.8-5.el8.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 python3-libs-3.6.8-31.el8.x86_64 zlib-1.2.11-16.2.el8_3.x86_64 (gdb) bt #0 0x00007f525d04ed21 in abort () from /lib64/libc.so.6 #1 0x00005638777ebe9c in fatal_signal_run () at lib/fatal-signal.c:313 #2 0x0000563877831110 in poll_block () at lib/poll-loop.c:388 #3 0x0000563877775cd0 in main (argc=<optimized out>, argv=<optimized out>) at northd/ovn-northd.c:14267 (gdb) frame 0 #0 0x00007f525d04ed21 in abort () from /lib64/libc.so.6 (gdb) frame 1 #1 0x00005638777ebe9c in fatal_signal_run () at lib/fatal-signal.c:313 313 OVS_NOT_REACHED(); (gdb) list 308 * termination status reflects that we were killed by this signal */ 309 signal(sig_nr, SIG_DFL); 310 raise(sig_nr); 311 312 ovs_mutex_unlock(&mutex); 313 OVS_NOT_REACHED(); 314 } 315 } 316 317 void ----- From the above investigation, this definitely not a big concern. I think ovsdb-server crash could be due to the same reason as lib/fatal-signal.c is part of ovs code base (which OVN also uses). Adding ImpactStatementRequested per comment 58 and [1]. [1]: https://github.com/openshift/enhancements/pull/475 Who is impacted? Customers upgrading from 4.6.z to 4.7.3+. What is the impact? Is it serious enough to warrant blocking edges? OVN coredumps (ovn-northd) will be observed when the northd process is already exiting during container termination. Functionality of OVN is not affected. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? No remediation is required, upgrade will be successful. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? No. The coredumps come from 4.6 components during upgrade process. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |