Created attachment 1861766 [details] ovn containers logs Description of problem: This is one of the issues found in BZ#2055857. Somehow we found SNO node in our lab could not be recovered automatically from a DHCP service outage, we managed to reproduce the issue by stopping DHCP service for a while. Thanks @dcbw eventually we found that it seems like northd doesn't wake up for poll intervals when DB IP cannot be reached. Version-Release number of selected component (if applicable): OCP 4.9.18 How reproducible: Steps to Reproduce: 1. Stop DHCP service for 10 mins 2. Startthe DHCP service 3. keep monitoring if the cluster can be recovered automatically Actual results: Could not be recovered due to some errors mentioned in BZ#2055857, but the root cause one is that the pods got error below: Warning FailedCreatePodSandBox 3m49s (x46 over 50m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx_default_93b27f51-58bf-4f80-916b-e10ae03ae10e_0(72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11): error adding pod default_nginx to CNI network "multus-cni-network": [default/nginx/93b27f51-58bf-4f80-916b-e10ae03ae10e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] [default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:00:19 [10.128.0.25/23] Expected results: The cluster shall be recovered automatically whthout any manual intervention. Additional info: Beside the attached logs, at some point when we ran command below, the cluster seems like recovered automatically from a 'reload'? kni@jumphost ~/sno/sno147 $ oc rsh -c northd -n openshift-ovn-kubernetes ovnkube-master-p7v7g sh-4.4# ovn-appctl -t /var/run/ovn/ovn-northd.1.ctl vlog/set dbg
First we thought the northd probe interval was set to 0, but it was 5000: sh-4.4# ovn-nbctl get NB_GLOBAL . options:northd_probe_interval "5000" nbctl daemon logs (from ovnkube-master pod): 2022-02-17T18:22:10Z|00002|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:22:11Z|00005|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:22:13Z|00009|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:22:17Z|00013|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable <snip> 2022-02-17T18:28:17.387Z|00062|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:25.395Z|00063|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:33.404Z|00064|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:41.412Z|00065|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:49.420Z|00066|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable 2022-02-17T18:28:57.431Z|00067|reconnect|INFO|ssl:192.168.58.108:9641: connected <eg, connectivity is restored at 18:28:57 at least> ovn-northd logs: 2022-02-17T18:20:05Z|17665|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20529|reconnect|DBG|ssl:192.168.58.108:9642: idle 29997 ms, sending inactivity probe 2022-02-17T18:20:05Z|20530|reconnect|DBG|ssl:192.168.58.108:9642: entering IDLE 2022-02-17T18:20:05Z|20531|jsonrpc|DBG|ssl:192.168.58.108:9642: send request, method="echo", params=[], id="echo" 2022-02-17T18:20:05Z|20532|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20533|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|17666|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17667|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20534|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage) 2022-02-17T18:20:05Z|17668|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20535|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20536|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|20537|jsonrpc|DBG|ssl:192.168.58.108:9642: received reply, result=[], id="echo" 2022-02-17T18:20:05Z|20538|reconnect|DBG|ssl:192.168.58.108:9642: entering ACTIVE 2022-02-17T18:20:05Z|17669|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20539|poll_loop|DBG|wakeup due to 0-ms timeout at lib/stream-ssl.c:834 (0% CPU usage) 2022-02-17T18:20:05Z|17670|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17671|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20540|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage) 2022-02-17T18:20:05Z|20541|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20542|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|17672|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20543|jsonrpc|DBG|ssl:192.168.58.108:9642: received notification, method="update2", params=[["monid","_Server"],{"Database":{"91a38ed7-d7ac-45c2-8123-659db7650ff6":{"modify":{"index":48732}} }}] 2022-02-17T18:20:05Z|20544|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20545|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|20546|jsonrpc|DBG|ssl:192.168.58.108:9642: received notification, method="update3", params=[["monid","OVN_Southbound"],"0b74fdca-f575-42d3-ba2a-533e98ab2df2",{"SB_Global":{"72de3344-ac6f-4f2 e-b26f-0b2087b05614":{"modify":{"options":["map",[["e2e_timestamp","1645122005"]]]}}}}] 2022-02-17T18:20:05Z|20547|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage) 2022-02-17T18:20:05Z|17673|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|20548|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T18:20:05Z|20549|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T18:20:05Z|20550|jsonrpc|DBG|ssl:192.168.58.108:9642: received reply, result=[{"count":1},{"count":1},{},{}], id=4200 2022-02-17T18:20:05Z|17674|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17675|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17676|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17677|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17678|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17679|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17680|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17681|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17682|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T18:20:05Z|17683|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) < connectivity is interrupted by allowing the DHCP lease to expire, and the IP to be removed from the node's interface > < northd goes to sleep and doesn't wake up for a long time > < then allow DHCP to re-acquire the lease for the same IP and assign it back to the interface around 18:28:57, but northd doesn't wake up for an hour > 2022-02-17T19:28:20Z|20551|poll_loop|DBG|wakeup due to [POLLIN] on fd 6 (/var/run/ovn/ovn-northd.1.ctl<->) at lib/stream-fd.c:274 (0% CPU usage) < attempt to wake northd up by running "ovn-appctl status", it resurrects itself > 2022-02-17T19:28:20Z|17684|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20552|hmap|DBG|Dropped 9 log messages in last 4095 seconds (most recently, 4095 seconds ago) due to excessive rate 2022-02-17T19:28:20Z|20553|hmap|DBG|northd/northd.c:1571: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets) 2022-02-17T19:28:20Z|20554|hmap|DBG|northd/northd.c:1571: 2 buckets with 6+ nodes, including 1 bucket with 7 nodes (64 nodes total across 64 buckets) 2022-02-17T19:28:20Z|20555|hmap|DBG|lib/ovn-util.c:613: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets) 2022-02-17T19:28:20Z|17685|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17686|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17687|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17688|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20556|hmap|DBG|lib/hmapx.c:41: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (64 nodes total across 64 buckets) 2022-02-17T19:28:20Z|20557|hmap|DBG|lib/hmapx.c:41: 1 bucket with 6+ nodes, including 1 bucket with 7 nodes (128 nodes total across 128 buckets) 2022-02-17T19:28:20Z|20558|hmap|DBG|lib/hmapx.c:41: 3 buckets with 6+ nodes, including 1 bucket with 8 nodes (256 nodes total across 256 buckets) 2022-02-17T19:28:20Z|20559|hmap|DBG|lib/hmapx.c:41: 4 buckets with 6+ nodes, including 3 buckets with 7 nodes (512 nodes total across 512 buckets) 2022-02-17T19:28:20Z|20560|hmap|DBG|lib/hmapx.c:41: 10 buckets with 6+ nodes, including 3 buckets with 7 nodes (1024 nodes total across 1024 buckets) 2022-02-17T19:28:20Z|17689|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17690|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20561|jsonrpc|DBG|unix#1: received request, method="status", params=[], id=0 2022-02-17T19:28:20Z|20562|unixctl|DBG|received request status[], id=0 2022-02-17T19:28:20Z|20563|unixctl|DBG|replying with success, id=0: "Status: active " 2022-02-17T19:28:20Z|20564|jsonrpc|DBG|unix#1: send reply, result="Status: active\n", id=0 2022-02-17T19:28:20Z|20565|poll_loop|DBG|wakeup due to 0-ms timeout at lib/reconnect.c:664 (0% CPU usage) 2022-02-17T19:28:20Z|17691|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20566|reconnect|DBG|ssl:192.168.58.108:9641: idle 4094953 ms, sending inactivity probe 2022-02-17T19:28:20Z|20567|reconnect|DBG|ssl:192.168.58.108:9641: entering IDLE 2022-02-17T19:28:20Z|20568|jsonrpc|DBG|ssl:192.168.58.108:9641: send request, method="echo", params=[], id="echo" 2022-02-17T19:28:20Z|20569|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:20Z|20570|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 257 (1 bytes) 2022-02-17T19:28:20Z|20571|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2022-02-17T19:28:20Z|20572|jsonrpc|WARN|ssl:192.168.58.108:9641: receive error: Connection reset by peer 2022-02-17T19:28:20Z|20573|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:20Z|20574|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 257 (1 bytes) 2022-02-17T19:28:20Z|20575|reconnect|DBG|ssl:192.168.58.108:9642: idle 4094944 ms, sending inactivity probe 2022-02-17T19:28:20Z|20576|reconnect|DBG|ssl:192.168.58.108:9642: entering IDLE 2022-02-17T19:28:20Z|20577|jsonrpc|DBG|ssl:192.168.58.108:9642: send request, method="echo", params=[], id="echo" 2022-02-17T19:28:20Z|20578|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T19:28:20Z|20579|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T19:28:20Z|20580|stream_ssl|WARN|SSL_read: system error (Connection reset by peer) 2022-02-17T19:28:20Z|20581|jsonrpc|WARN|ssl:192.168.58.108:9642: receive error: Connection reset by peer 2022-02-17T19:28:20Z|20582|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T19:28:20Z|20583|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes) 2022-02-17T19:28:20Z|17692|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20584|hmap|DBG|northd/northd.c:1571: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets) 2022-02-17T19:28:20Z|20585|hmap|DBG|northd/northd.c:1571: 2 buckets with 6+ nodes, including 1 bucket with 7 nodes (64 nodes total across 64 buckets) 2022-02-17T19:28:20Z|17693|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17694|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17695|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17696|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17697|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17698|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17699|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17700|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20586|poll_loop|DBG|wakeup due to 0-ms timeout at ssl:192.168.58.108:9641 (0% CPU usage) 2022-02-17T19:28:20Z|17701|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|17702|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:20Z|20587|reconnect|WARN|ssl:192.168.58.108:9641: connection dropped (Connection reset by peer) 2022-02-17T19:28:20Z|20588|reconnect|DBG|ssl:192.168.58.108:9641: entering BACKOFF 2022-02-17T19:28:20Z|20589|ovsdb_cs|DBG|ssl:192.168.58.108:9641: MONITORING -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419 2022-02-17T19:28:20Z|20590|reconnect|WARN|ssl:192.168.58.108:9642: connection dropped (Connection reset by peer) 2022-02-17T19:28:20Z|20591|reconnect|DBG|ssl:192.168.58.108:9642: entering BACKOFF 2022-02-17T19:28:20Z|20592|ovsdb_cs|DBG|ssl:192.168.58.108:9642: MONITORING -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419 2022-02-17T19:28:20Z|20593|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2022-02-17T19:28:20Z|17703|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:21Z|20594|poll_loop|DBG|wakeup due to 1000-ms timeout at lib/reconnect.c:664 (0% CPU usage) 2022-02-17T19:28:21Z|20595|reconnect|INFO|ssl:192.168.58.108:9641: connecting... 2022-02-17T19:28:21Z|20596|reconnect|DBG|ssl:192.168.58.108:9641: entering CONNECTING 2022-02-17T19:28:21Z|17704|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:21Z|20597|reconnect|INFO|ssl:192.168.58.108:9642: connecting... 2022-02-17T19:28:21Z|20598|reconnect|DBG|ssl:192.168.58.108:9642: entering CONNECTING 2022-02-17T19:28:21Z|20599|stream_ssl|DBG|client7-->ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20600|stream_ssl|DBG|client7-->ssl:192.168.58.108:9641 handshake: client_hello (340 bytes) 2022-02-17T19:28:21Z|20601|stream_ssl|DBG|client8-->ssl:192.168.58.108:9642 type 256 (5 bytes) 2022-02-17T19:28:21Z|20602|stream_ssl|DBG|client8-->ssl:192.168.58.108:9642 handshake: client_hello (340 bytes) 2022-02-17T19:28:21Z|17705|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:21Z|20603|poll_loop|DBG|wakeup due to [POLLIN] on fd 13 (192.168.58.108:60464<->192.168.58.108:9641) at lib/stream-ssl.c:821 (0% CPU usage) 2022-02-17T19:28:21Z|17706|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage) 2022-02-17T19:28:21Z|20604|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20605|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: server_hello (122 bytes) 2022-02-17T19:28:21Z|20606|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20607|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20608|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 257 (1 bytes) 2022-02-17T19:28:21Z|20609|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: <unknown> (6 bytes) 2022-02-17T19:28:21Z|20610|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes) 2022-02-17T19:28:21Z|20611|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 257 (1 bytes) 2022-02-17T19:28:21Z|20612|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: certificate_request (112 bytes)
My 3-nodes cluster got recovered with the same command: kni@jumphost ~/rhacm $ oc get pods NAME READY STATUS RESTARTS AGE nginx2 0/1 ContainerCreating 0 4h57m kni@jumphost ~/rhacm $ kni@jumphost ~/rhacm $ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-dqhtm 6/6 Running 2 (22h ago) 22h ovnkube-master-dwrn4 6/6 Running 1 (18h ago) 22h ovnkube-master-pmc2j 6/6 Running 3 (3h27m ago) 22h ovnkube-node-8lxdj 4/4 Running 0 23h ovnkube-node-q8t4d 4/4 Running 0 23h ovnkube-node-zt549 4/4 Running 0 23h kni@jumphost ~/rhacm $ kni@jumphost ~/rhacm $ kni@jumphost ~/rhacm $ oc exec -it ovnkube-master-dqhtm -n openshift-ovn-kubernetes -- ovn-appctl -t /var/run/ovn/ovn-northd.1.ctl vlog/set dbg Defaulted container "northd" out of: northd, nbdb, kube-rbac-proxy, sbdb, ovnkube-master, ovn-dbchecker kni@jumphost ~/rhacm $ oc get pods NAME READY STATUS RESTARTS AGE nginx2 1/1 Running 0 4h58m kni@jumphost ~/rhacm $ oc describe pod nginx2 Name: nginx2 Namespace: default Priority: 0 Node: openshift-master-0/192.168.58.21 Start Time: Thu, 17 Feb 2022 11:50:00 -0500 ...... node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 3m49s (x238 over 4h50m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx2_default_97b7c90b-238e-41da-b54b-5bedb1cfc5b7_0(b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac): error adding pod default_nginx2 to CNI network "multus-cni-network": [default/nginx2/97b7c90b-238e-41da-b54b-5bedb1cfc5b7:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[default/nginx2 b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac] [default/nginx2 b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:00:3c [10.128.0.60/24] ' Normal AddedInterface 2m40s multus Add eth0 [10.128.0.60/24] from ovn-kubernetes
I think, I found the root cause and I posted the fix for review: https://patchwork.ozlabs.org/project/openvswitch/patch/20220221141622.767208-1-i.maximets@ovn.org/ The problem caused by the following OVS patch: 4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long poll intervals.") And it basically breaks the inactivity probe functionality in case the application doesn't have any other sources of wake-up events (like northd). @
### Reproduced on: [root@bz_2055865 ~]# rpm -qa |grep -E 'ovn|openvswitch' ovn2.13-central-20.12.0-193.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch openvswitch2.13-2.13.0-162.el8fdp.x86_64 ovn2.13-20.12.0-193.el8fdp.x86_64 ovn2.13-host-20.12.0-193.el8fdp.x86_64 ### Created a cluster and start ovn-northd [root@bz_2055865 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=172.16.1.1 --db-nb-create-insecure-remote=yes --db-sb-addr=172.16.1.1 --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr=172.16.1.1 --db-nb-cluster-local-addr=172.16.1.1 --ovn-northd-nb-db=tcp:172.16.1.1:6641 --ovn-northd-sb-db=tcp:172.16.1.1:6642 start_northd Creating cluster database /etc/ovn/ovnnb_db.db [ OK ] Starting ovsdb-nb [ OK ] Waiting for OVN_Northbound to come up 2022-03-08T21:31:24Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-03-08T21:31:24Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected [ OK ] Creating cluster database /etc/ovn/ovnsb_db.db [ OK ] Starting ovsdb-sb [ OK ] Waiting for OVN_Southbound to come up 2022-03-08T21:31:24Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2022-03-08T21:31:24Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected [ OK ] Starting ovn-northd [ OK ] ### Check ovn-northd.log [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:30:12.435Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:30:12.435Z|00004|daemon_unix(monitor)|INFO|pid 67309 died, exit status 0, exiting 2022-03-08T21:31:24.448Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:31:24.449Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:31:24.449Z|00003|reconnect|INFO|tcp:172.16.1.2:6641: connecting... 2022-03-08T21:31:24.449Z|00004|reconnect|INFO|tcp:172.16.1.2:6642: connecting... 2022-03-08T21:31:24.449Z|00005|reconnect|INFO|tcp:172.16.1.2:6641: connected 2022-03-08T21:31:24.449Z|00006|reconnect|INFO|tcp:172.16.1.2:6642: connected 2022-03-08T21:31:24.449Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. ### Look the process IDs for ovnsb_db and ovn_nb_db. The objective is to block the communication between ovnsb_db and ovn_northd in such a way that ovn-northd does not know about this blockage. ### In this way we can see if the inactivity probes are disabled or enabled. [root@bz_2055865 ~]# cat /var/run/ovn/ovnsb_db.pid 67571 [root@bz_2055865 ~]# cat /var/run/ovn/ovnnb_db.pid 67550 [root@bz_2055865 ~]# kill -19 67571 [root@bz_2055865 ~]# kill -19 67550 [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:30:12.435Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:30:12.435Z|00004|daemon_unix(monitor)|INFO|pid 67309 died, exit status 0, exiting 2022-03-08T21:31:24.448Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:31:24.449Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:31:24.449Z|00003|reconnect|INFO|tcp:172.16.1.2:6641: connecting... 2022-03-08T21:31:24.449Z|00004|reconnect|INFO|tcp:172.16.1.2:6642: connecting... 2022-03-08T21:31:24.449Z|00005|reconnect|INFO|tcp:172.16.1.2:6641: connected 2022-03-08T21:31:24.449Z|00006|reconnect|INFO|tcp:172.16.1.2:6642: connected 2022-03-08T21:31:24.449Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-03-08T21:31:34.457Z|00008|memory|INFO|3800 kB peak resident set size after 10.0 seconds <============ Inactivity probes are disabled. ovn-northd did not detect the disconnection. So didn't attempted to reconnect. ### Verified on: [root@bz_2055865 ~]# rpm -qa |grep -E 'ovn|openvswitch' openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch ovn2.13-20.12.0-195.el8fdp.x86_64 ovn2.13-central-20.12.0-195.el8fdp.x86_64 ovn2.13-host-20.12.0-195.el8fdp.x86_64 openvswitch2.13-2.13.0-162.el8fdp.x86_64 ### Cluster created [root@bz_2055865 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=172.16.1.1 --db-nb-create-insecure-remote=yes --db-sb-addr=172.16.1.1 --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr=172.16.1.1 --db-nb-cluster-local-addr=172.16.1.1 --ovn-northd-nb-db=tcp:172.16.1.1:6641 --ovn-northd-sb-db=tcp:172.16.1.1:6642 start_northd Creating cluster database /etc/ovn/ovnnb_db.db [ OK ] Starting ovsdb-nb [ OK ] Waiting for OVN_Northbound to come up 2022-03-08T21:35:22Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2022-03-08T21:35:22Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected [ OK ] Creating cluster database /etc/ovn/ovnsb_db.db [ OK ] Starting ovsdb-sb [ OK ] Waiting for OVN_Southbound to come up 2022-03-08T21:35:22Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2022-03-08T21:35:22Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected [ OK ] Starting ovn-northd [ OK ] [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting 2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting... 2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected 2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds [root@bz_2055865 ~]# cat /var/run/ovn/ovnsb_db.pid 1793479 [root@bz_2055865 ~]# cat /var/run/ovn/ovnnb_db.pid 1793458 [root@bz_2055865 ~]# kill -19 1793479 [root@bz_2055865 ~]# kill -19 1793458 [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting 2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting... 2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected 2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds 2022-03-08T21:36:22.160Z|00009|reconnect|ERR|tcp:172.16.1.1:6642: no response to inactivity probe after 5 seconds, disconnecting 2022-03-08T21:36:22.160Z|00010|reconnect|INFO|tcp:172.16.1.1:6642: connection dropped 2022-03-08T21:36:22.160Z|00011|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2022-03-08T21:36:23.161Z|00012|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:36:23.161Z|00013|reconnect|INFO|tcp:172.16.1.1:6642: connected <============= Inactivity probes are enabled. ovn_northd is trying to reconnect with the db servers. ### Resume the northboudDB and southboundDB processes [root@bz_2055865 ~]# kill -18 1793479 [root@bz_2055865 ~]# kill -18 1793458 [root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting 2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log 2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1] 2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting... 2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected 2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds 2022-03-08T21:36:22.160Z|00009|reconnect|ERR|tcp:172.16.1.1:6642: no response to inactivity probe after 5 seconds, disconnecting 2022-03-08T21:36:22.160Z|00010|reconnect|INFO|tcp:172.16.1.1:6642: connection dropped 2022-03-08T21:36:22.160Z|00011|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2022-03-08T21:36:23.161Z|00012|reconnect|INFO|tcp:172.16.1.1:6642: connecting... 2022-03-08T21:36:23.161Z|00013|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:36:27.164Z|00014|reconnect|ERR|tcp:172.16.1.1:6641: no response to inactivity probe after 5 seconds, disconnecting 2022-03-08T21:36:27.164Z|00015|reconnect|INFO|tcp:172.16.1.1:6641: connection dropped 2022-03-08T21:36:28.165Z|00016|reconnect|INFO|tcp:172.16.1.1:6641: connecting... 2022-03-08T21:36:28.165Z|00017|reconnect|INFO|tcp:172.16.1.1:6641: connected ........... 2022-03-08T21:38:16.247Z|00066|reconnect|ERR|tcp:172.16.1.1:6641: no response to inactivity probe after 5 seconds, disconnecting 2022-03-08T21:38:16.247Z|00067|reconnect|INFO|tcp:172.16.1.1:6641: connection dropped 2022-03-08T21:38:19.245Z|00068|reconnect|INFO|tcp:172.16.1.1:6642: connected 2022-03-08T21:38:24.251Z|00069|reconnect|INFO|tcp:172.16.1.1:6641: connected 2022-03-08T21:38:24.626Z|00070|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. <======== ovn-northd reconnected again with the north and south bound DB servers.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1143