Bug 2055865

Summary:

northd doesn't wake up for poll intervals when DB IP cannot be reached

Product:

Red Hat Enterprise Linux Fast Datapath

Reporter:

bzhai

Component:

OVN

Assignee:

Ilya Maximets <i.maximets>

Status:

CLOSED ERRATA

QA Contact:

Ehsan Elahi <eelahi>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

FDP 21.H

CC:

achernet, ctrautma, dcbw, i.maximets, jiji, mmichels

Target Milestone:

---

Keywords:

CustomerScenariosInitiative

Target Release:

FDP 22.B

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

ovn2.13-20.12.0-195

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-03-30 16:28:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ovn containers logs	none

Description bzhai 2022-02-17 20:18:23 UTC

Created attachment 1861766 [details]
ovn containers logs

Description of problem:

This is one of the issues found in BZ#2055857.
Somehow we found SNO node in our lab could not be recovered automatically from a DHCP service outage, we managed to reproduce the issue by stopping DHCP service for a while. 

Thanks @dcbw eventually we found that it seems like northd doesn't wake up for poll intervals when DB IP cannot be reached. 
 
Version-Release number of selected component (if applicable):
OCP 4.9.18

How reproducible:


Steps to Reproduce:
1. Stop DHCP service for 10 mins
2. Startthe DHCP service
3. keep monitoring if the cluster can be recovered automatically

Actual results:
Could not be recovered due to some errors mentioned in BZ#2055857, but the root cause one is that the pods got error below:

  Warning  FailedCreatePodSandBox  3m49s (x46 over 50m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx_default_93b27f51-58bf-4f80-916b-e10ae03ae10e_0(72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11): error adding pod default_nginx to CNI network "multus-cni-network": [default/nginx/93b27f51-58bf-4f80-916b-e10ae03ae10e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] [default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:00:19 [10.128.0.25/23]


Expected results:
The cluster shall be recovered automatically whthout any manual intervention. 

Additional info:
Beside the attached  logs, at some point when we ran command below, the cluster seems like recovered automatically from a 'reload'? 

kni@jumphost ~/sno/sno147 $ oc rsh -c northd  -n openshift-ovn-kubernetes ovnkube-master-p7v7g 
sh-4.4# ovn-appctl -t /var/run/ovn/ovn-northd.1.ctl vlog/set dbg

Comment 2 Dan Williams 2022-02-17 21:32:35 UTC

First we thought the northd probe interval was set to 0, but it was 5000:

sh-4.4# ovn-nbctl get NB_GLOBAL . options:northd_probe_interval
"5000"

nbctl daemon logs (from ovnkube-master pod):
2022-02-17T18:22:10Z|00002|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
2022-02-17T18:22:11Z|00005|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
2022-02-17T18:22:13Z|00009|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
2022-02-17T18:22:17Z|00013|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
<snip>
2022-02-17T18:28:17.387Z|00062|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
2022-02-17T18:28:25.395Z|00063|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
2022-02-17T18:28:33.404Z|00064|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
2022-02-17T18:28:41.412Z|00065|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
2022-02-17T18:28:49.420Z|00066|stream_ssl|ERR|ssl:192.168.58.108:9641: connect: Network is unreachable
2022-02-17T18:28:57.431Z|00067|reconnect|INFO|ssl:192.168.58.108:9641: connected
<eg, connectivity is restored at 18:28:57 at least>


ovn-northd logs:
2022-02-17T18:20:05Z|17665|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|20529|reconnect|DBG|ssl:192.168.58.108:9642: idle 29997 ms, sending inactivity probe
2022-02-17T18:20:05Z|20530|reconnect|DBG|ssl:192.168.58.108:9642: entering IDLE
2022-02-17T18:20:05Z|20531|jsonrpc|DBG|ssl:192.168.58.108:9642: send request, method="echo", params=[], id="echo"
2022-02-17T18:20:05Z|20532|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes)
2022-02-17T18:20:05Z|20533|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes)
2022-02-17T18:20:05Z|17666|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17667|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|20534|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage)
2022-02-17T18:20:05Z|17668|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|20535|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes)
2022-02-17T18:20:05Z|20536|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes)
2022-02-17T18:20:05Z|20537|jsonrpc|DBG|ssl:192.168.58.108:9642: received reply, result=[], id="echo"
2022-02-17T18:20:05Z|20538|reconnect|DBG|ssl:192.168.58.108:9642: entering ACTIVE
2022-02-17T18:20:05Z|17669|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|20539|poll_loop|DBG|wakeup due to 0-ms timeout at lib/stream-ssl.c:834 (0% CPU usage)
2022-02-17T18:20:05Z|17670|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17671|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|20540|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage)
2022-02-17T18:20:05Z|20541|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes)
2022-02-17T18:20:05Z|20542|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes)
2022-02-17T18:20:05Z|17672|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|20543|jsonrpc|DBG|ssl:192.168.58.108:9642: received notification, method="update2", params=[["monid","_Server"],{"Database":{"91a38ed7-d7ac-45c2-8123-659db7650ff6":{"modify":{"index":48732}}
}}]
2022-02-17T18:20:05Z|20544|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes)
2022-02-17T18:20:05Z|20545|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes)
2022-02-17T18:20:05Z|20546|jsonrpc|DBG|ssl:192.168.58.108:9642: received notification, method="update3", params=[["monid","OVN_Southbound"],"0b74fdca-f575-42d3-ba2a-533e98ab2df2",{"SB_Global":{"72de3344-ac6f-4f2
e-b26f-0b2087b05614":{"modify":{"options":["map",[["e2e_timestamp","1645122005"]]]}}}}]
2022-02-17T18:20:05Z|20547|poll_loop|DBG|wakeup due to [POLLIN] on fd 14 (192.168.58.108:59440<->192.168.58.108:9642) at lib/stream-ssl.c:832 (0% CPU usage)
2022-02-17T18:20:05Z|17673|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|20548|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 256 (5 bytes)
2022-02-17T18:20:05Z|20549|stream_ssl|DBG|client6<--ssl:192.168.58.108:9642 type 257 (1 bytes)
2022-02-17T18:20:05Z|20550|jsonrpc|DBG|ssl:192.168.58.108:9642: received reply, result=[{"count":1},{"count":1},{},{}], id=4200
2022-02-17T18:20:05Z|17674|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17675|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17676|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17677|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17678|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17679|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17680|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17681|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17682|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T18:20:05Z|17683|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)

< connectivity is interrupted by allowing the DHCP lease to expire, and the IP to be removed from the node's interface >
< northd goes to sleep and doesn't wake up for a long time >
< then allow DHCP to re-acquire the lease for the same IP and assign it back to the interface around 18:28:57, but northd doesn't wake up for an hour >

2022-02-17T19:28:20Z|20551|poll_loop|DBG|wakeup due to [POLLIN] on fd 6 (/var/run/ovn/ovn-northd.1.ctl<->) at lib/stream-fd.c:274 (0% CPU usage)

< attempt to wake northd up by running "ovn-appctl status", it resurrects itself >

2022-02-17T19:28:20Z|17684|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|20552|hmap|DBG|Dropped 9 log messages in last 4095 seconds (most recently, 4095 seconds ago) due to excessive rate
2022-02-17T19:28:20Z|20553|hmap|DBG|northd/northd.c:1571: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets)
2022-02-17T19:28:20Z|20554|hmap|DBG|northd/northd.c:1571: 2 buckets with 6+ nodes, including 1 bucket with 7 nodes (64 nodes total across 64 buckets)
2022-02-17T19:28:20Z|20555|hmap|DBG|lib/ovn-util.c:613: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets)
2022-02-17T19:28:20Z|17685|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17686|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17687|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17688|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|20556|hmap|DBG|lib/hmapx.c:41: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (64 nodes total across 64 buckets)
2022-02-17T19:28:20Z|20557|hmap|DBG|lib/hmapx.c:41: 1 bucket with 6+ nodes, including 1 bucket with 7 nodes (128 nodes total across 128 buckets)
2022-02-17T19:28:20Z|20558|hmap|DBG|lib/hmapx.c:41: 3 buckets with 6+ nodes, including 1 bucket with 8 nodes (256 nodes total across 256 buckets)
2022-02-17T19:28:20Z|20559|hmap|DBG|lib/hmapx.c:41: 4 buckets with 6+ nodes, including 3 buckets with 7 nodes (512 nodes total across 512 buckets)
2022-02-17T19:28:20Z|20560|hmap|DBG|lib/hmapx.c:41: 10 buckets with 6+ nodes, including 3 buckets with 7 nodes (1024 nodes total across 1024 buckets)
2022-02-17T19:28:20Z|17689|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17690|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|20561|jsonrpc|DBG|unix#1: received request, method="status", params=[], id=0
2022-02-17T19:28:20Z|20562|unixctl|DBG|received request status[], id=0
2022-02-17T19:28:20Z|20563|unixctl|DBG|replying with success, id=0: "Status: active
"
2022-02-17T19:28:20Z|20564|jsonrpc|DBG|unix#1: send reply, result="Status: active\n", id=0
2022-02-17T19:28:20Z|20565|poll_loop|DBG|wakeup due to 0-ms timeout at lib/reconnect.c:664 (0% CPU usage)
2022-02-17T19:28:20Z|17691|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|20566|reconnect|DBG|ssl:192.168.58.108:9641: idle 4094953 ms, sending inactivity probe
2022-02-17T19:28:20Z|20567|reconnect|DBG|ssl:192.168.58.108:9641: entering IDLE
2022-02-17T19:28:20Z|20568|jsonrpc|DBG|ssl:192.168.58.108:9641: send request, method="echo", params=[], id="echo"
2022-02-17T19:28:20Z|20569|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 256 (5 bytes)
2022-02-17T19:28:20Z|20570|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 257 (1 bytes)
2022-02-17T19:28:20Z|20571|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2022-02-17T19:28:20Z|20572|jsonrpc|WARN|ssl:192.168.58.108:9641: receive error: Connection reset by peer
2022-02-17T19:28:20Z|20573|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 256 (5 bytes)
2022-02-17T19:28:20Z|20574|stream_ssl|DBG|client5-->ssl:192.168.58.108:9641 type 257 (1 bytes)
2022-02-17T19:28:20Z|20575|reconnect|DBG|ssl:192.168.58.108:9642: idle 4094944 ms, sending inactivity probe
2022-02-17T19:28:20Z|20576|reconnect|DBG|ssl:192.168.58.108:9642: entering IDLE
2022-02-17T19:28:20Z|20577|jsonrpc|DBG|ssl:192.168.58.108:9642: send request, method="echo", params=[], id="echo"
2022-02-17T19:28:20Z|20578|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes)
2022-02-17T19:28:20Z|20579|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes)
2022-02-17T19:28:20Z|20580|stream_ssl|WARN|SSL_read: system error (Connection reset by peer)
2022-02-17T19:28:20Z|20581|jsonrpc|WARN|ssl:192.168.58.108:9642: receive error: Connection reset by peer
2022-02-17T19:28:20Z|20582|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 256 (5 bytes)
2022-02-17T19:28:20Z|20583|stream_ssl|DBG|client6-->ssl:192.168.58.108:9642 type 257 (1 bytes)
2022-02-17T19:28:20Z|17692|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|20584|hmap|DBG|northd/northd.c:1571: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (32 nodes total across 32 buckets)
2022-02-17T19:28:20Z|20585|hmap|DBG|northd/northd.c:1571: 2 buckets with 6+ nodes, including 1 bucket with 7 nodes (64 nodes total across 64 buckets)
2022-02-17T19:28:20Z|17693|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17694|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17695|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17696|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17697|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17698|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17699|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17700|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|20586|poll_loop|DBG|wakeup due to 0-ms timeout at ssl:192.168.58.108:9641 (0% CPU usage)
2022-02-17T19:28:20Z|17701|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|17702|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:20Z|20587|reconnect|WARN|ssl:192.168.58.108:9641: connection dropped (Connection reset by peer)
2022-02-17T19:28:20Z|20588|reconnect|DBG|ssl:192.168.58.108:9641: entering BACKOFF
2022-02-17T19:28:20Z|20589|ovsdb_cs|DBG|ssl:192.168.58.108:9641: MONITORING -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
2022-02-17T19:28:20Z|20590|reconnect|WARN|ssl:192.168.58.108:9642: connection dropped (Connection reset by peer)
2022-02-17T19:28:20Z|20591|reconnect|DBG|ssl:192.168.58.108:9642: entering BACKOFF
2022-02-17T19:28:20Z|20592|ovsdb_cs|DBG|ssl:192.168.58.108:9642: MONITORING -> SERVER_SCHEMA_REQUESTED at lib/ovsdb-cs.c:419
2022-02-17T19:28:20Z|20593|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
2022-02-17T19:28:20Z|17703|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:21Z|20594|poll_loop|DBG|wakeup due to 1000-ms timeout at lib/reconnect.c:664 (0% CPU usage)
2022-02-17T19:28:21Z|20595|reconnect|INFO|ssl:192.168.58.108:9641: connecting...
2022-02-17T19:28:21Z|20596|reconnect|DBG|ssl:192.168.58.108:9641: entering CONNECTING
2022-02-17T19:28:21Z|17704|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:21Z|20597|reconnect|INFO|ssl:192.168.58.108:9642: connecting...
2022-02-17T19:28:21Z|20598|reconnect|DBG|ssl:192.168.58.108:9642: entering CONNECTING
2022-02-17T19:28:21Z|20599|stream_ssl|DBG|client7-->ssl:192.168.58.108:9641 type 256 (5 bytes)
2022-02-17T19:28:21Z|20600|stream_ssl|DBG|client7-->ssl:192.168.58.108:9641 handshake: client_hello (340 bytes)
2022-02-17T19:28:21Z|20601|stream_ssl|DBG|client8-->ssl:192.168.58.108:9642 type 256 (5 bytes)
2022-02-17T19:28:21Z|20602|stream_ssl|DBG|client8-->ssl:192.168.58.108:9642 handshake: client_hello (340 bytes)
2022-02-17T19:28:21Z|17705|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:21Z|20603|poll_loop|DBG|wakeup due to [POLLIN] on fd 13 (192.168.58.108:60464<->192.168.58.108:9641) at lib/stream-ssl.c:821 (0% CPU usage)
2022-02-17T19:28:21Z|17706|poll_loop(stopwatch0)|DBG|wakeup due to [POLLIN] on fd 7 (FIFO pipe:[237268213]) at lib/stopwatch.c:458 (0% CPU usage)
2022-02-17T19:28:21Z|20604|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes)
2022-02-17T19:28:21Z|20605|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: server_hello (122 bytes)
2022-02-17T19:28:21Z|20606|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes)
2022-02-17T19:28:21Z|20607|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes)
2022-02-17T19:28:21Z|20608|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 257 (1 bytes)
2022-02-17T19:28:21Z|20609|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: <unknown> (6 bytes)
2022-02-17T19:28:21Z|20610|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 256 (5 bytes)
2022-02-17T19:28:21Z|20611|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 type 257 (1 bytes)
2022-02-17T19:28:21Z|20612|stream_ssl|DBG|client7<--ssl:192.168.58.108:9641 handshake: certificate_request (112 bytes)

Comment 3 bzhai 2022-02-17 21:51:41 UTC

My 3-nodes cluster got recovered with the same command:

kni@jumphost ~/rhacm $ oc get pods
NAME     READY   STATUS              RESTARTS   AGE
nginx2   0/1     ContainerCreating   0          4h57m
kni@jumphost ~/rhacm $ 
kni@jumphost ~/rhacm $ oc get pods -n openshift-ovn-kubernetes 
NAME                   READY   STATUS    RESTARTS        AGE
ovnkube-master-dqhtm   6/6     Running   2 (22h ago)     22h
ovnkube-master-dwrn4   6/6     Running   1 (18h ago)     22h
ovnkube-master-pmc2j   6/6     Running   3 (3h27m ago)   22h
ovnkube-node-8lxdj     4/4     Running   0               23h
ovnkube-node-q8t4d     4/4     Running   0               23h
ovnkube-node-zt549     4/4     Running   0               23h
kni@jumphost ~/rhacm $ 
kni@jumphost ~/rhacm $ 
kni@jumphost ~/rhacm $ oc exec -it ovnkube-master-dqhtm -n openshift-ovn-kubernetes -- ovn-appctl -t /var/run/ovn/ovn-northd.1.ctl vlog/set dbg
Defaulted container "northd" out of: northd, nbdb, kube-rbac-proxy, sbdb, ovnkube-master, ovn-dbchecker
kni@jumphost ~/rhacm $ oc get pods
NAME     READY   STATUS    RESTARTS   AGE
nginx2   1/1     Running   0          4h58m


kni@jumphost ~/rhacm $ oc describe pod nginx2 
Name:         nginx2
Namespace:    default
Priority:     0
Node:         openshift-master-0/192.168.58.21
Start Time:   Thu, 17 Feb 2022 11:50:00 -0500
......
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                      From     Message
  ----     ------                  ----                     ----     -------
  Warning  FailedCreatePodSandBox  3m49s (x238 over 4h50m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx2_default_97b7c90b-238e-41da-b54b-5bedb1cfc5b7_0(b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac): error adding pod default_nginx2 to CNI network "multus-cni-network": [default/nginx2/97b7c90b-238e-41da-b54b-5bedb1cfc5b7:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[default/nginx2 b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac] [default/nginx2 b5b325344f7949c3e330896e96a529658b4695a3883c2ac6ec85b4cec82b41ac] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:00:3c [10.128.0.60/24]
'
  Normal  AddedInterface  2m40s  multus  Add eth0 [10.128.0.60/24] from ovn-kubernetes

Comment 4 Ilya Maximets 2022-02-21 14:28:25 UTC

I think, I found the root cause and I posted the fix for review:
  https://patchwork.ozlabs.org/project/openvswitch/patch/20220221141622.767208-1-i.maximets@ovn.org/

The problem caused by the following OVS patch:
  4241d652e465 ("jsonrpc: Avoid disconnecting prematurely due to long poll intervals.")

And it basically breaks the inactivity probe functionality in case the application
doesn't have any other sources of wake-up events (like northd).

@

Comment 8 Ehsan Elahi 2022-03-08 22:04:19 UTC

### Reproduced on:

[root@bz_2055865 ~]# rpm -qa |grep -E 'ovn|openvswitch'
ovn2.13-central-20.12.0-193.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch
openvswitch2.13-2.13.0-162.el8fdp.x86_64
ovn2.13-20.12.0-193.el8fdp.x86_64
ovn2.13-host-20.12.0-193.el8fdp.x86_64

### Created a cluster and start ovn-northd
[root@bz_2055865 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=172.16.1.1 --db-nb-create-insecure-remote=yes --db-sb-addr=172.16.1.1 --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr=172.16.1.1 --db-nb-cluster-local-addr=172.16.1.1 --ovn-northd-nb-db=tcp:172.16.1.1:6641 --ovn-northd-sb-db=tcp:172.16.1.1:6642 start_northd
Creating cluster database /etc/ovn/ovnnb_db.db             [  OK  ]
Starting ovsdb-nb                                          [  OK  ]
Waiting for OVN_Northbound to come up 2022-03-08T21:31:24Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-03-08T21:31:24Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
                                                           [  OK  ]
Creating cluster database /etc/ovn/ovnsb_db.db             [  OK  ]
Starting ovsdb-sb                                          [  OK  ]
Waiting for OVN_Southbound to come up 2022-03-08T21:31:24Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2022-03-08T21:31:24Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
                                                           [  OK  ]
Starting ovn-northd                                        [  OK  ]

### Check ovn-northd.log 
[root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log
2022-03-08T21:30:12.435Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:30:12.435Z|00004|daemon_unix(monitor)|INFO|pid 67309 died, exit status 0, exiting
2022-03-08T21:31:24.448Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:31:24.449Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1]
2022-03-08T21:31:24.449Z|00003|reconnect|INFO|tcp:172.16.1.2:6641: connecting...
2022-03-08T21:31:24.449Z|00004|reconnect|INFO|tcp:172.16.1.2:6642: connecting...
2022-03-08T21:31:24.449Z|00005|reconnect|INFO|tcp:172.16.1.2:6641: connected
2022-03-08T21:31:24.449Z|00006|reconnect|INFO|tcp:172.16.1.2:6642: connected
2022-03-08T21:31:24.449Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.

### Look the process IDs for ovnsb_db and ovn_nb_db. The objective is to block the communication between ovnsb_db and ovn_northd in such a way that ovn-northd does not know about this blockage. 
### In this way we can see if the inactivity probes are disabled or enabled.

[root@bz_2055865 ~]# cat /var/run/ovn/ovnsb_db.pid
67571
[root@bz_2055865 ~]# cat /var/run/ovn/ovnnb_db.pid
67550
[root@bz_2055865 ~]# kill -19 67571
[root@bz_2055865 ~]# kill -19 67550
[root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log
2022-03-08T21:30:12.435Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:30:12.435Z|00004|daemon_unix(monitor)|INFO|pid 67309 died, exit status 0, exiting
2022-03-08T21:31:24.448Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:31:24.449Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1]
2022-03-08T21:31:24.449Z|00003|reconnect|INFO|tcp:172.16.1.2:6641: connecting...
2022-03-08T21:31:24.449Z|00004|reconnect|INFO|tcp:172.16.1.2:6642: connecting...
2022-03-08T21:31:24.449Z|00005|reconnect|INFO|tcp:172.16.1.2:6641: connected
2022-03-08T21:31:24.449Z|00006|reconnect|INFO|tcp:172.16.1.2:6642: connected
2022-03-08T21:31:24.449Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
2022-03-08T21:31:34.457Z|00008|memory|INFO|3800 kB peak resident set size after 10.0 seconds

<============  Inactivity probes are disabled. ovn-northd did not detect the disconnection. So didn't attempted to reconnect. 

### Verified on:
[root@bz_2055865 ~]# rpm -qa |grep -E 'ovn|openvswitch'
openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch
ovn2.13-20.12.0-195.el8fdp.x86_64
ovn2.13-central-20.12.0-195.el8fdp.x86_64
ovn2.13-host-20.12.0-195.el8fdp.x86_64
openvswitch2.13-2.13.0-162.el8fdp.x86_64

### Cluster created
[root@bz_2055865 ~]# /usr/share/ovn/scripts/ovn-ctl --db-nb-addr=172.16.1.1 --db-nb-create-insecure-remote=yes --db-sb-addr=172.16.1.1 --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr=172.16.1.1 --db-nb-cluster-local-addr=172.16.1.1 --ovn-northd-nb-db=tcp:172.16.1.1:6641 --ovn-northd-sb-db=tcp:172.16.1.1:6642 start_northd
Creating cluster database /etc/ovn/ovnnb_db.db             [  OK  ]
Starting ovsdb-nb                                          [  OK  ]
Waiting for OVN_Northbound to come up 2022-03-08T21:35:22Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2022-03-08T21:35:22Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
                                                           [  OK  ]
Creating cluster database /etc/ovn/ovnsb_db.db             [  OK  ]
Starting ovsdb-sb                                          [  OK  ]
Waiting for OVN_Southbound to come up 2022-03-08T21:35:22Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2022-03-08T21:35:22Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
                                                           [  OK  ]
Starting ovn-northd                                        [  OK  ]
[root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log
2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting
2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1]
2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting...
2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting...
2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected
2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected
2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds
[root@bz_2055865 ~]# cat /var/run/ovn/ovnsb_db.pid
1793479
[root@bz_2055865 ~]# cat /var/run/ovn/ovnnb_db.pid
1793458
[root@bz_2055865 ~]# kill -19 1793479
[root@bz_2055865 ~]# kill -19 1793458
[root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log
2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting
2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1]
2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting...
2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting...
2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected
2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected
2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds
2022-03-08T21:36:22.160Z|00009|reconnect|ERR|tcp:172.16.1.1:6642: no response to inactivity probe after 5 seconds, disconnecting
2022-03-08T21:36:22.160Z|00010|reconnect|INFO|tcp:172.16.1.1:6642: connection dropped
2022-03-08T21:36:22.160Z|00011|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
2022-03-08T21:36:23.161Z|00012|reconnect|INFO|tcp:172.16.1.1:6642: connecting...
2022-03-08T21:36:23.161Z|00013|reconnect|INFO|tcp:172.16.1.1:6642: connected

<============= Inactivity probes are enabled. ovn_northd is trying to reconnect with the db servers. 

### Resume the northboudDB and southboundDB processes 
[root@bz_2055865 ~]# kill -18 1793479
[root@bz_2055865 ~]# kill -18 1793458
[root@bz_2055865 ~]# cat /var/log/ovn/ovn-northd.log
2022-03-08T21:34:47.965Z|00003|vlog(monitor)|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:34:47.965Z|00004|daemon_unix(monitor)|INFO|pid 1793028 died, exit status 0, exiting
2022-03-08T21:35:22.139Z|00001|vlog|INFO|opened log file /var/log/ovn/ovn-northd.log
2022-03-08T21:35:22.140Z|00002|ovn_northd|INFO|OVN internal version is : [20.12.0-20.17.0-56.1]
2022-03-08T21:35:22.140Z|00003|reconnect|INFO|tcp:172.16.1.1:6641: connecting...
2022-03-08T21:35:22.140Z|00004|reconnect|INFO|tcp:172.16.1.1:6642: connecting...
2022-03-08T21:35:22.140Z|00005|reconnect|INFO|tcp:172.16.1.1:6641: connected
2022-03-08T21:35:22.140Z|00006|reconnect|INFO|tcp:172.16.1.1:6642: connected
2022-03-08T21:35:22.140Z|00007|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
2022-03-08T21:35:32.145Z|00008|memory|INFO|3936 kB peak resident set size after 10.0 seconds
2022-03-08T21:36:22.160Z|00009|reconnect|ERR|tcp:172.16.1.1:6642: no response to inactivity probe after 5 seconds, disconnecting
2022-03-08T21:36:22.160Z|00010|reconnect|INFO|tcp:172.16.1.1:6642: connection dropped
2022-03-08T21:36:22.160Z|00011|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
2022-03-08T21:36:23.161Z|00012|reconnect|INFO|tcp:172.16.1.1:6642: connecting...
2022-03-08T21:36:23.161Z|00013|reconnect|INFO|tcp:172.16.1.1:6642: connected
2022-03-08T21:36:27.164Z|00014|reconnect|ERR|tcp:172.16.1.1:6641: no response to inactivity probe after 5 seconds, disconnecting
2022-03-08T21:36:27.164Z|00015|reconnect|INFO|tcp:172.16.1.1:6641: connection dropped
2022-03-08T21:36:28.165Z|00016|reconnect|INFO|tcp:172.16.1.1:6641: connecting...
2022-03-08T21:36:28.165Z|00017|reconnect|INFO|tcp:172.16.1.1:6641: connected
...........
2022-03-08T21:38:16.247Z|00066|reconnect|ERR|tcp:172.16.1.1:6641: no response to inactivity probe after 5 seconds, disconnecting
2022-03-08T21:38:16.247Z|00067|reconnect|INFO|tcp:172.16.1.1:6641: connection dropped
2022-03-08T21:38:19.245Z|00068|reconnect|INFO|tcp:172.16.1.1:6642: connected
2022-03-08T21:38:24.251Z|00069|reconnect|INFO|tcp:172.16.1.1:6641: connected
2022-03-08T21:38:24.626Z|00070|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.

<======== ovn-northd reconnected again with the north and south bound DB servers.

Comment 10 errata-xmlrpc 2022-03-30 16:28:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1143