Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2077078

Summary: stuck ovn-controller
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Jakub Libosvar <jlibosva>
Component: ovn2.13Assignee: Mohammad Heib <mheib>
Status: CLOSED UPSTREAM QA Contact: Ehsan Elahi <eelahi>
Severity: unspecified Docs Contact:
Priority: high    
Version: FDP 22.ACC: awalsh, ctrautma, jiji, jishi, jmelvin, ljozsa, mheib, mmichels, ralongi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-02-10 04:01:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jakub Libosvar 2022-04-20 16:05:54 UTC
Description of problem: 
The initial problem was with unstable floating ip with intermittent failures. Finding this in a big cloud was not easy as the stuck ovn-controller was on a different compute node than the node hosting the VM with the FIP. After it was found on the fabric where the packets are being sent it was observed the node has arp responder flows installed. ovn-controller was at 100% CPU utilization and was not writing to the logs and connection to the southbound database was in CLOSE_WAIT state.

Version-Release number of selected component (if applicable):
ovn2.13-20.12.0-192.el8fdp.x86_64

How reproducible:
Once in a very long time

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:


Expected results:


Additional info:
OVN DBs, local OVS db, ovn-controller log file and the core dump will be linked because it won't fit the BZ due to core dump being 6GB large.

Comment 9 Jakub Libosvar 2022-07-19 17:26:38 UTC
(In reply to Mohammad Heib from comment #8)
> Hi Jakub,
> 
> after a deep investigation of the issue and the core dump i saw that the
> issue happen because of the following reasons:
> 
> 1. the ovn controller trying to send RAs to the router ports that has the
> option ipv6_ra_send_periodic set to true in function prepare_ipv6_ras.
> 2. those ports in your case are stored in the struct  local_datapath as
> peers,(in newer versions of ovn we handle them in a different way).
> 3. for unknown reason some of those ports are pointing to an invalid memory
> address and when we execute this line of code:
>             if (!smap_get_bool(&pb->options, "ipv6_ra_send_periodic",
> false)) {
>                 continue;
>             }
> 
> inside function prepare_ipv6_ras it causes an infinity loop in the
> controller. 
> 
> unfortunately, 
> i wasn't able to reproduce the issue in my setup also tried to see if we
> facing any memory issues using valgrind but couldn't find anything useful
> that can help to reduce the issue.
> 
> i don't know what to do from here if you keep seeing the issue in your env
> maybe i can jump in and trying to do debugging in your env or just close the
> issue and reopen it if you see that issue again.
> 
> 
> thanks

Thanks Mohammad for detailed look. This was observed on a production cluster consisting of about 200 nodes and we experienced it only on one node once in more than a year. It's unknown what caused this issue and it hasn't reproduced since. I have no idea how it's possible to have pointers to invalid address nor how to even try to reproduce it. I think we can close this as INSUFFICIENT_DATA if we don't know what happened.

Is there anything I could have done better in terms of collecting data for troubleshooting in case it happens next time? What would help collecting in order to proceed further in the troubleshooting of this issue?

Comment 10 Mohammad Heib 2022-07-20 13:40:28 UTC
Hi Jakub,


please ignore my last comment we finally found a way to reproduce the issue and apparently it exists in ovn upstream as well,
i will add fix and let you know when it is available.

Comment 11 Mohammad Heib 2022-07-20 13:42:55 UTC
self_note:

how to reproduce upstream:

1. add the following code to controller/ovn-controller.c:
                                         time_msec());
+                        const struct local_datapath *ld;
+                        HMAP_FOR_EACH (ld, hmap_node, &runtime_data->local_datapaths) {
+
+                            for (size_t i = 0; i < ld->n_peer_ports; i++) {
+                                const struct sbrec_port_binding *peer = ld->peer_ports[i].remote;
+                                const struct sbrec_port_binding *mypb = ld->peer_ports[i].local;
+                                VLOG_INFO("PEER NAME = %s\n", peer->logical_port);
+                                VLOG_INFO("pb NAME = %s\n", mypb->logical_port);
+                            }
+                        }

2. execute the following commands:
ovn-nbctl ls-add sw0
ovn-nbctl lr-add ro0
ovn-nbctl lsp-add sw0 tmp
ovn-nbctl lsp-add sw0 lsp
ovn-nbctl lsp-set-type lsp router
ovn-nbctl lsp-set-options lsp router-port=lrp
ovn-nbctl lsp-set-addresses lsp  00:00:00:00:00:1
ovn-nbctl lrp-add ro0 lrp 00:00:00:00:00:1 aef0:0:0:0:0:0:0:1/64
ovs-vsctl add-port br-int tmp -- set interface tmp type=internal -- set interface tmp external_ids:iface-id=tmp
ovn-nbctl set Logical_Router_Port lrp ipv6_ra_configs:send_periodic=true -- set Logical_Router_Port lrp ipv6_ra_configs:address_mode=slaac -- set Logical_Router_Port lrp ipv6_ra_configs:mtu=1280 -- set Logical_Router_Port lrp ipv6_ra_configs:max_interval=2 -- set Logical_Router_Port lrp ipv6_ra_configs:min_interval=1
ovn-nbctl lsp-set-type lsp localnet
ovn-nbctl lsp-del lsp
ovn-nbctl --wait=hv sync 



3. check the ovn-controller.log and see some invalid memory address pointer.

Comment 13 OVN Bot 2022-08-20 04:04:16 UTC
ovn22.03 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2119940
ovn22.03 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2119941
ovn22.06 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2119942
ovn22.06 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2119943

Comment 14 Jianlin Shi 2023-07-10 08:08:35 UTC
Hi Mohammad,

can we reproduce the issue without modifying the code for ovn?

Comment 15 Mohammad Heib 2023-07-10 08:18:56 UTC
Hi,
AFAIR, we couldn't reproduce it since it's an invalid memory access issue,
so in order to see it we had to change OVN code and see that we are really hitting this invalid memory area.

unfortunately, we couldn't find a way to reproduce it without patching the code.

Comment 16 Red Hat Bugzilla 2025-02-10 04:01:10 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.