Bug 2077078
| Summary: | stuck ovn-controller | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Jakub Libosvar <jlibosva> |
| Component: | ovn2.13 | Assignee: | Mohammad Heib <mheib> |
| Status: | CLOSED UPSTREAM | QA Contact: | Ehsan Elahi <eelahi> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | high | ||
| Version: | FDP 22.A | CC: | awalsh, ctrautma, jiji, jishi, jmelvin, ljozsa, mheib, mmichels, ralongi |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2025-02-10 04:01:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jakub Libosvar
2022-04-20 16:05:54 UTC
(In reply to Mohammad Heib from comment #8) > Hi Jakub, > > after a deep investigation of the issue and the core dump i saw that the > issue happen because of the following reasons: > > 1. the ovn controller trying to send RAs to the router ports that has the > option ipv6_ra_send_periodic set to true in function prepare_ipv6_ras. > 2. those ports in your case are stored in the struct local_datapath as > peers,(in newer versions of ovn we handle them in a different way). > 3. for unknown reason some of those ports are pointing to an invalid memory > address and when we execute this line of code: > if (!smap_get_bool(&pb->options, "ipv6_ra_send_periodic", > false)) { > continue; > } > > inside function prepare_ipv6_ras it causes an infinity loop in the > controller. > > unfortunately, > i wasn't able to reproduce the issue in my setup also tried to see if we > facing any memory issues using valgrind but couldn't find anything useful > that can help to reduce the issue. > > i don't know what to do from here if you keep seeing the issue in your env > maybe i can jump in and trying to do debugging in your env or just close the > issue and reopen it if you see that issue again. > > > thanks Thanks Mohammad for detailed look. This was observed on a production cluster consisting of about 200 nodes and we experienced it only on one node once in more than a year. It's unknown what caused this issue and it hasn't reproduced since. I have no idea how it's possible to have pointers to invalid address nor how to even try to reproduce it. I think we can close this as INSUFFICIENT_DATA if we don't know what happened. Is there anything I could have done better in terms of collecting data for troubleshooting in case it happens next time? What would help collecting in order to proceed further in the troubleshooting of this issue? Hi Jakub, please ignore my last comment we finally found a way to reproduce the issue and apparently it exists in ovn upstream as well, i will add fix and let you know when it is available. self_note:
how to reproduce upstream:
1. add the following code to controller/ovn-controller.c:
time_msec());
+ const struct local_datapath *ld;
+ HMAP_FOR_EACH (ld, hmap_node, &runtime_data->local_datapaths) {
+
+ for (size_t i = 0; i < ld->n_peer_ports; i++) {
+ const struct sbrec_port_binding *peer = ld->peer_ports[i].remote;
+ const struct sbrec_port_binding *mypb = ld->peer_ports[i].local;
+ VLOG_INFO("PEER NAME = %s\n", peer->logical_port);
+ VLOG_INFO("pb NAME = %s\n", mypb->logical_port);
+ }
+ }
2. execute the following commands:
ovn-nbctl ls-add sw0
ovn-nbctl lr-add ro0
ovn-nbctl lsp-add sw0 tmp
ovn-nbctl lsp-add sw0 lsp
ovn-nbctl lsp-set-type lsp router
ovn-nbctl lsp-set-options lsp router-port=lrp
ovn-nbctl lsp-set-addresses lsp 00:00:00:00:00:1
ovn-nbctl lrp-add ro0 lrp 00:00:00:00:00:1 aef0:0:0:0:0:0:0:1/64
ovs-vsctl add-port br-int tmp -- set interface tmp type=internal -- set interface tmp external_ids:iface-id=tmp
ovn-nbctl set Logical_Router_Port lrp ipv6_ra_configs:send_periodic=true -- set Logical_Router_Port lrp ipv6_ra_configs:address_mode=slaac -- set Logical_Router_Port lrp ipv6_ra_configs:mtu=1280 -- set Logical_Router_Port lrp ipv6_ra_configs:max_interval=2 -- set Logical_Router_Port lrp ipv6_ra_configs:min_interval=1
ovn-nbctl lsp-set-type lsp localnet
ovn-nbctl lsp-del lsp
ovn-nbctl --wait=hv sync
3. check the ovn-controller.log and see some invalid memory address pointer.
patch posted upstream: https://patchwork.ozlabs.org/project/ovn/patch/20220803143455.2867615-1-mheib@redhat.com/ ovn22.03 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2119940 ovn22.03 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2119941 ovn22.06 fast-datapath-rhel-8 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2119942 ovn22.06 fast-datapath-rhel-9 clone created at https://bugzilla.redhat.com/show_bug.cgi?id=2119943 Hi Mohammad, can we reproduce the issue without modifying the code for ovn? Hi, AFAIR, we couldn't reproduce it since it's an invalid memory access issue, so in order to see it we had to change OVN code and see that we are really hitting this invalid memory area. unfortunately, we couldn't find a way to reproduce it without patching the code. This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |