Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2052945

Summary: [ovn-controller] ovn-controller fall into dead loop
Product: Red Hat Enterprise Linux Fast Datapath Reporter: yili <33186108>
Component: ovn22.03Assignee: Mohammad Heib <mheib>
Status: CLOSED ERRATA QA Contact: ying xu <yinxu>
Severity: urgent Docs Contact:
Priority: medium    
Version: RHEL 8.0CC: ctrautma, dceara, jiji, mheib, mmichels, yinxu
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovn22.03-22.03.0-15 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2079055 2079059 (view as bug list) Environment:
Last Closed: 2022-05-27 18:14:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2079055, 2079059    
Attachments:
Description Flags
gdb info
none
SB Database
none
NB Database
none
conf.db none

Description yili 2022-02-10 10:19:28 UTC
Created attachment 1860311 [details]
gdb info

Created attachment 1860311 [details]
gdb info 

Description of problem:

ovn use ipv6, ovn-controller process will fall dead loop.


Version-Release number of selected component (if applicable):

ovn: ovn-21.09.0
openvswitch: openvswitch-2.15.2


How reproducible:

have no idea that how to reproduce it.


Additional info:

accroding the gdb-ovn_controller.png, 
we find that the ovn-controller dead loop at (include/openvswitch/hmap.h)

static inline struct hmap_node *
hmap_next_with_hash__(const struct hmap_node *node, size_t hash)
{
    while (node != NULL && node->hash != hash) {  ---------------->dead loop
        node = node->next;
    }
    return CONST_CAST(struct hmap_node *, node);
}

Comment 1 yili 2022-02-10 10:44:04 UTC
(gdb) bt
#0  hmap_next_with_hash__ (hash=915080580, node=0x56283b830ed8)
    at include/openvswitch/hmap.h:324
#1  hmap_first_with_hash (hmap=hmap@entry=0x56283b830ed8,
    hmap=hmap@entry=0x56283b830ed8, hash=915080580)
    at include/openvswitch/hmap.h:334
#2  smap_find__ (smap=smap@entry=0x56283b830ed8,
    key=key@entry=0x56283a60a796 "peer", key_len=4, hash=915080580)
    at lib/smap.c:418
#3  0x000056283a5154f6 in smap_get_node (smap=smap@entry=0x56283b830ed8,
    key=key@entry=0x56283a60a796 "peer") at lib/smap.c:217
#4  0x000056283a515559 in smap_get_def (def=0x0,
    key=key@entry=0x56283a60a796 "peer", smap=smap@entry=0x56283b830ed8)
    at lib/smap.c:208
#5  smap_get (smap=smap@entry=0x56283b830ed8,
    key=key@entry=0x56283a60a796 "peer") at lib/smap.c:200
#6  0x000056283a4557d5 in prepare_ipv6_ras (
    sbrec_port_binding_by_name=0x56283b791110,
    local_active_ports_ras=0x56283b79bed8) at controller/pinctrl.c:3950
#7  pinctrl_run (ovnsb_idl_txn=ovnsb_idl_txn@entry=0x56283bb3da40,
    sbrec_datapath_binding_by_key=sbrec_datapath_binding_by_key@entry=0x56283b78e800,
    sbrec_port_binding_by_datapath=sbrec_port_binding_by_datapath@entry=0x56283b78faf0,
---Type <return> to continue, or q <return> to quit---
    sbrec_port_binding_by_key=sbrec_port_binding_by_key@entry=0x56283b792500,
    sbrec_port_binding_by_name=sbrec_port_binding_by_name@entry=0x56283b791110, sbrec_mac_binding_by_lport_ip=sbrec_mac_binding_by_lport_ip@entry=0x56283b78eee0, sbrec_igmp_groups=sbrec_igmp_groups@entry=0x56283b78c7b0,
    sbrec_ip_multicast_opts=sbrec_ip_multicast_opts@entry=0x56283b78d780,
    sbrec_fdb_by_dp_key_mac=sbrec_fdb_by_dp_key_mac@entry=0x56283b78bfd0,
    dns_table=0x56283b79e460, ce_table=ce_table@entry=0x56283b79e460,
    svc_mon_table=svc_mon_table@entry=0x56283b79e460,
    bfd_table=bfd_table@entry=0x56283b79e460,
    br_int=br_int@entry=0x56283b7c8230, chassis=chassis@entry=0x56283b8d2810,
    local_datapaths=local_datapaths@entry=0x56283b79bd70,
    active_tunnels=active_tunnels@entry=0x56283b79be30,
    local_active_ports_ipv6_pd=local_active_ports_ipv6_pd@entry=0x56283b79beb8, local_active_ports_ras=local_active_ports_ras@entry=0x56283b79bed8)
    at controller/pinctrl.c:3477
#8  0x000056283a42d230 in main (argc=11, argv=0x7ffd8b6f9ad8)
    at controller/ovn-controller.c:3693
(gdb) frame 2
#2  smap_find__ (smap=smap@entry=0x56283b830ed8,
    key=key@entry=0x56283a60a796 "peer", key_len=4, hash=915080580)
    at lib/smap.c:418
418         HMAP_FOR_EACH_WITH_HASH (node, node, hash, &smap->map) {
(gdb) p *smap
$8 = {map = {buckets = 0x56283ba02f80, one = 0x56283ba02f80, mask = 0,
    n = 94730797125360}}
(gdb) frame 0
#0  hmap_next_with_hash__ (hash=915080580, node=0x56283b830ed8)
    at include/openvswitch/hmap.h:324
324             node = node->next;
(gdb) p *node
$9 = {hash = 94730799034240, next = 0x56283ba02f80}
(gdb) ptype node
type = const struct hmap_node {
    size_t hash;
    struct hmap_node *next;
} *
(gdb) p (struct hmap_node*)0x56283ba02f80
$10 = (struct hmap_node *) 0x56283ba02f80
(gdb) p *(struct hmap_node*)0x56283ba02f80
$11 = {hash = 94730797125336, next = 0x56283b830ed8}
(gdb)


1: 
node(0x56283b830ed8)->next is 0x56283ba02f80   and node(0x56283b830ed8) hash is 94730799034240
node(0x56283ba02f80)->next is 0x56283b830ed8   and node(0x56283ba02f80) hash is 94730797125336 

and we find hash is 915080580

so we dead loop at: 
static inline struct hmap_node *
hmap_next_with_hash__(const struct hmap_node *node, size_t hash)
{
    while (node != NULL && node->hash != hash) {  ---------------->dead loop
        node = node->next;
    }
    return CONST_CAST(struct hmap_node *, node);
}
 


2: the smap->n ==94730797125360   is puzzling
(gdb) p *smap
$8 = {map = {buckets = 0x56283ba02f80, one = 0x56283ba02f80, mask = 0,
    n = 94730797125360}

Comment 2 Mohammad Heib 2022-02-10 11:07:43 UTC
hi Yili,
thank you for reporting this bug.
i see that you don't know how to reproduce the issue. 
can you please details/describe your use case is it OpenStack/OpenShift deployment
and can you please attach the OVS Database of the affected node as well the SB/NB Databases.

thanks,

Comment 3 yili 2022-02-10 12:16:41 UTC
(In reply to Mohamad Heib from comment #2)
> hi Yili,
> thank you for reporting this bug.
> i see that you don't know how to reproduce the issue. 
> can you please details/describe your use case is it OpenStack/OpenShift
> deployment

emm, i have not use OpenStack/OpenShift, Just using the redhat os 8.0 and install the openvswitch and ovn rpm.
and use the libvirt(qemu/kvm) create vms.

> and can you please attach the OVS Database of the affected node as well the
> SB/NB Databases.

i have upload the SB/NB Databases 

> 
> thanks,

Comment 4 yili 2022-02-10 12:17:36 UTC
Created attachment 1860340 [details]
SB Database

Comment 5 yili 2022-02-10 12:18:06 UTC
Created attachment 1860341 [details]
NB Database

Comment 6 yili 2022-02-17 03:39:18 UTC
(gdb) frame 6
#6  0x000056283a4557d5 in prepare_ipv6_ras (sbrec_port_binding_by_name=0x56283b791110, local_active_ports_ras=0x56283b79bed8) at controller/pinctrl.c:3950
3950            const char *peer_s = smap_get(&pb->options, "peer");
(gdb) list
3945        bool changed = false;
3946        SHASH_FOR_EACH (iter, local_active_ports_ras) {
3947            const struct pb_ld_binding *ras = iter->data;
3948            const struct sbrec_port_binding *pb = ras->pb;
3949
3950            const char *peer_s = smap_get(&pb->options, "peer");
3951            if (!peer_s) {
3952                continue;
3953            }
3954
(gdb) p pb
$15 = (const struct sbrec_port_binding *) 0x56283b830d80
(gdb) p *pb
$16 = {header_ = {hmap_node = {hash = 139878501501195, next = 0x56283baf86e0}, uuid = {parts = {1001359112, 22056, 0, 0}}, src_arcs = {prev = 0x4b5b1877,
      next = 0x56283ba37160}, dst_arcs = {prev = 0x0, next = 0x460342f3}, table = 0x0, old_datum = 0x4b5b1877, parsed = false, reparse_node = {prev = 0x56283b830dd8,
      next = 0x56283b830dd8}, new_datum = 0x56283b830e30, prereqs = 0x56283b830e30, written = 0x0, txn_node = {hash = 94730797125120, next = 0x56283b830e00},
    map_op_written = 0x56283bdd6f60, map_op_lists = 0x56283b830f00, set_op_written = 0xffffffff00000000, set_op_lists = 0x41, change_seqno = {998444520, 22056, 998444520},
    track_node = {prev = 0x56283b97a110, next = 0x56283ba02f90}, updated = 0x56283b830d80, tracked_old_datum = 0x9dae4afb460342f3}, chassis = 0xbadcf52db051844a,
  datapath = 0xb1, encap = 0x7f380064c70b, external_ids = {map = {buckets = 0x56283b836260, one = 0x56283b836288, mask = 0, n = 2351977226}}, gateway_chassis = 0x56283bbc8180,
  n_gateway_chassis = 0, ha_chassis_group = 0x460342f3, logical_port = 0x562800000000 <Address 0x562800000000 out of bounds>, mac = 0x8c30530a, n_mac = 94730797032680,
  nat_addresses = 0x56283b830ec8, n_nat_addresses = 94730797125320, options = {map = {buckets = 0x56283ba02f80, one = 0x56283ba02f80, mask = 0, n = 94730797125360}},
  parent_port = 0x56283b830ef0 "\360\016\203;(V", requested_chassis = 0x56283b830e10, tag = 0x56283bb4d290, n_tag = 0, tunnel_key = 49, type = 0x56283bdd7170 "",
  up = 0x56283bb4d390, n_up = 0, virtual_parent = 0x0}
(gdb) ptype *pb
type = const struct sbrec_port_binding {
    struct ovsdb_idl_row header_;
    struct sbrec_chassis *chassis;
    struct sbrec_datapath_binding *datapath;
    struct sbrec_encap *encap;
    struct smap external_ids;
    struct sbrec_gateway_chassis **gateway_chassis;
    size_t n_gateway_chassis;
    struct sbrec_ha_chassis_group *ha_chassis_group;
    char *logical_port;
    char **mac;
    size_t n_mac;
    char **nat_addresses;
    size_t n_nat_addresses;
    struct smap options;
    char *parent_port;
    struct sbrec_chassis *requested_chassis;
    int64_t *tag;
    size_t n_tag;
    int64_t tunnel_key;
    char *type;
    _Bool *up;
    size_t n_up;
    char *virtual_parent;
}

(gdb) p pb->mac[0]
Cannot access memory at address 0x8c30530a
(gdb) p pb->mac[1]
Cannot access memory at address 0x8c305312
(gdb) p pb->n_mac
$17 = 94730797032680
(gdb) p pb->logical_port
$18 = 0x562800000000 <Address 0x562800000000 out of bounds>
(gdb) p pb->options
$19 = {map = {buckets = 0x56283ba02f80, one = 0x56283ba02f80, mask = 0, n = 94730797125360}}
(gdb) p (struct smap_node *) 0x56283ba02f80
$20 = (struct smap_node *) 0x56283ba02f80
(gdb) p *(struct smap_node *) 0x56283ba02f80
$21 = {node = {hash = 94730797125336, next = 0x56283b830ed8}, key = 0x56283b830e40 "\020\241\227;(V", value = 0x56283bb4d2c0 "\220/\240;(V"}
(gdb) p *(struct smap_node *) 0x56283b830ed8
$22 = {node = {hash = 94730799034240, next = 0x56283ba02f80}, key = 0x0, value = 0x56283b830ef0 "\360\016\203;(V"}

Comment 7 Dumitru Ceara 2022-02-17 10:03:59 UTC
I think it would also help pinpoint this issue if we had the OVS database (/etc/openvswitch/conf.db or similar) from the node where ovn-controller hangs.

Comment 8 yili 2022-02-18 02:42:29 UTC
Created attachment 1861813 [details]
conf.db

Comment 9 yili 2022-02-18 02:47:27 UTC
1: the ovn-controller process will occupy 100% cpu when this issue happen
2: The env will recovery after restart the ovn-controller on that ovn chassis.

Comment 11 Mohammad Heib 2022-03-15 08:26:15 UTC
Hi @33186108, 

i was looking at this issue but unfortunately couldn't reproduce the case on my setup.

i was able to reproduce something similar that cause some invalid memory use and that seems to be related to your issue,
so i submitted a patch to fix the memory issue and will really appreciate if you can apply it to your setup and try to reproduce the issue.
if you need any help with applying or rerunning the controller please let Me know. 

the patch link:
https://patchwork.ozlabs.org/project/ovn/patch/20220314110928.471986-1-mheib@redhat.com/

Comment 12 yili 2022-03-16 02:44:50 UTC
(In reply to Mohamad Heib from comment #11)
> Hi @33186108, 
> 
> i was looking at this issue but unfortunately couldn't reproduce the case on
> my setup.
> 
> i was able to reproduce something similar that cause some invalid memory use
> and that seems to be related to your issue,
> so i submitted a patch to fix the memory issue and will really appreciate if
> you can apply it to your setup and try to reproduce the issue.
> if you need any help with applying or rerunning the controller please let Me
> know. 
> 
> the patch link:
> https://patchwork.ozlabs.org/project/ovn/patch/20220314110928.471986-1-
> mheib/

Thanks for your help ...

i will apply your patch,and report the result.

Comment 13 yili 2022-03-29 01:16:30 UTC
After apply this patch https://patchwork.ozlabs.org/project/ovn/patch/20220314110928.471986-1-mheib@redhat.com/

this issue has not happend using auto pressure testes two weeks.

Thanks all of you.

Comment 14 OVN Bot 2022-04-14 13:38:18 UTC
This issue is fixed in ovn22.03-22.03.0-15

Comment 17 ying xu 2022-05-06 00:36:28 UTC
I talked with numan, it is not easily to reproduce and test.So I just do some regression test to make sure no regressions.
and as comment 13, the reporter had verified the patch,so I set this bug verified as sanity only.

Comment 19 errata-xmlrpc 2022-05-27 18:14:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn22.03 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4785