Bug 2052945
| Summary: | [ovn-controller] ovn-controller fall into dead loop | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | yili <33186108> | ||||||||||
| Component: | ovn22.03 | Assignee: | Mohammad Heib <mheib> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | ying xu <yinxu> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | medium | ||||||||||||
| Version: | RHEL 8.0 | CC: | ctrautma, dceara, jiji, mheib, mmichels, yinxu | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | ovn22.03-22.03.0-15 | Doc Type: | If docs needed, set a value | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | |||||||||||||
| : | 2079055 2079059 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2022-05-27 18:14:13 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 2079055, 2079059 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
yili
2022-02-10 10:19:28 UTC
(gdb) bt
#0 hmap_next_with_hash__ (hash=915080580, node=0x56283b830ed8)
at include/openvswitch/hmap.h:324
#1 hmap_first_with_hash (hmap=hmap@entry=0x56283b830ed8,
hmap=hmap@entry=0x56283b830ed8, hash=915080580)
at include/openvswitch/hmap.h:334
#2 smap_find__ (smap=smap@entry=0x56283b830ed8,
key=key@entry=0x56283a60a796 "peer", key_len=4, hash=915080580)
at lib/smap.c:418
#3 0x000056283a5154f6 in smap_get_node (smap=smap@entry=0x56283b830ed8,
key=key@entry=0x56283a60a796 "peer") at lib/smap.c:217
#4 0x000056283a515559 in smap_get_def (def=0x0,
key=key@entry=0x56283a60a796 "peer", smap=smap@entry=0x56283b830ed8)
at lib/smap.c:208
#5 smap_get (smap=smap@entry=0x56283b830ed8,
key=key@entry=0x56283a60a796 "peer") at lib/smap.c:200
#6 0x000056283a4557d5 in prepare_ipv6_ras (
sbrec_port_binding_by_name=0x56283b791110,
local_active_ports_ras=0x56283b79bed8) at controller/pinctrl.c:3950
#7 pinctrl_run (ovnsb_idl_txn=ovnsb_idl_txn@entry=0x56283bb3da40,
sbrec_datapath_binding_by_key=sbrec_datapath_binding_by_key@entry=0x56283b78e800,
sbrec_port_binding_by_datapath=sbrec_port_binding_by_datapath@entry=0x56283b78faf0,
---Type <return> to continue, or q <return> to quit---
sbrec_port_binding_by_key=sbrec_port_binding_by_key@entry=0x56283b792500,
sbrec_port_binding_by_name=sbrec_port_binding_by_name@entry=0x56283b791110, sbrec_mac_binding_by_lport_ip=sbrec_mac_binding_by_lport_ip@entry=0x56283b78eee0, sbrec_igmp_groups=sbrec_igmp_groups@entry=0x56283b78c7b0,
sbrec_ip_multicast_opts=sbrec_ip_multicast_opts@entry=0x56283b78d780,
sbrec_fdb_by_dp_key_mac=sbrec_fdb_by_dp_key_mac@entry=0x56283b78bfd0,
dns_table=0x56283b79e460, ce_table=ce_table@entry=0x56283b79e460,
svc_mon_table=svc_mon_table@entry=0x56283b79e460,
bfd_table=bfd_table@entry=0x56283b79e460,
br_int=br_int@entry=0x56283b7c8230, chassis=chassis@entry=0x56283b8d2810,
local_datapaths=local_datapaths@entry=0x56283b79bd70,
active_tunnels=active_tunnels@entry=0x56283b79be30,
local_active_ports_ipv6_pd=local_active_ports_ipv6_pd@entry=0x56283b79beb8, local_active_ports_ras=local_active_ports_ras@entry=0x56283b79bed8)
at controller/pinctrl.c:3477
#8 0x000056283a42d230 in main (argc=11, argv=0x7ffd8b6f9ad8)
at controller/ovn-controller.c:3693
(gdb) frame 2
#2 smap_find__ (smap=smap@entry=0x56283b830ed8,
key=key@entry=0x56283a60a796 "peer", key_len=4, hash=915080580)
at lib/smap.c:418
418 HMAP_FOR_EACH_WITH_HASH (node, node, hash, &smap->map) {
(gdb) p *smap
$8 = {map = {buckets = 0x56283ba02f80, one = 0x56283ba02f80, mask = 0,
n = 94730797125360}}
(gdb) frame 0
#0 hmap_next_with_hash__ (hash=915080580, node=0x56283b830ed8)
at include/openvswitch/hmap.h:324
324 node = node->next;
(gdb) p *node
$9 = {hash = 94730799034240, next = 0x56283ba02f80}
(gdb) ptype node
type = const struct hmap_node {
size_t hash;
struct hmap_node *next;
} *
(gdb) p (struct hmap_node*)0x56283ba02f80
$10 = (struct hmap_node *) 0x56283ba02f80
(gdb) p *(struct hmap_node*)0x56283ba02f80
$11 = {hash = 94730797125336, next = 0x56283b830ed8}
(gdb)
1:
node(0x56283b830ed8)->next is 0x56283ba02f80 and node(0x56283b830ed8) hash is 94730799034240
node(0x56283ba02f80)->next is 0x56283b830ed8 and node(0x56283ba02f80) hash is 94730797125336
and we find hash is 915080580
so we dead loop at:
static inline struct hmap_node *
hmap_next_with_hash__(const struct hmap_node *node, size_t hash)
{
while (node != NULL && node->hash != hash) { ---------------->dead loop
node = node->next;
}
return CONST_CAST(struct hmap_node *, node);
}
2: the smap->n ==94730797125360 is puzzling
(gdb) p *smap
$8 = {map = {buckets = 0x56283ba02f80, one = 0x56283ba02f80, mask = 0,
n = 94730797125360}
hi Yili, thank you for reporting this bug. i see that you don't know how to reproduce the issue. can you please details/describe your use case is it OpenStack/OpenShift deployment and can you please attach the OVS Database of the affected node as well the SB/NB Databases. thanks, (In reply to Mohamad Heib from comment #2) > hi Yili, > thank you for reporting this bug. > i see that you don't know how to reproduce the issue. > can you please details/describe your use case is it OpenStack/OpenShift > deployment emm, i have not use OpenStack/OpenShift, Just using the redhat os 8.0 and install the openvswitch and ovn rpm. and use the libvirt(qemu/kvm) create vms. > and can you please attach the OVS Database of the affected node as well the > SB/NB Databases. i have upload the SB/NB Databases > > thanks, Created attachment 1860340 [details]
SB Database
Created attachment 1860341 [details]
NB Database
(gdb) frame 6
#6 0x000056283a4557d5 in prepare_ipv6_ras (sbrec_port_binding_by_name=0x56283b791110, local_active_ports_ras=0x56283b79bed8) at controller/pinctrl.c:3950
3950 const char *peer_s = smap_get(&pb->options, "peer");
(gdb) list
3945 bool changed = false;
3946 SHASH_FOR_EACH (iter, local_active_ports_ras) {
3947 const struct pb_ld_binding *ras = iter->data;
3948 const struct sbrec_port_binding *pb = ras->pb;
3949
3950 const char *peer_s = smap_get(&pb->options, "peer");
3951 if (!peer_s) {
3952 continue;
3953 }
3954
(gdb) p pb
$15 = (const struct sbrec_port_binding *) 0x56283b830d80
(gdb) p *pb
$16 = {header_ = {hmap_node = {hash = 139878501501195, next = 0x56283baf86e0}, uuid = {parts = {1001359112, 22056, 0, 0}}, src_arcs = {prev = 0x4b5b1877,
next = 0x56283ba37160}, dst_arcs = {prev = 0x0, next = 0x460342f3}, table = 0x0, old_datum = 0x4b5b1877, parsed = false, reparse_node = {prev = 0x56283b830dd8,
next = 0x56283b830dd8}, new_datum = 0x56283b830e30, prereqs = 0x56283b830e30, written = 0x0, txn_node = {hash = 94730797125120, next = 0x56283b830e00},
map_op_written = 0x56283bdd6f60, map_op_lists = 0x56283b830f00, set_op_written = 0xffffffff00000000, set_op_lists = 0x41, change_seqno = {998444520, 22056, 998444520},
track_node = {prev = 0x56283b97a110, next = 0x56283ba02f90}, updated = 0x56283b830d80, tracked_old_datum = 0x9dae4afb460342f3}, chassis = 0xbadcf52db051844a,
datapath = 0xb1, encap = 0x7f380064c70b, external_ids = {map = {buckets = 0x56283b836260, one = 0x56283b836288, mask = 0, n = 2351977226}}, gateway_chassis = 0x56283bbc8180,
n_gateway_chassis = 0, ha_chassis_group = 0x460342f3, logical_port = 0x562800000000 <Address 0x562800000000 out of bounds>, mac = 0x8c30530a, n_mac = 94730797032680,
nat_addresses = 0x56283b830ec8, n_nat_addresses = 94730797125320, options = {map = {buckets = 0x56283ba02f80, one = 0x56283ba02f80, mask = 0, n = 94730797125360}},
parent_port = 0x56283b830ef0 "\360\016\203;(V", requested_chassis = 0x56283b830e10, tag = 0x56283bb4d290, n_tag = 0, tunnel_key = 49, type = 0x56283bdd7170 "",
up = 0x56283bb4d390, n_up = 0, virtual_parent = 0x0}
(gdb) ptype *pb
type = const struct sbrec_port_binding {
struct ovsdb_idl_row header_;
struct sbrec_chassis *chassis;
struct sbrec_datapath_binding *datapath;
struct sbrec_encap *encap;
struct smap external_ids;
struct sbrec_gateway_chassis **gateway_chassis;
size_t n_gateway_chassis;
struct sbrec_ha_chassis_group *ha_chassis_group;
char *logical_port;
char **mac;
size_t n_mac;
char **nat_addresses;
size_t n_nat_addresses;
struct smap options;
char *parent_port;
struct sbrec_chassis *requested_chassis;
int64_t *tag;
size_t n_tag;
int64_t tunnel_key;
char *type;
_Bool *up;
size_t n_up;
char *virtual_parent;
}
(gdb) p pb->mac[0]
Cannot access memory at address 0x8c30530a
(gdb) p pb->mac[1]
Cannot access memory at address 0x8c305312
(gdb) p pb->n_mac
$17 = 94730797032680
(gdb) p pb->logical_port
$18 = 0x562800000000 <Address 0x562800000000 out of bounds>
(gdb) p pb->options
$19 = {map = {buckets = 0x56283ba02f80, one = 0x56283ba02f80, mask = 0, n = 94730797125360}}
(gdb) p (struct smap_node *) 0x56283ba02f80
$20 = (struct smap_node *) 0x56283ba02f80
(gdb) p *(struct smap_node *) 0x56283ba02f80
$21 = {node = {hash = 94730797125336, next = 0x56283b830ed8}, key = 0x56283b830e40 "\020\241\227;(V", value = 0x56283bb4d2c0 "\220/\240;(V"}
(gdb) p *(struct smap_node *) 0x56283b830ed8
$22 = {node = {hash = 94730799034240, next = 0x56283ba02f80}, key = 0x0, value = 0x56283b830ef0 "\360\016\203;(V"}
I think it would also help pinpoint this issue if we had the OVS database (/etc/openvswitch/conf.db or similar) from the node where ovn-controller hangs. Created attachment 1861813 [details]
conf.db
1: the ovn-controller process will occupy 100% cpu when this issue happen 2: The env will recovery after restart the ovn-controller on that ovn chassis. Hi @33186108, i was looking at this issue but unfortunately couldn't reproduce the case on my setup. i was able to reproduce something similar that cause some invalid memory use and that seems to be related to your issue, so i submitted a patch to fix the memory issue and will really appreciate if you can apply it to your setup and try to reproduce the issue. if you need any help with applying or rerunning the controller please let Me know. the patch link: https://patchwork.ozlabs.org/project/ovn/patch/20220314110928.471986-1-mheib@redhat.com/ (In reply to Mohamad Heib from comment #11) > Hi @33186108, > > i was looking at this issue but unfortunately couldn't reproduce the case on > my setup. > > i was able to reproduce something similar that cause some invalid memory use > and that seems to be related to your issue, > so i submitted a patch to fix the memory issue and will really appreciate if > you can apply it to your setup and try to reproduce the issue. > if you need any help with applying or rerunning the controller please let Me > know. > > the patch link: > https://patchwork.ozlabs.org/project/ovn/patch/20220314110928.471986-1- > mheib/ Thanks for your help ... i will apply your patch,and report the result. After apply this patch https://patchwork.ozlabs.org/project/ovn/patch/20220314110928.471986-1-mheib@redhat.com/ this issue has not happend using auto pressure testes two weeks. Thanks all of you. This issue is fixed in ovn22.03-22.03.0-15 I talked with numan, it is not easily to reproduce and test.So I just do some regression test to make sure no regressions. and as comment 13, the reporter had verified the patch,so I set this bug verified as sanity only. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn22.03 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:4785 |