Description of problem: While testing kernel with LTP, we are seeing sssd_be is sporadically crashing on s390x: ore was generated by `/usr/libexec/sssd/sssd_be --domain implicit_files --uid 0 --gid 0 --logger=file'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000002aa08d969c2 in dp_client_register () (gdb) bt -a No symbol "a" in current context. (gdb) bt #0 0x000002aa08d969c2 in dp_client_register () #1 0x000003ffb508cdaa in _sbus_sss_invoke_in_s_out__step () from /usr/lib64/sssd/libsss_iface.so #2 0x000003ffb4f0cee8 in tevent_common_invoke_timer_handler (te=te@entry=0x2aa1586e020, current_time=..., removed=removed@entry=0x0) at ../../tevent_timed.c:370 #3 0x000003ffb4f0d0a6 in tevent_common_loop_timer_delay (ev=0x2aa15820860) at ../../tevent_timed.c:442 #4 0x000003ffb4f0e71c in epoll_event_loop (tvalp=0x3ffe3e7e9b0, epoll_ev=0x2aa1582a920) at ../../tevent_epoll.c:667 #5 epoll_event_loop_once (ev=<optimized out>, location=<optimized out>) at ../../tevent_epoll.c:937 #6 0x000003ffb4f0c314 in std_event_loop_once (ev=0x2aa15820860, location=0x3ffb5d6f238 "src/util/server.c:718") at ../../tevent_standard.c:110 #7 0x000003ffb4f06cb4 in _tevent_loop_once (ev=ev@entry=0x2aa15820860, location=location@entry=0x3ffb5d6f238 "src/util/server.c:718") at ../../tevent.c:772 #8 0x000003ffb4f06f66 in tevent_common_loop_wait (ev=0x2aa15820860, location=0x3ffb5d6f238 "src/util/server.c:718") at ../../tevent.c:893 #9 0x000003ffb4f0c294 in std_event_loop_wait (ev=0x2aa15820860, location=0x3ffb5d6f238 "src/util/server.c:718") at ../../tevent_standard.c:141 #10 0x000003ffb5d4d8ba in server_loop () from /usr/lib64/sssd/libsss_util.so #11 0x000002aa08d89588 in main () Version-Release number of selected component (if applicable): sssd-2.2.3-6.el8.s390x How reproducible: sporadically, about 3 times in last ~6 weeks Steps to Reproduce: unknown, the test (LTP) that ran while this occurred has no known interaction with sssd. It however collects all cores, which is what makes it visible. Actual results: sssd_be sporadic crashes Expected results: no crashes Additional info:
#0 0x000002aa08d969c2 in dp_client_register (mem_ctx=<optimized out>, sbus_req=<optimized out>, provider=0x2aa1583a700, name=0x2aa1589dc60 "nss") at src/providers/data_provider/dp_client.c:107 #1 0x000003ffb508cdaa in _sbus_sss_invoke_in_s_out__step (ev=<optimized out>, te=<optimized out>, tv=<error reading variable: value has been optimized out>, private_data=<optimized out>) at src/sss_iface/sbus_sss_invokers.c:682 #2 0x000003ffb4f0cee8 in tevent_common_invoke_timer_handler (te=te@entry=0x2aa1586e020, current_time=..., removed=removed@entry=0x0) at ../../tevent_timed.c:370 (gdb) frame 0 #0 0x000002aa08d969c2 in dp_client_register (mem_ctx=<optimized out>, sbus_req=<optimized out>, provider=0x2aa1583a700, name=0x2aa1589dc60 "nss") at src/providers/data_provider/dp_client.c:107 107 dp_cli->name = talloc_strdup(dp_cli, name); (gdb) p name $2 = 0x2aa1589dc60 "nss" (gdb) p dp_cli $3 = (struct dp_client *) 0x0 ``` cli_conn = sbus_server_find_connection(dp_sbus_server(provider), sbus_req->sender->name); if (cli_conn == NULL) { DEBUG(SSSDBG_CRIT_FAILURE, "Unknown client: %s\n", sbus_req->sender->name); return ENOENT; } dp_cli = sbus_connection_get_data(cli_conn, struct dp_client); dp_cli->name = talloc_strdup(dp_cli, name); ``` (gdb) p cli_conn $5 = <optimized out> So `cli_conn != NULL` but either `cli_conn->data == NULL` or typeof `cli_conn->data` != `struct dp_client` (gdb) p *provider $9 = {uid = 0, gid = 0, be_ctx = 0x2aa15839060, ev = 0x2aa15820860, sbus_server = 0x2aa15852520, sbus_conn = 0x2aa158551f0, clients = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, terminating = false, requests = {index = 0, num_active = 0, active = 0x0}, modules = 0x2aa1586aff0, targets = 0x2aa1586b0b0} (gdb) p *provider->sbus_server $11 = {ev = 0x2aa15820860, server = 0x2aa1583dcf0, symlink = 0x2aa1583d7f0 "/var/lib/sss/pipes/private/sbus-dp_implicit_files", watch_ctx = 0x2aa15838070, router = 0x2aa15851e80, data_slot = 0, last_activity = 0x0, names = 0x2aa15838d30, match_rules = 0x2aa15852610, max_connections = 1000, uid = 0, gid = 0, on_connection = 0x2aa1583dae0, disconnecting = false, name = {major = 1, minor = 2}} p sizeof(struct talloc_chunk) $15 = 88 (gdb) p *((struct talloc_chunk *)((char *)provider->sbus_server - 96)) $17 = {flags = 3404304376, next = 0x0, prev = 0x2aa15855190, parent = 0x0, child = 0x2aa15861560, refs = 0x0, destructor = 0x3ffb5029d20 <sbus_server_destructor>, name = 0x3ffb5034f08 "struct sbus_server", size = 112, limit = 0x0, pool = 0x2aa15852280} (gdb) p *((struct talloc_chunk *)((char *)provider->sbus_server->names - 96)) $18 = {flags = 3404304368, next = 0x2aa1583da80, prev = 0x2aa158525b0, parent = 0x0, child = 0x2aa158980e0, refs = 0x0, destructor = 0x0, name = 0x3ffb5d6e436 "src/util/util.c:374", size = 168, limit = 0x0, pool = 0x0} => pointers are valid (not freed) (gdb) frame 1 (gdb) p state $20 = <optimized out> (gdb) p *((struct _sbus_sss_invoke_in_s_out__state*)req->data)->sbus_req->sender $26 = {name = 0x2aa1587b3c0 "sssd.nss", uid = 0} cli_conn = sbus_server_find_connection = sss_ptr_hash_lookup(provider->sbus_server->names, sbus_req->sender->name == "sssd.nss") (gdb) p *provider->sbus_server->names $14 = {p = 0, maxp = 4, entry_count = 4, bucket_count = 4, segment_count = 1, min_load_factor = 1, max_load_factor = 5, directory_size = 4, directory_size_shift = 2, segment_size = 4, segment_size_shift = 2, delete_callback = 0x3ffb5d59ef8 <sss_ptr_hash_delete_cb>, delete_pvt = 0x2aa1583e210, halloc = 0x3ffb5d490b0 <hash_talloc>, hfree = 0x3ffb5d490a0 <hash_talloc_free>, halloc_pvt = 0x2aa1583d200, directory = 0x2aa1583f610, statistics = {hash_accesses = 14, hash_collisions = 2, table_expansions = 0, table_contractions = 0}} (gdb) p *provider->sbus_server->names->directory[0][0] $45 = {entry = {key = {type = HASH_KEY_STRING, {str = 0x2aa15869aa0 "sssd.domain_implicit_5ffiles", c_str = 0x2aa15869aa0 "sssd.domain_implicit_5ffiles", ul = 2929528838816}}, value = {type = HASH_VALUE_PTR, {ptr = 0x2aa15869880, ..}}}, next = 0x2aa15869720} -- key of this entry - "sssd.domain_implicit_5ffiles" - looks strange (gdb) p *provider->sbus_server->names->directory[0][0]->next $57 = {entry = {key = {type = HASH_KEY_STRING, {str = 0x2aa1586a5a0 ":1.2", c_str = 0x2aa1586a5a0 ":1.2", ul = 2929528841632}}, value = { type = HASH_VALUE_PTR, {ptr = 0x2aa15866f60, ...}}}, next = 0x0} (gdb) p *provider->sbus_server->names->directory[0][1] $46 = {entry = {key = {type = HASH_KEY_STRING, {str = 0x2aa15867760 ":1.1", c_str = 0x2aa15867760 ":1.1", ul = 2929528829792}}, value = { type = HASH_VALUE_PTR, {ptr = 0x2aa15867560, ...}}}, next = 0x0} (gdb) p *provider->sbus_server->names->directory[0][3] $47 = {entry = {key = {type = HASH_KEY_STRING, {str = 0x2aa158764a0 "sssd.nss", c_str = 0x2aa158764a0 "sssd.nss", ul = 2929528890528}}, value = { type = HASH_VALUE_PTR, {ptr = 0x2aa15898140, ...}}}, next = 0x0} (gdb) p *((struct sss_ptr_hash_value *)provider->sbus_server->names->directory[0][3]->entry->value->ptr) $55 = {spy = 0x2aa15872a60, ptr = 0x2aa158615c0} (gdb) p *((struct talloc_chunk *)((char *)((struct sss_ptr_hash_value *)provider->sbus_server->names->directory[0][3]->entry->value->ptr)->ptr - 96)) $60 = {flags = 3404304368, next = 0x2aa15859e40, prev = 0x0, parent = 0x2aa158524c0, child = 0x2aa158e7eb0, refs = 0x0, destructor = 0x3ffb50147c8 <sbus_connection_destructor>, name = 0x3ffb502d0ca "struct sbus_connection", size = 128, limit = 0x0, pool = 0x0} => pointer is valid (not freed) and has proper type `struct sbus_connection` (gdb) p *((struct sbus_connection *)((struct sss_ptr_hash_value *)provider->sbus_server->names->directory[0][3]->entry->value->ptr)->ptr) $63 = {ev = 0x2aa15820860, connection = 0x2aa15862930, type = SBUS_CONNECTION_CLIENT, address = 0x0, wellknown_name = 0x2aa1586f2f0 sssd.nss", unique_name = 0x2aa1586a4c0 ":1.2", disconnecting = false, access = 0x2aa158654a0, destructor = 0x2aa15865520, requests = 0x2aa15863240, reconnect = 0x2aa15863710, router = 0x2aa158637a0, watch = 0x2aa158614f0, data = 0x0, senders = 0x2aa15861070, last_activity = 0x0} => dp_client *dp_cli == sbus_connection *cli_conn->data == NULL
I think this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1775766#c14
Other relevant tickets: * bz 1768670 * bz 1770467 * bz 1684824
Hi In RHEL-8.3 on s390x during the VM's provision, we got a problem with sssd_be. The VM crushed with error in dmesg log: User process fault: interruption code 003b ilc:3 in sssd_be[2aa21b80000+37000] Failing address: 0000000000000000 TEID: 0000000000000400 Fault in primary space mode while using user ASCE. AS:00000003e4e001c7 R3:0000000000000024 Can you provide status for this bug? Thanks
Upstream PR: https://github.com/SSSD/sssd/pull/5299
Pushed PR: https://github.com/SSSD/sssd/pull/5299 * `master` * 4a84f8e18ea5604ac7e69849dee492718fd96296 - dp: fix potential race condition in provider's sbus server
Pushed PR: https://github.com/SSSD/sssd/pull/5344 * `master` * 7fbcaa8feeb968711ff52f51705c45062fd81394 - be: remove accidental sleep
*** Bug 1895794 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sssd bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1666