Bug 1660939

Summary:	coredump on unlock after applying updates
Product:	[Fedora] Fedora	Reporter:	Patrick C. F. Ernzer <pcfe>
Component:	sssd	Assignee:	Michal Zidek <mzidek>
Status:	CLOSED WORKSFORME	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	29	CC:	abokovoy, jhrozek, lslebodn, mzidek, pbrezina, pcfe, rharwood, sbose, ssorce
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-07-08 15:45:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Patrick C. F. Ernzer 2018-12-19 16:31:42 UTC

Description of problem:
Today I applied updates `dnf ugrade`, quite a few were ulled in (meanig it had been a while since I upgraded, at least 1 week)

After the upgrade, I locked my desktop session and was unable to unlock it. A clean reboot fixed my inability to log in but I saw that sssd had dumped core

Version-Release number of selected component (if applicable):
sssd-2.0.0-5.fc29.x86_64

How reproducible:
did not try

Steps to Reproduce:
0. have sssd set up on machine and use it (Red Hat corporate auth)
1. yum upgrade
2. lock graphical desktop session
3. attempt to unlock

Actual results:
system refused to unlock

Expected results:
system unlocks just fine

Additional info:
apologies for the messy bug report, on-site at customer and filing this on the side so it's not lost.

sssd logs, sosreport and the core will be attached as private entries.

Comment 4 Sumit Bose 2018-12-21 20:15:44 UTC

gdb tells me the following:

Core was generated by `/usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 --logger=files'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000560b1625e1a1 in dp_client_register (mem_ctx=<optimized out>, sbus_req=<optimized out>, provider=0x560b1762e230, name=0x560b176a0530 "autofs") at src/providers/data_provider/dp_client.c:107
107         dp_cli->name = talloc_strdup(dp_cli, name);
(gdb) list
102             return ENOENT;
103         }
104
105         dp_cli = sbus_connection_get_data(cli_conn, struct dp_client);
106
107         dp_cli->name = talloc_strdup(dp_cli, name);
108         if (dp_cli->name == NULL) {
109             talloc_free(dp_cli);
110             return ENOMEM;
111         }
(gdb) p dp_cli
$1 = (struct dp_client *) 0x0

So cli_conn is not NULL but cli_conn->data is NULL.

I'm not sure if this is an expected state and just a NULL check is missing or if this is unexpected and more investigation is needed why we got into this state. Pavel knows the SBus code best, so I set Needinfo for him.

In the update to sssd-2.0.0-5.fc29.x86_64 only the SBus timeout is change from iirc 25s, the DBus default, to 120s. I wonder if there is maybe some dependent timeout which has to be increased as well?

Comment 5 Pavel Březina 2019-02-07 11:29:42 UTC

No, it is not expected.

I though this was a race condition in the initialization code when we set on connection function after the server is already created (dp_client_init creates the dp_cli):

static void dp_init_done(struct tevent_req *subreq)
{
    struct dp_init_state *state;
    struct tevent_req *req;
    errno_t ret;

    req = tevent_req_callback_data(subreq, struct tevent_req);
    state = tevent_req_data(req, struct dp_init_state);

    ret = sbus_server_create_and_connect_recv(state->provider, subreq,
                                              &state->provider->sbus_server,
                                              &state->provider->sbus_conn);
    talloc_zfree(subreq);
    if (ret != EOK) {
        tevent_req_error(req, ret);
        return;
    }

    sbus_server_set_on_connection(state->provider->sbus_server,
                                  dp_client_init, state->provider);

However, responders are started way past this point in sss_monitor_service_init that is called after dp_init_done (in dp_initialized). I even tried this with setting some delay before dp_init_done is called but it only proved that responders will not start that soon.

Unfortunately the sssd logs are empty so it does not tell us anything.

Comment 6 Pavel Březina 2019-02-07 11:39:04 UTC

I suppose this was a one time event and it is not reproducible, right? I'm afraid we can't do much without logs (ideally level 0x3ff0).

Comment 7 Patrick C. F. Ernzer 2019-07-08 15:45:13 UTC

Yes, have not seen the problem since the initial occurance.

Since there was nothin guseful in the logs I provided and I do not have a reproducer for you, I'll close this now.