Bug 1767159

Summary:	SSSD is crashing: dbus_watch_handle() is invoked with corrupted 'watch' value
Product:	Red Hat Enterprise Linux 7	Reporter:	Vinay Mishra <vmishra>
Component:	sssd	Assignee:	Alexey Tikhonov <atikhono>
Status:	CLOSED ERRATA	QA Contact:	sssd-qe <sssd-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.7	CC:	abroy, apeetham, atikhono, dominik.mierzejewski, grajaiya, jhrozek, lslebodn, mkosek, mzidek, pbrezina, sgoveas, thalman, tscherf
Target Milestone:	rc	Keywords:	Triaged, ZStream
Target Release:	7.9
Hardware:	All
OS:	Linux
Whiteboard:	sync-to-jira
Fixed In Version:	sssd-1.16.5-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1816031 (view as bug list)		Environment:
Last Closed:	2020-09-29 19:49:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1816031

Comment 3 Alexey Tikhonov 2019-11-04 15:48:19 UTC

(In reply to Vinay Mishra from comment #2)
> (gdb) bt
> #0  dbus_watch_handle (watch=0xa0, flags=2) at ../../dbus/dbus-watch.c:740


This looks similar to:
1) https://bugzilla.redhat.com/show_bug.cgi?id=1144765#c12 (I do not think resolution was correct)
2) https://pagure.io/SSSD/sssd/issue/2660 (no resolution)
3) https://bugzilla.redhat.com/show_bug.cgi?id=1731577 (no resolution)


Crashing process:
1) monitor
2) sssd_nss
3) sssd_be, sssd_ssh
And in this case `cmdline`: /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
 => seems to be service unspecific.


Package versions:
1) sssd - 1.12.0, dbus - 1.8.6 (released 2014-09-15)
2) sssd - 1.12.4, dbus - ?
3) sssd - 1.16.2, dbus - 1.10.24-12 (released 2017-09-27)
And in this case:  as in (3)


I was not able to find any similar issues with other apps using dbus.
I was not able to find any similar issues with sssd 2.x release. `sbus` code was refactored in 2.x branch.
Probably there is a bug in sssd 1.x sbus code...

What really surprises me is that corrupted pointer is "0xa0" in all 4 cases I was able to find. Doesn't look like a coincidence.

Comment 4 Alexey Tikhonov 2019-11-04 15:56:47 UTC

(In reply to Alexey Tikhonov from comment #3)
> I was not able to find any similar issues with other apps using dbus.

I mean I was not able to find "dbus_watch_handle (watch=0xa0)" with other apps.
Of course, there are a lot of generic SIGSEGV in dbus_watch_handle() can be found.

Comment 5 Alexey Tikhonov 2019-11-04 18:31:22 UTC

My guess is following is happening:

 - libdbus removes dbush watch calling `sbus_remove_watch()` callback
 - watch is freed, but `watch->fde()` (tevent fd monitoring; set up in `sbus_add_watch()`) is not removed
( https://tevent.samba.org/group__tevent.html#gadc52787f3daf49e589066d37a5cdb18c : "To cancel the monitoring of a file descriptor, call talloc_free() on the object returned by this function.")
 - tevent calls `sbus_watch_handler()` to report fd became writable, but watch context is already freed => use-after-free

Indirect confirmation is that sssd 2-x has this fixed: https://github.com/SSSD/sssd/blob/master/src/sbus/connection/sbus_watch.c#L283
(though I would prefer to put this into watch destructor)


I need to think about it yet.

Comment 7 Alexey Tikhonov 2019-11-05 15:55:09 UTC

(In reply to Alexey Tikhonov from comment #5)
> My guess is following is happening:
> 
>  - libdbus removes dbush watch calling `sbus_remove_watch()` callback
>  - watch is freed, but `watch->fde()` (tevent fd monitoring; set up in
> `sbus_add_watch()`) is not removed
>  - tevent calls `sbus_watch_handler()` to report fd became writable, but
> watch context is already freed => use-after-free
> 
> I need to think about it yet.

I can't be 100% sure described bug is root cause of this bz (IMO most probably), but I am pretty sure this is a bug.
I will reuse (reopen) existing upstream ticket https://pagure.io/SSSD/sssd/issue/2660 to track fixing it.

Comment 10 Alexey Tikhonov 2019-11-06 18:51:15 UTC

```
(gdb) frame 1
#1  0x00007f5b40066398 in sbus_watch_handler (ev=<optimized out>, fde=<optimized out>, 
    flags=<optimized out>, data=<optimized out>) at src/sbus/sssd_dbus_common.c:94
94                  dbus_watch_handle(watch->dbus_write_watch, DBUS_WATCH_WRITABLE);


(gdb) p *watch
$1 = {prev = 0x0, next = 0x0, conn = 0x55efe0f1e2f0, fde = 0x55efe0f1ed70, fd = 12, 
  dbus_read_watch = 0x0, dbus_write_watch = 0xa0}


(gdb) p *(watch->conn)
$2 = {ev = 0x55efe0f0e9b0, type = SBUS_CONNECTION, dbus = {server = 0x55efe0f1de10, 
    conn = 0x55efe0f1de10}, 
  address = 0x55efe0f1ef10 "unix:path=/var/lib/sss/pipes/private/sbus-dp_XXXXX.com", 
  connection_type = 2, disconnect = 0, managed_paths = 0x55efe0f1e3f0, 
  nodes_fns = 0x55efe0f1e620, incoming_signals = 0x55efe0f1e850, retries = 0, max_retries = 3, 
  reconnect_callback = 0x55efe0c21bf0 <nss_dp_reconnect_init>, reconnect_pvt = 0x55efe0f1d440, 
  symlink = 0x0, srv_init_fn = 0x0, srv_init_data = 0x0, clients = 0x55efe0f1ea80, 
  watch_list = 0x0, last_request_time = 0x0, client_destructor_data = 0x0}


(gdb) p *(watch->conn->watch_list)
Cannot access memory at address 0x0
```

sbus_connection::watch_list = 0x0  --  this assures me `watch_destructor()` had been called and `watch` removed from the list (making list empty) before `sbus_watch_handler()` was called (i.e. this is really use-after-free)


But now I am not so sure how this happens.

(In reply to Alexey Tikhonov from comment #5)
>  - watch is freed, but `watch->fde()` (tevent fd monitoring; set up in
> `sbus_add_watch()`) is not removed

  -- tevent fd should be removed alongside with `watch` because `watch` is parent context for tevent fd...

And even if `tevent_common_fd_destructor()` fails (for example because it was called from the context of `tevent_common_invoke_fd_handler()`) still `epoll_event_fd_destructor()` should call `epoll_update_event()` -> `epoll_del_event()` -> `epoll_ctl(EPOLL_CTL_DEL)` unconditionally...

But it seems that upstream PR930 doesn't make much sense because calling talloc_free(fde) from `watch_destructor()` is more or less the same what libtalloc should do on its own. So may be the order it is done in sssd-2.x - https://github.com/SSSD/sssd/commit/f1f9af528f71f42ac41bb7a272f4f7d940fd3a0f - makes some sense, but I do not have any explanation at the moment.

Comment 12 Alexey Tikhonov 2019-11-07 12:14:49 UTC

(1) Ok, it seems I figured out what is happening. Mechanics are different than I assumed initially.
```
#3  0x00007f5b3c93e527 in epoll_event_loop (tvalp=0x7ffe3ad15da0, epoll_ev=0x55efe0f0ec40) at ../tevent_epoll.c:736
        ...
        events = {{events = 29, data = {ptr = 0x55efe0f1ed70, ...}}}
```
events = 29 = 1D = 0x1 + 0x4 + 0x8 + 0x10 = EPOLLIN | EPOLLOUT | EPOLLERR | EPOLLHUP

tevent_common_invoke_fd_handler() -> sbus_watch_handler() -> dbus_watch_handle(R) -> ...libdbus detects connection is closed... -> sbus_remove_watch() -> talloc_free(watch) ... get back to libdbus and back to sbus_watch_handler() -> "if (watch->dbus_write_watch) dbus_watch_handle(W)"  =>  *use-after-free*


(2) It seems there is additional issue: call to `talloc_free(watch)` in the context of `tevent_common_invoke_fd_handler()` will make d-tor of attached `tevent_fd` to fail:
https://gitlab.com/samba-team/samba/blob/master/lib/tevent/tevent_fd.c#L124
and
https://gitlab.com/samba-team/samba/blob/master/lib/tevent/tevent_fd.c#L52
thus leaking memory. Fixing this leak is tricky because libtevent requires to destroy tevent_fd before close(fd), which most probably will be executed by libdbus right after sbus_remove_watch()...
But I have to check details with maintainer/knowledgeable person of libtevent.

Comment 16 Alexey Tikhonov 2019-11-12 14:42:20 UTC

*** Bug 1731577 has been marked as a duplicate of this bug. ***

Comment 24 Alexey Tikhonov 2019-11-27 13:27:07 UTC

 * master
  * 337a1adf78beefb5dde93a733ab2caf86d65edb9
 * sssd-1-16
  * f845355e32127c5e8f2bf700cdaa5b8721804232

Comment 32 Abhijit Roy 2020-02-03 14:01:47 UTC

*** Bug 1797159 has been marked as a duplicate of this bug. ***

Comment 41 Amith 2020-05-27 22:55:46 UTC

Verified the bug on SSSD Version: sssd-1.16.5-7.el7.x86_64

The bug is verified Sanity Only and no related regression detected.

Here are the relevant sanity runs:
- https://idm-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/IdM-Pipelines/job/sssd-rhel-prod/job/RHEL7.9/12/allsanity_20HTML_20Report/
- https://idm-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/IdM-Pipelines/job/sssd-rhel-prod/job/RHEL7.9/12/testReport/test_journald/TestJournald/
- https://beaker.engineering.redhat.com/jobs/4282980
- https://beaker.engineering.redhat.com/jobs/4282982

Comment 43 errata-xmlrpc 2020-09-29 19:49:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3904