1638774 – winbind crashes in wb_lookupsid_send

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1638774 - winbind crashes in wb_lookupsid_send

Summary: winbind crashes in wb_lookupsid_send

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	samba
Sub Component:
Version:	6.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Andreas Schneider
QA Contact:	Andrej Dzilský
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-12 12:28 UTC by amitkuma
Modified:	2019-09-04 09:29 UTC (History)
CC List:	18 users (show)
Fixed In Version:	samba-3.6.23-52.el6_10
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-13 14:59:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2472	0	None	None	None	2019-08-13 14:59:17 UTC

Comment 2 amitkuma 2018-10-12 13:26:35 UTC

More information
1. issue is happening intermittently 
2. On RHEL 6.9 as well winbind service was crashing

Comment 3 Andreas Schneider 2018-10-17 14:18:07 UTC

It reports a memory corruption. Can you run winbind with valgrind which should detect it.

valgrind --tool=memcheck -v --num-callers=20 --track-origins=yes --log-file=winbind.valgrind.%p.log /usr/sbin/winbindd -F

Comment 7 Andreas Schneider 2018-10-29 15:12:13 UTC

Yes, we need to reproduce the error under valgrind so it tells us where something went wrong.

Comment 10 Andreas Schneider 2018-11-07 15:08:06 UTC

The valgrind log doesn't show any error. We need the issue reproduced so that valgrind can catch it.

Comment 11 amitkuma 2018-11-08 06:50:17 UTC

Dear asn,
We have demanded the customer to recreate the issue ie procreate coredump when winbind is running in valgrind.
But that's occasional, the issue is not reproducible at one's convenience.

I was also thinking with coredump file available can't we gauge why winbind is taking a wrong turn while creating new async request with tevent.
I consider from _tevent_req_create(), We want this state variable structure to be allocated and zeroed.
struct wb_lookupsid_state {
        struct tevent_context *ev;
        struct winbindd_domain *lookup_domain;
        struct dom_sid sid;
        enum lsa_SidType type;
        const char *domname;
        const char *name;
};

It comes with 869 bytes to be allocated to libc_malloc
#12 0x00007fbb19d3faac in __libc_malloc (bytes=869) at malloc.c:3667
I am not sure 869 is correct size or not but can be calculated.

Then size qualifies to fastbin, fastbin index is calculated, converted to mfastbinptr*
  if ((unsigned long)(nb) <= (unsigned long)(get_max_fast ())) {
    idx = fastbin_index(nb);
    mfastbinptr* fb = &fastbin (av, idx);
    mchunkptr pp = *fb;
    do
      {
        victim = pp;
        if (victim == NULL)
          break;
      }

And fails in fastbin index calculation.
    if (victim != 0) {
      if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
        {
..
          malloc_printerr (check_action, errstr, chunk2mem (victim));
          return NULL;
....
    }

Nevertheless tough, But Can't we check memory allocated in dumped callstack, sizeof wb_lookupsid_state requested in good case where wb_lookupsid_send() gets valid chunk?

Comment 12 Andreas Schneider 2018-11-08 07:05:41 UTC

If you look at the git history of: source3/winbindd/wb_lookupsid.c there is no real change or fix since years. I think the the problem is that something overwrites memory and wenn we call wb_lookupsid_send() we end up accessing invalid memory and crash.

valgrind is normally good at finding the culprit for these thing, if you're able to reproduce it :-)

I guess the customer can't move to RHEL7?

Comment 13 amitkuma 2018-11-08 09:19:15 UTC

Dear asn,

Customer is not ready for RHEL7.

Also out of n number of server, He's facing issue only on 1 server. 
I have asked How this server is different from other servers running winbind and not crashing?
Is there any specific user/group queried on this server, Is this server joined to some other OU on the active directory?

Comment 14 amitkuma 2018-11-14 16:48:17 UTC

Dear asn,
I have asked customer to run winbind with valgrind in background using this command:
# valgrind --tool=memcheck -v --num-callers=20 --track-origins=yes --log-file=winbind.valgrind.log /usr/sbin/winbindd &
Since customer was not willing to run winbind in separate terminal.
I have asked customer to provide coredump, valgrind log file generated at time of crash.

Comment 15 amitkuma 2018-11-22 06:57:34 UTC

Dear asn,
winbind crashed and customer furnished valgrind report but report was full of ???.
==30623==    at 0x2E5EB2: sid_copy (in /usr/sbin/winbindd)
==30623==    by 0x2208C4: wb_sid2gid_send (in /usr/sbin/winbindd)
==30623==    by 0x229C93: ??? (in /usr/sbin/winbindd)
==30623==    by 0x2225FA: ??? (in /usr/sbin/winbindd)
==30623==    by 0x22179E: ??? (in /usr/sbin/winbindd)
==30623==    by 0x3CDC7C: ??? (in /usr/sbin/winbindd)
==30623==    by 0x20281A: ??? (in /usr/sbin/winbindd)
==30623==    by 0x201679: ??? (in /usr/sbin/winbindd)
==30623==    by 0x23A7C1: ??? (in /usr/sbin/winbindd)
==30623==    by 0x23AFE8: ??? (in /usr/sbin/winbindd)
==30623==    by 0x6C81EA5: ??? (in /usr/lib64/libtevent.so.0.9.26)
==30623==    by 0x6C802D5: ??? (in /usr/lib64/libtevent.so.0.9.26)
==30623==    by 0x6C7BC3C: _tevent_loop_once (in /usr/lib64/libtevent.so.0.9.26)
I requested to inaugurate debuginfo packages:
# debuginfo-install samba-winbind-3.6.23-51.el6.x86_64
# debuginfo-install glibc-2.12-1.212.el6.x86_64

Customer also served coredump file.
But no stack trace generated we running winbind using "/usr/sbin/winbindd &"
# file 100-winbind.valgrind.21449.log.core.21449 
100-winbind.valgrind.21449.log.core.21449: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'winbindd'

# gdb /usr/sbin/winbindd 100-winbind.valgrind.21449.log.core.21449 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/winbindd...Reading symbols from /usr/lib/debug/usr/sbin/winbindd.debug...done.
done.
Illegal process-id: 100-winbind.valgrind.21449.log.core.21449.
[New Thread 21449]
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib64/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `'.            <<<<<<<<<<<<<<<<<
Program terminated with signal 6, Aborted.
#0  0x00000000074f3495 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  0x00000000074f3495 in _start () from /lib64/ld-linux-x86-64.so.2
Cannot access memory at address 0x7feffe498
(gdb)

Comment 16 Andreas Schneider 2018-11-22 18:10:29 UTC

As always you need to install debuginfo packages to get useful information.

Comment 17 amitkuma 2018-12-08 05:27:44 UTC

Customer have installed debuginfo.

Now when he's running winbind in valgrind:
# valgrind --tool=memcheck -v --num-callers=20 --track-origins=yes --log-file=winbind.valgrind.log /usr/sbin/winbindd &

Top is showing high CPU for memcheck process.

# top - 14:51:02 up 91 days, 13:28,  7 users,  load average: 36.80, 36.93, 33.28
Tasks: 507 total,  10 running, 495 sleeping,   2 stopped,   0 zombie
Cpu(s): 17.6%us, 10.5%sy, 70.5%ni,  0.2%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
Mem:  32879952k total, 17954840k used, 14925112k free,   128260k buffers
Swap: 67104760k total,   538228k used, 66566532k free, 14924180k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31141 root      20   0  470m 154m 2280 R 50.8  0.5 395:55.45 memcheck-amd64-
31208 root      20   0  464m 144m 1048 S 46.1  0.4 335:32.08 memcheck-amd64-
31213 root      20   0  464m 144m 1072 S 40.1  0.5 335:31.31 memcheck-amd64-
 6896 svc_lsfa  35  15 1306m  37m  20m S 31.5  0.1   5:32.54 sas


Update given to Customer:
valgrind tools such as memcheck or helgrind use a lot of memory for tracking various aspects of your program. 
So, it is normal that top shows a lot more memory than what your program allocates itself.

1. Is your authentication getting hanged or delayed?
2. Did winbind crashed and generated coredump?
3. Provide us valgrind.log, rename it as valgrind_memcheck_high_CPU.log for checking.

If authentication, id information are retrieved without much delay, please continue to monitor winbind crash.

Do we need to run valgrind with --stats=yes?

Comment 26 amitkuma 2019-01-03 12:31:54 UTC

Dear asn,
Thank you vmuch for build.

I have asked whether customer whether he can test patchset or not?

But earlier on case he answered this:
3. We provide a test package, you install on this crashing rhel box and provide us the findings. Again test package does not guarantee the fix.
Ans: We can't take risk to install the package on prod boxes, What is the impact of this package ?? What is test package do.
We want the permanent fix, as we have waited for last 3 months but still not get any resolution

My question to you:
So can we tell customer, that this patchset will not break their existing system, code change is done to fix winbind crash issue.
The end result would be either winbind crash is successfully resolved or bleak chances are may be not.
But this patchset will not break your production box, ie authentication and information retrieval via winbind.

Please install this patch set on Production box?

Thanks

Comment 27 Andreas Schneider 2019-01-03 12:46:21 UTC

I've backported two patches which are in newer Samba versions. This means they should work as expected. However as Samba 3.6 is an old code base I cannot guarantee that the patch is working correctly. I think it will fix the problem but I'm not 100% sure. winbind is a complex piece of software :-)

Also you can tell the customer that it took so long because we needed the valgrind log to see where the root cause of the issue is. Once you know that you can start looking for issues or fixes.

Comment 32 amitkuma 2019-04-12 10:26:48 UTC

Dear asn,

Packages provided in Comment#28 have fixed the customer's issue.

But recent samba version for RHEL-6 is(samba-winbind-3.6.23-51.el6.x86_64.rpm).

Does the fix is also present in recent version(3.6.23-51) Can customer install this(3.6.23-51) on production boxes considering memory leak issue covered in this bugzilla is fixed?

Comment 33 Andreas Schneider 2019-04-12 12:51:36 UTC

I will do a zstream release with the patches.

Comment 34 amitkuma 2019-04-12 12:58:16 UTC

Dear asn.
Thanks for info.

Comment 48 Thorsten Scherf 2019-07-02 10:33:51 UTC

HI Romana, can you please give PM ACK for this BZ? You find the justification in c#46. Thank you.

Comment 54 errata-xmlrpc 2019-08-13 14:59:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2472

Comment 55 Jakov Sosic 2019-08-20 23:48:24 UTC

After upgrade from 3.6.24-51 to 3.6.24-52, AD groups stopped working. Winbind is not showing any additional group except the primary one.

Example:

[jsosic@test-vm ~]$ id
uid=13689(jsosic) gid=10513(domain users) groups=10513(domain users),10512(domain admins),10518(schema admins),10519(enterprise admins),..... context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

[root@test-vm ~]# yum -y update samba*

[root@test-vm ~]# su - jsosic
[jsosic@test-vm ~]$ id
uid=13689(jsosic) gid=10513(domain users) groups=10513(domain users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023


Downgrade solves the problem immediately:

[root@test-vm ~]# yum -y downgrade samba*

[root@test-vm ~]# su - jsosic
[jsosic@test-vm ~]$ id
uid=13689(jsosic) gid=10513(domain users) groups=10513(domain users),10512(domain admins),10518(schema admins),10519(enterprise admins),..... context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023



Has anyone else hit this one?

Comment 56 amitkuma 2019-08-21 01:19:24 UTC

Hello Jakov,
We have not yet heard from Customer for whom we have opened this bugzilla.

Comment 57 Hans Hielkema 2019-09-02 13:08:21 UTC

Hell Jakov,

Yes we have the same problem. We have problems with RHEL6 an RHEL7. Not with RHEL5.

Comment 58 julian.gilbert 2019-09-04 09:29:05 UTC

Hello Jakov,

Yes, we have the same issue with RHEL6. See Bug 1743358.

Note You need to log in before you can comment on or make changes to this bug.