Bug 1638774
| Summary: | winbind crashes in wb_lookupsid_send | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | amitkuma |
| Component: | samba | Assignee: | Andreas Schneider <asn> |
| Status: | CLOSED ERRATA | QA Contact: | Andrej Dzilský <adzilsky> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 6.10 | CC: | amitkuma, asakure, asn, bthakur, gdeschner, gparente, hans, jarrpa, jinjli, jsosic, julian.gilbert, jvilicic, knakai, pravisha, rcadova, rmitra, toneata, tscherf |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | samba-3.6.23-52.el6_10 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-08-13 14:59:15 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 2
amitkuma
2018-10-12 13:26:35 UTC
It reports a memory corruption. Can you run winbind with valgrind which should detect it. valgrind --tool=memcheck -v --num-callers=20 --track-origins=yes --log-file=winbind.valgrind.%p.log /usr/sbin/winbindd -F Yes, we need to reproduce the error under valgrind so it tells us where something went wrong. The valgrind log doesn't show any error. We need the issue reproduced so that valgrind can catch it. Dear asn,
We have demanded the customer to recreate the issue ie procreate coredump when winbind is running in valgrind.
But that's occasional, the issue is not reproducible at one's convenience.
I was also thinking with coredump file available can't we gauge why winbind is taking a wrong turn while creating new async request with tevent.
I consider from _tevent_req_create(), We want this state variable structure to be allocated and zeroed.
struct wb_lookupsid_state {
struct tevent_context *ev;
struct winbindd_domain *lookup_domain;
struct dom_sid sid;
enum lsa_SidType type;
const char *domname;
const char *name;
};
It comes with 869 bytes to be allocated to libc_malloc
#12 0x00007fbb19d3faac in __libc_malloc (bytes=869) at malloc.c:3667
I am not sure 869 is correct size or not but can be calculated.
Then size qualifies to fastbin, fastbin index is calculated, converted to mfastbinptr*
if ((unsigned long)(nb) <= (unsigned long)(get_max_fast ())) {
idx = fastbin_index(nb);
mfastbinptr* fb = &fastbin (av, idx);
mchunkptr pp = *fb;
do
{
victim = pp;
if (victim == NULL)
break;
}
And fails in fastbin index calculation.
if (victim != 0) {
if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
{
..
malloc_printerr (check_action, errstr, chunk2mem (victim));
return NULL;
....
}
Nevertheless tough, But Can't we check memory allocated in dumped callstack, sizeof wb_lookupsid_state requested in good case where wb_lookupsid_send() gets valid chunk?
If you look at the git history of: source3/winbindd/wb_lookupsid.c there is no real change or fix since years. I think the the problem is that something overwrites memory and wenn we call wb_lookupsid_send() we end up accessing invalid memory and crash. valgrind is normally good at finding the culprit for these thing, if you're able to reproduce it :-) I guess the customer can't move to RHEL7? Dear asn, Customer is not ready for RHEL7. Also out of n number of server, He's facing issue only on 1 server. I have asked How this server is different from other servers running winbind and not crashing? Is there any specific user/group queried on this server, Is this server joined to some other OU on the active directory? Dear asn, I have asked customer to run winbind with valgrind in background using this command: # valgrind --tool=memcheck -v --num-callers=20 --track-origins=yes --log-file=winbind.valgrind.log /usr/sbin/winbindd & Since customer was not willing to run winbind in separate terminal. I have asked customer to provide coredump, valgrind log file generated at time of crash. Dear asn, winbind crashed and customer furnished valgrind report but report was full of ???. ==30623== at 0x2E5EB2: sid_copy (in /usr/sbin/winbindd) ==30623== by 0x2208C4: wb_sid2gid_send (in /usr/sbin/winbindd) ==30623== by 0x229C93: ??? (in /usr/sbin/winbindd) ==30623== by 0x2225FA: ??? (in /usr/sbin/winbindd) ==30623== by 0x22179E: ??? (in /usr/sbin/winbindd) ==30623== by 0x3CDC7C: ??? (in /usr/sbin/winbindd) ==30623== by 0x20281A: ??? (in /usr/sbin/winbindd) ==30623== by 0x201679: ??? (in /usr/sbin/winbindd) ==30623== by 0x23A7C1: ??? (in /usr/sbin/winbindd) ==30623== by 0x23AFE8: ??? (in /usr/sbin/winbindd) ==30623== by 0x6C81EA5: ??? (in /usr/lib64/libtevent.so.0.9.26) ==30623== by 0x6C802D5: ??? (in /usr/lib64/libtevent.so.0.9.26) ==30623== by 0x6C7BC3C: _tevent_loop_once (in /usr/lib64/libtevent.so.0.9.26) I requested to inaugurate debuginfo packages: # debuginfo-install samba-winbind-3.6.23-51.el6.x86_64 # debuginfo-install glibc-2.12-1.212.el6.x86_64 Customer also served coredump file. But no stack trace generated we running winbind using "/usr/sbin/winbindd &" # file 100-winbind.valgrind.21449.log.core.21449 100-winbind.valgrind.21449.log.core.21449: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'winbindd' # gdb /usr/sbin/winbindd 100-winbind.valgrind.21449.log.core.21449 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/winbindd...Reading symbols from /usr/lib/debug/usr/sbin/winbindd.debug...done. done. Illegal process-id: 100-winbind.valgrind.21449.log.core.21449. [New Thread 21449] Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib64/ld-2.12.so.debug...done. done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Core was generated by `'. <<<<<<<<<<<<<<<<< Program terminated with signal 6, Aborted. #0 0x00000000074f3495 in _start () from /lib64/ld-linux-x86-64.so.2 (gdb) bt #0 0x00000000074f3495 in _start () from /lib64/ld-linux-x86-64.so.2 Cannot access memory at address 0x7feffe498 (gdb) As always you need to install debuginfo packages to get useful information. Customer have installed debuginfo. Now when he's running winbind in valgrind: # valgrind --tool=memcheck -v --num-callers=20 --track-origins=yes --log-file=winbind.valgrind.log /usr/sbin/winbindd & Top is showing high CPU for memcheck process. # top - 14:51:02 up 91 days, 13:28, 7 users, load average: 36.80, 36.93, 33.28 Tasks: 507 total, 10 running, 495 sleeping, 2 stopped, 0 zombie Cpu(s): 17.6%us, 10.5%sy, 70.5%ni, 0.2%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st Mem: 32879952k total, 17954840k used, 14925112k free, 128260k buffers Swap: 67104760k total, 538228k used, 66566532k free, 14924180k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31141 root 20 0 470m 154m 2280 R 50.8 0.5 395:55.45 memcheck-amd64- 31208 root 20 0 464m 144m 1048 S 46.1 0.4 335:32.08 memcheck-amd64- 31213 root 20 0 464m 144m 1072 S 40.1 0.5 335:31.31 memcheck-amd64- 6896 svc_lsfa 35 15 1306m 37m 20m S 31.5 0.1 5:32.54 sas Update given to Customer: valgrind tools such as memcheck or helgrind use a lot of memory for tracking various aspects of your program. So, it is normal that top shows a lot more memory than what your program allocates itself. 1. Is your authentication getting hanged or delayed? 2. Did winbind crashed and generated coredump? 3. Provide us valgrind.log, rename it as valgrind_memcheck_high_CPU.log for checking. If authentication, id information are retrieved without much delay, please continue to monitor winbind crash. Do we need to run valgrind with --stats=yes? Dear asn, Thank you vmuch for build. I have asked whether customer whether he can test patchset or not? But earlier on case he answered this: 3. We provide a test package, you install on this crashing rhel box and provide us the findings. Again test package does not guarantee the fix. Ans: We can't take risk to install the package on prod boxes, What is the impact of this package ?? What is test package do. We want the permanent fix, as we have waited for last 3 months but still not get any resolution My question to you: So can we tell customer, that this patchset will not break their existing system, code change is done to fix winbind crash issue. The end result would be either winbind crash is successfully resolved or bleak chances are may be not. But this patchset will not break your production box, ie authentication and information retrieval via winbind. Please install this patch set on Production box? Thanks I've backported two patches which are in newer Samba versions. This means they should work as expected. However as Samba 3.6 is an old code base I cannot guarantee that the patch is working correctly. I think it will fix the problem but I'm not 100% sure. winbind is a complex piece of software :-) Also you can tell the customer that it took so long because we needed the valgrind log to see where the root cause of the issue is. Once you know that you can start looking for issues or fixes. Dear asn, Packages provided in Comment#28 have fixed the customer's issue. But recent samba version for RHEL-6 is(samba-winbind-3.6.23-51.el6.x86_64.rpm). Does the fix is also present in recent version(3.6.23-51) Can customer install this(3.6.23-51) on production boxes considering memory leak issue covered in this bugzilla is fixed? I will do a zstream release with the patches. Dear asn. Thanks for info. HI Romana, can you please give PM ACK for this BZ? You find the justification in c#46. Thank you. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2472 After upgrade from 3.6.24-51 to 3.6.24-52, AD groups stopped working. Winbind is not showing any additional group except the primary one. Example: [jsosic@test-vm ~]$ id uid=13689(jsosic) gid=10513(domain users) groups=10513(domain users),10512(domain admins),10518(schema admins),10519(enterprise admins),..... context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [root@test-vm ~]# yum -y update samba* [root@test-vm ~]# su - jsosic [jsosic@test-vm ~]$ id uid=13689(jsosic) gid=10513(domain users) groups=10513(domain users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 Downgrade solves the problem immediately: [root@test-vm ~]# yum -y downgrade samba* [root@test-vm ~]# su - jsosic [jsosic@test-vm ~]$ id uid=13689(jsosic) gid=10513(domain users) groups=10513(domain users),10512(domain admins),10518(schema admins),10519(enterprise admins),..... context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 Has anyone else hit this one? Hello Jakov, We have not yet heard from Customer for whom we have opened this bugzilla. Hell Jakov, Yes we have the same problem. We have problems with RHEL6 an RHEL7. Not with RHEL5. Hello Jakov, Yes, we have the same issue with RHEL6. See Bug 1743358. |