I've been running into crashes related to stack corruption when doing some tests with the memberOf plug-in on x86_64 hardware. I do not run into any problems on 32-bit i386 hardware. I've also found that both the IPA and DS memberOf plug-in's both trigger the problem, but the stacks are a bit different (which is not too odd considering the code is quite a bit different and the stacks are corrupt). To reproduce the issue, use the following steps: - Setup a DS instance and enable the memberOf plug-in. - Add 20 user entries (user1 - user20): dn: cn=user1,dc=example,dc=com changetype: add cn: user1 objectclass: person objectclass: inetuser sn: 1 - Add 20 group entries (group1 - group20): dn: cn=group1,dc=example,dc=com changetype: add objectClass: groupOfNames objectClass: inetUser cn: group1 - Add each user as a member to their matching group (user1 to group1, etc.): dn: cn=group1,dc=example,dc=com changetype: modify add: member member: cn=user1,dc=example,dc=com - Create a group loop across all 20 groups (g1->g2->g3....g20->g1): dn: cn=group1,dc=example,dc=com changetype: modify add: member member: cn=group2,dc=example,dc=com The ns-slapd process will crash about half-way through the group loop creation. I've attached the top portions of some stack traces for a few different scenarios. For each scenario, the stack is consistently corrupted. The stacks included are one with the IPA memberOf plug-in, one with the DS memberOf plug-in, and another with the DS plug-in with the "-fstack-protector-all" gcc option enabled at build time. This gcc option doesn't seem to detect the corruption, but I believe the stack is different due to the use of stack canaries.
Created attachment 311199 [details] Call Stacks
I'm beginning to think that the problem is that we're overflowing the stack. The crash is occurring at the end of the function prologue when it attempts to make space for the local variables (the sub assembly instruction). Displaying the instruction in the program counter register in gdb shows that the next instruction is the "mov" of the first passed in variable, meaning that the "sub" instruction must have failed. I do not have access to my test system at the moment to confirm this by increasing the stack size, but I think that the theory holds water. I built a little test program that intentionally overflows the stack in a function called from main. I see the crash happen at the same point in the function prologue as it does when ns-slapd crashes: (gdb) disassemble Dump of assembler code for function big_stuff: 0x000000000040048e <big_stuff+0>: push %rbp 0x000000000040048f <big_stuff+1>: mov %rsp,%rbp 0x0000000000400492 <big_stuff+4>: sub $0x1c9c318,%rsp 0x0000000000400499 <big_stuff+11>: mov %edi,0xfffffffffe363c7c(%rbp) 0x000000000040049f <big_stuff+17>: mov %esi,0xfffffffffe363c78(%rbp) 0x00000000004004a5 <big_stuff+23>: mov 0xfffffffffe363c78(%rbp),%eax 0x00000000004004ab <big_stuff+29>: add 0xfffffffffe363c7c(%rbp),%eax 0x00000000004004b1 <big_stuff+35>: leaveq 0x00000000004004b2 <big_stuff+36>: retq End of assembler dump. (gdb) x/i $pc 0x400499 <big_stuff+11>: mov %edi,0xfffffffffe363c7c(%rbp) I think the reason I don't see this same issue on my 32-bit test machine is that the larger 64-bit pointer size used for all local pointer variables used by this thread is enough to push us past the stack limit with the test that I'm running. The stack limit on both my 32-bit and 64-bit test machines is a default of 10MB.
This issue is indeed stack size related. I tried increasing the limit with "ulimit -s", but it didn't help. I started looking into the server code to see if we hard-code the stack size, and I noticed this define in slapi-plugin.h: /* All 64-bit builds get a bigger stack size */ #elif ( defined ( __LP64__ )) || defined (_LP64) #define SLAPD_DEFAULT_THREAD_STACKSIZE 262144L #else #define SLAPD_DEFAULT_THREAD_STACKSIZE 0 #endif We pass this as the requested stack size to PR_CreateThread(). If we pass a "0", then NSPR will use the default pthread stack size, which is the same as the current soft resource limit (10MB). This is what happens on a 32-bit Linux system. On a LP64 system, we are hard-coding a stack size of 256k. This explains why we run out of stack space on such a 20 group loop on 64-bit Linux when I've been able to create group loops of 100 groups on 32-bit Linux. Changing this to "0" for LP64 systems makes the crash go away. I think we should just get rid of the LP64 specific define and use the default define of "0" for SLAPD_DEFAULT_THREAD_STACKSIZE.
I think you've got it. Note - this is why recursion is generally bad in daemon code :P Would probably be better to rewrite to limit recursion or avoid recursion altogether, but let's go with your solution for now.
Created attachment 311304 [details] CVS Diffs This fixes the define for the default stack size. I also dealt with a compiler error with a function not returning a value when it is supposed to return an int in the memberOf code.
Checked into ldapserver (HEAD). Thanks to Rich and Noriko for their reviews! Checking in ldap/servers/plugins/memberof/memberof.c; /cvs/dirsec/ldapserver/ldap/servers/plugins/memberof/memberof.c,v <-- memberof.c new revision: 1.12; previous revision: 1.11 done Checking in ldap/servers/slapd/slapi-plugin.h; /cvs/dirsec/ldapserver/ldap/servers/slapd/slapi-plugin.h,v <-- slapi-plugin.h new revision: 1.25; previous revision: 1.24 done
Checked into Directory71RtmBranch. Checking in slapd/slapi-plugin.h; /cvs/dirsec/ldapserver/ldap/servers/slapd/slapi-plugin.h,v <-- slapi-plugin.h new revision: 1.8.2.3; previous revision: 1.8.2.2 done
Checked into Directory_Server_8_0_Branch. Checking in ldap/servers/slapd/slapi-plugin.h; /cvs/dirsec/ldapserver/ldap/servers/slapd/slapi-plugin.h,v <-- slapi-plugin.h new revision: 1.19.2.1; previous revision: 1.19 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0602.html