Bug 137454
Summary: | RedHat EL 3.0 Update 3 ExecShield causes application crashes | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Barry Tilton <tilton1> |
Component: | glibc | Assignee: | Jakub Jelinek <jakub> |
Status: | CLOSED WORKSFORME | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | drepper, ezannoni, jknaggs, johnsond, jparadis, riel, sdenham, tburke |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-07-27 05:06:52 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Barry Tilton
2004-10-28 16:28:38 UTC
Created attachment 106132 [details]
assembler instructions for _int_malloc obtained from TotalView 6.5.0-2
We have not been able to develop a test program that reproduces the crashes. The difficulty is emulating a 32Mb executable with 170+ dynamic libraries and 1000's of malloc's, realloc's and C++ objects. We believe that the problem is related to the size of our application, the amount of memory "manipuation" caused by the malloc's, objects, etc. and some overflow condition related to the random relocation of dynamic libraries by ExecShield. We have debugged the crashes using TotalView 6.5.0-2. All three cases that we have debugged have crashed with a SEGV in _int_malloc in /lib/tls/libc-2.3.2.so. The file _int_malloc1.txt contains the assembler instruction for _int_malloc obtained from TotalView 6.5.0-2. The observed register settings and updates for one case with ExecShield "on" have been interspersed in the instructions. This case was run with ExecShield "on" and "off". We used the kernel boot parameter "noexec=off" to turn of ExecShield. With ExecShield "on" we get a SEGV at _int_malloc+0x16d and with ExecShield "off" we do not get a SEGV. This is what we found. With ExecShield "on" at _int_malloc+0xe8 the address loaded into %edx from %edi+84 appears to point to a data block that is not valid or has a bad value at %edx+4. With ExecShield "on" the value at %edx+4 is zero(0) but with ExecShield "off" this is a value greater than zero but usually less than 512. This value is used by the "leal" at _int_malloc+0x147 to compute the index loaded into %edx. The value at %edx+8 is loaded into %esi. With the ExecShield "on" the address in %edx appears to point into the header for the data starting at the address stored in %ecx when the "leal" at _int_malloc+0x147 is executed. This causes the "movl" at _int_malloc+0x14b to load a zero(0) into %esi. Later at _int_malloc+0x16d a "movl" attempts to store %edi to %esi+12 and this causes a SEGV. For some reason with ExecShield "on" the value at %edx+4 is not valid when the "movl" at _int_malloc+0xe8 is executed. Other observations. The files es_on.map and es_off.map contain the /prox/xxxx/maps files from a run with ExecShield "on" and "off" respectively. When we run with ExecShield "on" /lib/tls/libc-2.3.2.so appears to be loaded at lower addresses typically around 0x03axxxxx When we run with ExecShield "on" /lib/tls/libc-2.3.2.so appears to be loaded at higher addresses around 0xaexxxxxx See attachments _int_malloc1.txt, es_on.map and es_off.map Created attachment 106133 [details]
/proc/xxx/maps with ExecShield on
Created attachment 106134 [details]
/proc/xxxx/maps with ExecShield off
Barry: Have you tried this application with the 2.4.9.21-23 kernel (at this point, this is the RHEL 3 U4 release candidate kernel) that I provided you (and Arun) on 3 November? Since we agreed on that phone call that you will be qualifying your GeoViz application (and others?) against RHEL 3 U4, we'd very much like to see if the problem that you've reported here is also visible with the latest RHEL 3 U4 (.21-23) kernel. Regards, Sue Denham 978.884.8501 (mobile) Barry: Have you tried this application with the 2.4.9.21-23 kernel (at this point, this is the RHEL 3 U4 release candidate kernel) that I provided you (and Arun) on 3 November? Since we agreed on that phone call that you will be qualifying your GeoViz application (and others?) against RHEL 3 U4, we'd very much like to see if the problem that you've reported here is also visible with the latest RHEL 3 U4 (.21-23) kernel. Regards, Sue Denham 978.884.8501 (mobile) Barry: I've just sent you mail with the latest RHEL 3 U4 kernel respin -- 2.4.9.21-24. We're hoping that this is the GA (final, customer version) kernel for RHEL 3 U4. Can you please test this with your application rather than the -23 kernel to see if the problem is still occurring? At least we'll know you've tried it with the very latest kernel.... Thanks, Sue, please be more careful with your kernel versions. The latest RHEL3 U4 kernel in the RHN beta channel is 2.4.21-23.EL, and last night's respin is 2.4.21-24.EL. Crashes inside of malloc and/or free are in > 95% of cases we have seen application bugs where the application or libraries it is using overflow malloced buffers, call free on memory not obtained by malloc/calloc/realloc or similar memory handling bugs. Have you tried some memory allocation debugger on your program? To name a few: 1) valgrind (included in RHEL4 beta{1,2}, but should work on RHEL3 too) 2) setting MALLOC_CHECK_=3 in the environment 3) ElectricFence Yes we have used valgrind; that was one of the very first things we tried when these problems started. Valgrind shows no buffer overflow or other memory allocation/deallocation errors related to this problem. Facts: 1. The exact same binaries do not crash under RHEL 3.0 Update 1 or Update 2. 2. The exact same binaries do not crash when ExecShield is turned off under RHEL 3.0 Update 3 or beta Update 4. 3. The "bad value" shown at "_int_malloc+0x103 assembler instructions for _int_malloc obtained from TotalView 6.5.0-2" appears to be coming from the stack not the heap. I would not expect our applications to be able to corrupt the stack. The fact that the exact same binaries do not crash under some conditions and will consistently crash under other conditions that we should have no abiltiy to affect, indicates to me that this is not a buffer overflow or other memory allocation/deallocation problem directly caused by our applications. You can install glibc-debuginfo to get source for malloc and debugging information. Anyway, here is what your assembly analysis looks in the source: 0xb750daf4 <_int_malloc+212>: call 0xb750e260 <malloc_consolidate> 0xb750daf9 <_int_malloc+217>: mov 0xfffffff0(%ebp),%eax %eax = av 0xb750dafc <_int_malloc+220>: add $0x48,%eax 0xb750daff <_int_malloc+223>: mov %eax,0xffffffdc(%ebp) 0xb750db02 <_int_malloc+226>: mov 0xfffffff0(%ebp),%edi %edi = av 0xb750db05 <_int_malloc+229>: mov 0xffffffdc(%ebp),%ecx 0xb750db08 <_int_malloc+232>: mov 0x54(%edi),%edx %edx = av->bins[3] // == unsorted_chunks(av)->bk 0xb750db0b <_int_malloc+235>: mov %edx,0xffffffe4(%ebp) victim = unsorted_chunks(av)->bk 0xb750db0e <_int_malloc+238>: cmp %ecx,%edx 0xb750db10 <_int_malloc+240>: je 0xb750dba7 <_int_malloc+391> 0xb750db16 <_int_malloc+246>: lea 0x0(%esi),%esi 0xb750db19 <_int_malloc+249>: lea 0x0(%edi),%edi 0xb750db20 <_int_malloc+256>: mov 0xffffffe4(%ebp),%eax %eax = victim 0xb750db23 <_int_malloc+259>: mov 0x4(%eax),%ecx %ecx = victim->size 0xb750db26 <_int_malloc+262>: mov 0xc(%eax),%edx bck = victim->bk 0xb750db29 <_int_malloc+265>: and $0xfffffff8,%ecx size = chunksize(victim) 0xb750db2c <_int_malloc+268>: cmpl $0x1ff,0xffffffec(%ebp) 0xb750db33 <_int_malloc+275>: ja 0xb750db3e <_int_malloc+286> not jumping -> in_smallbin_range(nb) 0xb750db35 <_int_malloc+277>: cmp 0xffffffdc(%ebp),%edx 0xb750db38 <_int_malloc+280>: je 0xb750dede <_int_malloc+1214> not jumping -> bck != unsorted_chunks(av) 0xb750db3e <_int_malloc+286>: cmp 0xffffffec(%ebp),%ecx 0xb750db41 <_int_malloc+289>: mov 0xfffffff0(%ebp),%esi %esi = av 0xb750db44 <_int_malloc+292>: mov 0xffffffdc(%ebp),%edi %edi = unsorted_chunks(av) 0xb750db47 <_int_malloc+295>: mov %edx,0x54(%esi) unsorted_chunks(av)->bk = bck 0xb750db4a <_int_malloc+298>: mov %edi,0x8(%edx) bck->fd = unsorted_chunks(av) 0xb750db4d <_int_malloc+301>: je 0xb750dec0 <_int_malloc+1184> 0xb750db53 <_int_malloc+307>: cmp $0x1ff,%ecx 0xb750db59 <_int_malloc+313>: ja 0xb750de32 <_int_malloc+1042> not jumping size <= (unsigned long)(nb + MINSIZE) 0xb750db5f <_int_malloc+319>: mov %ecx,%edi %edi = size 0xb750db61 <_int_malloc+321>: mov 0xfffffff0(%ebp),%ecx %ecx = av 0xb750db64 <_int_malloc+324>: shr $0x3,%edi victim_index = smallbin_index(size) 0xb750db67 <_int_malloc+327>: lea 0x40(%ecx,%edi,8),%edx bck = bin_at(av, victim_index) 0xb750db6b <_int_malloc+331>: mov 0x8(%edx),%esi fwd = bck->fd 0xb750db6e <_int_malloc+334>: mov %edi,%ecx 0xb750db70 <_int_malloc+336>: and $0x1f,%ecx 0xb750db73 <_int_malloc+339>: mov $0x1,%eax 0xb750db78 <_int_malloc+344>: shl %cl,%eax %eax = idx2bit(victim_index) 0xb750db7a <_int_malloc+346>: mov 0xfffffff0(%ebp),%ecx %ecx = av 0xb750db7d <_int_malloc+349>: sar $0x5,%edi %edi = idx2block(victim_index) 0xb750db80 <_int_malloc+352>: or %eax,0x448(%ecx,%edi,4) mark_bin(av,victim_index) 0xb750db87 <_int_malloc+359>: mov 0xffffffe4(%ebp),%edi %edi = unsorted_chunks(av)->bk 0xb750db8a <_int_malloc+362>: mov %edx,0xc(%edi) victim->bk = bck 0xb750db8d <_int_malloc+365>: mov %edi,0xc(%esi) fwd->bk = victim If you SEGV now, it means fwd was NULL struct malloc_chunk { INTERNAL_SIZE_T prev_size; /* Size of previous chunk (if free). */ INTERNAL_SIZE_T size; /* Size in bytes, including overhead. */ struct malloc_chunk* fd; /* double links -- used only if free. */ struct malloc_chunk* bk; }; typedef struct malloc_chunk* mbinptr; #define unsorted_chunks(M) (bin_at(M, 1)) #define bin_at(m, i) ((mbinptr)((char*)&((m)->bins[(i)<<1]) - (SIZE_SZ<<1))) #define in_smallbin_range(sz) \ ((unsigned long)(sz) < (unsigned long)MIN_LARGE_SIZE) #define set_head(p, s) ((p)->size = (s)) 3794 if ( (victim = last(bin)) != bin) { 3795 if (victim == 0) /* initialization check */ 3796 malloc_consolidate(av); ... 3828 /* 3829 Process recently freed or remaindered chunks, taking one only if 3830 it is exact fit, or, if this a small request, the chunk is remainder from 3831 the most recent non-exact fit. Place other traversed chunks in 3832 bins. Note that this step is the only place in any routine where 3833 chunks are placed in bins. 3834 3835 The outer loop here is needed because we might not realize until 3836 near the end of malloc that we should have consolidated, so must 3837 do so and retry. This happens at most once, and only when we would 3838 otherwise need to expand memory to service a "small" request. 3839 */ 3840 3841 for(;;) { 3842 3843 while ( (victim = unsorted_chunks(av)->bk) != unsorted_chunks(av)) { 3844 bck = victim->bk; 3845 size = chunksize(victim); 3846 3847 /* 3848 If a small request, try to use last remainder if it is the 3849 only chunk in unsorted bin. This helps promote locality for 3850 runs of consecutive small requests. This is the only 3851 exception to best-fit, and applies only when there is 3852 no exact fit for a small chunk. 3853 */ 3854 3855 if (in_smallbin_range(nb) && 3856 bck == unsorted_chunks(av) && 3857 victim == av->last_remainder && 3858 (unsigned long)(size) > (unsigned long)(nb + MINSIZE)) { ... 3874 } 3875 3876 /* remove from unsorted list */ 3877 unsorted_chunks(av)->bk = bck; 3878 bck->fd = unsorted_chunks(av); 3879 3880 /* Take now instead of binning if exact fit */ 3881 3882 if (size == nb) { ... 3888 } 3889 3890 /* place chunk in bin */ 3891 3892 if (in_smallbin_range(size)) { 3893 victim_index = smallbin_index(size); 3894 bck = bin_at(av, victim_index); 3895 fwd = bck->fd; 3896 } 3897 else { ... 3921 } 3922 3923 mark_bin(av, victim_index); 3924 victim->bk = bck; 3925 victim->fd = fwd; 3926 fwd->bk = victim; 3927 bck->fd = victim; 3928 } I believe the problem is already that chunksize(victim) is 0, that shouldn't happen and means a corruption of malloc data structures somewhere. /* addressing -- note that bin_at(0) does not exist */ #define bin_at(m, i) ((mbinptr)((char*)&((m)->bins[(i)<<1]) - (SIZE_SZ<<1))) comment even says this. Can you try MALLOC_CHECK_=3 if that finds out something? This adds several consistency checks to malloc. If you can reproduce that segfault, start looking on when a chunk with 0 size appeared on the unsorted chunks doubly linked list or when a correct chunk on that list had its size overwritten to 0. I believe Exec-Shield on or off plays in here just the role that it drastically changes memory layout of the application and either some code somewhere can't cope with it, or there is a bug lurking all the time, but just doesn't show up with the traditional memory layout. We tried MALLOC_CHECK_ (at 3 and also 1). We consistently can get "malloc: top chunk is corrupt" HOWEVER, i tried to use strace to get an idea of where/when the _corruption_ occured -- the message timing seems to be somewhat random -- we can see it consistently during dynamic loading after one of the oracle client libs is loaded, but by changing just about anything in the evironment we see the message in differnt locations -- including substantially after dyanmic-loading has finished and application code is running. Paranoid about MALLOC_CHECK testing with randomized memory layout, I tried MALLOC_CHECK with mozilla & soffice and both emit the "top chunk..." message. If I change /proc/sys/kernel/exec-shield-randomize to "0", then soffice runs without the message (i did NOT try the same experiment with mozilla). If needed, contact Barry for stderr output captured from one of the strace/MALLOC_CHECK. When/if I get a chance, I will attempt to read glibc/malloc/hooks.c:top_check() and see what is happening, but for now we're looking at other ways to narrow down the original issue(s). Created attachment 107619 [details]
/proc/<pid>/maps a while after SIGSEGV
Created attachment 107620 [details]
strace of application
Just a note on Comment #14: we used LD_DEBUG with MALLOC_CHECK in-order to get a execution timing information of "top chunk" error during the loading/initing of the .so's see also bug 154759 I'm closing this bug. There has been no reaction from the reporter and we see signs of memory corruption (which are with all likelyhood the application's fault). Reopen in case there is news. |