74271 – Random segmentation faults under IO stress

Bug 74271 - Random segmentation faults under IO stress

Summary: Random segmentation faults under IO stress

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-09-19 09:53 UTC by Need Real Name
Modified:	2007-11-30 22:06 UTC (History)
CC List:	1 user (show)
Fixed In Version:	2.4.9-e8
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2003-06-23 17:50:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
A sample core dump, awk this time (not very helpful) (104.00 KB, application/octet-stream) 2002-09-19 10:09 UTC, Martin Wilck	no flags	Details
Analysis of segfault at 0x40001ec1 (_dl_start) (7.62 KB, text/plain) 2002-09-19 19:05 UTC, Martin Wilck	no flags	Details
View All

Description Need Real Name 2002-09-19 09:53:09 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)

Description of problem:
Random processes catch SIGSEGV under high IO stress. These are often "small" 
processes like "basename" and "tee" (which our IO stress test uses a lot). 

Examining the core dumps of the crashed apps showed nothing - gdb would simply 
say that the process was at an invalid address.

Further inspection (through debug output printed from do_page_fault())has shown 
that these processes get SIGSEGV because they are trying to access invalid 
memory addresses; these addresses themselves are also random and not related in 
any way to the process' virtual memory regions.

In our opinion this indicates either a memory corruption or a race condition of 
some sort.


Version-Release number of selected component (if applicable):
2.4.9-e.3enterprise, 2.4.9-e.8enterprise

How reproducible:
Always

Steps to Reproduce:
1. Start high-load IO stress test on filesystems connected over FibreChannel 
(Emulex LP9000 or Qlogic 2x00)
2. Wait 3-12h. 
	
Actual Results:  Random processes catch SIGSEGV. These are often "small" 
processes like "basename" and "tee" (which our IO stress test uses a lot). 

Expected Results:  IO stress test runs cleanly.

Additional info:

Hardware info: The tests were carried out on different hardware.

Basic test configuration:
4way P4 Xeon (with and without HT), 4GB RAM, Serverworks HE,
aic7899P onboard, ATI Rage XL 8MB, either Emulex LP9002 
(driver 4.20p from Emulex) or Qlogic 2200 (driver qla2x00
4.27beta, the RedHat-supplied qla2200 driver doesn't see all LUNs
on our FC hardware). 

We have varied the physical memory between 1 and 4GB and carried out
similar tests on a 4way PIII machine, with no difference.
Only using the UP kernel apparently solved the problem (see below).

The stress test were carried out with our "copy-compare" test suite.
This program copies a set of files with different sizes containing predefined 
patterns back and forth between 2 directories. After each copy operation, the 
file intergrity is checked a) with cmp(1) and b) by checking the pattern. We 
will soon make this suite publicly available.

In the test situation described, we start 8 simultaneous copy-compare tests, 
resulting in a load average 60-100 and ~300 processes in the process table. 
Each process runs on different directories (ext2 or ext3) file system on a disk 
accessed trough a Fibre Channel controller. 

The problem appears to occur only on SMP systems (on UP, the test ran fine for 
>24h).

Although this error pattern (SIGSEGV) occurs mainly with Fibre Channel, we have 
seen it also with a dpt_i2o driver. During similar tests on drives on an 
aic7xxx we had kernel crashes (we never had crashes with qla2x00 or lp9000 so 
far). If our assumption on memory corruption or a race condition is right, 
whether a panic or a SIGSEGV occurs is only a question whether the problem 
occurs in kernel or user space.

Thus, the problem may be related to BUG 67608.

Comment 1 Martin Wilck 2002-09-19 10:09:32 UTC

Created attachment 76565 [details]
A sample core dump, awk this time (not very helpful)

Comment 2 Martin Wilck 2002-09-19 13:12:18 UTC

I have to correct our (my!) previous statement that the offending addresses
follow no systematic pattern.  

After reviewing my material, I found:

- ALL segmentation faults occur in ld.so code, at program initialization.

- The error occurs at one of 9 EIP values (see below). Two of these locations
  account for 60% and 20% of all application crashes, respectively, and
  occur in several independent test runs where the machine had been rebooted
  in the meantime.

- For each EIP location, the offending address that causes the SEGV is always
  the same; of course addresses differ for different EIP locations.
  The first statement is of course only significant for the locations that
  we have hit several times.

- The frequency of crashes in different apps is roughly proportional to the
  frequency with which they are started by our test script. This makes sense
  because the crashes occur at program startup.

Summary of SIGSEGV over several test runs
=========================================

Grand total: 50 (3 runs)

EIP         ADDRESS  EXPR         FUNCTION          TOTAL   RUNS            
0x40001ec1  00015a1c (%esi)       _dl_start         9       09/12,09/14     
0x400020da  00750080 (%edx)       _dl_start         2       09/13,09/14     
0x400024a6  00000000 (%edx)       dl_main           1       09/14           
0x400025d2  00000154 0x154(%eax)  dl_main           1       09/14           
0x400077b4  80001715 %es:(%edi)   _dl_map_object    1       09/14           
0x40009de5  00000094 (%eax)       _dl_setup_hash    1       09/12           
0x40009ede  0075008c 0xc(%edx)    _dl_new_object    31      09/12,09/13,09/14
0x4000db39  0000002c 0x2c(%edx)   match_symbol      1       09/14           
0x40010982  40001600 %es:(%edi)   calloc            2       09/12           

Application crash statistics:
=========================================
awk:            10
basename:       10
tee:            10
date:           6
cp:             5
cmp:            3
rm:             3
cpat32:         1
expr:           1
sleep:          1

Comment 3 Martin Wilck 2002-09-19 19:05:57 UTC

Created attachment 76644 [details]
Analysis of segfault at 0x40001ec1 (_dl_start)

Comment 4 Martin Wilck 2002-09-19 19:07:35 UTC

The attachment I just created gives a strange impression.
It appears exactly 1 bit is wrong, and this must have occured during
2 Assembler instructions. OTOH, this fault occured 9 times during our tests.

Comment 5 Martin Wilck 2002-09-23 14:34:45 UTC

Things are getting a bit more clear.
This is the relevant part from the function _dl_start:

    1e8b:	8d 83 c4 c5 fe ff    	lea    0xfffec5c4(%ebx),%eax
    1e91:	2b 83 74 00 00 00    	sub    0x74(%ebx),%eax
    1e97:	89 85 60 fd ff ff    	mov    %eax,0xfffffd60(%ebp)
    1e9d:	89 85 d8 fd ff ff    	mov    %eax,0xfffffdd8(%ebp)
    1ea3:	03 03                	add    (%ebx),%eax
    1ea5:	89 c6                	mov    %eax,%esi
    1ea7:	85 f6                	test   %esi,%esi
    1ea9:	89 85 e0 fd ff ff    	mov    %eax,0xfffffde0(%ebp)
    1eaf:	0f 84 eb 00 00 00    	je     1fa0 <_dl_start+0x144>
    1eb5:	8b 95 60 fd ff ff    	mov    0xfffffd60(%ebp),%edx
    1ebb:	89 95 5c fd ff ff    	mov    %edx,0xfffffd5c(%ebp)
    1ec1:	8b 06                	mov    (%esi),%eax 
**** SISEGV is caught at this instruction. ****
User-mode register dump:   
        eax: 00015a1c   ebx: 40015898   ecx: 00000000   edx: 00000000
        esi: 00015a1c   edi: 00000000   ebp: bffed554   esp: bffed27c

1 %ebx contains the address of the Global Offset Table (GOT) (correct)
2 The offending address in %esi comes from %eax (1ea5).
3 %eax is calculated in lines 1e8b, 1e91, and 1ea3.
4 1e8b can hardly go wrong, it puts the value 0x40001e5c into %eax.
5 1e91 subtracts the value in %0x74(%ebx), i,e, the value in 0x4001590c.
6 This is a value in the GOT that should be 0x00001e5c before bootstrapping,
  and 0x40001e5c after bootstrapping.
  Since we run before bootstrapping, %eax should now be 0x40000000.
7 The resulting value is stored in 0xfffffd60(%ebp) and 0xfffffdd8(%ebp).
  The value of 0xfffffd60(%ebp) occurs in the register dump in %edx (1eb5).
8 1ea3 adds 0x00015a1c, the value in (%ebx).
  %eax should now be 0x40015a1c.
9 The result is in %eax and %edi at SIGSEGV time, and in 0xfffffde0(%ebp)

=> The register dump shows that %eax contains 0x00015a1c instead of 0x40015a1c.
   Inspection of %edx shows that the instruction that "went wrong" must have
   been 1e91.

I have dumped the procedure stack and GOT when these conditions were found, 
and indeed I see the value 0x40001e5c in 0x74(%ebx), that is, we are looking
at an "already-bootstrapped" Global Offset Table! Here is a typical excerpt
from the GOT at this stage:

[ Read lines from right to left ]
4000d000 4000f634 00000000 4000e000 4000da00 00000000 00000000 00015a1c
                                                               ^start of GOT
400109c4 40004a44 4000d61c 00000000 40010948 40008224 400089e4 40010610
40006efc 40008038 40004880 4000a10c 4000e608 40011538 4000d624 4000e5a8
40015888 40015c60 40001e5c 4000b6d8 4001099c 400116d8 4000cdc0 4000cbd0
                  ^0x74(%ebx)

I found this about 20 times when the error described here occured.
This explains why the operation at 1e91 yields 0x00000000 instead of
0x40000000.
 
NOW it becomes interesting: There was one exeption. 
In exactly one case, I found this:

00001ca2 00001c92 00001c82 00001c72 00001c62 00000000 00000000 00015a1c
                                                               ^start of GOT
00001d22 00001d12 00001d02 00001cf2 00001ce2 00001cd2 00001cc2 00001cb2
00001da2 00001d92 00001d82 00001d72 00001d62 00001d52 00001d42 00001d32
00015888 00000000 00001e5c 00001df2 00001de2 00001dd2 00001dc2 00001db2
                  ^0x74(%ebx)

This is a non-bootstrapped GOT, as it should be!
However, the register and stack dumps look exactly as in the other cases.
That is at 1e91 the processor must have read 0x40001e5c, at 1ec1 it is 00001e5c.

Comment 6 Martin Wilck 2002-09-23 17:06:52 UTC

Let me summarize the most important results:

1 The fault happens only a couple of instructions after the exexve() system
  call has finished and the new process has started to run in user mode.
2 The GOT is in a memory area that PRIVATE and writeable.
  This memory area ranges from 0x40015000 to 0x40016000. (1 page only).
3 The instruction at 1e91 is the first time the process reads from this memory
  area. It is a read access, therefore the page should not have been copied,
  and point to the original, unmodified map of ld.so.
4 Instead, it seems to map to a modified page where some other process has
  already bootstrapped itself.
5 The "exceptional" situation where we find the unmodified GOT at SIGSEGV-time
  seems to indicate that in this case the correct mapping was established
  somewhere between instruction 1e91 and 1ec1 (??). 

It appears that there is some race condition with multiple processes mapping
this page (and the page potentially being swapped out, since this happens
only at extreme memory pressure).

Here is a sample vmstat output at fill load:
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
11 63  2  60256   8644 3301492 495632   0   0  8088 11174 1013  5018  22  38 40
 0 70  2  60256   8060 3300808 493932   0   0  5657 15569 1105  3694  16  29 54
14 60  2  60256   5112 3297952 502980   0   0  6488  6554  970  5495  25  47 29
18 54  4  60256  13596 3293924 497412   0   0  9315 16562 1361  4241  23  42 35

Comment 7 Martin Wilck 2002-09-23 17:14:26 UTC

The above results indicate that something is going wrong in the context of
the do_no_page(), do_wp_page(), and filemap_nopage() (as well as other nopage()
functions), probably only with swapping or page cache invalidation going on at
the same time.

I found the following comment by Linus on 2.4.14.pre8:
(http://marc.theaimsgroup.com/?l=linux-kernel&m=100483865518735&w=2)
  "Special thanks to Andrea - we spent too much time tracking down a subtle
   sigsegv problem, but we got it in the end."

The patch in question modifies a lot of code related to the page cache, 
memory mapping, and paging-on-demand / copy-on-write code. 
These parts are obviously not included in the RedHat 2.4.9-e.x series.

We have strong reasons to believe that applying this patch would solve 
the SIGSEGV problems shown here.

Comment 8 Martin Wilck 2002-09-25 13:48:45 UTC

I discovered another aspect that indicates that there may be a second, unrelated
bug that causes the SIGSEGV problem to become relevant under AS2.1,
at least with the load pattern we are currently generating.

Look again at the vmstat under full load:

 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
11 63  2  60256   8644 3301492 495632   0   0  8088 11174 1013  5018  22  38 40
 0 70  2  60256   8060 3300808 493932   0   0  5657 15569 1105  3694  16  29 54
14 60  2  60256   5112 3297952 502980   0   0  6488  6554  970  5495  25  47 29
18 54  4  60256  13596 3293924 497412   0   0  9315 16562 1361  4241  23  42 35

Here is a vmstat of a RedHat 7.3 machine running a very similar test on very
similar hardware:

 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2 56  2      0 517140 282712 1492028   0   0     0 26146  524  2192  26  55 20
29 31  2      0 216328 282892 1787788   0   0     0 37807  816  2636   1   9 90
31 29  2      0 453808 283208 1483000   0   0     0 34402  525  1828  39  61  1
36 19  2      0 524104 283436 1447124   0   0     0 22470  572  2243  13  31 56

Note the difference in the "buff" values: <300MB on 7.3, >3GB on AS2.1!

It is typical for our current test pattern that the "bi" column (blocks sent
to block devices) is almost constant 0. The reason for this is that files 
are copied and compared, and immediately removed when that operation is
finished. OTOH, blocks are constantly streaming in at ~10-20k/s. I am no VM
expert, but most likely under normal conditions the kernel detects that
the blocks in the buffer cache don't belong to any valid inode anymore,
and puts them back to the free memory pool. This seems not to work under 
AS2.1. The >3GB buffers cannot be explained in any way; the total amount of
files being copied and compared in the test series in question was around
700MB. Even if these were in memory twice it would account for no more than
1.4GB, and you'd expect most of that to be in the page cache rather than in
the buffer cache.

As I said, my knowledge about VM is insufficient to tell whether this is really
a bug, but it surely looks weird. The problem may also have to do with the
Emulex LP9000 driver, although the same driver does the work in the RH7.3
example above, too.

In any case, if the memory patterns under AS2.1 were like those under RH7.3, 
the SIGSEGV problem would not have occured in this test series, because the
system would have never got short of memory. Other test patterns, or systems
with less memory, might generate the SIGSEGV problem nonetheless, we don't know.

If you follow vmstat output during the tests on AS2.1, you can watch the buffer
size growing slowly to those >3GB in ~3 hours, until the system starts swapping.
SIGSEGV starts another hour later. Thus, _if_ this is a bug, it 
is independent of the SIGSEGV problem itself.

Comment 9 Martin Wilck 2002-09-25 13:53:33 UTC

Short note: The RH7.3 vmstat pattern above was generated after >24 hours of
constant IO load.

Comment 10 Arjan van de Ven 2002-09-25 15:31:42 UTC

emulex hardware is unsupported
Which qlogic driver are you guys using?

Comment 11 Arjan van de Ven 2002-09-25 15:33:57 UTC

oh sorry
qlogic 4.x I see
also unsupported

please use supported drivers

Comment 12 Martin Wilck 2002-10-21 10:28:37 UTC

Can you explain to me how the driver used can possibly matter for the problem we
describe? This is obviously a problem related to memory management, memory
mapping, and (perhaps) process creation and I cannot imagine how a block device
driver could mess this stuff up.

Btw we have strong reasons to use the Qlogic driver we're using, but that's a
different issue.

Comment 13 Arjan van de Ven 2002-10-21 10:33:25 UTC

qlogic 4.x we know corrupts data/memory
so yes that matters

Comment 14 Martin Wilck 2002-10-21 11:20:02 UTC

1) This is not a memory corruption bug.

2) I hope are not telling me that both the qlogic 4.x and Emulex driver are
   accidentally having the same bug that causes the above problem --
   under AS2.1 only. Note the same drivers work perfectly well on RH 7.3 and
   other distributions, on a broad range of hardware.

3) I don't know which 4.x driver version you're referring to. The one we are 
   using is EMC-certified and (unlike the driver provided by RedHat) has passed
   all IO stress and data integrity tests so far in our lab.

Comment 15 Martin Wilck 2002-11-06 13:35:41 UTC

Apparently the problem is solved with 2.4.9-e.9 - no segfaults aynmore
We need to test this more thoroughly, though.

Looking at the changes between e.8 and e.9, only the "2.4.18-pge" patch seems
to be a likely candidate that could have caused the improvement. Would that be
possible?

The strange behaviour with the high "buff" numbers in vmstat is still observed.

Comment 16 Martin Wilck 2002-11-18 07:56:00 UTC

This bug can be closed. Fixed with 2.4.9-e.9.

Note You need to log in before you can comment on or make changes to this bug.