Bug 194533 - veritas storage foundation 32bit apps crash in glibc during post-process installation
veritas storage foundation 32bit apps crash in glibc during post-process inst...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
x86_64 Linux
urgent Severity urgent
: ---
: ---
Assigned To: Tom Coughlan
Brian Brock
:
: 196573 (view as bug list)
Depends On:
Blocks: 181409 181411 196056
  Show dependency treegraph
 
Reported: 2006-06-08 15:28 EDT by Barry Marson
Modified: 2013-04-02 19:51 EDT (History)
12 users (show)

See Also:
Fixed In Version: RHSA-2006-0575
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-10 19:30:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
fix two corruption bugs (6.13 KB, patch)
2006-06-27 12:27 EDT, Tom Coughlan
no flags Details | Diff
rev 2 (3.23 KB, patch)
2006-06-29 11:50 EDT, Tom Coughlan
no flags Details | Diff

  None (edit)
Comment 1 Jakub Jelinek 2006-06-08 16:09:26 EDT
In glibc-2.3.4-2.22.i686 _nl_load_locale_from_archive really calls __sysconf
(_SC_PAGE_SIZE) at that relative spot within a 4K page and in that case
0x004344d0 corresponds to the very first instruction in __sysconf, which is:
0008c4d0 <__sysconf>:
   8c4d0:       55                      push   %ebp
   8c4d1:       ba b0 0d 03 00          mov    $0x30db0,%edx
   8c4d6:       89 e5                   mov    %esp,%ebp
But there really aren't many reasons why that instruction would segfault,
especially when the caller stored some words to %esp:
   1f3eb:       c7 04 24 1e 00 00 00    movl   $0x1e,(%esp,1)
   1f3f2:       e8 d9 d0 06 00          call   8c4d0 <__sysconf>
One of the reasons could be that it hits RLIMIT_STACK (info regs would be helpful
in that case), also info whether that part of libc.so's .text is still mapped at
that point (cat /proc/<pid>/maps when it gets the segfault under gdb).
Comment 4 Barry Marson 2006-06-14 10:52:23 EDT
The following are the notes taken to try and bring back all the reporting in a
single comment:

------- Additional Comments From bmarson@redhat.com  2006-06-08 23:37 EST -------
veritas3 in the same subnet was showing the same problems.  It's different in 
that it has a 05/26 tree installed and has been tested with the -38 and -39 
kernels.  We experienced intermittent failures here.  It seems if we remove 
the veritas bits and then rebuild the rpm database and reboot it works fine.

Both systems were built either with PXE install latest (from 
qafiler.boston.redhat.com) or with a boot CD from the same tree areas. Would 
it help if I rebuilt one of them fresh ?

Barry


------- Additional Comments From jakub@redhat.com  2006-06-09 02:19 EST -------
Both systems are offline now, so I can't investigate now.
I guess backing up the corrupted libc
(/usr/tmp/jbm/libc-2.3.4.so-prelinked-corrupted) to some other host (or a
partition that won't be reformated) and reinstalling wouldn't be a bad idea.
That way we can see if it was a random corruption or if some program
intentionally overwrites part of libc.so.

------- Additional Comments From bmarson@redhat.com  2006-06-09 09:18 EST -------
I was planning on rebuilding veritas3 with the 05/06 tree and not touching
veritas2 at this time.  That way we will have a system to compare and see where
things deteriorate.  We are having lab network trouble.  Once resolved, I will
stage veritas3

Barry

------- Additional Comments From bmarson@redhat.com  2006-06-09 09:42 EST -------
thats the 06/05 tree.   Netorking has been restored.  So I'll be staging
veritas3 now

------- Additional Comments From jakub@redhat.com  2006-06-09 11:04 EST -------
I just looked at veritas3.rhts and there libc.so is crippled as well.
It was not prelinked, so simple cmp -l with rpm2cpio glibc-2.3.4-2.22.i686.rpm |
cpio -id extracted lib/tls/libc-2.3.4.so was possible.
It seems a big chunk of libc.so's .text section was overwritten, from file offset
0x90000 to 0x9b000.
The corrupted file part starts with (from hexdump -C):
00090000  20 00 00 32 4d 65 67 61  52 41 49 44 4c 44 20 30  | ..2MegaRAIDLD 0|
00090010  20 52 41 49 44 30 20 20  31 33 39 47 35 32 31 58  | RAID0  139G521X|
00090020  00 00 00 00 02 00 00 00  b8 74 27 a0 a8 1b 17 a0  |.........t'.....|
00090030  00 00 00 00 f8 a0 9b a0  f8 a0 9b a0 24 77 27 a0  |............$w'.|
00090040  70 74 27 a0 4c 3e 07 a0  2c 36 07 a0 0c 00 00 00  |pt'.L>..,6......|
00090050  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00090060  1e 04 01 00 00 01 00 00  00 00 00 00 00 00 00 00  |................|
00090070  00 00 00 00 00 00 80 0f  11 26 06 a0 a8 60 0f a0  |.........&...`..|
00090080  00 00 00 00 00 00 00 00  18 98 13 a0 00 00 02 00  |................|
00090090  02 00 00 00 00 e4 6e 01  00 00 00 00 cf 51 01 06  |......n......Q..|
000900a0  00 00 80 57 d4 74 27 a0  3c 4e 06 a0 ac 11 09 a0  |...W.t'.<N......|
000900b0  38 75 27 a0 f0 74 27 a0  24 da 08 a0 00 00 00 00  |8u'..t'.$.......|
000900c0  48 fd 00 00 00 00 00 00  00 00 00 02 1c 00 00 00  |H...............|
000900d0  10 75 27 a0 1c 00 00 00  00 00 00 00 00 00 00 00  |.u'.............|
000900e0  00 00 00 00 00 00 00 00  00 00 00 00 05 00 00 00  |................|
000900f0  00 00 00 00 a9 75 27 a0  0a 00 00 00 00 00 00 00  |.....u'.........|
00090100  00 00 00 00 a8 75 27 a0  4c 75 27 a0 3c 75 27 a0  |.....u'.Lu'.<u'.|
00090110  fc d9 0d a0 0c da 0d a0  00 00 00 00 78 75 27 a0  |............xu'.|
00090120  50 75 27 a0 30 17 09 a0  ec d9 0d a0 1e 71 0e a0  |Pu'.0........q..|
00090130  02 00 00 00 01 00 00 00  2c 76 27 a0 0c 00 00 00  |........,v'.....|
00090140  ec 75 27 a0 2c 76 27 a0  e8 75 27 a0 7c 75 27 a0  |.u'.,v'..u'.|u'.|
...
000908c0  00 00 00 00 00 00 00 00  fc 03 00 00 56 41 4c 49  |............VALI|
000908d0  44 41 54 49 4f 4e 3d 4e  6f 6e 65 00 56 45 52 53  |DATION=None.VERS|
000908e0  49 4f 4e 53 3d 42 54 42  4c 5f 44 2e 32 2e 31 2e  |IONS=BTBL_D.2.1.|
000908f0  39 2c 42 49 4f 53 5f 48  34 33 30 2c 43 54 4c 4d  |9,BIOS_H430,CTLM|
00090900  5f 55 38 32 37 2c 41 50  50 5f 35 32 31 58 00 46  |_U827,APP_521X.F|
00090910  6c 61 73 68 61 62 6c 65  3d 31 30 32 38 5f 30 30  |lashable=1028_00|
00090920  31 33 5f 31 30 32 38 5f  30 31 36 44 00 00 00 00  |13_1028_016D....|
00090930  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
...
so it might give some clue on what might have corrupted the C library.

On veritas2.rhts the corruption started on slightly different spot, but with
apparently the same data:
0008c000  20 00 00 32 4d 65 67 61  52 41 49 44 4c 44 20 30  | ..2MegaRAIDLD 0|
0008c010  20 52 41 49 44 30 20 20  31 33 39 47 35 32 31 58  | RAID0  139G521X|
0008c020  00 00 00 00 01 00 00 00  b8 74 27 a0 64 1b 17 a0  |.........t'.d...|
0008c030  00 00 00 00 7c a0 9b a0  7c a0 9b a0 24 77 27 a0  |....|...|...$w'.|
0008c040  70 74 27 a0 4c 3e 07 a0  2c 36 07 a0 0c 00 00 00  |pt'.L>..,6......|
0008c050  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
0008c060  1e 04 01 00 00 01 00 00  00 00 00 00 00 00 00 00  |................|
0008c070  00 00 00 00 00 00 80 0f  11 26 06 a0 a8 60 0f a0  |.........&...`..|
0008c080  00 00 00 00 00 00 00 00  18 95 13 a0 00 00 01 00  |................|
0008c090  01 00 00 00 00 a2 a7 7d  00 00 00 00 cf 51 01 06  |.......}.....Q..|
0008c0a0  00 00 80 57 d4 74 27 a0  3c 4e 06 a0 ac 11 09 a0  |...W.t'.<N......|
0008c0b0  38 75 27 a0 f0 74 27 a0  24 da 08 a0 00 00 00 00  |8u'..t'.$.......|
0008c0c0  48 fd 00 00 00 00 00 00  00 00 00 02 1c 00 00 00  |H...............|
0008c0d0  10 75 27 a0 1c 00 00 00  00 00 00 00 00 00 00 00  |.u'.............|
0008c0e0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
0008c0f0  00 00 00 00 a8 75 27 a0  0a 00 00 00 00 00 00 00  |.....u'.........|
0008c100  00 00 00 00 a8 75 27 a0  4c 75 27 a0 3c 75 27 a0  |.....u'.Lu'.<u'.|
0008c110  fc d9 0d a0 0c da 0d a0  00 00 00 00 78 75 27 a0  |............xu'.|
0008c120  50 75 27 a0 30 17 09 a0  ec d9 0d a0 1e 71 0e a0  |Pu'.0........q..|
0008c130  02 00 00 00 01 00 00 00  2c 76 27 a0 0c 00 00 00  |........,v'.....|
0008c140  ec 75 27 a0 2c 76 27 a0  e8 75 27 a0 7c 75 27 a0  |.u'.,v'..u'.|u'.|
0008c150  14 14 09 a0 8c 16 09 a0  00 00 00 00 a8 75 27 a0  |.............u'.|
0008c160  01 00 00 00 00 12 09 a0  00 00 00 20 cf 51 01 06  |........... .Q..|
0008c170  00 00 80 57 60 bf 1a a0  ff ff ff ff 00 00 00 00  |...W`...........|
0008c180  20 10 ab 7d 33 34 00 00  3c 76 27 c0 03 00 00 00  | ..}34..<v'.....|
0008c190  0f 00 00 00 fe 00 00 00  00 6b 17 a0 00 00 00 02  |.........k......|
0008c1a0  5c 76 27 a0 e4 75 27 a0  b8 9e 0d a0 30 26 06 a0  |\v'..u'.....0&..|
0008c1b0  a8 60 0f a0 00 00 00 00  00 04 00 00 10 e4 ff ff  |.`..............|
0008c1c0  d4 11 09 a0 4c 12 09 a0  48 76 27 a0 03 00 00 00  |....L...Hv'.....|
0008c1d0  cf 51 01 06 00 00 80 57  08 76 27 a0 00 00 00 00  |.Q.....W.v'.....|
0008c1e0  20 10 ab 7d 00 00 00 00  a0 76 27 c0 03 00 00 00  | ..}.....v'.....|
...
0008c8c0  00 00 00 00 00 00 00 00  fc 03 00 00 56 41 4c 49  |............VALI|
0008c8d0  44 41 54 49 4f 4e 3d 4e  6f 6e 65 00 56 45 52 53  |DATION=None.VERS|
0008c8e0  49 4f 4e 53 3d 42 54 42  4c 5f 44 2e 32 2e 31 2e  |IONS=BTBL_D.2.1.|
0008c8f0  39 2c 42 49 4f 53 5f 48  34 33 30 2c 43 54 4c 4d  |9,BIOS_H430,CTLM|
0008c900  5f 55 38 32 37 2c 41 50  50 5f 35 32 31 58 00 46  |_U827,APP_521X.F|
0008c910  6c 61 73 68 61 62 6c 65  3d 31 30 32 38 5f 30 30  |lashable=1028_00|
0008c920  31 33 5f 31 30 32 38 5f  30 31 36 44 00 00 00 00  |13_1028_016D....|
0008c930  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

So, this doesn't sound like glibc bug, but can be either hw related, or kernel
(lvm and such), or the veritas stuff that is installed on top of that.

------- Additional Comments From bmarson@redhat.com  2006-06-09 11:48 EST -------
Whats interesting is I ran rpm -V on all the RPMS and stored the results in
/usr/tmp/bjm/rpm-v.log before I started any veritas stuff.

Then I manually installed the RPMS (3) the way they failed on veritas2.  All
worked.  SO I removed those RPMS and fired off the real test, and so far it
hasn't failed either.  I'm letting the install test complete before I touch
anything.

Just a note about the hardware.  There's a Dell RAID controller being used.  2
drives that were RAID0 striped for space/performance.

Barry

------- Additional Comments From bmarson@redhat.com  2006-06-09 15:41 EST -------
Installing the 06/07 tree both from PXE install and CD failed with the segv

Next I installed the 06/07 tree with PXE, logged in and immediately ran rpm -V
on all RPMS.  Results are on veritas3 in /usr/tmp/bjm.  I noted to jturner that
/lib/tls/libc-2.3.4.so had an MD5 error bit.  He noted when he checked it, it
wasn't fine.  Rerunning the verify 5 minutes later, and the MD5 error was gone.

What's going on here ?

And suspecting veritas would run if there was no error, I ran the simple test
and it passed.

I think were dealing with something with prelink or the state of the library
during prelink work.

Barry

------- Additional Comments From jakub@redhat.com  2006-06-09 16:53 EST -------
It would be very surprising if prelink ran right after the install, normally
it runs during the first night after install and on veritas3 libc was corrupted
this morning, yet prelink hasn't been executed yet.
Do you happen to have saved copies of /lib/tls/libc-2.3.4.so when rpm -V was
reporting failure, when VRTS* scriptlets were segfaulting and when it was ok
again?  If you copied the file each time, we could analyze what was going on,
but without it it is hard to say.  Certainly I don't see how prelink could write
a MegaRAID firmware or what was it into the middle of libc.so.

------- Additional Comments From bmarson@redhat.com  2006-06-10 08:37 EST -------
Im not going to have good access to the machine until late Sunday.  I will 
reinstall and see if the last scenario was exactly repeatable.  Then I will 
test it against just glibc RPM, copying the file between each phase.  Again, 
that corruption was found before any veritas bits existed on the machine.  I'm 
also going to be commandering another box to test with to hopefully rule out 
the specific systems as being the issue.


------- Additional Comments From bmarson@redhat.com  2006-06-11 21:46 EST -------
I rebuilt veritas3 with the same 06/07 tree.  My first login consisted of

1.  cp /lib/tls/libc-2.3.4.so /usr/tmp/bjm/libc-2.3.4.so.preverify1

2.  rpm -V glibc-2.3.4-2.22 > /usr/tmp/bjm/glibc_verify1

    RPM: glibc-2.3.4-2.22
    ........C c /etc/ld.so.conf
    .......T. c /etc/rpc
    ..5......   /lib/tls/libc-2.3.4.so
    S.5....T.   /sbin/ldconfig
    S.5....T.   /sbin/sln
    ........C   /usr/lib/gconv/gconv-modules.cache
    S.5....T.   /usr/sbin/iconvconfig
    ........C c /etc/ld.so.conf
    ........C   /usr/lib64/gconv/gconv-modules.cache


3.  cp /lib/tls/libc-2.3.4.so /usr/tmp/bjm/libc-2.3.4.so.preverify2

4.  rpm -V glibc-2.3.4-2.22 > /usr/tmp/bjm/glibc_verify2

    RPM: glibc-2.3.4-2.22
    ........C c /etc/ld.so.conf
    .......T. c /etc/rpc
    S.5....T.   /sbin/ldconfig
    S.5....T.   /sbin/sln
    ........C   /usr/lib/gconv/gconv-modules.cache
    S.5....T.   /usr/sbin/iconvconfig
    ........C c /etc/ld.so.conf
    ........C   /usr/lib64/gconv/gconv-modules.cache

5.  cmp /usr/tmp/bjm/libc-2.3.4.so.preverify1 /usr/tmp/bjm/libc-
2.3.4.so.preverify2

    libc-2.3.4.so.preverify1 libc-2.3.4.so.preverify2 differ: byte 147457, 
line 247

6.  I then copied over the Veritas kit and it successfully installed.

7.  cp /lib/tls/libc-2.3.4.so /usr/tmp/bjm/libc-2.3.4.so.postVRTSob

    it is identical to /usr/tmp/bjm/libc-2.3.4.so.preverify2

8.  Rebooted machine

9.  rpm -V glibc-2.3.4-2.22 > glibc_verify_postRPMinstall_andReboot

    RPM: glibc-2.3.4-2.22
    ........C c /etc/ld.so.conf
    .......T. c /etc/rpc
    prelink: /lib/tls/libc-2.3.4.so: prelinked file was modified
    S.?......   /lib/tls/libc-2.3.4.so
    S.5....T.   /sbin/ldconfig
    S.5....T.   /sbin/sln
    ........C   /usr/lib/gconv/gconv-modules.cache
    S.5....T.   /usr/sbin/iconvconfig
    ........C c /etc/ld.so.conf
    ........C   /usr/lib64/gconv/gconv-modules.cache

I was thinking of retrying the steps again but never installing the veritas 
kit.  If the results are the same, then Veritas has nothing to do with the 
issue.   thoughts ?

------- Additional Comments From jakub@redhat.com  2006-06-12 05:20 EST -------
I really suspect the hw and/or kernel problem is what is happening here.
libc-2.3.4.so.preverify1 is corrupted similarly as before, with:
00024000  20 00 00 32 4d 65 67 61  52 41 49 44 4c 44 20 30  | ..2MegaRAIDLD 0|
00024010  20 52 41 49 44 30 20 20  31 33 39 47 35 32 31 58  | RAID0  139G521X|
00024020  00 00 00 00 00 00 00 00  b8 74 27 a0 20 1b 17 a0  |.........t'. ...|
00024030  00 00 00 00 00 a0 9b a0  00 a0 9b a0 24 77 27 a0  |............$w'.|
00024040  70 74 27 a0 4c 3e 07 a0  2c 36 07 a0 0c 00 00 00  |pt'.L>..,6......|
00024050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00024060  1e 04 01 00 00 01 00 00  00 00 00 00 00 00 00 00  |................|
00024070  00 00 00 00 00 00 80 0f  11 26 06 a0 a8 60 0f a0  |.........&...`..|
00024080  00 00 00 00 00 00 00 00  18 92 13 a0 00 00 00 00  |................|
00024090  00 00 00 00 00 50 8e 7d  00 00 00 00 cf 51 01 06  |.....P.}.....Q..|
000240a0  00 00 80 57 d4 74 27 a0  3c 4e 06 a0 ac 11 09 a0  |...W.t'.<N......|
000240b0  38 75 27 a0 f0 74 27 a0  24 da 08 a0 00 00 00 00  |8u'..t'.$.......|
000240c0  48 fd 00 00 00 00 00 00  00 00 00 02 1c 00 00 00  |H...............|
000240d0  10 75 27 a0 1c 00 00 00  00 00 00 00 00 00 00 00  |.u'.............|
000240e0  00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00  |................|
000240f0  00 00 00 00 a9 75 27 a0  0a 00 00 00 00 00 00 00  |.....u'.........|
00024100  00 00 00 00 a8 75 27 a0  4c 75 27 a0 3c 75 27 a0  |.....u'.Lu'.<u'.|
00024110  fc d9 0d a0 0c da 0d a0  00 00 00 00 78 75 27 a0  |............xu'.|
00024120  50 75 27 a0 30 17 09 a0  ec d9 0d a0 1e 71 0e a0  |Pu'.0........q..|
00024130  02 00 00 00 01 00 00 00  2c 76 27 a0 0c 00 00 00  |........,v'.....|
00024140  ec 75 27 a0 2c 76 27 a0  e8 75 27 a0 7c 75 27 a0  |.u'.,v'..u'.|u'.|
00024150  14 14 09 a0 8c 16 09 a0  00 00 00 00 a8 75 27 a0  |.............u'.|
00024160  01 00 00 00 00 12 09 a0  00 00 00 20 cf 51 01 06  |........... .Q..|
...
000248c0  00 00 00 00 00 00 00 00  fc 03 00 00 56 41 4c 49  |............VALI|
000248d0  44 41 54 49 4f 4e 3d 4e  6f 6e 65 00 56 45 52 53  |DATION=None.VERS|
000248e0  49 4f 4e 53 3d 42 54 42  4c 5f 44 2e 32 2e 31 2e  |IONS=BTBL_D.2.1.|
000248f0  39 2c 42 49 4f 53 5f 48  34 33 30 2c 43 54 4c 4d  |9,BIOS_H430,CTLM|
00024900  5f 55 38 32 37 2c 41 50  50 5f 35 32 31 58 00 46  |_U827,APP_521X.F|
00024910  6c 61 73 68 61 62 6c 65  3d 31 30 32 38 5f 30 30  |lashable=1028_00|
00024920  31 33 5f 31 30 32 38 5f  30 31 36 44 00 00 00 00  |13_1028_016D....|
00024930  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
...
being the start of the corrupted chunk (again, whole page aligned)
and the /lib/tls/libc-2.3.4.so has that garbage in a different location (that's
why prelink moans):
00109000  20 00 00 32 4d 65 67 61  52 41 49 44 4c 44 20 30  | ..2MegaRAIDLD 0|
00109010  20 52 41 49 44 30 20 20  31 33 39 47 35 32 31 58  | RAID0  139G521X|
00109020  00 00 00 00 01 00 00 00  b8 74 27 a0 64 1b 17 a0  |.........t'.d...|
00109030  00 00 00 00 7c a0 9b a0  7c a0 9b a0 24 77 27 a0  |....|...|...$w'.|
00109040  70 74 27 a0 4c 3e 07 a0  2c 36 07 a0 0c 00 00 00  |pt'.L>..,6......|
00109050  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00109060  1e 04 01 00 00 01 00 00  00 00 00 00 00 00 00 00  |................|
00109070  00 00 00 00 00 00 80 0f  11 26 06 a0 a8 60 0f a0  |.........&...`..|
00109080  00 00 00 00 00 00 00 00  18 95 13 a0 00 00 01 00  |................|
00109090  01 00 00 00 00 d2 8b 7d  00 00 00 00 cf 51 01 06  |.......}.....Q..|
001090a0  00 00 80 57 d4 74 27 a0  3c 4e 06 a0 ac 11 09 a0  |...W.t'.<N......|
001090b0  38 75 27 a0 f0 74 27 a0  24 da 08 a0 00 00 00 00  |8u'..t'.$.......|
001090c0  48 fd 00 00 00 00 00 00  00 00 00 02 1c 00 00 00  |H...............|
001090d0  10 75 27 a0 1c 00 00 00  00 00 00 00 00 00 00 00  |.u'.............|
001090e0  00 00 00 00 00 00 00 00  00 00 00 00 04 00 00 00  |................|
001090f0  00 00 00 00 a9 75 27 a0  0a 00 00 00 00 00 00 00  |.....u'.........|
00109100  00 00 00 00 a8 75 27 a0  4c 75 27 a0 3c 75 27 a0  |.....u'.Lu'.<u'.|
00109110  fc d9 0d a0 0c da 0d a0  00 00 00 00 78 75 27 a0  |............xu'.|
00109120  50 75 27 a0 30 17 09 a0  00 00 00 02 e8 75 27 a0  |Pu'.0........u'.|
00109130  70 75 27 a0 b8 9e 0d a0  30 26 06 a0 a8 60 0f a0  |pu'.....0&...`..|
00109140  00 00 00 00 00 04 00 00  00 04 00 00 7c 75 27 a0  |............|u'.|
00109150  14 14 09 a0 8c 16 09 a0  00 00 00 00 cf 51 01 06  |.............Q..|
00109160  00 00 80 57 00 12 09 a0  00 00 00 20 cf 51 01 06  |...W....... .Q..|
00109170  00 00 80 57 60 bf 1a a0  ff ff ff ff 00 00 00 00  |...W`...........|
00109180  08 d0 8a 7d 00 00 00 00  08 a0 9b c0 08 00 00 00  |...}............|
...
001098c0  00 00 00 00 00 00 00 00  fc 03 00 00 56 41 4c 49  |............VALI|
001098d0  44 41 54 49 4f 4e 3d 4e  6f 6e 65 00 56 45 52 53  |DATION=None.VERS|
001098e0  49 4f 4e 53 3d 42 54 42  4c 5f 44 2e 32 2e 31 2e  |IONS=BTBL_D.2.1.|
001098f0  39 2c 42 49 4f 53 5f 48  34 33 30 2c 43 54 4c 4d  |9,BIOS_H430,CTLM|
00109900  5f 55 38 32 37 2c 41 50  50 5f 35 32 31 58 00 46  |_U827,APP_521X.F|
00109910  6c 61 73 68 61 62 6c 65  3d 31 30 32 38 5f 30 30  |lashable=1028_00|
00109920  31 33 5f 31 30 32 38 5f  30 31 36 44 00 00 00 00  |13_1028_016D....|
00109930  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Neither libc-2.3.4.so.preverify1 nor libc-2.3.4.so.preverify2 was ever
prelinked, so we can easily rule prelink out (I'd certainly wonder where prelink
could grab similar kind of garbage when it is not present in any of the
libraries that are installed).

------- Additional Comments From bmarson@redhat.com  2006-06-12 09:32 EST -------
Im rebuilding veritas2 with RHEL4-U3 to make sure there is no wierd behavior and
to have as a reference.


------- Additional Comments From bmarson@redhat.com  2006-06-12 10:28 EST -------
I verified that libc is not showing any corruption on RHEL4-U3.  Both copies of
the libc file are identical and rpm -V does not even show libc as having any change.


------- Additional Comments From bmarson@redhat.com  2006-06-12 12:11 EST -------
I reinstalled the 06/07 tree on veritas3.  This time with the intent on running
a 32bit app different than veritas.  I picked iozone which I verified worked on
veritas2 (RHEL4-U3).  This was a binary copied over, not built locally.

Trying to run the binary first, segv's.  It fails in getopt()....  rpm -V shows
the library corrupt.  This time when I tried to rerun iozone after the verify
and it still segv'ed.  Rerunning the verify reported the same results. Both
copies of the libarary are identical too.

Interesting enough running iozone without an arg did report the helper message
from the app (ie., enter -h for all the options).

I proceeded to reboot the box and upon iozone segv'ed again.  a 3rd verify
showed the same results BUT cmp against the older copy of the libraries
(pre-reboot) shows one byte difference.

Of further interest, iozone without args NOW seg'vs

Im running out of ideas of things to look for.  One thought was to try and get a
copy of libc with a program that bypasses the pagecache.

Barry

------- Additional Comments From jakub@redhat.com  2006-06-12 12:19 EST -------
Can you perhaps try RHEL4-U3 kernel with RHEL4-U4 userland?  Were there any
changes in the MegaRAID driver in U4?

------- Additional Comments From bmarson@redhat.com  2006-06-12 13:09 EST -------
I installed the U3 kernel on the existing U4 bits and we have the same
corruption.  gdb shows the failure as early as _dl_start_user .... -> sysconf

I wonder if a reinstall and a reboot to single user mode would show anything. 
Any thoughts ?

------- Additional Comments From jakub@redhat.com  2006-06-12 13:20 EST -------
If you took the already corrupted U4 bits with U3 kernel, than that only shows
that by the time the corruption happens, it is not just in memory, but also
on disk.
But we still don't know what causes the corruption.
For that, doing an install with U3 kernel plus U4 userland, or if you think this
might have anything to do with glibc (I don't), an U3 install with U4 glibc
composed into it might show something.


------- Additional Comments From bmarson@redhat.com  2006-06-12 13:48 EST -------
It appears to not be easy to create such an install environment.  I dont have
the know how at this time.  How about a U3 install with a U4 up2date without a
kernel update.  That might shed some light.


------- Additional Comments From tao@redhat.com  2006-06-12 14:45 EST -------
Issue is being worked in Engineering. This isn't a beta blocker, but
should be resolved prior to GA.


Issue escalated to Support Engineering Group by: andriusb.
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by andriusb 
 issue 95733



Comment 5 Barry Marson 2006-06-14 10:56:04 EDT
And now for a further update.  in real time ..

We have been trying to take a U3 system and do an update with the 06/07 tree
without the kernel bits.  This is more of an effort than I ever dreamed.

In parallel, I rebuilt veritas2 with the 06/07 tree but this time I installed
the Minimal kits.  Status  ... libc is still corrupt, yet iozone runs successfully.

Barry
Comment 8 Barry Marson 2006-06-15 10:14:02 EDT
I decided to test this on the third system we use for Veritas cert.  A Dell
PE6800 16CPUx16GB with the same PERC type RAID controller.  Sure enough iozone
fails the same way with a 06/07 tree "everything" install.

Am looking into two things.  Building on the system without Megaraid and taking
a U3 system and installing a U4 kernel
Comment 9 Barry Marson 2006-06-15 11:58:33 EDT
We get the same failure with the pe6800 system.

Building the pe2850 with simple SCSI shows NO corruption.  Talking with Josh
Giles (former DELL and now Red Hat QA) there were firmware issues with PERC4
RAID controllers.  We are attempting to update our firmware now on veritas2
(pe2850 system).

Also of note, jturner now recreates the problem when he uses the PERC controller
and gets megaraid loading.

I believe it was verified that megaraid driver did not change between U3/U4.

Stay tuned.

Barry
Comment 10 Tom Coughlan 2006-06-15 12:00:41 EDT
The last time we touched the megaraid driver was 2.6.9-11.38 (U2). FWIW.
Comment 11 Tom Coughlan 2006-06-15 15:50:05 EDT
The corruption looks like SCSI Inquiry data:

[root@pe6800-01 ~]# sg_inq -h /dev/sda
standard INQUIRY:
 00     00 00 02 00 20 00 00 32  4d 65 67 61 52 41 49 44    .... ..2MegaRAID
 10     4c 44 20 30 20 52 41 49  44 30 20 20 35 35 39 47    LD 0 RAID0  559G
 20     35 32 31 53 00                                      521S.

These strings are also in /proc, FWIW:

[root@pe6800-01 ~]# more /proc/scsi/scsi
Attached devices:
...
Host: scsi0 Channel: 02 Id: 00 Lun: 00
  Vendor: MegaRAID Model: LD 0 RAID0  559G Rev: 521S
  Type:   Direct-Access                    ANSI SCSI revision: 02

Josha is checking to see if this is a known problem. 

SCSI Inquiries happen at boot time, and possibly after a bus error. We could
turn on logging to see if they happen at other times. 

It is interesting that the problem does not occur on U3 and it does on U4, where
the driver is the same, and no change to the FW. As a guess, I'd say that a
change in dm or LVM in U4 may be provoking the bug in the driver or the
firmware. We could test this theory by doing an install without dm and LVM. That
is not a fix, though, so lets wait and see if this is a known bug.

Comment 12 Jay Turner 2006-06-15 16:34:30 EDT
Just did an install without LVM or dm and still get the corruption on boot, so I
think we can factor that out of the picture.
Comment 14 Tom Coughlan 2006-06-21 10:01:28 EDT
The current theory is that when the o.s. issues an SCSI Inquiry command that
requests "Vital Product Data" (EVPD) from the Megaraid adapter, the adapter DMAs
the data to the wrong place in memory. This type of Inquiry is used to get more
detailed info, such as the WWID of the device. 

I added some logging to the driver and found that there is just one Inquiry
issued with EVPD set, late in the boot process (the second Inquiry below):

messagebus: messagebus startup succeeded
cups-config-daemon: cups-config-daemon startup succeeded
haldaemon: haldaemon startup succeeded
fstab-sync[5459]: removed all generated mount points
fstab-sync[5476]: added mount point /media/cdrom for /dev/hda
fstab-sync[5783]: added mount point /media/cdrom1 for /dev/hdf
fstab-sync[5792]: added mount point /media/floppy for /dev/hde
kernel: Inquiry 00 00 00 ff 00 
kernel: Inquiry 01 80 00 ff 00 
fstab-sync[5909]: added mount point /media/floppy1 for /dev/fd0
inventory.py: Reading DMI info failed
rhqa-inventory: inventory.py startup succeeded
kernel: mtrr: type mismatch for c8000000,1000000 old: write-back new:
write-combining

This is presumably coming from something like haldaemon running the scsi_id
utility to get the WWID. This may be where the difference is between U3 and U4. 

Now I have a couple questions for Jakub:

1) I imagine that libc.so is mmapped, and it could certainly be the victim of a
bad DMA write from the megaraid hba. Do you know of a way that this would be
written to disk, where you found it with hexdump?

2) I could use some help finding an easy way to reproduce this, and testing the
fix. I don't know enough about glibc to know, can I just replace the corrupted
/lib/tls/libc-2.3.4.so file with a clean one from the rpm, then run the megaraid
Inquiry test to see if it gets corrupted? 

Thanks

Comment 15 Jakub Jelinek 2006-06-21 10:15:25 EDT
1) yes, libc.so.6 is mapped in all (non-statically linked) processes, but the
corruption was always in its read-only PT_LOAD segment, which is mapped with
PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE.  I'm not aware of any program
that would mmap libc.so.6 using a writable or non-private mapping, so it is
much more probable the corruption happens at the page cache level in the kernel
directly.
2) easiest is probably rpm -Uvh --force glibc-2.3.4-2.*.i686.rpm
You can replace just one file of course, but you'd need to do that atomically
(copy the non-corrupted one from e.g. rpm2cpio to /lib/tls/~libc-2.3.4.so
and then mv -f /lib/tls/{~,}libc-2.3.4.so
or something similar).
Comment 16 Tom Coughlan 2006-06-24 21:14:17 EDT
Status:

1. I modified the driver so it would reject Inquiry EVPD commands. This
solved the corruption problem, at least in the scenario where I can most
easily detect it.

2. Seokmann indicated that this may not be a complete solution. The
problem may be because certain megaraid models should not be run in 64-
bit DMA mode. He suggested that I comment out the section in the driver
that enables it to switch from 32-bit to 64-bit DMA:

pci_set_dma_mask(adapter->pdev, DMA_64BIT_MASK)

This also seemed to solve the problem. 

3. Seokmann suggested a patch similar to:

http://marc.theaimsgroup.com/?t=114805150900005&r=1&w=2

This prevents the switch to 64-bits on boards that do not support it.

This did not solve the problem because the board I am using is one of
the ones that is allowed to be set to 64-bits (PERC4E_DI_KOBUK).

4. I added mem=3GB on the kernel command line and tested with 
the stock driver. This did not solve the problem.

At this point it is not clear whether there is a firmware problem that
is specific to EVPD, or if the fw problem is with 64-bit DMA on some or all
models. If it is the latter, then the problem apparently exists on a
broader set of hardware models than LSI Logic expected. We would also
need to explanation test 4 above. 

Seokmann is checking with the LSI Logic firmware team. I have also asked Dell to
provide any info they have on megaraid fw issues. 
Comment 17 Tom Coughlan 2006-06-27 11:32:21 EDT
*** Bug 196573 has been marked as a duplicate of this bug. ***
Comment 18 Tom Coughlan 2006-06-27 12:24:50 EDT
There are two separate bugs:

1) The firmware can corrupt data when processing an Inquiry command with the
EVPD bit set (this command is used to get the WWID of the device, for example).
The board incorrectly returns normal Inquiry data, which typically over-flows
the input buffer. 

2) The driver incorrectly enables 64-bit DMA on some models that do not
correctly support it.

A patch to fix these issues is attached. I have built a kernel with this patch
for testing. 

http://people.redhat.com/coughlan/.2.6.9-40.ELbz194533/

It would be helpful if anyone who is having problems with megaraid systems would
test this kernel. 

You will see an error logged when the o.s. issues a request to send an EVPD Inquiry:

kernel: SCSI error : <0 1 0 0> return code = 0x40000

This is not a problem. This is what the megaraid driver normally emits when it
receives an unsupported command. 

Comment 19 Tom Coughlan 2006-06-27 12:27:14 EDT
Created attachment 131615 [details]
fix two corruption bugs
Comment 20 Barry Marson 2006-06-27 15:56:03 EDT
I have done some preliminary testing of the test kernel listed above.

Installing .2.6.9-40.ELbz194533 kernel just before reboot of a 6/7 tree yields
no known data corruption.  SCSI error ... messages are seen 3 times.  The last
time is when the system inventory RPM scans the system.  This  was a minimal
install and rpm -V show no corruption in glibc.  iozone works too.

I want to repeat this test with a full install, so I can test veritas as well as
openoffice suite (since it's 32bit and known to fail without a fix).

Barry

Comment 22 Tom Coughlan 2006-06-29 11:50:32 EDT
Created attachment 131753 [details]
rev 2

Here is an updated version of the patch. This avoids the generation of a SCSI
error message in the log when an Inquiry EVPD command is submitted. 

A kernel with this patch applied is available at:

http://people.redhat.com/coughlan/.2.6.9-40.ELbz194533a/

Please test.
Comment 28 Jason Baron 2006-07-07 13:56:21 EDT
committed in stream U4 build 40.1. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 29 Barry Marson 2006-07-07 15:18:50 EDT
The RHEL4 U4 40.1 x86_64 kernel was installed (just before reboot of a 7/6 U4
tree) and tested on the veritas system.  While veritas software could not be
tested at this time (resource contstraint), there was no sign of data corruption
in glibc.  Also the other KNOWN 32 bit app that use to fail (Ooffice) ran
successfully.

Unless someone absolutely needs veritas tested with THIS kernel,  I declare the
fix a success.

Barry
Comment 30 Barry Marson 2006-07-07 15:30:00 EDT
VERIFIED kernel-smp-2.6.9-40.1.EL.x86_64.rpm works as noted above
Comment 32 Red Hat Bugzilla 2006-08-10 19:30:38 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html

Note You need to log in before you can comment on or make changes to this bug.