Bug 194533
| Summary: | veritas storage foundation 32bit apps crash in glibc during post-process installation | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Barry Marson <bmarson> | ||||||
| Component: | kernel | Assignee: | Tom Coughlan <coughlan> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 4.0 | CC: | andriusb, bmarson, coughlan, ezannoni, fhirtz, jburke, jlaska, jnomura, jturner, kueda, syeghiay, tkincaid | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | RHSA-2006-0575 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2006-08-10 23:30:38 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 181409, 181411, 196056 | ||||||||
| Attachments: |
|
||||||||
|
Comment 1
Jakub Jelinek
2006-06-08 20:09:26 UTC
The following are the notes taken to try and bring back all the reporting in a
single comment:
------- Additional Comments From bmarson 2006-06-08 23:37 EST -------
veritas3 in the same subnet was showing the same problems. It's different in
that it has a 05/26 tree installed and has been tested with the -38 and -39
kernels. We experienced intermittent failures here. It seems if we remove
the veritas bits and then rebuild the rpm database and reboot it works fine.
Both systems were built either with PXE install latest (from
qafiler.boston.redhat.com) or with a boot CD from the same tree areas. Would
it help if I rebuilt one of them fresh ?
Barry
------- Additional Comments From jakub 2006-06-09 02:19 EST -------
Both systems are offline now, so I can't investigate now.
I guess backing up the corrupted libc
(/usr/tmp/jbm/libc-2.3.4.so-prelinked-corrupted) to some other host (or a
partition that won't be reformated) and reinstalling wouldn't be a bad idea.
That way we can see if it was a random corruption or if some program
intentionally overwrites part of libc.so.
------- Additional Comments From bmarson 2006-06-09 09:18 EST -------
I was planning on rebuilding veritas3 with the 05/06 tree and not touching
veritas2 at this time. That way we will have a system to compare and see where
things deteriorate. We are having lab network trouble. Once resolved, I will
stage veritas3
Barry
------- Additional Comments From bmarson 2006-06-09 09:42 EST -------
thats the 06/05 tree. Netorking has been restored. So I'll be staging
veritas3 now
------- Additional Comments From jakub 2006-06-09 11:04 EST -------
I just looked at veritas3.rhts and there libc.so is crippled as well.
It was not prelinked, so simple cmp -l with rpm2cpio glibc-2.3.4-2.22.i686.rpm |
cpio -id extracted lib/tls/libc-2.3.4.so was possible.
It seems a big chunk of libc.so's .text section was overwritten, from file offset
0x90000 to 0x9b000.
The corrupted file part starts with (from hexdump -C):
00090000 20 00 00 32 4d 65 67 61 52 41 49 44 4c 44 20 30 | ..2MegaRAIDLD 0|
00090010 20 52 41 49 44 30 20 20 31 33 39 47 35 32 31 58 | RAID0 139G521X|
00090020 00 00 00 00 02 00 00 00 b8 74 27 a0 a8 1b 17 a0 |.........t'.....|
00090030 00 00 00 00 f8 a0 9b a0 f8 a0 9b a0 24 77 27 a0 |............$w'.|
00090040 70 74 27 a0 4c 3e 07 a0 2c 36 07 a0 0c 00 00 00 |pt'.L>..,6......|
00090050 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00090060 1e 04 01 00 00 01 00 00 00 00 00 00 00 00 00 00 |................|
00090070 00 00 00 00 00 00 80 0f 11 26 06 a0 a8 60 0f a0 |.........&...`..|
00090080 00 00 00 00 00 00 00 00 18 98 13 a0 00 00 02 00 |................|
00090090 02 00 00 00 00 e4 6e 01 00 00 00 00 cf 51 01 06 |......n......Q..|
000900a0 00 00 80 57 d4 74 27 a0 3c 4e 06 a0 ac 11 09 a0 |...W.t'.<N......|
000900b0 38 75 27 a0 f0 74 27 a0 24 da 08 a0 00 00 00 00 |8u'..t'.$.......|
000900c0 48 fd 00 00 00 00 00 00 00 00 00 02 1c 00 00 00 |H...............|
000900d0 10 75 27 a0 1c 00 00 00 00 00 00 00 00 00 00 00 |.u'.............|
000900e0 00 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00 |................|
000900f0 00 00 00 00 a9 75 27 a0 0a 00 00 00 00 00 00 00 |.....u'.........|
00090100 00 00 00 00 a8 75 27 a0 4c 75 27 a0 3c 75 27 a0 |.....u'.Lu'.<u'.|
00090110 fc d9 0d a0 0c da 0d a0 00 00 00 00 78 75 27 a0 |............xu'.|
00090120 50 75 27 a0 30 17 09 a0 ec d9 0d a0 1e 71 0e a0 |Pu'.0........q..|
00090130 02 00 00 00 01 00 00 00 2c 76 27 a0 0c 00 00 00 |........,v'.....|
00090140 ec 75 27 a0 2c 76 27 a0 e8 75 27 a0 7c 75 27 a0 |.u'.,v'..u'.|u'.|
...
000908c0 00 00 00 00 00 00 00 00 fc 03 00 00 56 41 4c 49 |............VALI|
000908d0 44 41 54 49 4f 4e 3d 4e 6f 6e 65 00 56 45 52 53 |DATION=None.VERS|
000908e0 49 4f 4e 53 3d 42 54 42 4c 5f 44 2e 32 2e 31 2e |IONS=BTBL_D.2.1.|
000908f0 39 2c 42 49 4f 53 5f 48 34 33 30 2c 43 54 4c 4d |9,BIOS_H430,CTLM|
00090900 5f 55 38 32 37 2c 41 50 50 5f 35 32 31 58 00 46 |_U827,APP_521X.F|
00090910 6c 61 73 68 61 62 6c 65 3d 31 30 32 38 5f 30 30 |lashable=1028_00|
00090920 31 33 5f 31 30 32 38 5f 30 31 36 44 00 00 00 00 |13_1028_016D....|
00090930 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
...
so it might give some clue on what might have corrupted the C library.
On veritas2.rhts the corruption started on slightly different spot, but with
apparently the same data:
0008c000 20 00 00 32 4d 65 67 61 52 41 49 44 4c 44 20 30 | ..2MegaRAIDLD 0|
0008c010 20 52 41 49 44 30 20 20 31 33 39 47 35 32 31 58 | RAID0 139G521X|
0008c020 00 00 00 00 01 00 00 00 b8 74 27 a0 64 1b 17 a0 |.........t'.d...|
0008c030 00 00 00 00 7c a0 9b a0 7c a0 9b a0 24 77 27 a0 |....|...|...$w'.|
0008c040 70 74 27 a0 4c 3e 07 a0 2c 36 07 a0 0c 00 00 00 |pt'.L>..,6......|
0008c050 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
0008c060 1e 04 01 00 00 01 00 00 00 00 00 00 00 00 00 00 |................|
0008c070 00 00 00 00 00 00 80 0f 11 26 06 a0 a8 60 0f a0 |.........&...`..|
0008c080 00 00 00 00 00 00 00 00 18 95 13 a0 00 00 01 00 |................|
0008c090 01 00 00 00 00 a2 a7 7d 00 00 00 00 cf 51 01 06 |.......}.....Q..|
0008c0a0 00 00 80 57 d4 74 27 a0 3c 4e 06 a0 ac 11 09 a0 |...W.t'.<N......|
0008c0b0 38 75 27 a0 f0 74 27 a0 24 da 08 a0 00 00 00 00 |8u'..t'.$.......|
0008c0c0 48 fd 00 00 00 00 00 00 00 00 00 02 1c 00 00 00 |H...............|
0008c0d0 10 75 27 a0 1c 00 00 00 00 00 00 00 00 00 00 00 |.u'.............|
0008c0e0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
0008c0f0 00 00 00 00 a8 75 27 a0 0a 00 00 00 00 00 00 00 |.....u'.........|
0008c100 00 00 00 00 a8 75 27 a0 4c 75 27 a0 3c 75 27 a0 |.....u'.Lu'.<u'.|
0008c110 fc d9 0d a0 0c da 0d a0 00 00 00 00 78 75 27 a0 |............xu'.|
0008c120 50 75 27 a0 30 17 09 a0 ec d9 0d a0 1e 71 0e a0 |Pu'.0........q..|
0008c130 02 00 00 00 01 00 00 00 2c 76 27 a0 0c 00 00 00 |........,v'.....|
0008c140 ec 75 27 a0 2c 76 27 a0 e8 75 27 a0 7c 75 27 a0 |.u'.,v'..u'.|u'.|
0008c150 14 14 09 a0 8c 16 09 a0 00 00 00 00 a8 75 27 a0 |.............u'.|
0008c160 01 00 00 00 00 12 09 a0 00 00 00 20 cf 51 01 06 |........... .Q..|
0008c170 00 00 80 57 60 bf 1a a0 ff ff ff ff 00 00 00 00 |...W`...........|
0008c180 20 10 ab 7d 33 34 00 00 3c 76 27 c0 03 00 00 00 | ..}34..<v'.....|
0008c190 0f 00 00 00 fe 00 00 00 00 6b 17 a0 00 00 00 02 |.........k......|
0008c1a0 5c 76 27 a0 e4 75 27 a0 b8 9e 0d a0 30 26 06 a0 |\v'..u'.....0&..|
0008c1b0 a8 60 0f a0 00 00 00 00 00 04 00 00 10 e4 ff ff |.`..............|
0008c1c0 d4 11 09 a0 4c 12 09 a0 48 76 27 a0 03 00 00 00 |....L...Hv'.....|
0008c1d0 cf 51 01 06 00 00 80 57 08 76 27 a0 00 00 00 00 |.Q.....W.v'.....|
0008c1e0 20 10 ab 7d 00 00 00 00 a0 76 27 c0 03 00 00 00 | ..}.....v'.....|
...
0008c8c0 00 00 00 00 00 00 00 00 fc 03 00 00 56 41 4c 49 |............VALI|
0008c8d0 44 41 54 49 4f 4e 3d 4e 6f 6e 65 00 56 45 52 53 |DATION=None.VERS|
0008c8e0 49 4f 4e 53 3d 42 54 42 4c 5f 44 2e 32 2e 31 2e |IONS=BTBL_D.2.1.|
0008c8f0 39 2c 42 49 4f 53 5f 48 34 33 30 2c 43 54 4c 4d |9,BIOS_H430,CTLM|
0008c900 5f 55 38 32 37 2c 41 50 50 5f 35 32 31 58 00 46 |_U827,APP_521X.F|
0008c910 6c 61 73 68 61 62 6c 65 3d 31 30 32 38 5f 30 30 |lashable=1028_00|
0008c920 31 33 5f 31 30 32 38 5f 30 31 36 44 00 00 00 00 |13_1028_016D....|
0008c930 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
So, this doesn't sound like glibc bug, but can be either hw related, or kernel
(lvm and such), or the veritas stuff that is installed on top of that.
------- Additional Comments From bmarson 2006-06-09 11:48 EST -------
Whats interesting is I ran rpm -V on all the RPMS and stored the results in
/usr/tmp/bjm/rpm-v.log before I started any veritas stuff.
Then I manually installed the RPMS (3) the way they failed on veritas2. All
worked. SO I removed those RPMS and fired off the real test, and so far it
hasn't failed either. I'm letting the install test complete before I touch
anything.
Just a note about the hardware. There's a Dell RAID controller being used. 2
drives that were RAID0 striped for space/performance.
Barry
------- Additional Comments From bmarson 2006-06-09 15:41 EST -------
Installing the 06/07 tree both from PXE install and CD failed with the segv
Next I installed the 06/07 tree with PXE, logged in and immediately ran rpm -V
on all RPMS. Results are on veritas3 in /usr/tmp/bjm. I noted to jturner that
/lib/tls/libc-2.3.4.so had an MD5 error bit. He noted when he checked it, it
wasn't fine. Rerunning the verify 5 minutes later, and the MD5 error was gone.
What's going on here ?
And suspecting veritas would run if there was no error, I ran the simple test
and it passed.
I think were dealing with something with prelink or the state of the library
during prelink work.
Barry
------- Additional Comments From jakub 2006-06-09 16:53 EST -------
It would be very surprising if prelink ran right after the install, normally
it runs during the first night after install and on veritas3 libc was corrupted
this morning, yet prelink hasn't been executed yet.
Do you happen to have saved copies of /lib/tls/libc-2.3.4.so when rpm -V was
reporting failure, when VRTS* scriptlets were segfaulting and when it was ok
again? If you copied the file each time, we could analyze what was going on,
but without it it is hard to say. Certainly I don't see how prelink could write
a MegaRAID firmware or what was it into the middle of libc.so.
------- Additional Comments From bmarson 2006-06-10 08:37 EST -------
Im not going to have good access to the machine until late Sunday. I will
reinstall and see if the last scenario was exactly repeatable. Then I will
test it against just glibc RPM, copying the file between each phase. Again,
that corruption was found before any veritas bits existed on the machine. I'm
also going to be commandering another box to test with to hopefully rule out
the specific systems as being the issue.
------- Additional Comments From bmarson 2006-06-11 21:46 EST -------
I rebuilt veritas3 with the same 06/07 tree. My first login consisted of
1. cp /lib/tls/libc-2.3.4.so /usr/tmp/bjm/libc-2.3.4.so.preverify1
2. rpm -V glibc-2.3.4-2.22 > /usr/tmp/bjm/glibc_verify1
RPM: glibc-2.3.4-2.22
........C c /etc/ld.so.conf
.......T. c /etc/rpc
..5...... /lib/tls/libc-2.3.4.so
S.5....T. /sbin/ldconfig
S.5....T. /sbin/sln
........C /usr/lib/gconv/gconv-modules.cache
S.5....T. /usr/sbin/iconvconfig
........C c /etc/ld.so.conf
........C /usr/lib64/gconv/gconv-modules.cache
3. cp /lib/tls/libc-2.3.4.so /usr/tmp/bjm/libc-2.3.4.so.preverify2
4. rpm -V glibc-2.3.4-2.22 > /usr/tmp/bjm/glibc_verify2
RPM: glibc-2.3.4-2.22
........C c /etc/ld.so.conf
.......T. c /etc/rpc
S.5....T. /sbin/ldconfig
S.5....T. /sbin/sln
........C /usr/lib/gconv/gconv-modules.cache
S.5....T. /usr/sbin/iconvconfig
........C c /etc/ld.so.conf
........C /usr/lib64/gconv/gconv-modules.cache
5. cmp /usr/tmp/bjm/libc-2.3.4.so.preverify1 /usr/tmp/bjm/libc-
2.3.4.so.preverify2
libc-2.3.4.so.preverify1 libc-2.3.4.so.preverify2 differ: byte 147457,
line 247
6. I then copied over the Veritas kit and it successfully installed.
7. cp /lib/tls/libc-2.3.4.so /usr/tmp/bjm/libc-2.3.4.so.postVRTSob
it is identical to /usr/tmp/bjm/libc-2.3.4.so.preverify2
8. Rebooted machine
9. rpm -V glibc-2.3.4-2.22 > glibc_verify_postRPMinstall_andReboot
RPM: glibc-2.3.4-2.22
........C c /etc/ld.so.conf
.......T. c /etc/rpc
prelink: /lib/tls/libc-2.3.4.so: prelinked file was modified
S.?...... /lib/tls/libc-2.3.4.so
S.5....T. /sbin/ldconfig
S.5....T. /sbin/sln
........C /usr/lib/gconv/gconv-modules.cache
S.5....T. /usr/sbin/iconvconfig
........C c /etc/ld.so.conf
........C /usr/lib64/gconv/gconv-modules.cache
I was thinking of retrying the steps again but never installing the veritas
kit. If the results are the same, then Veritas has nothing to do with the
issue. thoughts ?
------- Additional Comments From jakub 2006-06-12 05:20 EST -------
I really suspect the hw and/or kernel problem is what is happening here.
libc-2.3.4.so.preverify1 is corrupted similarly as before, with:
00024000 20 00 00 32 4d 65 67 61 52 41 49 44 4c 44 20 30 | ..2MegaRAIDLD 0|
00024010 20 52 41 49 44 30 20 20 31 33 39 47 35 32 31 58 | RAID0 139G521X|
00024020 00 00 00 00 00 00 00 00 b8 74 27 a0 20 1b 17 a0 |.........t'. ...|
00024030 00 00 00 00 00 a0 9b a0 00 a0 9b a0 24 77 27 a0 |............$w'.|
00024040 70 74 27 a0 4c 3e 07 a0 2c 36 07 a0 0c 00 00 00 |pt'.L>..,6......|
00024050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00024060 1e 04 01 00 00 01 00 00 00 00 00 00 00 00 00 00 |................|
00024070 00 00 00 00 00 00 80 0f 11 26 06 a0 a8 60 0f a0 |.........&...`..|
00024080 00 00 00 00 00 00 00 00 18 92 13 a0 00 00 00 00 |................|
00024090 00 00 00 00 00 50 8e 7d 00 00 00 00 cf 51 01 06 |.....P.}.....Q..|
000240a0 00 00 80 57 d4 74 27 a0 3c 4e 06 a0 ac 11 09 a0 |...W.t'.<N......|
000240b0 38 75 27 a0 f0 74 27 a0 24 da 08 a0 00 00 00 00 |8u'..t'.$.......|
000240c0 48 fd 00 00 00 00 00 00 00 00 00 02 1c 00 00 00 |H...............|
000240d0 10 75 27 a0 1c 00 00 00 00 00 00 00 00 00 00 00 |.u'.............|
000240e0 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |................|
000240f0 00 00 00 00 a9 75 27 a0 0a 00 00 00 00 00 00 00 |.....u'.........|
00024100 00 00 00 00 a8 75 27 a0 4c 75 27 a0 3c 75 27 a0 |.....u'.Lu'.<u'.|
00024110 fc d9 0d a0 0c da 0d a0 00 00 00 00 78 75 27 a0 |............xu'.|
00024120 50 75 27 a0 30 17 09 a0 ec d9 0d a0 1e 71 0e a0 |Pu'.0........q..|
00024130 02 00 00 00 01 00 00 00 2c 76 27 a0 0c 00 00 00 |........,v'.....|
00024140 ec 75 27 a0 2c 76 27 a0 e8 75 27 a0 7c 75 27 a0 |.u'.,v'..u'.|u'.|
00024150 14 14 09 a0 8c 16 09 a0 00 00 00 00 a8 75 27 a0 |.............u'.|
00024160 01 00 00 00 00 12 09 a0 00 00 00 20 cf 51 01 06 |........... .Q..|
...
000248c0 00 00 00 00 00 00 00 00 fc 03 00 00 56 41 4c 49 |............VALI|
000248d0 44 41 54 49 4f 4e 3d 4e 6f 6e 65 00 56 45 52 53 |DATION=None.VERS|
000248e0 49 4f 4e 53 3d 42 54 42 4c 5f 44 2e 32 2e 31 2e |IONS=BTBL_D.2.1.|
000248f0 39 2c 42 49 4f 53 5f 48 34 33 30 2c 43 54 4c 4d |9,BIOS_H430,CTLM|
00024900 5f 55 38 32 37 2c 41 50 50 5f 35 32 31 58 00 46 |_U827,APP_521X.F|
00024910 6c 61 73 68 61 62 6c 65 3d 31 30 32 38 5f 30 30 |lashable=1028_00|
00024920 31 33 5f 31 30 32 38 5f 30 31 36 44 00 00 00 00 |13_1028_016D....|
00024930 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
...
being the start of the corrupted chunk (again, whole page aligned)
and the /lib/tls/libc-2.3.4.so has that garbage in a different location (that's
why prelink moans):
00109000 20 00 00 32 4d 65 67 61 52 41 49 44 4c 44 20 30 | ..2MegaRAIDLD 0|
00109010 20 52 41 49 44 30 20 20 31 33 39 47 35 32 31 58 | RAID0 139G521X|
00109020 00 00 00 00 01 00 00 00 b8 74 27 a0 64 1b 17 a0 |.........t'.d...|
00109030 00 00 00 00 7c a0 9b a0 7c a0 9b a0 24 77 27 a0 |....|...|...$w'.|
00109040 70 74 27 a0 4c 3e 07 a0 2c 36 07 a0 0c 00 00 00 |pt'.L>..,6......|
00109050 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00109060 1e 04 01 00 00 01 00 00 00 00 00 00 00 00 00 00 |................|
00109070 00 00 00 00 00 00 80 0f 11 26 06 a0 a8 60 0f a0 |.........&...`..|
00109080 00 00 00 00 00 00 00 00 18 95 13 a0 00 00 01 00 |................|
00109090 01 00 00 00 00 d2 8b 7d 00 00 00 00 cf 51 01 06 |.......}.....Q..|
001090a0 00 00 80 57 d4 74 27 a0 3c 4e 06 a0 ac 11 09 a0 |...W.t'.<N......|
001090b0 38 75 27 a0 f0 74 27 a0 24 da 08 a0 00 00 00 00 |8u'..t'.$.......|
001090c0 48 fd 00 00 00 00 00 00 00 00 00 02 1c 00 00 00 |H...............|
001090d0 10 75 27 a0 1c 00 00 00 00 00 00 00 00 00 00 00 |.u'.............|
001090e0 00 00 00 00 00 00 00 00 00 00 00 00 04 00 00 00 |................|
001090f0 00 00 00 00 a9 75 27 a0 0a 00 00 00 00 00 00 00 |.....u'.........|
00109100 00 00 00 00 a8 75 27 a0 4c 75 27 a0 3c 75 27 a0 |.....u'.Lu'.<u'.|
00109110 fc d9 0d a0 0c da 0d a0 00 00 00 00 78 75 27 a0 |............xu'.|
00109120 50 75 27 a0 30 17 09 a0 00 00 00 02 e8 75 27 a0 |Pu'.0........u'.|
00109130 70 75 27 a0 b8 9e 0d a0 30 26 06 a0 a8 60 0f a0 |pu'.....0&...`..|
00109140 00 00 00 00 00 04 00 00 00 04 00 00 7c 75 27 a0 |............|u'.|
00109150 14 14 09 a0 8c 16 09 a0 00 00 00 00 cf 51 01 06 |.............Q..|
00109160 00 00 80 57 00 12 09 a0 00 00 00 20 cf 51 01 06 |...W....... .Q..|
00109170 00 00 80 57 60 bf 1a a0 ff ff ff ff 00 00 00 00 |...W`...........|
00109180 08 d0 8a 7d 00 00 00 00 08 a0 9b c0 08 00 00 00 |...}............|
...
001098c0 00 00 00 00 00 00 00 00 fc 03 00 00 56 41 4c 49 |............VALI|
001098d0 44 41 54 49 4f 4e 3d 4e 6f 6e 65 00 56 45 52 53 |DATION=None.VERS|
001098e0 49 4f 4e 53 3d 42 54 42 4c 5f 44 2e 32 2e 31 2e |IONS=BTBL_D.2.1.|
001098f0 39 2c 42 49 4f 53 5f 48 34 33 30 2c 43 54 4c 4d |9,BIOS_H430,CTLM|
00109900 5f 55 38 32 37 2c 41 50 50 5f 35 32 31 58 00 46 |_U827,APP_521X.F|
00109910 6c 61 73 68 61 62 6c 65 3d 31 30 32 38 5f 30 30 |lashable=1028_00|
00109920 31 33 5f 31 30 32 38 5f 30 31 36 44 00 00 00 00 |13_1028_016D....|
00109930 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
Neither libc-2.3.4.so.preverify1 nor libc-2.3.4.so.preverify2 was ever
prelinked, so we can easily rule prelink out (I'd certainly wonder where prelink
could grab similar kind of garbage when it is not present in any of the
libraries that are installed).
------- Additional Comments From bmarson 2006-06-12 09:32 EST -------
Im rebuilding veritas2 with RHEL4-U3 to make sure there is no wierd behavior and
to have as a reference.
------- Additional Comments From bmarson 2006-06-12 10:28 EST -------
I verified that libc is not showing any corruption on RHEL4-U3. Both copies of
the libc file are identical and rpm -V does not even show libc as having any change.
------- Additional Comments From bmarson 2006-06-12 12:11 EST -------
I reinstalled the 06/07 tree on veritas3. This time with the intent on running
a 32bit app different than veritas. I picked iozone which I verified worked on
veritas2 (RHEL4-U3). This was a binary copied over, not built locally.
Trying to run the binary first, segv's. It fails in getopt().... rpm -V shows
the library corrupt. This time when I tried to rerun iozone after the verify
and it still segv'ed. Rerunning the verify reported the same results. Both
copies of the libarary are identical too.
Interesting enough running iozone without an arg did report the helper message
from the app (ie., enter -h for all the options).
I proceeded to reboot the box and upon iozone segv'ed again. a 3rd verify
showed the same results BUT cmp against the older copy of the libraries
(pre-reboot) shows one byte difference.
Of further interest, iozone without args NOW seg'vs
Im running out of ideas of things to look for. One thought was to try and get a
copy of libc with a program that bypasses the pagecache.
Barry
------- Additional Comments From jakub 2006-06-12 12:19 EST -------
Can you perhaps try RHEL4-U3 kernel with RHEL4-U4 userland? Were there any
changes in the MegaRAID driver in U4?
------- Additional Comments From bmarson 2006-06-12 13:09 EST -------
I installed the U3 kernel on the existing U4 bits and we have the same
corruption. gdb shows the failure as early as _dl_start_user .... -> sysconf
I wonder if a reinstall and a reboot to single user mode would show anything.
Any thoughts ?
------- Additional Comments From jakub 2006-06-12 13:20 EST -------
If you took the already corrupted U4 bits with U3 kernel, than that only shows
that by the time the corruption happens, it is not just in memory, but also
on disk.
But we still don't know what causes the corruption.
For that, doing an install with U3 kernel plus U4 userland, or if you think this
might have anything to do with glibc (I don't), an U3 install with U4 glibc
composed into it might show something.
------- Additional Comments From bmarson 2006-06-12 13:48 EST -------
It appears to not be easy to create such an install environment. I dont have
the know how at this time. How about a U3 install with a U4 up2date without a
kernel update. That might shed some light.
------- Additional Comments From tao 2006-06-12 14:45 EST -------
Issue is being worked in Engineering. This isn't a beta blocker, but
should be resolved prior to GA.
Issue escalated to Support Engineering Group by: andriusb.
Internal Status set to 'Waiting on SEG'
This event sent from IssueTracker by andriusb
issue 95733
And now for a further update. in real time .. We have been trying to take a U3 system and do an update with the 06/07 tree without the kernel bits. This is more of an effort than I ever dreamed. In parallel, I rebuilt veritas2 with the 06/07 tree but this time I installed the Minimal kits. Status ... libc is still corrupt, yet iozone runs successfully. Barry I decided to test this on the third system we use for Veritas cert. A Dell PE6800 16CPUx16GB with the same PERC type RAID controller. Sure enough iozone fails the same way with a 06/07 tree "everything" install. Am looking into two things. Building on the system without Megaraid and taking a U3 system and installing a U4 kernel We get the same failure with the pe6800 system. Building the pe2850 with simple SCSI shows NO corruption. Talking with Josh Giles (former DELL and now Red Hat QA) there were firmware issues with PERC4 RAID controllers. We are attempting to update our firmware now on veritas2 (pe2850 system). Also of note, jturner now recreates the problem when he uses the PERC controller and gets megaraid loading. I believe it was verified that megaraid driver did not change between U3/U4. Stay tuned. Barry The last time we touched the megaraid driver was 2.6.9-11.38 (U2). FWIW. The corruption looks like SCSI Inquiry data: [root@pe6800-01 ~]# sg_inq -h /dev/sda standard INQUIRY: 00 00 00 02 00 20 00 00 32 4d 65 67 61 52 41 49 44 .... ..2MegaRAID 10 4c 44 20 30 20 52 41 49 44 30 20 20 35 35 39 47 LD 0 RAID0 559G 20 35 32 31 53 00 521S. These strings are also in /proc, FWIW: [root@pe6800-01 ~]# more /proc/scsi/scsi Attached devices: ... Host: scsi0 Channel: 02 Id: 00 Lun: 00 Vendor: MegaRAID Model: LD 0 RAID0 559G Rev: 521S Type: Direct-Access ANSI SCSI revision: 02 Josha is checking to see if this is a known problem. SCSI Inquiries happen at boot time, and possibly after a bus error. We could turn on logging to see if they happen at other times. It is interesting that the problem does not occur on U3 and it does on U4, where the driver is the same, and no change to the FW. As a guess, I'd say that a change in dm or LVM in U4 may be provoking the bug in the driver or the firmware. We could test this theory by doing an install without dm and LVM. That is not a fix, though, so lets wait and see if this is a known bug. Just did an install without LVM or dm and still get the corruption on boot, so I think we can factor that out of the picture. The current theory is that when the o.s. issues an SCSI Inquiry command that requests "Vital Product Data" (EVPD) from the Megaraid adapter, the adapter DMAs the data to the wrong place in memory. This type of Inquiry is used to get more detailed info, such as the WWID of the device. I added some logging to the driver and found that there is just one Inquiry issued with EVPD set, late in the boot process (the second Inquiry below): messagebus: messagebus startup succeeded cups-config-daemon: cups-config-daemon startup succeeded haldaemon: haldaemon startup succeeded fstab-sync[5459]: removed all generated mount points fstab-sync[5476]: added mount point /media/cdrom for /dev/hda fstab-sync[5783]: added mount point /media/cdrom1 for /dev/hdf fstab-sync[5792]: added mount point /media/floppy for /dev/hde kernel: Inquiry 00 00 00 ff 00 kernel: Inquiry 01 80 00 ff 00 fstab-sync[5909]: added mount point /media/floppy1 for /dev/fd0 inventory.py: Reading DMI info failed rhqa-inventory: inventory.py startup succeeded kernel: mtrr: type mismatch for c8000000,1000000 old: write-back new: write-combining This is presumably coming from something like haldaemon running the scsi_id utility to get the WWID. This may be where the difference is between U3 and U4. Now I have a couple questions for Jakub: 1) I imagine that libc.so is mmapped, and it could certainly be the victim of a bad DMA write from the megaraid hba. Do you know of a way that this would be written to disk, where you found it with hexdump? 2) I could use some help finding an easy way to reproduce this, and testing the fix. I don't know enough about glibc to know, can I just replace the corrupted /lib/tls/libc-2.3.4.so file with a clean one from the rpm, then run the megaraid Inquiry test to see if it gets corrupted? Thanks 1) yes, libc.so.6 is mapped in all (non-statically linked) processes, but the
corruption was always in its read-only PT_LOAD segment, which is mapped with
PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE. I'm not aware of any program
that would mmap libc.so.6 using a writable or non-private mapping, so it is
much more probable the corruption happens at the page cache level in the kernel
directly.
2) easiest is probably rpm -Uvh --force glibc-2.3.4-2.*.i686.rpm
You can replace just one file of course, but you'd need to do that atomically
(copy the non-corrupted one from e.g. rpm2cpio to /lib/tls/~libc-2.3.4.so
and then mv -f /lib/tls/{~,}libc-2.3.4.so
or something similar).
Status: 1. I modified the driver so it would reject Inquiry EVPD commands. This solved the corruption problem, at least in the scenario where I can most easily detect it. 2. Seokmann indicated that this may not be a complete solution. The problem may be because certain megaraid models should not be run in 64- bit DMA mode. He suggested that I comment out the section in the driver that enables it to switch from 32-bit to 64-bit DMA: pci_set_dma_mask(adapter->pdev, DMA_64BIT_MASK) This also seemed to solve the problem. 3. Seokmann suggested a patch similar to: http://marc.theaimsgroup.com/?t=114805150900005&r=1&w=2 This prevents the switch to 64-bits on boards that do not support it. This did not solve the problem because the board I am using is one of the ones that is allowed to be set to 64-bits (PERC4E_DI_KOBUK). 4. I added mem=3GB on the kernel command line and tested with the stock driver. This did not solve the problem. At this point it is not clear whether there is a firmware problem that is specific to EVPD, or if the fw problem is with 64-bit DMA on some or all models. If it is the latter, then the problem apparently exists on a broader set of hardware models than LSI Logic expected. We would also need to explanation test 4 above. Seokmann is checking with the LSI Logic firmware team. I have also asked Dell to provide any info they have on megaraid fw issues. *** Bug 196573 has been marked as a duplicate of this bug. *** There are two separate bugs: 1) The firmware can corrupt data when processing an Inquiry command with the EVPD bit set (this command is used to get the WWID of the device, for example). The board incorrectly returns normal Inquiry data, which typically over-flows the input buffer. 2) The driver incorrectly enables 64-bit DMA on some models that do not correctly support it. A patch to fix these issues is attached. I have built a kernel with this patch for testing. http://people.redhat.com/coughlan/.2.6.9-40.ELbz194533/ It would be helpful if anyone who is having problems with megaraid systems would test this kernel. You will see an error logged when the o.s. issues a request to send an EVPD Inquiry: kernel: SCSI error : <0 1 0 0> return code = 0x40000 This is not a problem. This is what the megaraid driver normally emits when it receives an unsupported command. Created attachment 131615 [details]
fix two corruption bugs
I have done some preliminary testing of the test kernel listed above. Installing .2.6.9-40.ELbz194533 kernel just before reboot of a 6/7 tree yields no known data corruption. SCSI error ... messages are seen 3 times. The last time is when the system inventory RPM scans the system. This was a minimal install and rpm -V show no corruption in glibc. iozone works too. I want to repeat this test with a full install, so I can test veritas as well as openoffice suite (since it's 32bit and known to fail without a fix). Barry Created attachment 131753 [details] rev 2 Here is an updated version of the patch. This avoids the generation of a SCSI error message in the log when an Inquiry EVPD command is submitted. A kernel with this patch applied is available at: http://people.redhat.com/coughlan/.2.6.9-40.ELbz194533a/ Please test. committed in stream U4 build 40.1. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ The RHEL4 U4 40.1 x86_64 kernel was installed (just before reboot of a 7/6 U4 tree) and tested on the veritas system. While veritas software could not be tested at this time (resource contstraint), there was no sign of data corruption in glibc. Also the other KNOWN 32 bit app that use to fail (Ooffice) ran successfully. Unless someone absolutely needs veritas tested with THIS kernel, I declare the fix a success. Barry VERIFIED kernel-smp-2.6.9-40.1.EL.x86_64.rpm works as noted above An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html |