Bug 458748

Summary: Repeated segfaults in a wide range of processes.
Product: [Fedora] Fedora Reporter: Danny Yee <bookreviewer>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: medium    
Version: 9   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-14 01:33:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Danny Yee 2008-08-12 02:57:02 UTC
Repeated segfaults in a wide range of processes.

Upgraded a mail server to Fedora 9 with kernel 2.6.25.11-97.fc9.i686 #1 SMP

Segfaults started about ten hours after the upgrade, early in the morning.  There were nearly 400 in total, over four hours, affecting common processes: mostly dovecot-auth and spamd, but a few imap-logins.  Then there was a gap of 13 hours with no segfaults, followed by segfaults in a Networker backup process and postgrey (that was bad) in the evening.  The following morning (six hours later), there were segfaults in mostly postfix processes: smtpd, local, trivial-rewrite, qmgr, etc.

Aug 11 02:27:39 mail kernel: dovecot-auth[16512]: segfault at a01d4c ip 00126c4a sp bf87ade0 error 4 in ld-2.8.so[110000+1c000]
Aug 11 02:27:44 mail kernel: dovecot-auth[16521]: segfault at a01d4c ip 00126c4a sp bf87ade0 error 4 in ld-2.8.so[110000+1c000]
   .
   .
Aug 11 02:30:33 mail kernel: spamd[12374]: segfault at 7f ip 00233007 sp bfe98670 error 4 in libperl.so[12f000+26a000]
Aug 11 02:31:17 mail kernel: dovecot-auth[17196]: segfault at a01d4c ip 00126c4a sp bf87ade0 error 4 in ld-2.8.so[110000+1c000]
Aug 11 02:34:17 mail kernel: dovecot-auth[17956]: segfault at a01d4c ip 00126c4a sp bf87ade0 error 4 in ld-2.8.so[110000+1c000]
Aug 11 02:36:33 mail kernel: spamd[18520]: segfault at 0 ip 001f7057 sp bfe998d0 error 4 in libperl.so[12f000+26a000]
  .
  .
Aug 11 06:47:02 mail kernel: dovecot-auth[4240]: segfault at 976eb6 ip 00976eb6 sp bf87bbac error 4 in dovecot-auth[8048000+3e000]
Aug 11 06:47:02 mail kernel: imap-login[7393]: segfault at fddbf4c2 ip 0044a047 sp bfb9248c error 6 in libgssapi_krb5.so.2.2.#prelink#.PpOFcj (deleted)[44a000+2d000]
  .
  .

Aug 12 02:02:45 mail kernel: smtpd[1353]: segfault at 4 ip 00460662 sp bf9ed3a0 error 4 in libcrypto.so.0.9.8g[39d000+137000]
Aug 12 02:10:58 mail kernel: smtpd[32430]: segfault at b66db78c ip b7f02c7b sp bfc24098 error 6 in smtpd[b7eb2000+73000]
Aug 12 02:32:10 mail kernel: local[6608]: segfault at 96889d5 ip 00119985 sp bfad3abc error 4 in ld-2.8.so[110000+1c000]
Aug 12 02:32:10 mail kernel: local[31800]: segfault at 96889d5 ip 00119985 sp bfad39ac error 4 in ld-2.8.so[110000+1c000]
Aug 12 03:27:48 mail kernel: trivial-rewrite[2901]: segfault at b68d063d ip b68d063d sp bfde7a1c error 4
Aug 12 07:03:31 mail kernel: smtpd[9909]: segfault at b6800dbc ip 0044eacb sp bfa68760 error 4 in libcrypto.so.0.9.8g[39d000+137000]
Aug 12 07:41:01 mail kernel: imap-login[10515]: segfault at 0 ip 00000000 sp bfa905ec error 4
Aug 12 11:58:25 mail kernel: qmgr[2276]: segfault at 8fb20fd ip 00119985 sp bfd9e49c error 4 in ld-2.8.so[110000+1c000]

At the moment I'm still trying to stop/fix this, and I suspect it won't be reproducible once I've done that (I'm not prepared to play with a production server).

Possible clues.

The failures cluster in the early morning, when a tape dump runs, and in the evening (when a Networker backup runs).  Which suggests a disk access issue.

The server has i2o RAID arrays.  (Which used to have driver problems but have worked flawlessly for a couple of years now.)

I upgraded from Fedora 8 to Fedora 9 using yum.  Could something critical not have been updated?

Any suggestions would be most welcome.  If I get another segfault I will probably just revert to the last "known to work" Fedora 8 kernel.

Comment 1 Dave Jones 2008-08-12 04:49:33 UTC
The first thing I'd suggest is to try running memtest86 for a while.
That it only seems to trigger under high disk activity smells like bad memory or similar hardware problem.  Also, to the best of my knowledge, we've had no similar reports.

If memtest doesn't turn anything up, it would be interesting to know if the f8 kernel still works, as they're quite similar (pretty much the same code, but with different config options).

Comment 2 Danny Yee 2008-08-14 01:33:09 UTC
Yes, it was bad memory - just a coincidence that it started after my upgrade!  Sorry to trouble you all with this.