Bug 55797

Summary:	(SCSI CCISS)error in shutdown - post-glibc 2.2.4 upgrade
Product:	[Retired] Red Hat Linux	Reporter:	Frank Hirtz <fhirtz>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.1
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-30 15:39:15 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frank Hirtz 2001-11-06 21:14:49 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20010917

Description of problem:
m now getting the following error, just after "turning off swap" during
shutdown:

VM: swap entry overflow
VM: swap entry overflow

And I've had machines hang on shutdown with "swap_dup: swap entry overflow".
the "swap entry overflow" errors occur for both the 2.4.7-10 and 2.4.9-6
kernel after the glibc-2.2.4 update, but so far only happen for machines
which use the cpqarray driver. Machines using the cciss driver aren't
affected. 

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1.Update glibc to 2.2.4
2.Shutdown
3.
	

Actual Results:  Machine hangs at swap shutdown with the following message:
swap_dup: swap entry overflow on 2.4.9-6 or
VM: swap entry overflow on 2.4.3.

Expected Results:  It should have cleanly shut down.

Additional info:

Comment 1 Arjan van de Ven 2001-11-08 21:24:02 UTC

Which filesystems are in use ?

Comment 2 Arjan van de Ven 2001-11-08 21:25:12 UTC

also can you give the output of lsmod ?

Comment 3 Michael K. Johnson 2001-11-08 21:47:59 UTC

Also, a description of the process load on the machine would be useful.
We would also like to know if any swapfiles are in use, or if only
partitions are in use.

Comment 4 Frank Hirtz 2001-11-08 23:01:49 UTC

> Which filesystems are in use ?

ext2

> The output of lsmod

On IBM x330, which hangs every time:

Module                  Size  Used by
ide-cd                 27168   0 (autoclean)
cdrom                  28736   0 (autoclean) [ide-cd]
nfs                    83648   1 (autoclean)
lockd                  53968   1 (autoclean) [nfs]
sunrpc                 69520   1 (autoclean) [nfs lockd]
autofs                 12128   1 (autoclean)
openafs               477472   2
e100                   60800   1 (autoclean)
ipchains               41856   0
ips                    39680   4
aic7xxx               115120   0 (unused)
sd_mod                 11824   4
scsi_mod              100896   3 [ips aic7xxx sd_mod]

On Compaq which just gives an error that clears:

Module                  Size  Used by
ide-cd                 27168   0 (autoclean)
cdrom                  28736   0 (autoclean) [ide-cd]
sg                     30064   0 (autoclean) (unused)
nfs                    83648   1 (autoclean)
lockd                  53968   1 (autoclean) [nfs]
sunrpc                 69520   1 (autoclean) [nfs lockd]
autofs                 12128   1 (autoclean)
openafs               477472   2
e100                   60800   1 (autoclean)
ipchains               41856   0
cpqarray               17840   4
sd_mod                 11824   0 (unused)
scsi_mod              100896   2 [sg sd_mod]


> Also, a description of the process load on the machine would be useful.
> We would also like to know if any swapfiles are in use, or if only
> partitions are in use.

The load is almost nothing - just start and reboot even without anything
interesting running. Just one 2Gb swap partition in use, with barely any
of it used at all.

Anything else?

Comment 5 Arjan van de Ven 2001-11-08 23:12:23 UTC

Since you're not doing something extreme, I would have expected to have seen
this much more in bugreports.

There are 2 modules in common between the systems: e100 and openafs. Both are
not common in Red Hat Linux installs as well.

Is it possible to 1) use eepro100 and 2) not use openafs
(in two separate steps) to see if either one is the culprit ?

Comment 6 Frank Hirtz 2001-11-09 17:32:25 UTC

Using the eepro100 is worse - the swap error occurs and the machine hangs
and never reboots - and it happens every time.

Haven't tried taking afs out of the equation yet, but that'll be next.

Comment 7 Frank Hirtz 2001-11-12 15:06:58 UTC

Going back to our current config (ext2, initrd with ips, cpqarray and
cciss) showed that if I switch the order of cpqarray and cciss, then
machines with the corresponding raid driver for their card loading last
don't experience those errors at all.

Moving the cpqarray and cciss to the beginning of the "--with="
arguments when creating the initrd allows those cards to work, while not
generating any swap errors, regardless of what their order is. Still
need to check on the machine with the IBM raid card in it to make sure
that the resulting initrd still works for it, as well as checking on a
machine with no raid cards at all, to see if they can reboot
successfully (using the initrd we have in production right now causes
such machines to hang on reboot, with the swap error during shutdown).

Here's the way we currently create our initrd:

$  mkinitrd --with=scsi_mod --with=sd_mod --with=aic7xxx --with=ips
--with=cpqarray --with=cciss /boot/initrd.new.img 2.4.9-6smp

That version causes no errors for the machines that use the cciss
driver, the swap error (but successful reboot) on machines that use the
cpqarray, swap errors and hangs for machines that use the ips driver,
and swap errors and hangs for machines w/ none of those drivers needed.

If we remove the ips-related "--with=" arguments from the command above,
and use just "--with=cpqarray --with=cciss", we still get the swap
errors that didn't exist at all prior to going to glibc-2.2.2, so the
glibc update is definitely causing this. Short of going back to
glibc-2.2.2 I can't verify that for certain though.

Comment 8 Frank Hirtz 2001-11-12 16:52:26 UTC

Tried it on a stock 7.1 box with just the kernel (2.4.9-12smp) and necessary
utils upgraded (mkinitrd, e2fsprogs, filesystem). It still fails. It looks like
glibc is not the root of this.

Comment 9 Frank Hirtz 2001-11-12 22:44:42 UTC

Keeping everything else constant and using the mkinitrd line posted above this
error will occur with the 2.4.9-6 and 2.4.9-12 kernels, but not on the
2.4.2-2.4.7-10(The stock 7.2 kernel) kernels.

Comment 10 Frank Hirtz 2001-11-13 20:20:00 UTC

I have found a couple more bits of info on this.

Order really does make a difference. The one order that seems to work for
all systems, whether they have a Compaq, IBM or no raid controller at all,
is if they are loaded as:

$  mkinitrd --with=cpqarray --with=cciss --with=scsi_mod --with=sd_mod
--with=aic7xxx
--with=ips /boot/initrd.new.img 2.4.9-6smp

or

$  mkinitrd --with=cciss --with=cpqarray --with=scsi_mod --with=sd_mod
--with=aic7xxx
--with=ips /boot/initrd.new.img 2.4.9-6smp

And the ordering that works properly also results in no swap being used
right after boot time (for machines with enough ram to hold everything
started) where the initrds that result in problems will show some swap in
use, even though it is surely unnecessary - something is getting stuck in
memory while the initrd is loaded, or something like that, and that causes
their reboots to sometimes hang or show swap errors...

Comment 11 Frank Hirtz 2001-11-19 20:25:40 UTC

It seems that this issue only happens when the cciss driver is used.

Comment 12 Frank Hirtz 2001-11-26 18:05:11 UTC

mkinitrd --with=cciss --with=scsi_mod --with=sd_mod --with=aic7xxx
/boot/initrd.test.img 2.4.9-16smp

(Machine has an adaptec card, but it seems to do
 this on any machine)

Reboot. 

swapoff /dev/sdc1

Result:
swap_dup: swap entry overflow. 

Info:

#free
         total      used     free     shared     buffers     cached
Mem:    125764     50944    74820         0         6852      21316-/+
buffers/cache: 22776   102988
Swap:   265032         8   265024

#swapon -s 
Filename           Type          Size     Used     Priority
/dev/sdc1          partition     265032   8        -1


#swapoff /dev/sdc1
swap_dup: swap entry overflow 
--Hangs console--
(continued on different console)

#free
         total      used     free     shared     buffers     cached
Mem:    125764     51676    74088         0         6852      21328
-/+ buffers/cache: 22776   102988
Swap:        0         0        0
 
#swapon -s
Filename           Type          Size     Used     Priority
/dev/sdc1          partition     265032   4        -1

Comment 13 Norm Murray 2001-12-04 22:23:07 UTC

The following patch into 2.4.9-12 will fix the problem. 

--- linux/drivers/block/cciss.c~	Tue Oct 30 20:01:03 2001
+++ linux/drivers/block/cciss.c	Tue Oct 30 20:02:05 2001
@@ -156,11 +156,17 @@
 
  }
 
+
+static void cleanup_cciss_module(void);
+
 EXPORT_NO_SYMBOLS;
 static int __init init_cciss_module(void)
 {
-
-	return ( cciss_init());
+	int i;
+	if (i = cciss_init() ) {
+		cleanup_cciss_module();
+	}
+	return (i);
 }
 
 static void __exit cleanup_cciss_module(void)

Comment 14 Alan Cox 2003-06-07 20:34:50 UTC

Can someone confirm this fix was folded into later trees

Comment 15 Frank Hirtz 2003-06-09 16:58:11 UTC

It's in the current RHEL 2.1 tree, but does not appear to be in the current
Taroon sources.

Comment 16 Bugzilla owner 2004-09-30 15:39:15 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/