Last week, I updated my systems with the do_brk() errata kernels. One system crashed last night. I did rebuild the kernel RPM with the patch from bugzilla #97843 added and LVM_VFS_ENHANCEMENT defined, and I _think_ that the crash may have happened when a snapshot was being made (but I'm not sure). The system is a Penguin Computing dual PIII with 1G RAM, an AMI MegaRAID adapter with multiple drives striped and mirrored (presented as one logical drive to the system). All filesystems are ext3, and all except /boot are on LVM. The system runs sendmail (with a custom cf file), OpenLDAP, and cucipop (with some local mods); it is a "sidelined spam" server (stores suspected spam for users to check if they want) that receives about 5-6G of spam a day. The volume with the spam storage and sendmail queue is not snapshotted (and not backed up), but all other volumes are backed up over the network - a script is called that creates a snapshot, mounts it (as ext2), and when the backup is done another script is called to unmount and remove the snapshot. I will attach the decoded kernel oops. Please let me know if there is anything I can do to help.
Created attachment 96423 [details] kysmoops output
Hmm. There's definitely no footprint associated with LVM in this oops. It's a panic walking a list where, historically, we usually only see problems if there is hardware memory corruption going on. A memtest86 run might be useful; but for a corrupt list seen on a home-built kernel, I'm not sure that there's any useful debugging we can do without more information.
When I looked at the logs, it looked like this oops happened shortly after an LVM snapshot was created, but that may just have been coincidence; I didn't look at the order of events (the oops came 2 seconds _before_ an LVM snapshot was mounted, but then the system froze). At this point, I don't think it is hardware. We've got two Penguin Computing Relion servers (dual PIII 1.13GHz; one has 1G RAM and the other 2G RAM) still running RHL 8.0. Since upgrading to 2.4.20-24.8 (with the LVM patch from RH Bugzilla added), one (where this oops came from) has crashed once, and the other has crashed a couple of times (but no oops logged; we're working on getting a serial console server so we can capture any oops on the console). Both systems have run for a long time before that with no trouble (one for about 6-8 months, one for about a year and a half). If there's any more information I can provide, I'll be happy to. The only reason I'm running a home-built kernel is because the patch necessary for LVM snapshots to be useful with ext3 has not made its way into the errata kernels, and I need that patch to make good backups. These systems will eventually be migrated to RHEL ES (I thought about building a kernel from the RHEL 3 ES errata SRPM, but I don't know if that would work right on 8.0 since the kernel has NPTL patches and such), but I can't do that right this minute.
We have had the exact same problem with LVM and snapshots since Red Hat kernel 2.4.18-27. Our machine has 8GB of RAM and the only way I could get snapshots to work was to artificially limit the system RAM with mem=6000M on the kernel command line. Looking at various Google searches, many people think this problem is related to a bug in the kernel VMM. I'm attaching a kernel BUG() output that I receive when trying to use snapshots with 2.4.18-27 which points vmalloc.c. If I use 2.4.20-28, I no longer get a kernel BUG(), but now I receive: lvcreate -- ERROR "Cannot allocate memory" creating VGDA for "/dev/vg00/lvsnap" in kernel Probably because __vmalloc() in mm/vmalloc.c no longer returns BUG() as it did in 2.4.18-27, but now it just returns NULL. Here are the specs for my machine: Dell PE4600 2x2.6GHz Xeon with HT 8GB RAM RH 7.3 Kernel 2.4.20-28 I really need 8GB of RAM not 6000M.
Created attachment 97350 [details] Kernel BUG when creating snapshots with 2.4.18-27
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/