From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Description of problem: Running pvmove causes kernel panic. Server has internal 1-TB array on 3w_xxxx controller (/dev/sdb), and large (> 1-TB)external fibre-channel array on an LSI fibre-channel controller (mptscsih, /dev/sda). Running "pvmove /dev/sda /dev/sdb" causes kernel panic, while running "pvmove -n LogicalVolumeXX /dev/sda /dev/sdb" does not. This is difficult to recover from as the "pvmove" starts again immediately upon entering run-level 1 (single user). Slow typists will not be able to run "pvmove --abort" before the system panics again. Version-Release number of selected component (if applicable): kernel-2.6.9-22.0.2.ELsmp How reproducible: Always Steps to Reproduce: # Start with a production server with 1-T of data on /dev/sdb1, but no LVM # BTW, I've only done this twice... vgscan # /dev/sda is external fibre-channel storage pvcreate /dev/sda1 vgcreate -s 64M vg01 /dev/sda1 # actually creating multiple logical volumes # total space used > 1T, editing for brevity lvcreate -L 1200G -n data2 vg01 mkfs -t ext3 /dev/vg01/data2 mount /dev/sdb1 /data1 mount /dev/sda1 /data2 rsync -a /data1/ /data2/ # verify correct copy umount /data1/ # /dev/sdb is the internal 3w_xxxx raid array pvcreate /dev/sdb1 vgextend vg01 /dev/sdb1 pvmove /dev/sda1 /dev/sdb1 # never get this far... # vgreduce vg01 /dev/sda1 Actual Results: From syslog... kernel: Unable to handle kernel paging request at virtual address f89f2000 ... Process kmirrord... Call Trace: [<f88bb020>] rh_state+0x4c/0x5c [dm_mirror] [<f88bbe46>] do_writes+0x7d/0x243 [dm_mirror] [<f88bc030>] do_mirror+0x7e/0x84 [dm_mirror] ... <0>Fatal exception: panic in 5 seconds ... Expected Results: pvmove should complete with exit status zero Additional info: I will submit full Oops information as an attachment...
Correction: Where I said "mount /dev/sda1 /data2", I really meant to say "mount /dev/vg01/data2 /data2".
Created attachment 123818 [details] log entries in /var/log/messages relating to this Oops...
Attached log entries are from my 2nd attempt to reproduce the problem using an older kernel (2.6.9-22.EL). My first panic happened on a custom 2.6.9-22.0.2.ELsmp that included reiserfs support (different system). My second panic occurred on this machine with a stock 2.6.9-22.0.2.ELsmp. Third panic, 2nd attempt to reproduce, occurred when I tried the older 2.6.9-22.EL "up" kernel. Sabine is registered under RHN, so you should be able to get a hardware profile.
Can you extract the relevant lvm2 metadata? Ideally by running 'vgcfgbackup' whilst the pvmove is happening, but a backup from immediately before initiating the pvmove will probably give us the same information (look for one in /etc/lvm/archive or backup). Alternatively capture the output of 'dmsetup table' while the pvmove is in progress? Also can you supply the output of 'dmsetup info -c' and 'cat /proc/mounts'? We've fixed some bugs in this area recently, but I'm not sure yet whether or not this one is different.
FYI, I will be able to isolate hardware for this starting 2/6/2006. All the servers with this type of server hardware are still in production.
Sorry for the delay, I have pulled hardware from production use to deal with this bug, and have verified that I can reproduce the problem... But I can't seem to get to a single-user prompt because that panicked pvmove operation starts immediately once the volume groups are activated and panicks again before I can get to a single-user prompt. How do I get past this?
Here goes... I wrote a script that would run the following commands vgcfgbackup dmsetup table dmsetup info -c cat /proc/mounts every 10 seconds while the pvmove runs. I've collected all the output into a tarball which I will attach after I finish this comment. The names of my logical volumes have been sanitized for your protection :-) Here's the script: ####### # The /boot partition is not part of LVM, so this is # a safe place to hide files. I can still get at these # files with Knoppix even if the machine panics on boot. SAFE_ZONE=/boot/bugzilla-179201 VGSAVE() { vgcfgbackup -f ${SAFE_ZONE}/$1_metadata.${TIMESTAMP} \ -v --ignorelockingfailure $1 } ###### if [ ! -d ${SAFE_ZONE}/vgcfgbackups ] then mkdir ${SAFE_ZONE}/vgcfgbackups fi rsync -a /etc/lvm/ \ ${SAFE_ZONE}/vgcfgbackups/ ###### TIMESTAMP="`date +%Y%m%d%H%M%S`" VGSAVE vg01 pvremove /dev/sdb pvcreate /dev/sdb vgextend --autobackup y vg01 /dev/sdb pvscan pvmove --debug /dev/sda /dev/sdb > pvmove.log 2>&1 & while [ 1 ] do TIMESTAMP="`date +%Y%m%d%H%M%S`" echo "######" echo "### Timestamp: ${TIMESTAMP}" echo "" VGSAVE vg01 echo '### Output from "dmsetup table"' echo "" dmsetup table | \ sed -e 's/^/ /' echo "" echo '### Output from "dmsetup info -c"' echo "" dmsetup info -c | \ sed -e 's/^/ /' echo "" echo '### Output from "cat /proc/mounts"' echo "" cat /proc/mounts | \ sed -e 's/^/ /' echo "" sleep 10 done ###### # We'll never get here due to kernel panics. rsync -a /etc/lvm/ \ ${SAFE_ZONE}/vgcfgbackups/ bofh%
Created attachment 125224 [details] output from cause_panic.sh BTW, I recovered this data using the Ubuntu 5.10 Live CD for i386. It seemed to be able to restart the pvmove operation without causing a kernel panic. Ubuntu 5.10 uses 2.6.12-9
This bug, which appeared to disappear with EL4 Update 3 is now back again in 2.6.9-42.0.3...
I verified that this same bug is still present in kernel 2.6.9-55.0.2.EL as well in 2.6.9-42.0.8.EL...
I have the same problem. System is running 2.6.9-55.0.2EL. /dev/dm-11 vgdev lvm2 a- 400.00G 100.01G /dev/dm-12 vgdev lvm2 a- 200.00G 200.00G /dev/dm-13 vgdev lvm2 a- 200.00G 200.00G pvmove /dev/dm-11 LVs are active and mounted. after 1 hour (+- 1 min) system crashes with: Process kmirrord (pid: 7776, threadinfo 0000010025b58000, task 0000010082bf87f0) Stack: ffffffffa0150e8b 000001004159ca80 0000010025b59e68 000000000002c151 0000000000000000 0000010314c0f400 ffffffffa0151e88 0000000000000021 3a6c697475445252 007368706172473a Call Trace:<ffffffffa0150e8b>{:dm_mirror:rh_state+79} <ffffffffa0151e88>{:dm_mir ror:do_work+2149} <ffffffff8030c099>{thread_return+0} <ffffffff8030c0f1>{thread_return+88} <ffffffffa0151623>{:dm_mirror:do_work+0} <ffffffff80147c42>{worker_thread +419} <ffffffff801341cc>{default_wake_function+0} <ffffffff801341cc>{default_wa ke_function+0} <ffffffff8014b990>{keventd_create_kthread+0} <ffffffff80147a9f>{worker_th read+0} <ffffffff8014b990>{keventd_create_kthread+0} <ffffffff8014b967>{kthread+2 00} <ffffffff80110f47>{child_rip+8} <ffffffff8014b990>{keventd_create_kthread +0} <ffffffff8014b89f>{kthread+0} <ffffffff80110f3f>{child_rip+0} Code: 0f a3 30 19 f6 31 c0 85 f6 0f 95 c0 c3 31 c0 c3 31 c0 c3 55 RIP <ffffffffa0150917>{:dm_mirror:core_in_sync+8} RSP <0000010025b59c80> CR2: ffffff00101dc828 <0>Kernel panic - not syncing: Oops after reboot( boot 2.6.9-55.0.6EL) an attempt to mount filesystem placed on a LV belonging to this VG crashes system again. after next reboot (2.6.9-55.0.6EL) "pvmove" continues moving extents. After "pvmove --abort" FS can be mounted. Linux host.at.worklplace 2.6.9-55.0.6.ELsmp #1 SMP Thu Aug 23 11:13:21 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux HT(hyper-threading) was enabled.
There is bug in bio_to_region function in RHEL4 kernel. Reproducible, problem with volumes with multiple segments (pvmove mirror segment must not be the first in mapping table to trigger this bug).
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** Bug 200341 has been marked as a duplicate of this bug. ***
committed in stream U7 build 68.7. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html