179201 – pvmove causes kernel panic

Bug 179201 - pvmove causes kernel panic

Summary: pvmove causes kernel panic

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i386
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Milan Broz
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	GSSApproved
Duplicates (1):	200341 (view as bug list)
Depends On:
Blocks:	428636 428637
TreeView+	depends on / blocked

Reported:	2006-01-28 01:30 UTC by Randy Zagar
Modified:	2018-11-28 19:43 UTC (History)
CC List:	9 users (show)
Fixed In Version:	RHSA-2008-0665
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-07-24 19:11:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log entries in /var/log/messages relating to this Oops... (33.10 KB, text/plain) 2006-01-28 01:42 UTC, Randy Zagar	no flags	Details
output from cause_panic.sh (55.44 KB, application/x-gzip) 2006-02-25 03:52 UTC, Randy Zagar	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2008:0665	0	normal	SHIPPED_LIVE	Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7	2008-07-24 16:41:06 UTC

Description Randy Zagar 2006-01-28 01:30:39 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
Running pvmove causes kernel panic.

Server has internal 1-TB array on 3w_xxxx controller (/dev/sdb), and large (> 1-TB)external fibre-channel array on an LSI fibre-channel controller (mptscsih, /dev/sda).

Running "pvmove /dev/sda /dev/sdb" causes kernel panic, while running "pvmove -n LogicalVolumeXX /dev/sda /dev/sdb" does not.

This is difficult to recover from as the "pvmove" starts again immediately upon entering run-level 1 (single user).  Slow typists will not be able to run "pvmove --abort" before the system panics again.

Version-Release number of selected component (if applicable):
kernel-2.6.9-22.0.2.ELsmp

How reproducible:
Always

Steps to Reproduce:
# Start with a production server with 1-T of data on /dev/sdb1, but no LVM
# BTW, I've only done this twice...

vgscan
# /dev/sda is external fibre-channel storage
pvcreate /dev/sda1
vgcreate -s 64M  vg01 /dev/sda1
# actually creating multiple logical volumes
# total space used > 1T, editing for brevity
lvcreate -L 1200G -n data2 vg01
mkfs -t ext3 /dev/vg01/data2
mount /dev/sdb1 /data1
mount /dev/sda1 /data2
rsync -a /data1/ /data2/
# verify correct copy
umount /data1/
# /dev/sdb is the internal 3w_xxxx raid array
pvcreate /dev/sdb1
vgextend vg01 /dev/sdb1
pvmove /dev/sda1 /dev/sdb1
# never get this far...
# vgreduce vg01 /dev/sda1

Actual Results:  From syslog...
kernel: Unable to handle kernel paging request at virtual address f89f2000
...
Process kmirrord...
Call Trace:
 [<f88bb020>] rh_state+0x4c/0x5c [dm_mirror]
 [<f88bbe46>] do_writes+0x7d/0x243 [dm_mirror]
 [<f88bc030>] do_mirror+0x7e/0x84 [dm_mirror]
...
<0>Fatal exception: panic in 5 seconds
...



Expected Results:  pvmove should complete with exit status zero

Additional info:

I will submit full Oops information as an attachment...

Comment 1 Randy Zagar 2006-01-28 01:40:54 UTC

Correction:

Where I said "mount /dev/sda1 /data2", I really meant to say "mount
/dev/vg01/data2 /data2".

Comment 2 Randy Zagar 2006-01-28 01:42:53 UTC

Created attachment 123818 [details]
log entries in /var/log/messages relating to this Oops...

Comment 3 Randy Zagar 2006-01-28 01:53:07 UTC

Attached log entries are from my 2nd attempt to reproduce the problem using an
older kernel (2.6.9-22.EL).

My first panic happened on a custom 2.6.9-22.0.2.ELsmp that included reiserfs
support (different system).

My second panic occurred on this machine with a stock 2.6.9-22.0.2.ELsmp.

Third panic, 2nd attempt to reproduce, occurred when I tried the older
2.6.9-22.EL "up" kernel.

Sabine is registered under RHN, so you should be able to get a hardware profile.

Comment 4 Alasdair Kergon 2006-02-01 17:36:12 UTC

Can you extract the relevant lvm2 metadata?

Ideally by running 'vgcfgbackup' whilst the pvmove is happening, but a backup
from immediately before initiating the pvmove will probably give us the same
information (look for one in /etc/lvm/archive or backup).

Alternatively capture the output of 'dmsetup table' while the pvmove is in progress?

Also can you supply the output of 'dmsetup info -c' and 'cat /proc/mounts'?


We've fixed some bugs in this area recently, but I'm not sure yet whether or not
this one is different.

Comment 5 Randy Zagar 2006-02-01 18:56:11 UTC

FYI, I will be able to isolate hardware for this starting 2/6/2006.  All the
servers with this type of server hardware are still in production.

Comment 6 Randy Zagar 2006-02-16 21:20:10 UTC

Sorry for the delay, I have pulled hardware from production use to deal with
this bug, and have verified that I can reproduce the problem...

But I can't seem to get to a single-user prompt because that panicked pvmove
operation starts immediately once the volume groups are activated and panicks
again before I can get to a single-user prompt.

How do I get past this?

Comment 7 Randy Zagar 2006-02-25 03:49:08 UTC

Here goes...

I wrote a script that would run the following commands

    vgcfgbackup
    dmsetup table
    dmsetup info -c
    cat /proc/mounts

every 10 seconds while the pvmove runs.  I've collected all the output into a
tarball which I will attach after I finish this comment.  The names of my
logical volumes have been sanitized for your protection :-)

Here's the script:

#######
# The /boot partition is not part of LVM, so this is
# a safe place to hide files.  I can still get at these
# files with Knoppix even if the machine panics on boot.

SAFE_ZONE=/boot/bugzilla-179201

VGSAVE() {
    vgcfgbackup -f ${SAFE_ZONE}/$1_metadata.${TIMESTAMP} \
        -v --ignorelockingfailure $1
    }

######

if  [ ! -d ${SAFE_ZONE}/vgcfgbackups ]
then
    mkdir ${SAFE_ZONE}/vgcfgbackups
fi

rsync -a /etc/lvm/ \
    ${SAFE_ZONE}/vgcfgbackups/

######

TIMESTAMP="`date +%Y%m%d%H%M%S`"

VGSAVE vg01

pvremove /dev/sdb
pvcreate /dev/sdb
vgextend --autobackup y vg01 /dev/sdb

pvscan

pvmove --debug /dev/sda /dev/sdb > pvmove.log 2>&1 &

while [ 1 ]
do
    TIMESTAMP="`date +%Y%m%d%H%M%S`"

    echo "######"
    echo "### Timestamp: ${TIMESTAMP}"
    echo ""

    VGSAVE vg01

    echo '### Output from "dmsetup table"'
    echo ""

    dmsetup table | \
        sed -e 's/^/    /'

    echo ""

    echo '### Output from "dmsetup info -c"'
    echo ""

    dmsetup info -c | \
        sed -e 's/^/    /'

    echo ""

    echo '### Output from "cat /proc/mounts"'
    echo ""

    cat /proc/mounts | \
        sed -e 's/^/    /'

    echo ""

    sleep 10
done

######
# We'll never get here due to kernel panics.

rsync -a /etc/lvm/ \
    ${SAFE_ZONE}/vgcfgbackups/
bofh%

Comment 8 Randy Zagar 2006-02-25 03:52:19 UTC

Created attachment 125224 [details]
output from cause_panic.sh

BTW, I recovered this data using the Ubuntu 5.10 Live CD for i386.  It seemed
to be able to restart the pvmove operation without causing a kernel panic. 
Ubuntu 5.10 uses 2.6.12-9

Comment 9 Randy Zagar 2006-11-01 08:51:15 UTC

This bug, which appeared to disappear with EL4 Update 3 is now back again in
2.6.9-42.0.3...

Comment 10 luca villa 2007-07-16 10:05:17 UTC

I verified that this same bug is still present in kernel 2.6.9-55.0.2.EL as well
in 2.6.9-42.0.8.EL...

Comment 11 Ivan Stoykov 2007-09-25 20:15:40 UTC

I have the same problem. 

System is running 2.6.9-55.0.2EL.
  /dev/dm-11        vgdev lvm2 a-   400.00G 100.01G
  /dev/dm-12        vgdev lvm2 a-   200.00G 200.00G
  /dev/dm-13        vgdev lvm2 a-   200.00G 200.00G

pvmove /dev/dm-11

LVs are active and mounted.

after 1 hour (+- 1 min) system crashes with:

Process kmirrord (pid: 7776, threadinfo 0000010025b58000, task 0000010082bf87f0)
Stack: ffffffffa0150e8b 000001004159ca80 0000010025b59e68 000000000002c151
       0000000000000000 0000010314c0f400 ffffffffa0151e88 0000000000000021
       3a6c697475445252 007368706172473a
Call Trace:<ffffffffa0150e8b>{:dm_mirror:rh_state+79} <ffffffffa0151e88>{:dm_mir
ror:do_work+2149}
       <ffffffff8030c099>{thread_return+0} <ffffffff8030c0f1>{thread_return+88}
       <ffffffffa0151623>{:dm_mirror:do_work+0} <ffffffff80147c42>{worker_thread
+419}
       <ffffffff801341cc>{default_wake_function+0} <ffffffff801341cc>{default_wa
ke_function+0}
       <ffffffff8014b990>{keventd_create_kthread+0} <ffffffff80147a9f>{worker_th
read+0}
       <ffffffff8014b990>{keventd_create_kthread+0} <ffffffff8014b967>{kthread+2
00}
       <ffffffff80110f47>{child_rip+8} <ffffffff8014b990>{keventd_create_kthread
+0}
       <ffffffff8014b89f>{kthread+0} <ffffffff80110f3f>{child_rip+0}


Code: 0f a3 30 19 f6 31 c0 85 f6 0f 95 c0 c3 31 c0 c3 31 c0 c3 55
RIP <ffffffffa0150917>{:dm_mirror:core_in_sync+8} RSP <0000010025b59c80>
CR2: ffffff00101dc828
 <0>Kernel panic - not syncing: Oops

after reboot( boot 2.6.9-55.0.6EL) an attempt to mount filesystem placed on a LV
belonging to this VG crashes system again.
after next reboot (2.6.9-55.0.6EL) "pvmove" continues moving extents. After
"pvmove --abort" FS can be mounted.

Linux host.at.worklplace 2.6.9-55.0.6.ELsmp #1 SMP Thu Aug 23 11:13:21 EDT 2007
x86_64 x86_64 x86_64 GNU/Linux
HT(hyper-threading) was enabled.

Comment 12 Milan Broz 2008-01-11 01:31:13 UTC

There is bug in bio_to_region function in RHEL4 kernel.

Reproducible, problem with volumes with multiple segments (pvmove mirror segment
must not be the first in mapping table to trigger this bug).

Comment 15 RHEL Program Management 2008-01-11 01:49:16 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 23 Milan Broz 2008-01-14 14:19:41 UTC

*** Bug 200341 has been marked as a duplicate of this bug. ***

Comment 25 Jason Baron 2008-01-17 15:43:12 UTC

committed in stream U7 build 68.7. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 30 errata-xmlrpc 2008-07-24 19:11:00 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Note You need to log in before you can comment on or make changes to this bug.