Bug 602516

Summary:

pvmove stuck waiting for I/O to complete

Product:

Red Hat Enterprise Linux 4

Reporter:

Lachlan McIlroy <lmcilroy>

Component:

lvm2

Assignee:

Alasdair Kergon <agk>

Status:

CLOSED NEXTRELEASE

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.6

CC:

agk, dejohnso, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, pyu, samuli.kaski, thornber, vgaikwad

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

684083 706036 (view as bug list)

Environment:

Last Closed:

2012-03-22 12:22:56 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

684083, 706036

Bug Blocks:

Attachments:

Description	Flags
lvmdump of reproducer system	none
lvmdump of new customer's system experiencing problem	none

Description Lachlan McIlroy 2010-06-10 03:03:57 UTC

Description of problem:
A customer reported that when they are doing pvmove while doing some file IO and database IO on the LV, the pvmove and those IO may hang.

They see that in their production system before. And they setup a testing system with the following structure:

LV:Ftest               LV:Ftestdb
   |                        |
   +-----------+------------+
               |
           VG:vgtest
               |
   +-----+-----+-----+-----+
   |     |           |     |
  sdb   sdd         sdg   sdi  

They do the test like this:
1. Create vgtest with 4 PVs
2. Create LV "Ftest" and "Ftestdb" on the VG, specify PV as sdb and sdd
3. Create ext3 fs on Ftest and Ftestdb and mount them
4. Run some file IO workload on Ftest and run oracle workload on Ftestdb
5. do pvmove from sdb to sdg, finish successfully in 5 mins (see pvmove_1)
6. do pvmove from sdd to sdi, seems hanged at this line (see pvmove_2):

   Suspending vgtest-Ftestdb (253:14) without device flush

And customer waited for a few hours and it didn't have any progress. Moreover, all the IO on the LV seems stopped and the related process hanged at "D" state.

Moreover, we asked to customer to use "iostat -x 1" to check and no IO traffic on the disk. And "ctrl+c" cannot break the pvmove process.

Version-Release number of selected component (if applicable):
kernel-2.6.9-67.EL

How reproducible:
Every time the above test is run.

Additional info:
A vmcore is available at megatron:/cores/20100525012912/work

Comment 3 Lachlan McIlroy 2010-06-10 03:15:05 UTC

Looking at the vmcore, the pvmove process is blocked attempting to suspend the device - it hasn't run for 13 minutes:

PID: 7373   TASK: 1020e063110       CPU: 0   COMMAND: "pvmove"
#0 [101e05d5c48] schedule at ffffffff8035fb2c
#1 [101e05d5cd0] io_schedule at ffffffff80360fcb
#2 [101e05d5ce0] dm_suspend at ffffffffa00de43d
#3 [101e05d5d80] dev_suspend at ffffffffa00e11a1
#4 [101e05d5db0] ctl_ioctl at ffffffffa00e2383
#5 [101e05d5f40] sys_ioctl at ffffffff801a668a
   RIP: 0000003185dc25d9  RSP: 0000007fbfffb688  RFLAGS: 00000206
   RAX: 0000000000000010  RBX: ffffffff80110a92  RCX: 000000000000000a
   RDX: 000000000066c920  RSI: 00000000c138fd06  RDI: 0000000000000008
   RBP: 0000000000000000   R8: 0000000000000005   R9: 000000000066c950
   R10: 0000003186114700  R11: 0000000000000246  R12: 000000000066ca58
   R13: 0000000000000000  R14: 00000000000000fd  R15: 000000000066c920
   ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

I found this oracle process waiting for I/O to complete and according to the last_ran time got stuck just before the pvmove process so may be of interest:

PID: 7240   TASK: 10222fbf650       CPU: 0   COMMAND: "oracle"
#0 [10222fc1b38] schedule at ffffffff8035fb2c
#1 [10222fc1bc0] io_schedule at ffffffff80360fcb
#2 [10222fc1bd0] wait_on_page_bit at ffffffff801688ae
#3 [10222fc1c70] wait_on_page_writeback_range at ffffffff80168935
#4 [10222fc1d40] sync_page_range_nolock at ffffffff80168aa2
#5 [10222fc1d70] generic_file_aio_write_nolock at ffffffff8016b159
#6 [10222fc1da0] generic_file_aio_write at ffffffff8016b1eb
#7 [10222fc1df0] ext3_file_write at ffffffffa010b2f9
#8 [10222fc1e10] do_sync_write at ffffffff8018f412
#9 [10222fc1f10] vfs_write at ffffffff8018f50d
#10 [10222fc1f40] sys_pwrite64 at ffffffff8018f6ea
#11 [10222fc1f80] system_call at ffffffff80110a92
   RIP: 000000318660bf75  RSP: 0000007fbfffdda8  RFLAGS: 00010246
   RAX: 0000000000000012  RBX: ffffffff80110a92  RCX: 000000002ae835f7
   RDX: 0000000000000400  RSI: 000000006071ec00  RDI: 0000000000000016
   RBP: 00048753a5be767c   R8: 0000000000092abc   R9: 000000000020e0c2
   R10: 000000000238e000  R11: 0000000000000246  R12: 0000000000000200
   R13: 0000007fbfffd8f0  R14: 0000002a97d56c48  R15: 0000000000000b6f
   ORIG_RAX: 0000000000000012  CS: 0033  SS: 002b

And this oracle process got stuck just after the pvmove process:

PID: 7238   TASK: 10223ba4530       CPU: 0   COMMAND: "oracle"
#0 [1022302db38] schedule at ffffffff8035fb2c
#1 [1022302dbc0] io_schedule at ffffffff80360fcb
#2 [1022302dbd0] wait_on_page_bit at ffffffff801688ae
#3 [1022302dc70] wait_on_page_writeback_range at ffffffff80168935
#4 [1022302dd40] sync_page_range_nolock at ffffffff80168aa2
#5 [1022302dd70] generic_file_aio_write_nolock at ffffffff8016b159
#6 [1022302dda0] generic_file_aio_write at ffffffff8016b1eb
#7 [1022302ddf0] ext3_file_write at ffffffffa010b2f9
#8 [1022302de10] do_sync_write at ffffffff8018f412
#9 [1022302df10] vfs_write at ffffffff8018f50d
#10 [1022302df40] sys_pwrite64 at ffffffff8018f6ea
#11 [1022302df80] system_call at ffffffff80110a92
   RIP: 000000318660bf75  RSP: 0000007fbfffd808  RFLAGS: 00010212
   RAX: 0000000000000012  RBX: ffffffff80110a92  RCX: 000000008c663000
   RDX: 00000000000a0000  RSI: 000000008c75f000  RDI: 0000000000000014
   RBP: 00048753a5c610a1   R8: 00000000000182a1   R9: 000000000020e4a6
   R10: 0000000029108000  R11: 0000000000000246  R12: 0000000000002000
   R13: 0000007fbfffd750  R14: 0000002a975bf4b0  R15: 000000009758e4c8
   ORIG_RAX: 0000000000000012  CS: 0033  SS: 002b

The pvmove process appears to be stuck in this code in dm_suspend():

/*
* Then we wait for the already mapped ios to
* complete.
*/
while (1) {
set_current_state(TASK_INTERRUPTIBLE);

if (!atomic_read(&md->pending) || signal_pending(current))
break;

io_schedule();
}

The suspend will only continue if pvmove receives a signal or the count of pending I/Os drops to zero.  Let's check both cases:

static inline int signal_pending(struct task_struct *p)
{
return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
}

crash> task 1020e063110
PID: 7373   TASK: 1020e063110       CPU: 0   COMMAND: "pvmove"
struct task_struct {
 state = 1,
 thread_info = 0x101e05d4000,
...

crash> thread_info 0x101e05d4000
struct thread_info {
 task = 0x1020e063110,
 exec_domain = 0xffffffff8043d0c0,
 flags = 0,
...

The TIF_SIGPENDING flag is not set so it hasn't received a signal.

The dm_suspend() function leaves the mapped_device pointer in %rbp which we can see here:

crash> dis dm_suspend
0xffffffffa00de24f <dm_suspend>:        push   %r14
0xffffffffa00de251 <dm_suspend+2>:      xor    %eax,%eax
0xffffffffa00de253 <dm_suspend+4>:      mov    $0xa,%ecx
0xffffffffa00de258 <dm_suspend+9>:      cld   
0xffffffffa00de259 <dm_suspend+10>:     push   %r13
0xffffffffa00de25b <dm_suspend+12>:     mov    %esi,%edx
0xffffffffa00de25d <dm_suspend+14>:     push   %r12
0xffffffffa00de25f <dm_suspend+16>:     push   %rbp
0xffffffffa00de260 <dm_suspend+17>:     mov    %rdi,%rbp
...
0xffffffffa00de41a <dm_suspend+459>:    mov    0xe8(%rbp),%eax   # md->pending
0xffffffffa00de420 <dm_suspend+465>:    test   %eax,%eax
0xffffffffa00de422 <dm_suspend+467>:    je     0xffffffffa00de43f
0xffffffffa00de424 <dm_suspend+469>:    mov    %gs:0x0,%rax
0xffffffffa00de42d <dm_suspend+478>:    mov    0x8(%rax),%rax
0xffffffffa00de431 <dm_suspend+482>:    mov    0x10(%rax),%eax
0xffffffffa00de434 <dm_suspend+485>:    test   $0x4,%al
0xffffffffa00de436 <dm_suspend+487>:    jne    0xffffffffa00de43f
0xffffffffa00de438 <dm_suspend+489>:    callq  0xffffffff80360fbc <io_schedule>

And io_schedule() pushes it onto the stack here:

crash> dis io_schedule
0xffffffff80360fbc <io_schedule>:       push   %rbp
0xffffffff80360fbd <io_schedule+1>:     mov    %rsp,%rbp
0xffffffff80360fc0 <io_schedule+4>:     incl   0x166d1e(%rip)        # 0xffffffff804c7ce4 <per_cpu__runqueues+4676>
0xffffffff80360fc6 <io_schedule+10>:    callq  0xffffffff8035f58b <schedule>
0xffffffff80360fcb <io_schedule+15>:    decl   0x166d13(%rip)        # 0xffffffff804c7ce4 <per_cpu__runqueues+4676>
0xffffffff80360fd1 <io_schedule+21>:    leaveq
0xffffffff80360fd2 <io_schedule+22>:    retq   

crash> bt -f 7373
PID: 7373   TASK: 1020e063110       CPU: 0   COMMAND: "pvmove"
#0 [101e05d5c48] schedule at ffffffff8035fb2c
   101e05d5c50: 0000000000000006 00000101e05d5d28
   101e05d5c60: 0000007300000046 000001020e063110
   101e05d5c70: 00000000000bc4cd 000001022e9d3110
   101e05d5c80: 000001020e0633f8 000001022e2c7200
   101e05d5c90: ffffffffa00dfa35 00000101e05d5cd4
   101e05d5ca0: 000001023c09cc00 0000000000000000
   101e05d5cb0: 00000000ffffffea 0000000000000000
   101e05d5cc0: ffffffffa00e1040 00000101e05d5cd8
   101e05d5cd0: ffffffff80360fcb
#1 [101e05d5cd0] io_schedule at ffffffff80360fcb
   101e05d5cd8: 000001023f213e00 ffffffffa00de43d
...

crash> px *(struct mapped_device *)0x000001023f213e00
$7 = {
 io_lock = {
   activity = 0x0,
   wait_lock = {
     magic = 0x1d244b3c,
     lock = 0x0,
     babble = 0x5,
     module = 0xffffffff80384ccd "lib/rwsem-spinlock.c",
     owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c",
     oline = 0x12e
   },
   wait_list = {
     next = 0x1023f213e38,
     prev = 0x1023f213e38
   }
 },
 suspend_lock = {
   count = {
     counter = 0x0
   },
   sleepers = 0x0,
   wait = {
     lock = {
       magic = 0x1d244b3c,
       lock = 0x0,
       babble = 0xa,
       module = 0xffffffffa00e5161 "include/linux/wait.h",
       owner = 0x0,
       oline = 0x0
     },
     task_list = {
       next = 0x1023f213e80,
       prev = 0x1023f213e80
     }
   }
 },
 pushback_lock = {
   magic = 0x1d244b3c,
   lock = 0x0,
   babble = 0x5,
   module = 0xffffffffa00e4f96 "drivers/md/dm.c",
   owner = 0x0,
   oline = 0x0
 },
 map_lock = {<No data fields>},
 holders = {
   counter = 0x3
 },
 open_count = {
   counter = 0x1
 },
 flags = 0x1,
 queue = 0x1023e379420,
 disk = 0x100bfc97e00,
 interface_ptr = 0x1023ee29200,
 pending = {
   counter = 0x1
 },
 wait = {
   lock = {
     magic = 0x1d244b3c,
     lock = 0x0,
     babble = 0xa,
     module = 0xffffffffa00e5161 "include/linux/wait.h",
     owner = 0xffffffff8037aba9 "kernel/fork.c",
     oline = 0x6c
   },
   task_list = {
     next = 0x101e05d5d40,
     prev = 0x101e05d5d40
   }
 },
 deferred = {
   head = 0x101e2a69980,
   tail = 0x1020289a380
 },
 pushback = {
   head = 0x0,
   tail = 0x0
 },
 map = 0x1023c09cc00,
 io_pool = 0x1023d338b80,
 tio_pool = 0x1023d338c80,
 bs = 0x1023ee29cc0,
 event_nr = {
   counter = 0x0
 },
 eventq = {
   lock = {
     magic = 0x1d244b3c,
     lock = 0x0,
     babble = 0xa,
     module = 0xffffffffa00e5161 "include/linux/wait.h",
     owner = 0x0,
     oline = 0x0
   },
   task_list = {
     next = 0x1023f213fa8,
     prev = 0x1023f213fa8
   }
 },
 frozen_sb = 0x0,
 suspended_bdev = 0x1023c89a300
}

And here we have md->pending == 1 so it's still waiting for a device mapped I/O to complete and call dec_pending()->end_io_acct() which will wake up the pvmove process.  So we need to work out what is holding up that last I/O.

Comment 4 Lachlan McIlroy 2010-06-10 03:51:22 UTC

Some analysis from Bryn:

PV table:

pv0 8  16 /dev/sdb KccCqL-ngfW-l0s9-EJZ0-SsGa-rL0Q-VZbUKe
pv1 8  48 /dev/sdd jIXNDc-yBd3-dRKg-8m20-qWRV-jyhr-d5DTu4
pv2 8  96 /dev/sdg iqCYlL-Hl0j-SyCN-XwKF-dxgE-UwXx-VfSU22
pv3 8 128 /dev/sdi RGoqRm-BZ3H-b1sP-8JJ9-ijgi-gJW2-bV0a0c

According to the dm data in the lvmdump we have two suspended devices:

$ grep '\ L.s.\ ' lvmdump-hklxdv55-2010051831514/dmsetup_info
vgtest-Ftest     253  13 L-sw    3    2      0 LVM-53JBzdDpdYHnBaKBdbaLXmiZ7VdTG606m6TeYUkWcl1U3kfqjIh5zJEhorK53Emr
vgtest-pvmove0   253  15 L-sw    3    3      1 LVM-53JBzdDpdYHnBaKBdbaLXmiZ7VdTG6069vX6BhJ0Z7etvouBH8VZM82I3GnhRXu5

This is consistent with the state recorded in pvmove_2:

root-root@hklxdv55-DR:/$pvmove -v /dev/sdd /dev/sdi
   Wiping cache of LVM-capable devices
   Finding volume group "vgtest"
   Archiving volume group "vgtest" metadata (seqno 17).
   Creating logical volume pvmove0
[...]
   Suspending vgtest-Ftest (253:13) without device flush
   Suspending vgtest-pvmove0 (253:15) without device flush
   Found volume group "vgtest"
   Found volume group "vgtest"
   Suspending vgtest-Ftestdb (253:14) without device flush

We can see that the temporary vgtest-pvmove0 LV has been created:

vgtest-pvmove0: 0 1048576 mirror core 1 1024 2 8:48 2097536 8:128 384
vgtest-pvmove0: 1048576 2097152 linear 8:48 384
vgtest-pvmove0: 3145728 131072 linear 8:48 3146112

Looking at the device stack at this point:

vgtest-Ftest: 0 2097152 linear 8:96 384
vgtest-Ftest: 2097152 1048576 linear 253:15 0

vgtest-Ftestdb: 0 2097152 linear 253:15 1048576
vgtest-Ftestdb: 2097152 1966080 linear 8:96 2097536
vgtest-Ftestdb: 4063232 131072 linear 253:15 3145728

vgtest-Ftest is a two-segment volume consisting of one region on /dev/sdg and one region on the temporary mirror:

+--------------+
| vgtest-Ftest | [S]
+--------------+
/    / \      \
+-----+ +----------------+
| sdg | | vgtest-pvmove0 | [S]
+-----+ +----------------+

vgtest-Ftestdb is a three-segment volume consisting of one region on /dev/sdg and two regions on the temporary mirror:

 +----------------+
 | vgtest-Ftestdb | [*]
 +----------------+
  /  / |   |   \   \
+----+ +-----+ +----------------+
|pvm0| | sdg | |       pvm0     | [S]
+----+ +-----+ +----------------+

The pvmove command running on the system is blocked attempting to suspend vgtest-Ftestdb:

kernel: pvmove        S 00004aae9fdf7c94     0 25787  24814                     (NOTLB)
kernel: 0000010061f4dcc8 0000000000000006 0000010061f4dd28 0000007300000046
kernel:        00000101ad8b9110 00000000000c4342 0000010230b88230 00000101ad8b93f8
kernel:        0000010068967700 ffffffffa00dfa35
kernel: Call Trace:<ffffffffa00dfa35>{:dm_mod:dm_table_unplug_all+49}
kernel:        <ffffffffa00e1040>{:dm_mod:dev_suspend+0} <ffffffff80360fcb>{io_schedule+15}
kernel:        <ffffffffa00de43d>{:dm_mod:dm_suspend+494} <ffffffff80183b4b>{map_vm_area+447}
kernel:        <ffffffff801355dd>{default_wake_function+0} <ffffffff801355dd>{default_wake_function+0}
kernel:        <ffffffffa00e11a1>{:dm_mod:dev_suspend+353} <ffffffffa00e2383>{:dm_mod:ctl_ioctl+602}
kernel:        <ffffffff801a668a>{sys_ioctl+1006} <ffffffff80110a92>{system_call+126}

The pvmove has made no progress at all between the three sets of call traces (register values are unchanged between the 3 sets).

This data also indicates that the first pvmove did indeed complete correctly as reported (all segments have been transferred correctly to /dev/sdg and the updated metadata reflects this).

Comment 5 Lachlan McIlroy 2010-06-10 06:38:04 UTC

Tried to reproduce this issue in a VM using four physical volumes (each backed by a 2GB file), combined them into one volume group, created logical volumes over the first two disks and formatted them with ext3.  I ran a load on the filesystems (dd and fsstress) while pvmove'ing sda to sdc and sdb to sdd and vice versa.  I could not cause pvmove to get stuck but it did run very slowly.

  --- Volume group ---
  VG Name               vgtest
  System ID             
  Format                lvm2
  Metadata Areas        4
  Metadata Sequence No  11
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                4
  Act PV                4
  VG Size               7.75 GB
  PE Size               64.00 MB
  Total PE              124
  Alloc PE / Size       62 / 3.88 GB
  Free  PE / Size       62 / 3.88 GB
  VG UUID               wWsTLq-NprO-E0Wn-OdYa-77Eq-MOtH-MrxxVf
   
  --- Logical volume ---
  LV Name                /dev/vgtest/lv0
  VG Name                vgtest
  LV UUID                2SIki7-aIKj-yJMd-4zej-YoRQ-SyBF-3R3wKr
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.94 GB
  Current LE             31
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
   
  --- Logical volume ---
  LV Name                /dev/vgtest/lv1
  VG Name                vgtest
  LV UUID                iFfX3c-naiq-oPfo-bOxT-G3f8-hj0X-SLgvOG
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                1.94 GB
  Current LE             31
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3
   
  --- Physical volumes ---
  PV Name               /dev/sda     
  PV UUID               f4GWeC-f2zr-1wTm-dzlL-Q0ue-l4OY-SC2CC5
  PV Status             allocatable
  Total PE / Free PE    31 / 31
   
  PV Name               /dev/sdb     
  PV UUID               aPwCGC-X4fl-sMvF-fkiW-PXOS-zGGr-eDyc3o
  PV Status             allocatable
  Total PE / Free PE    31 / 31
   
  PV Name               /dev/sdc     
  PV UUID               F1e9Yb-IcHS-Ae3u-OmLB-lR14-ytqg-fEnGV2
  PV Status             allocatable
  Total PE / Free PE    31 / 0
   
  PV Name               /dev/sdd     
  PV UUID               MhmBhu-Vk7q-g75v-mX8G-3lMP-b1RI-YthlbL
  PV Status             allocatable
  Total PE / Free PE    31 / 0

Comment 7 Lachlan McIlroy 2010-06-28 06:02:39 UTC

Continuing the analysis:

crash> px *(struct mapped_device *)0x000001023f213e00
$32 = {
  io_lock = {
    activity = 0x0, 
    wait_lock = {
      magic = 0x1d244b3c, 
      lock = 0x0, 
      babble = 0x5, 
      module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", 
      owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", 
      oline = 0x12e
    }, 
    wait_list = {
      next = 0x1023f213e38, 
      prev = 0x1023f213e38
    }
  }, 
  suspend_lock = {
    count = {
      counter = 0x0
    }, 
    sleepers = 0x0, 
    wait = {
      lock = {
        magic = 0x1d244b3c, 
        lock = 0x0, 
        babble = 0xa, 
        module = 0xffffffffa00e5161 "include/linux/wait.h", 
        owner = 0x0, 
        oline = 0x0
      }, 
      task_list = {
        next = 0x1023f213e80, 
        prev = 0x1023f213e80
      }
    }
  }, 
  pushback_lock = {
    magic = 0x1d244b3c, 
    lock = 0x0, 
    babble = 0x5, 
    module = 0xffffffffa00e4f96 "drivers/md/dm.c", 
    owner = 0x0, 
    oline = 0x0
  }, 
  map_lock = {<No data fields>}, 
  holders = {
    counter = 0x3
  }, 
  open_count = {
    counter = 0x1
  }, 
  flags = 0x1, 
  queue = 0x1023e379420, 
  disk = 0x100bfc97e00, 
  interface_ptr = 0x1023ee29200, 
  pending = {
    counter = 0x1
  }, 
  wait = {
    lock = {
      magic = 0x1d244b3c, 
      lock = 0x0, 
      babble = 0xa, 
      module = 0xffffffffa00e5161 "include/linux/wait.h", 
      owner = 0xffffffff8037aba9 "kernel/fork.c", 
      oline = 0x6c
    }, 
    task_list = {
      next = 0x101e05d5d40, 
      prev = 0x101e05d5d40
    }
  }, 
  deferred = {
    head = 0x101e2a69980, 
    tail = 0x1020289a380
  }, 
  pushback = {
    head = 0x0, 
    tail = 0x0
  }, 
  map = 0x1023c09cc00, 
  io_pool = 0x1023d338b80, 
  tio_pool = 0x1023d338c80, 
  bs = 0x1023ee29cc0, 
  event_nr = {
    counter = 0x0
  }, 
  eventq = {
    lock = {
      magic = 0x1d244b3c, 
      lock = 0x0, 
      babble = 0xa, 
      module = 0xffffffffa00e5161 "include/linux/wait.h", 
      owner = 0x0, 
      oline = 0x0
    }, 
    task_list = {
      next = 0x1023f213fa8, 
      prev = 0x1023f213fa8
    }
  }, 
  frozen_sb = 0x0, 
  suspended_bdev = 0x1023c89a300
}

pending is 1
flags is DMF_BLOCK_IO (stalling I/Os while trying to suspend)
what does the disk queue say?

crash> gendisk 0x100bfc97e00
struct gendisk {
  major = 253, 
  first_minor = 14, 
  minors = 1, 
  disk_name = "dm-14\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  part = 0x0, 
  fops = 0xffffffffa00eef00, 
  queue = 0x1023e379420, 
  private_data = 0x1023f213e00, 
  capacity = 4194304, 
  flags = 16, 
  devfs_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  number = 0, 
  driverfs_dev = 0x0, 
  kobj = {
    k_name = 0x100bfc97eb0 "dm-14", 
    name = "dm-14\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
    kref = {
      refcount = {
        counter = 3
      }
    }, 
    entry = {
      next = 0x1023c09ecc8, 
      prev = 0x1023f213cc8
    }, 
    parent = 0xffffffff804646e0, 
    kset = 0xffffffff804646c0, 
    ktype = 0x0, 
    dentry = 0x1023e05da20
  }, 
  random = 0x1023de20f00, 
  policy = 0, 
  sync_io = {
    counter = 0
  }, 
  stamp = 4296507683, 
  stamp_idle = 4296507683, 
  in_flight = 1, 
  dkstats = {
    read_sectors = 435498, 
    write_sectors = 5866608, 
    reads = 12972, 
    writes = 733326, 
    read_merges = 0, 
    write_merges = 0, 
    read_ticks = 55488, 
    write_ticks = 1125481, 
    io_ticks = 1096527, 
    time_in_queue = 1943931
  }
}

in_flight is 1 so that agrees.
Device is 253:14 (dm-14)

Let's look a the dm map and find out what's below this device.

crash> px *((struct mapped_device *)0x000001023f213e00)->map
$34 = {
  holders = {
    counter = 0x2
  }, 
  depth = 0x1, 
  counts = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  index = {0xffffff000003c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  num_targets = 0x3, 
  num_allocated = 0x10, 
  highs = 0xffffff000003c000, 
  targets = 0xffffff000003c080, 
  mode = 0x3, 
  devices = {
    next = 0x101f9374ac0, 
    prev = 0x10235bb3e80
  }, 
  limits = {
    max_sectors = 0xffff, 
    max_phys_segments = 0x80, 
    max_hw_segments = 0xff, 
    hardsect_size = 0x200, 
    max_segment_size = 0x10000, 
    seg_boundary_mask = 0xffffffff
  }, 
  event_fn = 0xffffffffa00dd404 <event_callback>, 
  event_context = 0x1023f213e00, 
  md = 0x1023f213e00
}

crash> px ((struct mapped_device *)0x000001023f213e00)->map->targets[0]->type->name
$38 = 0xffffffffa00e53c2 "linear"

crash> px ((struct mapped_device *)0x000001023f213e00)->map->targets[0]->private
$39 = (void *) 0x101f93edfa0

crash> px *((struct linear_c *)0x101f93edfa0)->dev
$41 = {
  list = {
    next = 0x1023c09cce8, 
    prev = 0x101f9374ac0
  }, 
  count = {
    counter = 0x2
  }, 
  mode = 0x3, 
  bdev = 0x101dfa6ca00, 
  name = "253:15\000\000\000\000\000\000\000\000\000"
}

crash> block_device 0x101dfa6ca00
struct block_device {
  bd_dev = 265289743, 
  bd_inode = 0x101dfa6cb10, 
  bd_openers = 3, 
  bd_sem = {
    count = {
      counter = 1
    }, 
    sleepers = 0, 
    wait = {
      lock = {
        magic = 488917820, 
        lock = 0, 
        babble = 10, 
        module = 0xffffffff803758bd "include/linux/wait.h", 
        owner = 0x0, 
        oline = 0
      }, 
      task_list = {
        next = 0x101dfa6ca50, 
        prev = 0x101dfa6ca50
      }
    }
  }, 
  bd_mount_sem = {
    count = {
      counter = 1
    }, 
    sleepers = 0, 
    wait = {
      lock = {
        magic = 488917820, 
        lock = 0, 
        babble = 10, 
        module = 0xffffffff803758bd "include/linux/wait.h", 
        owner = 0x0, 
        oline = 0
      }, 
      task_list = {
        next = 0x101dfa6ca98, 
        prev = 0x101dfa6ca98
      }
    }
  }, 
  bd_inodes = {
    next = 0x1022bb68970, 
    prev = 0x1022bb68970
  }, 
  bd_holder = 0xffffffffa00e524e, 
  bd_holders = 4, 
  bd_contains = 0x101dfa6ca00, 
  bd_block_size = 4096, 
  bd_part = 0x0, 
  bd_part_count = 0, 
  bd_invalidated = 0, 
  bd_disk = 0x1023c09ec00, 
  bd_list = {
    next = 0x10006be7270, 
    prev = 0x1023c1d5570
  }, 
  bd_inode_backing_dev_info = 0x0, 
  bd_private = 0
}

crash> gendisk 0x1023c09ec00
struct gendisk {
  major = 253, 
  first_minor = 15, 
  minors = 1, 
  disk_name = "dm-15\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  part = 0x0, 
  fops = 0xffffffffa00eef00, 
  queue = 0x1023e379810, 
  private_data = 0x1023c09ce00, 
  capacity = 3276800, 
  flags = 16, 
  devfs_name = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  number = 0, 
  driverfs_dev = 0x0, 
  kobj = {
    k_name = 0x1023c09ecb0 "dm-15", 
    name = "dm-15\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
    kref = {
      refcount = {
        counter = 3
      }
    }, 
    entry = {
      next = 0xffffffff804646d0, 
      prev = 0x100bfc97ec8
    }, 
    parent = 0xffffffff804646e0, 
    kset = 0xffffffff804646c0, 
    ktype = 0x0, 
    dentry = 0x1020d87a908
  }, 
  random = 0x101f93edf80, 
  policy = 0, 
  sync_io = {
    counter = 0
  }, 
  stamp = 4296507683, 
  stamp_idle = 4296507683, 
  in_flight = 0, 
  dkstats = {
    read_sectors = 40, 
    write_sectors = 504856, 
    reads = 5, 
    writes = 63107, 
    read_merges = 0, 
    write_merges = 0, 
    read_ticks = 79, 
    write_ticks = 55902, 
    io_ticks = 28089, 
    time_in_queue = 55982
  }
}

in_flight is 0.
Device is 253:15 (dm-15) so that's another dm device.
this dm's mapped_device is in private_data

crash> mapped_device 0x1023c09ce00
struct mapped_device {
  io_lock = {
    activity = 0, 
    wait_lock = {
      magic = 488917820, 
      lock = 0, 
      babble = 5, 
      module = 0xffffffff80384ccd "lib/rwsem-spinlock.c", 
      owner = 0xffffffff80384ccd "lib/rwsem-spinlock.c", 
      oline = 302
    }, 
    wait_list = {
      next = 0x1023c09ce38, 
      prev = 0x1023c09ce38
    }
  }, 
  suspend_lock = {
    count = {
      counter = 1
    }, 
    sleepers = 0, 
    wait = {
      lock = {
        magic = 488917820, 
        lock = 0, 
        babble = 10, 
        module = 0xffffffffa00e5161 "include/linux/wait.h", 
        owner = 0x0, 
        oline = 0
      }, 
      task_list = {
        next = 0x1023c09ce80, 
        prev = 0x1023c09ce80
      }
    }
  }, 
  pushback_lock = {
    magic = 488917820, 
    lock = 0, 
    babble = 5, 
    module = 0xffffffffa00e4f96 "drivers/md/dm.c", 
    owner = 0x0, 
    oline = 0
  }, 
  map_lock = {<No data fields>}, 
  holders = {
    counter = 4
  }, 
  open_count = {
    counter = 3
  }, 
  flags = 3, 
  queue = 0x1023e379810, 
  disk = 0x1023c09ec00, 
  interface_ptr = 0x1022e2c7f40, 
  pending = {
    counter = 0
  }, 
  wait = {
    lock = {
      magic = 488917820, 
      lock = 0, 
      babble = 10, 
      module = 0xffffffffa00e5161 "include/linux/wait.h", 
      owner = 0xffffffff8037aba9 "kernel/fork.c", 
      oline = 131
    }, 
    task_list = {
      next = 0x1023c09cf20, 
      prev = 0x1023c09cf20
    }
  }, 
  deferred = {
    head = 0x1023f264980, 
    tail = 0x1020aa44800
  }, 
  pushback = {
    head = 0x0, 
    tail = 0x0
  }, 
  map = 0x1023d05d000, 
  io_pool = 0x1022eff23c0, 
  tio_pool = 0x1022eff22c0, 
  bs = 0x101f36e4300, 
  event_nr = {
    counter = 1
  }, 
  eventq = {
    lock = {
      magic = 488917820, 
      lock = 0, 
      babble = 10, 
      module = 0xffffffffa00e5161 "include/linux/wait.h", 
      owner = 0xffffffff8037a9ff "kernel/sched.c", 
      oline = 2894
    }, 
    task_list = {
      next = 0x1023c09cfa8, 
      prev = 0x1023c09cfa8
    }
  }, 
  frozen_sb = 0x0, 
  suspended_bdev = 0x101dfa6ca00
}

pending is 0.
flags is 3 so that's DMF_BLOCK_IO and DMF_SUSPENDED.

Aha! The underlying device is suspended.  The pvmove is stuck trying to suspend dm-14 because an I/O is blocked by the underlying device dm-15 because it is suspended - dm-15 will need to resumed before dm-14 can continue.

Comment 9 Lachlan McIlroy 2010-08-26 07:02:05 UTC

I've managed to reproduce this problem but I had to mimic the customer's lvm config exactly to get pvmove to hang.  Here's what I did:

- Installed a VM with RHEL4.6
- Created 4 x 2GB files on the host and attached them to the VM as SCSI disks

$ pvcreate /dev/sda /dev/sdb /dev/sdc /dev/sdd

Must use a 64MB extent size:

$ vgcreate -s 64M vgtest /dev/sda /dev/sdb /dev/sdc /dev/sdd
 
This where it got confusing - the volume Ftest is made up of 2 segments and spans two disks with 16 extents on the first disk and 8 on the second.  The volume Ftestdb is made up of 3 segments and spans the same two disks with 16 extents on the second disk, then 15 extents on the first and then 1 extent on the second disk again.  The only way I know to create that is with this:

$ lvcreate -n Ftest vgtest -l 16 /dev/sda
$ lvcreate -n Ftestdb vgtest -l 16 /dev/sdb
$ lvextend -l 31 /dev/vgtest/Ftestdb /dev/sda
$ lvextend -l 24 /dev/vgtest/Ftest /dev/sdb
$ lvextend -l 32 /dev/vgtest/Ftestdb /dev/sdb

This is what it all looks like after that.

vgtest-Ftest: 0 2097152 linear 8:0 384
vgtest-Ftest: 2097152 1048576 linear 8:16 384

vgtest-Ftestdb: 0 2097152 linear 8:16 1048960
vgtest-Ftestdb: 2097152 1966080 linear 8:0 2097536
vgtest-Ftestdb: 4063232 131072 linear 8:16 3146112


  --- Logical volume ---
  LV Name                /dev/vgtest/Ftest
  VG Name                vgtest
  LV UUID                XaxY0e-TNVq-gwzS-2scb-yA6L-lZ8u-atAnHm
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                1.50 GB
  Current LE             24
  Segments               2
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:3
   
  --- Logical volume ---
  LV Name                /dev/vgtest/Ftestdb
  VG Name                vgtest
  LV UUID                2a5AaI-enRN-DdAk-eqCK-JMJ5-Jm4p-0lAzdg
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                2.00 GB
  Current LE             32
  Segments               3
  Allocation             inherit
  Read ahead sectors     0
  Block device           253:4


  --- Volume group ---
  VG Name               vgtest
  System ID             
  Format                lvm2
  Metadata Areas        4
  Metadata Sequence No  72
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               0
  Max PV                0
  Cur PV                4
  Act PV                4
  VG Size               7.75 GB
  PE Size               64.00 MB
  Total PE              124
  Alloc PE / Size       56 / 3.50 GB
  Free  PE / Size       68 / 4.25 GB
  VG UUID               42GW3U-1l0X-Sb5I-jD5b-S3Gy-w8gc-LtAlRB


  --- Physical volume ---
  PV Name               /dev/sda
  VG Name               vgtest
  PV Size               1.95 GB / not usable 16.00 MB
  Allocatable           yes (but full)
  PE Size (KByte)       65536
  Total PE              31
  Free PE               0
  Allocated PE          31
  PV UUID               gCJW0L-Lqrd-snDX-aswE-rLJH-wvlc-I73604
   
  --- Physical volume ---
  PV Name               /dev/sdb
  VG Name               vgtest
  PV Size               1.95 GB / not usable 16.00 MB
  Allocatable           yes 
  PE Size (KByte)       65536
  Total PE              31
  Free PE               6
  Allocated PE          25
  PV UUID               cvzwHb-vYqS-uTz0-cclI-P1Nt-N35H-R11i8x
   
  --- Physical volume ---
  PV Name               /dev/sdc
  VG Name               vgtest
  PV Size               1.95 GB / not usable 16.00 MB
  Allocatable           yes 
  PE Size (KByte)       65536
  Total PE              31
  Free PE               31
  Allocated PE          0
  PV UUID               yDuC5g-Pfla-TUvF-kGk3-zPpg-f4o3-AEFc3p
   
  --- Physical volume ---
  PV Name               /dev/sdd
  VG Name               vgtest
  PV Size               1.95 GB / not usable 16.00 MB
  Allocatable           yes 
  PE Size (KByte)       65536
  Total PE              31
  Free PE               31
  Allocated PE          0
  PV UUID               DwVjpk-XvvM-Q0q3-0VPJ-lOGu-Ycze-CcDVAd

Make filesystems and mount both volumes:

$ mkfs.ext3 /dev/vgtest/Ftest
$ mkfs.ext3 /dev/vgtest/Ftestdb
$ mount /dev/vgtest/Ftest /mnt/Ftest
$ mount /dev/vgtest/Ftestdb /mnt/Ftestdb

Add an I/O load to the Ftestdb volume:

$ fsstress -d /mnt/Ftestdb/fsstress -n 1000000 -l 0 -p 1

And fire off pvmove, this one should succeed:

$ pvmove -v /dev/sda /dev/sdc

This one should hang (if it doesn't hang keep following the steps):

$ pvmove -v /dev/sdb /dev/sdd

This will succeed:

$ pvmove -v /dev/sdc /dev/sda

And this one should hang:

$ pvmove -v /dev/sdd /dev/sdb

The output from pvmove before it hangs shows that it is trying to suspend a device (vgtest-Ftestdb) that sits on top on another already suspended device (vgtest-pvmove0):

    Suspending vgtest-Ftest (253:2) without device flush
    Suspending vgtest-pvmove0 (253:4) without device flush
    Found volume group "vgtest"
    Found volume group "vgtest"
    Suspending vgtest-Ftestdb (253:3) without device flush

Comment 10 Milan Broz 2010-08-26 07:09:54 UTC

Can you please also attach "lvmdump -m" from the system where you are able to reproduce it?

Comment 11 Lachlan McIlroy 2010-08-26 07:18:30 UTC

Created attachment 441128 [details]
lvmdump of reproducer system

Comment 13 Debbie Johnson 2011-02-16 20:20:23 UTC

Created attachment 479206 [details]
lvmdump of new customer's system experiencing problem

Comment 14 Debbie Johnson 2011-02-16 20:28:36 UTC

Information that might prove helpful that one customer has noticed.

Scenario.
New bigger volume is added and data moved on it via pvmove.
pvmove at some time stops in D state.

dmsetup info will show suspended devices
dmsetup resume <map> - on the suspended devices unblocks pvmove and it continues.

This only happens with pvmove.

Comment 15 Lachlan McIlroy 2011-02-17 00:29:39 UTC

Manually resuming the devices will clear the problem but it may also undermine the integrity of pvmove since the devices are being resumed before it expects them to be.

This problem is due to pvmove suspending a device (A) that sits on top of another device (B) that is already suspended.  Any I/O sent to device B while it is suspended gets queued and wont complete until device B is resumed.  For device A to be suspended it must wait for all I/O it has already issued to complete but device A is waiting for an I/O that's in device B's queue.

The fix for this problem will need to be done in pvmove.  It will need to avoid the sequence "suspend B, ... suspend A" by either always suspending A first whenever suspending B or by converting it to "suspend B, ... resume B, suspend A, suspend B".

Comment 17 Alasdair Kergon 2011-06-02 15:30:40 UTC

OK - I may have found a simple way to reproduce this now, and a logic error in the code.

Comment 18 Alasdair Kergon 2011-06-03 10:35:11 UTC

Summary:

If you are using pvmove to move more than one LV sharing the same underlying PV at once, you are vulnerable to this problem.

Simple case, PV1 is used by LV1 and LV2.  You move PV1 to PV2.  At the end of the moving the first segment, the code suspends the devices in an incorrect order which allows I/O to become trapped between them.

Comment 19 Samuli Kaski 2011-07-01 09:10:19 UTC

I was just facing this very same problem. It was infact the LV used for moving (pvmove) which caused the system to hang. I manually resumed pvmove0 twice, doing so I was able to complete the operation.

This was on a 2.6.32 kernel (x86_64) with LVM2 tools 2.02.39.

Comment 20 hank 2011-07-01 09:33:01 UTC

The same bug is reported in rhel5:
https://bugzilla.redhat.com/show_bug.cgi?id=706036

And someone is working on fix it.

On comment 13, Alasdair provide a method to work around:

In the meantime, as a workaround, try using the -n option of pvmove to move
only one LV at once.

List of LVS in VG:  lvs --noheadings -o name $vg
Move one LV:  pvmove -i0 -n $lvname

Comment 21 Alasdair Kergon 2011-07-06 16:45:04 UTC

So, the patched upstream tools passed the upstream test suite for the first time last night.  However, a backport to RHEL4 is looking unlikely at this stage: RHEL5 and RHEL6 are the priority, and there is an adequate workaround on RHEL4.

Comment 22 Alasdair Kergon 2011-07-08 21:35:58 UTC

Upstream release 2.02.86 include in Fedora rawhide.  Please test.

Comment 23 Milan Broz 2012-03-22 12:07:16 UTC

RHEL4 is no longer actively developed, let's track this bug for more recent releases. AFAIK the bug is already fixed there.