Bug 623030

Summary:	Failed to resume paused domain with error"Timed out during operation: cannot acquire state change lock".
Product:	Red Hat Enterprise Linux 6	Reporter:	wangyimiao <yimwang>
Component:	libvirt	Assignee:	Eric Blake <eblake>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.0	CC:	berrange, dallan, dyuan, eblake, jclift, llim, moshiro, xen-maint, yoyzhang
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-11-18 18:31:40 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description wangyimiao 2010-08-11 07:14:14 UTC

Description of problem:
 Failed to resume paused domain with error"Timed out during operation: cannot acquire state change lock".

Version-Release number of selected component (if applicable):
libvirt-0.8.1-21.el6.x86_64
qemu-kvm-0.12.1.2-2.108.el6.x86_64
qemu-img-0.12.1.2-2.108.el6.x86_64
kernel-2.6.32-61.el6.x86_64

How reproducible:
5/5

Steps to Reproduce:
1.Use virt-manager to install a domain with build 'RHEL6.0-20100807.0'.

2.Use http and default set to finish installation.
 (http://download.englab.nay.redhat.com/pub/rhel/rel-eng/RHEL6.0-20100807.0/6.0/Server/x86_64/os/)

3.start the domain.
 # virsh start rh6
Domain rh6 started
4.Suspend the domain.
# virsh suspend rh6
Domain rh6 suspended
5.Resume the paused domain.
# virsh resume  rh6
error: Failed to resume domain rh6
error: Timed out during operation: cannot acquire state change lock

Actual results:
 Failed to resume paused domain with error"Timed out during operation: cannot acquire state change lock".

Expected results:
Domain resume successfully

Additional info:
Use that steps can make domain resume work fine.
Steps:
1.Select rh6 domin and click View->details->virtio disk
2.Check the shareable box,and click apply button.

after that domain can resume successfully.


Log message and domain xml:
1.
# cat /var/log/libvirt/qemu/rh6.log 
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -S -M rhel6.0.0 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name rh6 -uuid 8fef431d-c98c-cdb7-aff6-9ca9130e19ef -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/rh6.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -no-reboot -boot c -kernel /var/lib/libvirt/boot/virtinst-vmlinuz.HDRTDy -initrd /var/lib/libvirt/boot/virtinst-initrd.img.6D0yNA -append method=http://download.englab.nay.redhat.com/pub/rhel/rel-eng/RHEL6.0-20100807.0/6.0/Server/x86_64/os/ -drive file=/var/lib/libvirt/images/rh6.img,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=20,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:31:72:c9,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 127.0.0.1:3 -vga cirrus -device AC97,id=sound0,bus=pci.0,addr=0x4 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 
char device redirected to /dev/pts/3
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -S -M rhel6.0.0 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name rh6 -uuid 8fef431d-c98c-cdb7-aff6-9ca9130e19ef -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/rh6.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/var/lib/libvirt/images/rh6.img,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=20,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:31:72:c9,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 127.0.0.1:1 -vga cirrus -device AC97,id=sound0,bus=pci.0,addr=0x4 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 
char device redirected to /dev/pts/0

2.# virsh dumpxml rh6
<domain type='kvm' id='2'>
  <name>rh6</name>
  <uuid>8fef431d-c98c-cdb7-aff6-9ca9130e19ef</uuid>
  <memory>524288</memory>
  <currentMemory>524288</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='rhel6.0.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source file='/var/lib/libvirt/images/rh6.img'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
    <interface type='network'>
      <mac address='52:54:00:31:72:c9'/>
      <source network='default'/>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <target port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='5901' autoport='yes'/>
    <sound model='ac97'>
      <alias name='sound0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </sound>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='selinux'>
    <label>system_u:system_r:svirt_t:s0:c481,c962</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c481,c962</imagelabel>
  </seclabel>
</domain>

Comment 1 Daniel Berrangé 2010-08-11 13:53:42 UTC

>  Failed to resume paused domain with error"Timed out during operation: cannot
> acquire state change lock".

This indicates that a previous monitor command is still running.

Comment 3 Eric Blake 2010-08-20 22:33:47 UTC

I'm having problems reliably reproducing it with:
kernel-2.6.32-66.el6.x86_64
qemu-kvm-0.12.1.2-2.112.el6.x86_64
libvirt-0.8.1-27.el6.x86_64

I've tried both raw images and qcow2 images.  I did get a hang once, but right now, all my save/restore attempts are either working (when I use /var/lib/libvirt/images/ as the place to store the state) or failing without locking up (when I use /root/ as the place to store state, since qemu is running as qemu:qemu, I get:
error: Failed to save domain rawhide to /root/dom.sav
error: operation failed: Migration unexpectedly failed
and the VM still continues to run)

But what I did notice when I did get the hang, compared to what is happening to me right now, is that 'virsh list --all' shows a saved domain as 'shut off', but the hung case showed as 'paused'.  Maybe there's some sort of timing race where qemu and libvirtd get out of sync as to what state the VM has when qemu can't save because of permissions issues, and if the race is lost, then all subsequent attempts to get libvirtd to control the domain are hung because libvirtd is still waiting for a monitor command to qemu to complete, even though qemu has nothing further to do based on the failed migration.

Comment 4 Eric Blake 2010-08-23 21:12:53 UTC

I have to wonder if this is related to https://bugzilla.redhat.com/show_bug.cgi?id=626544

Comment 6 Justin Clift 2010-09-01 11:32:42 UTC

Ouch, just got bitten by this when migrating a guest instance backwards and forwards between two RHEL 6.0 hosts.

After only the 4th migration too.

Should we mark this as a RHEL 6.0 blocker?


  host1  (destination host, where it's hung)
  *************************

  # rpm -qa | grep libvirt
  libvirt-client-0.8.1-27.el6.x86_64
  libvirt-0.8.1-27.el6.x86_64
  libvirt-python-0.8.1-27.el6.x86_64
  libvirt-devel-0.8.1-27.el6.x86_64
  libvirt-debuginfo-0.8.1-27.el6.x86_64

  # rpm -qa | grep kvm
  qemu-kvm-0.12.1.2-2.113.el6.x86_64

  # ps -ef|grep qemu
qemu     21899     1  0 21:16 ?        00:00:02 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Nehalem,+x2apic -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -name vm1 -uuid 4ce376a6-db4a-9382-4f06-03ea4d2f6d0b -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/home/nfs_images/bz615941-vm1.img,if=none,id=drive-virtio-disk0,boot=on,format=qcow2 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=22,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:bd:48:7e,bus=pci.0,addr=0x5 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:0,tls,x509verify=/etc/pki/libvirt-vnc -k en-us -vga cirrus -incoming tcp:0.0.0.0:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3

  # uname -a
  Linux host1 2.6.32-70.el6.x86_64 #1 SMP Wed Aug 25 10:17:53 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux


  host2 (source host)
  *******************

  # rpm -qa | grep libvirt
  libvirt-python-0.8.1-27.el6.x86_64
  libvirt-client-0.8.1-27.el6.x86_64
  libvirt-devel-0.8.1-27.el6.x86_64
  libvirt-0.8.1-27.el6.x86_64
  libvirt-debuginfo-0.8.1-27.el6.x86_64

  # rpm -qa | grep kvm
  qemu-kvm-0.12.1.2-2.113.el6.x86_64
  qemu-kvm-tools-0.12.1.2-2.113.el6.x86_64

  # ps -ef|grep qemu
  qemu     17136     1  5 21:16 ?        00:00:46 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Nehalem,+x2apic -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -name vm1 -uuid 4ce376a6-db4a-9382-4f06-03ea4d2f6d0b -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/home/nfs_images/bz615941-vm1.img,if=none,id=drive-virtio-disk0,boot=on,format=qcow2 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=22,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:bd:48:7e,bus=pci.0,addr=0x5 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -vga cirrus -incoming tcp:0.0.0.0:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3

  # uname -a
  Linux host2 2.6.32-70.el6.x86_64 #1 SMP Wed Aug 25 10:17:53 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Comment 7 Justin Clift 2010-09-01 12:03:24 UTC

As a data point, even after killing the qemu process on both servers, plus the libvirtd process on both servers, the vm would not start up on host1.  (same error)

No lock file or anything in /var/run/libvirt/*.

There was a /var/run/libvirt/network/default.pid file though, with a valid pid, on both hosts.

  # cd /var/run/libvirt/
  # find .
  ./lxc
  ./network
  ./network/default.pid
  ./qemu
  # cat network/default.pid 
  1720
  # ps -ef|grep 1720
  nobody    1720     1  0 Aug29 ?        00:00:00 /usr/sbin/dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid --conf-file=  --listen-address 192.168.122.1 --except-interface lo --dhcp-range 192.168.122.2,192.168.122.254 --dhcp-lease-max=253 --dhcp-no-override

And same kind of thing on the other host.

What allowed things to start working again on these hosts was (on each host):

  1. Killing the process listed in the pid file
  2. Remove the now-stale pid file
  3. Flush any iptables rules (just in case) -> iptables -F
  4. Start libvirtd again

The guest was able to be started on host1 after that, and then migrated backwards and forwards between hosts.  So far, so good.  (using SSH non-tunnelled migrations now, rather than the TLS ones being used when the process wedged)

Comment 8 Justin Clift 2010-09-01 12:33:05 UTC

This error "error: Timed out during operation: cannot acquire state change lock", is happening frequently here.

Eric, is there any kind of useful diag stuff I can do to help figure out the cause?

Comment 9 Justin Clift 2010-09-01 13:53:16 UTC

If it helps, I've found a way to replicate this problem fairly easily, and to make that cause go away.  (No idea of the underlying problem though)

For the two hosts here, if:

 a) the libvirt daemon is not told to listen (ie --listen is not enabled
    in /etc/sysconfig/libvirtd),

but

 b) the QEMU config _has_ been told to use TLS (plus had the TLS certificates
    configured)

In this situation, migration between the hosts will wedge with the "cannot acquire state change lock" error within 3 migrations from one to the other.  (sometimes on the first go)

In this situation the source migration host is still running the guest, but the destination one has it in "paused" state, and that's as far as it gets.  Any operation attempted on the destination host's one gives that error message.

Additionally, even though TLS is configured, the migrations still wedge if done using SSH.  Thinking it might be related to the TLS vnc options on each host or something, which are there regardless of migration type, if TLS is configured in qemu.conf.

The way to avoid the problem... disable TLS in the qemu.conf.  Migrations seem to happen pretty flawlessly then.


***************************************************************************

Simple script I use to move a guest between hosts, to see if the process is working in a stable fashion.

***************************************************************************

#!/bin/bash

# Migrate a guest back and forth between two hosts, printing progress as it goes
HOST1=host1
HOST2=host2
GUEST="vm1"
TRANSPORT="ssh"
#TRANSPORT="tls"

for i in `seq 1 10`;
do
  echo "Loop ${i}: Migrating from ${HOST1} to ${HOST2}"
  virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${GUEST} --live qemu+${TRANSPORT}://root@${HOST2}/system

  echo "Loop ${i}: Migrating back from ${HOST2} to ${HOST1}"
  virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${GUEST} --live qemu+${TRANSPORT}://root@${HOST1}/system
done

Comment 10 Daniel Berrangé 2010-09-01 13:58:32 UTC

> Additionally, even though TLS is configured, the migrations still wedge if done
> using SSH.  Thinking it might be related to the TLS vnc options on each host or
> something, which are there regardless of migration type, if TLS is configured
> in qemu.conf.

I smell a lack of entropy on /dev/random.  I bet QEMU is blocking waiting on /dev/random at startup, and won't respond on the QMP monitor until it has sufficient entropy. This could then block libvirt's startup of QEMU, waiting for QEMU monitor to wake up.

Comment 11 Justin Clift 2010-09-01 14:14:31 UTC

That may also explain why the system it wedges on still feels a bit weird for a while afterwards.  Even though the libvirt and qemu processes have been nuked, it could still be taking a period of time to build up enough useful entropy again.

Comment 12 Justin Clift 2010-09-01 14:21:59 UTC

Uh oh... it occurs to me that if this is the underlying cause... then hosts with large numbers of guests (ie 500+) could be looking at a pretty lengthy time to get all their guests moved off.

  ie scenario where hardware has reported imminent failure, needs repair or looking at, etc

Comment 14 Justin Clift 2010-09-02 03:42:44 UTC

As additional data, this problem also occurs in _guests_ using installation types other than "minimum installation".

I've just now installed a guest using the "Software Development Workstation" (or similar worded name) install, and this error occurs immediately upon migration if QEMU is configured for TLS.

Both host systems, and the guest are running the RHEL6 nightly build from last night (20100901), so this is occurring with the current QEMU code.

(wonder if I should try an older QEMU build and see what happens?)

Comment 15 Justin Clift 2010-09-02 07:32:12 UTC

Migration generally works using an older version of QEMU:

  qemu-kvm-0.12.1.2-2.90.el6.x86_64.rpm
  qemu-img-0.12.1.2-2.90.el6.x86_64.rpm
  qemu-kvm-tools-0.12.1.2-2.90.el6.x86_64.rpm

(built from SRPMs from brew)

It also kernel panics reasonably often during migration though, so not a huge win. :/

Note that the migrations take a lot longer than "normal" to occur.

It seems to exhibit the same kind of problem as reported in the bug:

  -> guest is started on destination in "paused" state, then waits for ages before <failing>


With this older QEMU though, instead of virsh coming back with "cannot acquire state change lock" after the long period of time, instead it generally finishes the migration successfully (after a "paused state of 2+  minutes).

Is there any chance that with the old version of qemu, after some internal timeout occurs it would allow things to continue, but the newer versions have changed to abort instead, thus the error message being reported?


Timings for clarification using the downgraded QEMU:


  TLS enabled migration - host1 to host2
  **************************************

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --p2p --tunnelled
  real	2m54.508s
  user	0m0.006s
  sys	0m0.009s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live
  real	2m59.868s
  user	0m0.016s
  sys	0m0.033s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live
  error: Timed out during operation: cannot acquire state change lock
  real	0m30.405s
  user	0m0.014s
  sys	0m0.021s

(I manually rebooted host here with "sudo reboot")

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --persistent
  real	2m30.338s
  user	0m0.016s
  sys	0m0.027s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --persistent
  real	2m46.492s
  user	0m0.021s
  sys	0m0.028s



  TLS enabled migration - host2 to host1
  **************************************

  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live --p2p --tunnelled
  real	2m54.731s
  user	0m0.005s
  sys	0m0.003s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live
  (kernel panic)  :(

(no long pause)  --> This looks to be the only occasion where the migration "worked properly"
  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live
  real	0m23.507s
  user	0m0.015s
  sys	0m0.011s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live
  (kernel panic again)



  TLS not enabled version - host1 to host2
  ****************************************

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --p2p --tunnelled
  real	2m25.240s
  user	0m0.004s
  sys	0m0.010s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --p2p --tunnelled
  real	2m39.303s
  user	0m0.007s
  sys	0m0.020s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live
  (kernel panic)

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live
  real	2m20.949s
  user	0m0.018s
  sys	0m0.017s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live
  real	2m16.579s
  user	0m0.016s
  sys	0m0.038s


  TLS not enabled version - host1 to host2
  ****************************************

  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live
  real	0m22.101s
  user	0m0.010s
  sys	0m0.019s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live
  real	0m22.075s
  user	0m0.013s
  sys	0m0.037s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live
  real	0m20.878s
  user	0m0.017s
  sys	0m0.011s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live
  real	0m20.671s
  user	0m0.011s
  sys	0m0.025s

  $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live
  real	0m25.632s
  user	0m0.013s
  sys	0m0.010s


Weirdly, the only migrations here consistently "ok" are those from host2 to host2 with TLS not enabled in QEMU.  Even the host1 -> host2 version is having troubles.

Comment 16 Justin Clift 2010-09-02 07:52:55 UTC

That should be:

  Weirdly, the only migrations here consistently "ok" are those from host2 to
  host1 with TLS not enabled in QEMU.  Even the host1 -> host2 version is having
  troubles.

Comment 18 Justin Clift 2010-09-07 15:43:01 UTC

Hi Moritoshi,

After updating both host servers from yesterday's nightly tree, this problem doesn't seem to be occurring any more.

Migration (using --live --p2p --tunnelled) over TLS is working properly and consistently, and hasn't given any indication of problems when run in the simple scripting loop.  It's done about 60 migrations so far in the last few minutes, no problems at all.

Notice the migrations are also consistently faster by a few seconds, than the fastest of the times above.


$ ./migrate_test.sh 
Wed Sep  8 01:32:06 EST 2010
Loop 1: Migrating from host1 to host2


real	0m14.065s
user	0m0.010s
sys	0m0.002s
Loop 1: Migrating back from host2 to host1


real	0m14.034s
user	0m0.008s
sys	0m0.003s
Loop 2: Migrating from host1 to host2


real	0m14.112s
user	0m0.006s
sys	0m0.005s
Loop 2: Migrating back from host2 to host1


real	0m14.175s
user	0m0.010s
sys	0m0.000s
Loop 3: Migrating from host1 to host2


real	0m14.073s
user	0m0.009s
sys	0m0.002s
Loop 3: Migrating back from host2 to host1


real	0m14.011s
user	0m0.007s
sys	0m0.004s
Loop 4: Migrating from host1 to host2


real	0m14.091s
user	0m0.008s
sys	0m0.003s
Loop 4: Migrating back from host2 to host1


real	0m14.257s
user	0m0.009s
sys	0m0.001s
Loop 5: Migrating from host1 to host2


real	0m14.117s
user	0m0.008s
sys	0m0.003s
Loop 5: Migrating back from host2 to host1


real	0m14.097s
user	0m0.008s
sys	0m0.002s
Loop 6: Migrating from host1 to host2


real	0m14.018s
user	0m0.008s
sys	0m0.004s
Loop 6: Migrating back from host2 to host1


real	0m14.106s
user	0m0.008s
sys	0m0.002s
Loop 7: Migrating from host1 to host2


real	0m14.174s
user	0m0.008s
sys	0m0.003s
Loop 7: Migrating back from host2 to host1


real	0m14.034s
user	0m0.008s
sys	0m0.002s
Loop 8: Migrating from host1 to host2


real	0m14.065s
user	0m0.010s
sys	0m0.001s
Loop 8: Migrating back from host2 to host1


real	0m14.028s
user	0m0.007s
sys	0m0.005s
<snipped>

Comment 19 Dave Allan 2010-11-18 18:31:40 UTC

Based on comments 17 and 18, I believe this BZ is resolved.