Hide Forgot
Description of problem: Unknown error out while starting rbd pool with a disconnecting monitor Sometimes, virsh cmd hang Version-Release number of selected component (if applicable): libvirt-1.3.1-1.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. prepare a rbd pool xml with 3 monitors <pool type='rbd'> <name>rbd</name> <uuid>ebda974a-4fb7-4af2-b1a0-7a94e5cdda98</uuid> <capacity unit='bytes'>0</capacity> <allocation unit='bytes'>0</allocation> <available unit='bytes'>0</available> <source> <host name='10.66.110.191'/> <host name='10.66.111.149'/> <host name='10.66.110.36'/> <name>yy</name> </source> </pool> 2. disconnect 1 monitor # iptables -A OUTPUT -d 10.66.110.191 -j DROP 3. start pool # virsh pool-start rbd error: Failed to start pool rbd error: An error occurred, but the cause is unknown Actual results: Expected results: I guess pool should startup successfully even if 1 monitor of 3 can not be connected. If incorrect, the error message should be improved Additional info:
libvirt info # pwd /root/libvirt # git describe v1.3.1-rc1
Can you start libvirt with DEBUG enabled and show the output? It should tell you which monitors it will use. The connection is handled by librados and the "mon_host" argument set via rados_conf_set() should be it. Librados probably times out somewhere, but that should not happen.
I just tried this by adding two non-existing monitor hosts: <source> <host name='[2001:db8::100]'/> <host name='[2001:db8::101]'/> <host name='management.mbg.XXX.nl'/> <name>libvirt</name> <auth type='ceph' username='admin'> <secret uuid='f94812dd-f06f-48f6-9839-1edf7ee8f8d6'/> </auth> </source> 2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:171 : Found 3 RADOS cluster monitors in the pool configuration 2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:193 : RADOS mon_host has been set to: [2001:db8::100],[2001:db8::101],management.mbg.XXXX.nl, 2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:206 : Setting RADOS option client_mount_timeout to 30 2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:209 : Setting RADOS option rados_mon_op_timeout to 30 2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:212 : Setting RADOS option rados_osd_op_timeout to 30 2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:220 : Setting RADOS option rbd_default_format to 2 In my case this worked just fine and it failed over to the other monitor.
Well, I clarify my reproduced steps here. I have a ceph server with 3 monitors. Firstly, I created a rbd pool using only 1 monitor host as source. Then I created 20 volumes in the pool (I removed 1 volume through ceph server, but I thinks it does not matter).e.g. 1. # virsh pool-dumpxml rbd <pool type='rbd'> <name>rbd</name> <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid> <capacity unit='bytes'>56239325184</capacity> <allocation unit='bytes'>304</allocation> <available unit='bytes'>36592431104</available> <source> <host name='10.66.7.6'/> <name>yy</name> </source> </pool> 2. # for i in {1..20}; do virsh vol-create-as rbd vol$i 100M; done Next, I destroyed the pool. Then updated pool xml using 3 monitors hosts as source, e.g. 3. # virsh pool-destroy rbd Pool rbd destroyed # virsh pool-edit rbd <pool type='rbd'> <name>rbd</name> <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid> <capacity unit='bytes'>56239325184</capacity> <allocation unit='bytes'>304</allocation> <available unit='bytes'>36592431104</available> <source> <host name='10.66.7.6'/> <host name='10.66.7.26'/> <host name='10.66.7.140'/> <name>yy</name> </source> </pool> 4. Scenario 1 is that I manually disconnected to the monitor used in step 1 # iptables -A OUTPUT -d 10.66.7.6 -j DROP # ping 10.66.7.6 PING 10.66.7.6 (10.66.7.6) 56(84) bytes of data. 5. tried to start the pool # virsh pool-start rbd Pool rbd started # virsh vol-list rbd error: Failed to list volumes error: key in virGetStorageVol must not be NULL Well, the pool was started. The problem is that it took 10 minutes to start the pool. Libvirt tried to connect to each volume, it took 30 seconds 1 volume because of failure on connection. Imagine that the pool contains 100 volumes, it would take 50 minutes to start the pool. The behaviour would confuse user that libvirt hang. Another problem is that volume cannot be displayed in the pool. See libvirtd-hang.log for details Okay, introduced 2nd scenario here. Recover the connection to the monitor introduced in step 4. Disconnect to other monitor. Tried to start the pool but unknown error out. See libvirtd-unknown-error.log for details # iptables -D OUTPUT -d 10.66.7.6 -j DROP # iptables -A OUTPUT -d 10.66.7.26 -j DROP 6. destroy rbd pool and then start rbd pool # virsh pool-destroy rbd Pool rbd destroyed # virsh pool-dumpxml rbd <pool type='rbd'> <name>rbd</name> <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid> <capacity unit='bytes'>56239325184</capacity> <allocation unit='bytes'>304</allocation> <available unit='bytes'>36592431104</available> <source> <host name='10.66.7.6'/> <host name='10.66.7.26'/> <host name='10.66.7.140'/> <name>yy</name> </source> </pool> # virsh pool-start rbd error: Failed to start pool rbd error: An error occurred, but the cause is unknown
Created attachment 1115773 [details] Took long time to start pool
Created attachment 1115774 [details] Unknown error out
I don't think this is really libvirt to blame. It just passes calls down to librados and librbd and let's them handle this. The problem is that libvirt creates a new connection to the Ceph cluster for many calls, this is because the backend is not persistent. It can't keep a active connection with the Ceph cluster open, so it has to create a new one every time a call is done. I don't see a easy way to 'fix' this in libvirt.
Sounds like WONTFIX according to Comment #7 but please reopen if I've misunderstood