Bug 1296856 - Unknown error out while starting rbd pool with a disconnecting monitor
Unknown error out while starting rbd pool with a disconnecting monitor
Status: CLOSED WONTFIX
Product: Virtualization Tools
Classification: Community
Component: libvirt (Show other bugs)
unspecified
x86_64 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Libvirt Maintainers
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-08 04:31 EST by yangyang
Modified: 2016-04-10 18:59 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-10 18:59:13 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Took long time to start pool (2.48 MB, text/plain)
2016-01-18 03:32 EST, yangyang
no flags Details
Unknown error out (811.50 KB, text/plain)
2016-01-18 03:34 EST, yangyang
no flags Details

  None (edit)
Description yangyang 2016-01-08 04:31:04 EST
Description of problem:
Unknown error out while starting rbd pool with a disconnecting monitor
Sometimes, virsh cmd hang


Version-Release number of selected component (if applicable):
libvirt-1.3.1-1.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. prepare a rbd pool xml with 3 monitors
<pool type='rbd'>
  <name>rbd</name>
  <uuid>ebda974a-4fb7-4af2-b1a0-7a94e5cdda98</uuid>
  <capacity unit='bytes'>0</capacity>
  <allocation unit='bytes'>0</allocation>
  <available unit='bytes'>0</available>
  <source>
    <host name='10.66.110.191'/>
    <host name='10.66.111.149'/>
    <host name='10.66.110.36'/>
    <name>yy</name>
  </source>
</pool>

2. disconnect 1 monitor
# iptables -A OUTPUT -d 10.66.110.191 -j DROP

3. start pool
# virsh pool-start rbd
error: Failed to start pool rbd
error: An error occurred, but the cause is unknown

Actual results:


Expected results:
I guess pool should startup successfully even if 1 monitor of 3
can not be connected. If incorrect, the error message should be
improved

Additional info:
Comment 1 yangyang 2016-01-12 04:05:06 EST
libvirt info

# pwd
/root/libvirt
# git describe 
v1.3.1-rc1
Comment 2 Wido den Hollander 2016-01-15 07:42:37 EST
Can you start libvirt with DEBUG enabled and show the output?

It should tell you which monitors it will use. The connection is handled by librados and the "mon_host" argument set via rados_conf_set() should be it.

Librados probably times out somewhere, but that should not happen.
Comment 3 Wido den Hollander 2016-01-15 08:02:50 EST
I just tried this by adding two non-existing monitor hosts:


  <source>
    <host name='[2001:db8::100]'/>
    <host name='[2001:db8::101]'/>
    <host name='management.mbg.XXX.nl'/>
    <name>libvirt</name>
    <auth type='ceph' username='admin'>
      <secret uuid='f94812dd-f06f-48f6-9839-1edf7ee8f8d6'/>
    </auth>
  </source>


2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:171 : Found 3 RADOS cluster monitors in the pool configuration
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:193 : RADOS mon_host has been set to: [2001:db8::100],[2001:db8::101],management.mbg.XXXX.nl,
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:206 : Setting RADOS option client_mount_timeout to 30
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:209 : Setting RADOS option rados_mon_op_timeout to 30
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:212 : Setting RADOS option rados_osd_op_timeout to 30
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:220 : Setting RADOS option rbd_default_format to 2


In my case this worked just fine and it failed over to the other monitor.
Comment 4 yangyang 2016-01-18 03:29:32 EST
Well, I clarify my reproduced steps here. I have a ceph server with 3 monitors. Firstly, I created a rbd pool using only 1 monitor host as source. Then I created 20 volumes in the pool (I removed 1 volume through ceph server, but I thinks it does not matter).e.g.

1. # virsh pool-dumpxml rbd
<pool type='rbd'>
  <name>rbd</name>
  <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid>
  <capacity unit='bytes'>56239325184</capacity>
  <allocation unit='bytes'>304</allocation>
  <available unit='bytes'>36592431104</available>
  <source>
    <host name='10.66.7.6'/>
    <name>yy</name>
  </source>
</pool>

2. # for i in {1..20}; do virsh vol-create-as rbd vol$i 100M; done

Next, I destroyed the pool. Then updated pool xml using 3 monitors hosts as source, e.g.

3.
# virsh pool-destroy rbd
Pool rbd destroyed

# virsh pool-edit rbd
<pool type='rbd'>
  <name>rbd</name>
  <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid>
  <capacity unit='bytes'>56239325184</capacity>
  <allocation unit='bytes'>304</allocation>
  <available unit='bytes'>36592431104</available>
  <source>
    <host name='10.66.7.6'/>
    <host name='10.66.7.26'/>
    <host name='10.66.7.140'/>
    <name>yy</name>
  </source>
</pool>

4. Scenario 1 is that I manually disconnected to the monitor used in step 1

# iptables -A OUTPUT -d 10.66.7.6 -j DROP
# ping 10.66.7.6
PING 10.66.7.6 (10.66.7.6) 56(84) bytes of data.

5. tried to start the pool

# virsh pool-start rbd
Pool rbd started

# virsh vol-list rbd
error: Failed to list volumes
error: key in virGetStorageVol must not be NULL

Well, the pool was started. The problem is that it took 10 minutes to start the pool. Libvirt tried to connect to each volume, it took 30 seconds 1 volume because of failure on connection. Imagine that the pool contains 100 volumes, it would take 50 minutes to start the pool. The behaviour would confuse user that libvirt hang. Another problem is that volume cannot be displayed in the pool. See libvirtd-hang.log for details

Okay, introduced 2nd scenario here. Recover the connection to the monitor introduced in step 4. Disconnect to other monitor. Tried to start the pool but unknown error out. See libvirtd-unknown-error.log for details

# iptables -D OUTPUT -d 10.66.7.6 -j DROP

# iptables -A OUTPUT -d 10.66.7.26 -j DROP

6. destroy rbd pool and then start rbd pool
# virsh pool-destroy rbd
Pool rbd destroyed

# virsh pool-dumpxml rbd
<pool type='rbd'>
  <name>rbd</name>
  <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid>
  <capacity unit='bytes'>56239325184</capacity>
  <allocation unit='bytes'>304</allocation>
  <available unit='bytes'>36592431104</available>
  <source>
    <host name='10.66.7.6'/>
    <host name='10.66.7.26'/>
    <host name='10.66.7.140'/>
    <name>yy</name>
  </source>
</pool>

# virsh pool-start rbd
error: Failed to start pool rbd
error: An error occurred, but the cause is unknown
Comment 5 yangyang 2016-01-18 03:32 EST
Created attachment 1115773 [details]
Took long time to start pool
Comment 6 yangyang 2016-01-18 03:34 EST
Created attachment 1115774 [details]
Unknown error out
Comment 7 Wido den Hollander 2016-01-27 10:38:26 EST
I don't think this is really libvirt to blame. It just passes calls down to librados and librbd and let's them handle this.

The problem is that libvirt creates a new connection to the Ceph cluster for many calls, this is because the backend is not persistent.

It can't keep a active connection with the Ceph cluster open, so it has to create a new one every time a call is done.

I don't see a easy way to 'fix' this in libvirt.
Comment 8 Cole Robinson 2016-04-10 18:59:13 EDT
Sounds like WONTFIX according to Comment #7 but please reopen if I've misunderstood

Note You need to log in before you can comment on or make changes to this bug.