1296856 – Unknown error out while starting rbd pool with a disconnecting monitor

Bug 1296856 - Unknown error out while starting rbd pool with a disconnecting monitor

Summary: Unknown error out while starting rbd pool with a disconnecting monitor

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Virtualization Tools
Classification:	Community
Component:	libvirt
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Libvirt Maintainers
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-08 09:31 UTC by Yang Yang
Modified:	2016-04-10 22:59 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-04-10 22:59:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Took long time to start pool (2.48 MB, text/plain) 2016-01-18 08:32 UTC, Yang Yang	no flags	Details
Unknown error out (811.50 KB, text/plain) 2016-01-18 08:34 UTC, Yang Yang	no flags	Details
View All

Description Yang Yang 2016-01-08 09:31:04 UTC

Description of problem:
Unknown error out while starting rbd pool with a disconnecting monitor
Sometimes, virsh cmd hang


Version-Release number of selected component (if applicable):
libvirt-1.3.1-1.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. prepare a rbd pool xml with 3 monitors
<pool type='rbd'>
  <name>rbd</name>
  <uuid>ebda974a-4fb7-4af2-b1a0-7a94e5cdda98</uuid>
  <capacity unit='bytes'>0</capacity>
  <allocation unit='bytes'>0</allocation>
  <available unit='bytes'>0</available>
  <source>
    <host name='10.66.110.191'/>
    <host name='10.66.111.149'/>
    <host name='10.66.110.36'/>
    <name>yy</name>
  </source>
</pool>

2. disconnect 1 monitor
# iptables -A OUTPUT -d 10.66.110.191 -j DROP

3. start pool
# virsh pool-start rbd
error: Failed to start pool rbd
error: An error occurred, but the cause is unknown

Actual results:


Expected results:
I guess pool should startup successfully even if 1 monitor of 3
can not be connected. If incorrect, the error message should be
improved

Additional info:

Comment 1 Yang Yang 2016-01-12 09:05:06 UTC

libvirt info

# pwd
/root/libvirt
# git describe 
v1.3.1-rc1

Comment 2 Wido den Hollander 2016-01-15 12:42:37 UTC

Can you start libvirt with DEBUG enabled and show the output?

It should tell you which monitors it will use. The connection is handled by librados and the "mon_host" argument set via rados_conf_set() should be it.

Librados probably times out somewhere, but that should not happen.

Comment 3 Wido den Hollander 2016-01-15 13:02:50 UTC

I just tried this by adding two non-existing monitor hosts:


  <source>
    <host name='[2001:db8::100]'/>
    <host name='[2001:db8::101]'/>
    <host name='management.mbg.XXX.nl'/>
    <name>libvirt</name>
    <auth type='ceph' username='admin'>
      <secret uuid='f94812dd-f06f-48f6-9839-1edf7ee8f8d6'/>
    </auth>
  </source>


2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:171 : Found 3 RADOS cluster monitors in the pool configuration
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:193 : RADOS mon_host has been set to: [2001:db8::100],[2001:db8::101],management.mbg.XXXX.nl,
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:206 : Setting RADOS option client_mount_timeout to 30
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:209 : Setting RADOS option rados_mon_op_timeout to 30
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:212 : Setting RADOS option rados_osd_op_timeout to 30
2016-01-15 12:57:12.499+0000: 12317: debug : virStorageBackendRBDOpenRADOSConn:220 : Setting RADOS option rbd_default_format to 2


In my case this worked just fine and it failed over to the other monitor.

Comment 4 Yang Yang 2016-01-18 08:29:32 UTC

Well, I clarify my reproduced steps here. I have a ceph server with 3 monitors. Firstly, I created a rbd pool using only 1 monitor host as source. Then I created 20 volumes in the pool (I removed 1 volume through ceph server, but I thinks it does not matter).e.g.

1. # virsh pool-dumpxml rbd
<pool type='rbd'>
  <name>rbd</name>
  <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid>
  <capacity unit='bytes'>56239325184</capacity>
  <allocation unit='bytes'>304</allocation>
  <available unit='bytes'>36592431104</available>
  <source>
    <host name='10.66.7.6'/>
    <name>yy</name>
  </source>
</pool>

2. # for i in {1..20}; do virsh vol-create-as rbd vol$i 100M; done

Next, I destroyed the pool. Then updated pool xml using 3 monitors hosts as source, e.g.

3.
# virsh pool-destroy rbd
Pool rbd destroyed

# virsh pool-edit rbd
<pool type='rbd'>
  <name>rbd</name>
  <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid>
  <capacity unit='bytes'>56239325184</capacity>
  <allocation unit='bytes'>304</allocation>
  <available unit='bytes'>36592431104</available>
  <source>
    <host name='10.66.7.6'/>
    <host name='10.66.7.26'/>
    <host name='10.66.7.140'/>
    <name>yy</name>
  </source>
</pool>

4. Scenario 1 is that I manually disconnected to the monitor used in step 1

# iptables -A OUTPUT -d 10.66.7.6 -j DROP
# ping 10.66.7.6
PING 10.66.7.6 (10.66.7.6) 56(84) bytes of data.

5. tried to start the pool

# virsh pool-start rbd
Pool rbd started

# virsh vol-list rbd
error: Failed to list volumes
error: key in virGetStorageVol must not be NULL

Well, the pool was started. The problem is that it took 10 minutes to start the pool. Libvirt tried to connect to each volume, it took 30 seconds 1 volume because of failure on connection. Imagine that the pool contains 100 volumes, it would take 50 minutes to start the pool. The behaviour would confuse user that libvirt hang. Another problem is that volume cannot be displayed in the pool. See libvirtd-hang.log for details

Okay, introduced 2nd scenario here. Recover the connection to the monitor introduced in step 4. Disconnect to other monitor. Tried to start the pool but unknown error out. See libvirtd-unknown-error.log for details

# iptables -D OUTPUT -d 10.66.7.6 -j DROP

# iptables -A OUTPUT -d 10.66.7.26 -j DROP

6. destroy rbd pool and then start rbd pool
# virsh pool-destroy rbd
Pool rbd destroyed

# virsh pool-dumpxml rbd
<pool type='rbd'>
  <name>rbd</name>
  <uuid>2adec417-a833-48f6-8dc5-259ed8d0b6dc</uuid>
  <capacity unit='bytes'>56239325184</capacity>
  <allocation unit='bytes'>304</allocation>
  <available unit='bytes'>36592431104</available>
  <source>
    <host name='10.66.7.6'/>
    <host name='10.66.7.26'/>
    <host name='10.66.7.140'/>
    <name>yy</name>
  </source>
</pool>

# virsh pool-start rbd
error: Failed to start pool rbd
error: An error occurred, but the cause is unknown

Comment 5 Yang Yang 2016-01-18 08:32:44 UTC

Created attachment 1115773 [details]
Took long time to start pool

Comment 6 Yang Yang 2016-01-18 08:34:24 UTC

Created attachment 1115774 [details]
Unknown error out

Comment 7 Wido den Hollander 2016-01-27 15:38:26 UTC

I don't think this is really libvirt to blame. It just passes calls down to librados and librbd and let's them handle this.

The problem is that libvirt creates a new connection to the Ceph cluster for many calls, this is because the backend is not persistent.

It can't keep a active connection with the Ceph cluster open, so it has to create a new one every time a call is done.

I don't see a easy way to 'fix' this in libvirt.

Comment 8 Cole Robinson 2016-04-10 22:59:13 UTC

Sounds like WONTFIX according to Comment #7 but please reopen if I've misunderstood

Note You need to log in before you can comment on or make changes to this bug.