1223731 – Parallel resize and write operation is causing the VM to crash on a RBD image, having object map feature enabled.

Bug 1223731 - Parallel resize and write operation is causing the VM to crash on a RBD image, having object map feature enabled.

Summary: Parallel resize and write operation is causing the VM to crash on a RBD image...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD
Sub Component:
Version:	1.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	1.3.2
Assignee:	Jason Dillaman
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1225005 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-05-21 11:07 UTC by Tanay Ganguly
Modified:	2017-07-30 15:33 UTC (History)
CC List:	12 users (show)
Fixed In Version:	RHEL: ceph-0.94.5-2.el7cp Ubuntu: ceph_0.94.5-2redhat1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-29 14:42:05 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
core backtrace (2.72 KB, text/plain) 2015-05-22 17:58 UTC, Jason Dillaman	no flags	Details
log (8.51 KB, text/plain) 2015-05-23 06:35 UTC, Tanay Ganguly	no flags	Details
rbd log (5.56 MB, text/plain) 2016-02-03 11:39 UTC, Tanay Ganguly	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	11743	None	None	None	Never
Red Hat Bugzilla	1305421	unspecified	CLOSED	Downstream document for setting up QEMU and Libvirt for using librbd	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2016:0313	normal	SHIPPED_LIVE	Red Hat Ceph Storage 1.3.2 bug fix and enhancement update	2016-02-29 19:37:43 UTC

Internal Links: 1305421

Description Tanay Ganguly 2015-05-21 11:07:25 UTC

Description of problem:
While writing data in RBD hosted image of a VM, and in parallel re-size on the same RBD image is causing the VM to shutdown abruptly.

Version-Release number of selected component (if applicable):
ceph version 0.94.1
librbd1-0.94.1-10.el7cp.x86_64

How reproducible:
100%

Steps to Reproduce:

1. Create a rbd image, take snap and clone it.

rbd -p Tanay-RBD --image OM_NEWClone_new1 info
rbd image 'OM_NEWClone_new1':
        size 14302 MB in 3576 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.77981515f007c
        format: 2
        features: layering, exclusive, object map
        flags: object map invalid
        parent: Tanay-RBD/OM_NEW1@OM_NEWsnap1
        overlap: 9791 MB

2. Import the same Image OM_NEWClone_new1 into a VM as a spare disk.
3. Start re-sizing on an rbd image which have object map enabled.

#!/bin/python
import os
import random
import time
size=14240
i=0
new_size=0
sh_size=0
while i < 10:
        x=random.randint(1,500)
        new_size=size + x 
        cmd1 = 'rbd resize Tanay-RBD/OM_NEWClone_new1 --size %s' %new_size
        print 'cmd is %s' %cmd1
        os.system(cmd1)
        time.sleep(5)
        x=random.randint(1,100)
        sh_size= new_size - x
        cmd2 = 'rbd resize Tanay-RBD/OM_NEWClone_new1 --size %s --allow-shrink' %sh_size
        print 'cmd2 is %s' %cmd2
        os.system(cmd2)
        i = i +1

4. While re-sizing is happening, start some IO on the VM (where the rbd image OM_NEWClone_new1, is added as the spare disk )

strace dd if=500-2 of=/dev/vdb bs=1M count=500

Actual results:
When the write operations starts it takes the lock, and while re-size is happening it sometimes acquire a lock. I think there is some deadlock which is causing the VM to crash.

When the VM write operations start, the re-size operations stall, Only once the VM is abruptly shut down then the re-size again continued and completed successfully.

Expected results:
Both re-size and IO operations should continue gracefully.

Additional info:

Comment 2 Josh Durgin 2015-05-22 00:50:11 UTC

Do you have the core file from the crash, or the traceback ceph prints? It should be in qemu's stderr, wherever that was redirected to.

If there's no log from qemu, you should also be able to trigger it via e.g. rbd bench-write --io-pattern rand, with the resizes happening in parallel. Then the crash will be printed to stderr of rbd bench-write, and you may be able to control ulimit to get a core dump more easily.

I'd suggest that this one not block the release, since it's a rare scenario  that can't be triggered by openstack (resize is only allowed when the volume is not in use, and only increasing the size is possible).

Comment 3 Ken Dreyer (Red Hat) 2015-05-22 17:38:19 UTC

Thanks Josh, re-targeting to 1.3.1 based on Comment 2.

Comment 4 Jason Dillaman 2015-05-22 17:58:35 UTC

Created attachment 1028908 [details]
core backtrace

Comment 5 Tanay Ganguly 2015-05-23 06:14:07 UTC

Thanks Jason.

Josh,
Please find the VM log from the QEMU.

Comment 6 Tanay Ganguly 2015-05-23 06:35:18 UTC

Created attachment 1028997 [details]
log

Somehow i was unable to attach the file

Comment 7 Jason Dillaman 2015-05-26 13:59:33 UTC

*** Bug 1225005 has been marked as a duplicate of this bug. ***

Comment 9 Ken Dreyer (Red Hat) 2015-12-10 20:44:37 UTC

Hi Jason, what timeframe are we looking at for landing this fix?

Comment 10 Jason Dillaman 2015-12-10 20:55:38 UTC

commit d4eb7bd63a5e46b790bad7001d2873c5c238cc90
Author: Jason Dillaman <dillaman>
Date:   Tue Jun 23 14:18:20 2015 -0400

    librbd: invalidate cache outside cache callback context
    
    When shrinking an image, it's possible that the op flush callback
    will be from within the cache callback context.  This would result
    in a deadlock when attempting to re-lock the cache lock in order to
    invalidate the cache.
    
    Fixes: #11743
    Backport: hammer
    Signed-off-by: Jason Dillaman <dillaman>
    (cherry picked from commit 726d699b7790c7e371279281ab32cd3aeb8ece7b)


# git tag --contains d4eb7bd63a5e46b790bad7001d2873c5c238cc90
v0.94.4

Comment 12 Tanay Ganguly 2016-01-29 10:36:05 UTC

Jason,
The write is getting completed from VM ( in parallel resize is also working fine), but in Vm log i am seeing lots of error message:

block I/O error in device 'drive-virtio-disk1': Invalid argument (22)
block I/O error in device 'drive-virtio-disk1': Invalid argument (22)
block I/O error in device 'drive-virtio-disk1': Invalid argument (22)
block I/O error in device 'drive-virtio-disk1': Invalid argument (22)

I did a checksum match of the data i am writing and reading from the rbd image.
Checksum is matching, but just wanted to understand why i am seeing this Error message in vm log.

Followed the steps mentioned in comment #1

Comment 13 Jason Dillaman 2016-01-29 15:38:27 UTC

Can you please attach the RBD logs from the qemu-kvm process?

Comment 14 Tanay Ganguly 2016-02-03 11:30:28 UTC

Jason,

Please find the attached.

Comment 15 Tanay Ganguly 2016-02-03 11:39:45 UTC

Created attachment 1120720 [details]
rbd log

Comment 16 Jason Dillaman 2016-02-03 14:27:03 UTC

The attached file doesn't appear to be a librbd log, either that or you only had message debug logs enabled.  I need to see the librbd debug logs to see if librbd is throwing the EINVAL error.

Comment 17 Jason Dillaman 2016-02-03 16:03:55 UTC

With node in SElinux enforcing mode, Ceph logs are not being generated from the qemu-kvm processes.

Added the following section to ceph.conf:

[client.libvirt]
log file = /var/log/libvirt/ceph/rbd-$pid.$cctid.log
admin socket = /var/log/libvirt/ceph/rbd-$pid.$cctid.asok

Since libvirt will use the 'libvirt' client id, this directs all libvirt VMs to use "/var/log/libvirt/ceph" as the log and asok directory.

I also ran the following:

mkdir /var/log/libvirt/ceph
chown qemu:qemu /var/log/libvirt/ceph  # qemu-kvm runs under the qemu account
chcon system_u:object_r:svirt_image_t:s0 # SElinux sVirt profile to allow R/W access to the log/asok files

Comment 18 Ken Dreyer (Red Hat) 2016-02-03 23:50:12 UTC

Unfortunately custom chcon operations will be wiped out during the next system relabel, so we'll need a more permanent solution.

What is the default location of the logs/asok to which libvirt would normally write? (if "client.libvirt" was not configured?) I'm guessing that location does not have the necessary svirt_image_t context?

Comment 19 Jason Dillaman 2016-02-04 04:19:26 UTC

If you don't include those entries, you don't get those files.  Upstream has some RBD documentation suggesting you set "log path = /var/log/qemu/qemu-guest-$pid.log" and "admin socket = /var/run/ceph/guests/$cluster-$type.$id.$pid.$cctid.asok" [1].  I'm not sure which docs should be covering RHCS <-> RHEL-OSP/other OpenStack.

Looking at the selinux-targeted policy, it looks like only the following directories have virt_image_t:

/var/lib/libvirt/images(/.*)?
/var/lib/imagefactory/images(/.*)?

It would take someone with more SElinux knowledge than me to know if other file contexts would work within system_u:object_r:qemu_exec_t.

[1] http://docs.ceph.com/docs/master/rbd/rbd-openstack/

Comment 20 Ken Dreyer (Red Hat) 2016-02-04 20:53:20 UTC

Boris / Milan, can you please let us know how to satisfy SELinux while enabling librbd's logs within libvirt?

The problem is explained in Comment #17. We can make librbd write its logfile/asok anywhere, but SELinux will only allow writes when the location has a svirt_image_t context.

Comment 21 Boris Ranto 2016-02-04 23:06:49 UTC

You should get the proper context if you store the logs in e.g.

* /var/lib/libvirt/images/ceph/

directory. That line means that everything under /var/lib/libvirt/images will be labelled that way by default (if it was created there and not just mv'd there). Please know that this won't include sub-mounts. You need to re-label those manually after (first) mount via restorecon:

* mount /some/dev /var/lib/libvirt/images/ceph/ && restorecon /var/lib/libvirt/images/ceph/

Alternatively, you can create persistent rules for files with 'selinux fcontext' but that is not recommended all that much for general-purpose cases. In your case, it would be something like

* semanage fcontext -a -t virt_image_t "/var/lib/libvirt/ceph(/.*)?"
* restorecon -R -v /var/lib/libvirt/ceph

Comment 22 Jason Dillaman 2016-02-05 18:45:42 UTC

The image was shrunk and IO was issued to an offset outside the image size.  This error is correct and to be expected:

2016-02-06 05:05:02.512314 7f3a4dfba700 20 librbd: aio_read 0x7f3a79688900 completion 0x7f3a827d5680 [1006863581184,4096]
2016-02-06 05:05:02.512334 7f3a4dfba700 20 librbd: ictx_check 0x7f3a79688900
2016-02-06 05:05:02.512336 7f3a4dfba700 -1 librbd::AioCompletion: AioCompletion::fail() 0x7f3a827d5680: (22) Invalid argument

(In reply to Tanay Ganguly from comment #12)
> Jason,
> The write is getting completed from VM ( in parallel resize is also working
> fine), but in Vm log i am seeing lots of error message:
> 
> block I/O error in device 'drive-virtio-disk1': Invalid argument (22)
> block I/O error in device 'drive-virtio-disk1': Invalid argument (22)
> block I/O error in device 'drive-virtio-disk1': Invalid argument (22)
> block I/O error in device 'drive-virtio-disk1': Invalid argument (22)
> 
> I did a checksum match of the data i am writing and reading from the rbd
> image.
> Checksum is matching, but just wanted to understand why i am seeing this
> Error message in vm log.
> 
> Followed the steps mentioned in comment #1

Comment 23 Ken Dreyer (Red Hat) 2016-02-06 03:30:24 UTC

I think the conclusion here is that we need to make sure the docs clarify two points:

1) When debugging VMs running under libvirt, the librbd settings such as "log file" and "admin socket" should be configured in /etc/ceph/ceph.conf under a [client.libvirt] header, not a [global] header. Putting settings like "admin socket" and "log file" under [global] means that everything will try to write to the same place, which can cause permission errors, SELinux errors, and other hard-to-debug issues. (In a typical customer setup, an OSD would not also be serving as a client, but we might as well make this explicit.)

2) When SELinux is enabled, the qemu process (and librd) can only write to areas labeled virt_image_t. The "log path" and "asok path" needs to be something under /var/lib/libvirt/images/ceph/ .

So combining Jason's and Boris's suggestions:

  mkdir /var/lib/libvirt/images/ceph/

and in ceph.conf:

  [client.libvirt]
  log file = /var/lib/libvirt/images/ceph/rbd-$pid.$cctid.log
  admin socket = /var/lib/libvirt/images/ceph/rbd-$pid.$cctid.asok

Tanay were you following any pre-existing documentation regarding librbd debugging? I'm wondering if there's an existing section that we need to modify, or if we should just add an entirely new doc section for this.

Comment 24 Tanay Ganguly 2016-02-08 08:51:03 UTC

Ken,

Actually i didn't find any downstream document where the logging is mentioned for libvirt.

And there is no specific document ( w.r.t your 2nd point ) i think we need to add this in some document.

Documentation Team,

Please let me know if i am missing some document ( Please read comment #23 from Ken)
If it exists please share the link of the document.

FYI,

I opened a new ticket https://bugzilla.redhat.com/show_bug.cgi?id=1305421
which mentions the need to create a downstream document for librbd with QEMU/Libvirt. I think in that place we can address this logging part as well.

Comment 25 Boris Ranto 2016-02-08 14:24:18 UTC

Re-reading this a bit, it is probably not the best idea to use the /var/lib/libvirt/images location for things that are not images. While the context may give you the privileges that allows you to do R/W, it is not the only context that will allow you to do that.

The qemu (libvirt) process should be able to R/W logs even if they are located in

/var/log/libvirt/ceph/

directory with the context virt_log_t. That context is designed for virtualization log files. As for the asok file. We should probably use

/var/run/libvirt/ceph/

or

/var/run/ceph/

directory with the context virt_var_run_t/ceph_var_run_t for these files. This depends mostly on the things like who creates it, who needs to access it and similar. We might need to update ceph policy to allow the full R/W in this case, though.

Comment 26 Ken Dreyer (Red Hat) 2016-02-08 21:12:20 UTC

(In reply to Boris Ranto from comment #25)
> We might need to update ceph policy to allow the full R/W in
> this case, though.

That's part of the issue; librbd clients won't have ceph-selinux installed.

Comment 27 Boris Ranto 2016-02-08 23:05:48 UTC

(In reply to Ken Dreyer (Red Hat) from comment #26)
> (In reply to Boris Ranto from comment #25)
> > We might need to update ceph policy to allow the full R/W in
> > this case, though.
> 
> That's part of the issue; librbd clients won't have ceph-selinux installed.

If we get the denials as referenced in

https://bugzilla.redhat.com/show_bug.cgi?id=1303799#c12

then it means that the daemons could not temper with these files, not the clients and we can change the (server) policy to accommodate that.

The .log related denials should simply disappear once we use the proper log dir (/var/log/ceph/<stg>). As for asok files, I'd need more information on what/where creates and accesses them -- i.e. if qemu accesses them then we should put them to the location where qemu is allowed to access them and update ceph policy in a way that allows ceph daemons to access them.

Comment 28 Josh Durgin 2016-02-08 23:32:16 UTC

(In reply to Boris Ranto from comment #27)
> (In reply to Ken Dreyer (Red Hat) from comment #26)
> > (In reply to Boris Ranto from comment #25)
> > > We might need to update ceph policy to allow the full R/W in
> > > this case, though.
> > 
> > That's part of the issue; librbd clients won't have ceph-selinux installed.
> 
> If we get the denials as referenced in
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1303799#c12
> 
> then it means that the daemons could not temper with these files, not the
> clients and we can change the (server) policy to accommodate that.
> 
> The .log related denials should simply disappear once we use the proper log
> dir (/var/log/ceph/<stg>). As for asok files, I'd need more information on
> what/where creates and accesses them -- i.e. if qemu accesses them then we
> should put them to the location where qemu is allowed to access them and
> update ceph policy in a way that allows ceph daemons to access them.

On the client side, whatever is using librbd creates the asok files, and they'd only be accessed by an administrator.

Mainly we want qemu to be able to create and read/write to the asok file. It'd be nice if nova/cinder/glance (which also use librbd) could create asok files as well, but they use short-lived instances of librados/librbd that make that less useful unless something has gone very wrong.

So I'd suggest just worrying about asok files for qemu. They'd be created whenever an rbd image is opened by qemu (at boot, or at disk attach time afterwards).

Comment 29 Boris Ranto 2016-02-09 00:28:47 UTC

(In reply to Josh Durgin from comment #28)
.
.
.
> 
> On the client side, whatever is using librbd creates the asok files, and
> they'd only be accessed by an administrator.
> 
> Mainly we want qemu to be able to create and read/write to the asok file.
> It'd be nice if nova/cinder/glance (which also use librbd) could create asok
> files as well, but they use short-lived instances of librados/librbd that
> make that less useful unless something has gone very wrong.
> 
> So I'd suggest just worrying about asok files for qemu. They'd be created
> whenever an rbd image is opened by qemu (at boot, or at disk attach time
> afterwards).

Hmm, two solutions come to my mind, here. We could either try to write the .asok files to a directory that qemu/libvirt can R/W according to their policy and tune our policy so that the ceph daemons could access these files or we could create a tunable that will allow all the processes to access ceph socket files (alongside the rest of files in /var/run/ceph). In this case, a user would have to run a command like

setsebool ceph_run_access_all=on

in order to allow this R/W access to an arbitrary process. The user would have to do that only if he actually needed other processes to R/W ceoh /var/run/ceph/* files. I believe the second solution is generally the preferred one in SELinux context but I'll need to think about it a bit more/discuss it with more core SELinux guys. I'll let you know about the results ASAP.

Comment 30 Boris Ranto 2016-02-09 15:11:34 UTC

I've given this some more thought and if I (hopefully) understand this correctly, we do have a server/client architecture, here so we can have separate locations for the socket files and logs on each system. Hence, why don't we let the clients put their files to the location where they can r/w them? In this case, we deal with qemu so

/var/{log,run}/libvirt/(qemu/)

directories for the client logs/sockets and

/var/{log,run}/ceph/

directory for the server logs/sockets. Hopefully, we could achieve this if we mixed the configuration in ceph.conf like this:

[global]
log file = /var/log/ceph/ceph-$pid.$cctid.log
admin socket = /var/run/ceph/ceph-$pid.$cctid.asok

[client.libvirt]
log file = /var/log/libvirt/ceph/rbd-$pid.$cctid.log
admin socket = /var/log/libvirt/ceph/rbd-$pid.$cctid.asok


Afterwards, we should probably document what to do when you use non-default locations for log files/admin sockets and how to choose the right locations for these.

Comment 31 Jason Dillaman 2016-02-09 15:40:26 UTC

I think the original issue is that the context on /var/log/libvirt/qemu/ is "system_u:object_r:virt_log_t:s0" and the RBD client running under "system_u:object_r:qemu_exec_t" isn't able to create new files in this directory.

I am assuming qemu-kvm is able to write its logs there because libvirt redirects stdout/stderr to the log file for the process.  I'd assume the same is true for /var/run/libvirt/qemu which has context "system_u:object_r:qemu_var_run_t:s0", but I'll admit I haven't tried.

Comment 32 Boris Ranto 2016-02-09 21:28:40 UTC

@Jason: The qemu process should be able to at least append the log files there as qemu has got virt_append_log(qemu_t) in its SELinux policy. The similar should hopefully hold for qemu_var_run_t. It would be really helpful if you could try running your test scenario with the suggested config (minor updates are welcome, ofc) in permissive mode and attach the messages from /var/log/audit/audit.log, here.

Anyway, if you were hitting problems like Vasu in bz1303799#c12 then you have probably had the config files in [global] section making ceph{-osd,-mon,-mds} daemons use the non-default locations

Comment 33 Jason Dillaman 2016-02-09 21:41:39 UTC

@Boris: I only ran into this issue because qemu-kvm (librbd/librados) wasn't able to create logs nor asok files in /var/log/libvirt/qemu -- that is why I wrote my (poor) workaround in c17

Semi-related to bz1223652#c19 as host had "log file = /var/log/libvirt/qemu/rbd-$pid.$cctid.log" in [global].

Comment 34 Boris Ranto 2016-02-10 14:40:37 UTC

Did you by any chance manage to gather the audit log? It would really help me guide my next steps, here.

Given the limited time-frame for the downstream release, we will probably have to just modify the policy to allow the access to these files. We should continue investigating this upstream, though.

Comment 35 Jason Dillaman 2016-02-10 15:34:17 UTC

@Boris: I didn't copy the log but I did copy the audit line:

audit.log:type=AVC msg=audit(1454366026.841:150902): avc:  denied  { add_name } for  pid=73812 comm="qemu-kvm" name="rbd-73812.140599771217920.log" scontext=system_u:system_r:svirt_t:s0:c457,c714 tcontext=system_u:object_r:virt_log_t:s0 tclass=dir

Forgot that this was running under sVirt context, not the normal "system_u:object_r:qemu_exec_t" context.

I think it makes sense to track it as a new ticket upstream and down (bz1305421) for continued investigation.  From the point of view of this particular BZ, I believe Tanay has verified the original issue is fixed.

Comment 36 Boris Ranto 2016-02-11 12:46:14 UTC

@Jason: That audit line is quite interesting.

First, if the qemu process is started by libvirt then it probably inherits the svirt(exec_)_t context from libvirt. The qemu(_exec)_t is the context you get be default if you run qemu directly, e.g. from shell. This can be useful as it gives a binary a way to run with higher permission level -- e.g. it can allow you to write logs to a new location, etc [1].

The denial says that it denied to add a name 'rbd-73812.140599771217920.log' to the directory listing (probably) for /var/log/libvirt/ which essentially means that it stopped qemu-kvm from creating the log file. It should help if you hosted the qemu log files in /var/log/libvirt/qemu/ directory instead which should hopefully allow you to add files there. The permission to add files to /var/log/libvirt/ is probably reserved to libvirtd and similar [2].

All in all, if you only hit that one denial then it should get fixed once the proper configuration options are used (or you tune the contexts manually) and there is no need for us to update the SELinux policy for ceph.

[1] I'm not familiar with virt-stack SELinux policies so I can't say for sure they use it there that way.

[2] Again, I'm not all that familiar with the virt-stack SELinux policies so I'm mostly guessing on its layout at the moment.

Comment 37 Ken Dreyer (Red Hat) 2016-02-12 22:39:34 UTC

Based on Tanay's Comment #12 and Jason's Comment #22, we can agree that the original issue described in this bug report is fixed.

The problem has now shifted to documenting precisely where librbd should write its files to satisfy the filesystem permissions and SELinux when running in conjunction with libvirt+qemu.

Since those docs are being handled in bz 1305421, I'm putting this bug back ON_QA for verification that the *original* issue is fixed.

Comment 38 Tanay Ganguly 2016-02-14 15:00:35 UTC

Marking this Bug as Verified.

Comment 40 errata-xmlrpc 2016-02-29 14:42:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0313

Note You need to log in before you can comment on or make changes to this bug.