Created attachment 1059057 [details] Rbd crash Description of problem: After i am hitting Bug 1249004, when i reboot the VM i am seeing the RBD crash in VM.log Version-Release number of selected component (if applicable): ceph version 0.94.1.4 How reproducible: 100 % Steps to Reproduce: 1. Hit the Bug 1249004 ( Steps are mentioned in that Bug ) 2. Force Off the Vm. Actual results: Once the VM is powered off, i am seeing the Crash dump in vm.log inputs_connect: inputs channel client create block I/O error in device 'drive-virtio-disk0': Invalid argument (22) block I/O error in device 'drive-virtio-disk0': Invalid argument (22) block I/O error in device 'drive-virtio-disk0': Invalid argument (22) block I/O error in device 'drive-virtio-disk0': Invalid argument (22) block I/O error in device 'drive-virtio-disk0': Invalid argument (22) block I/O error in device 'drive-virtio-disk0': Invalid argument (22) block I/O error in device 'drive-virtio-disk0': Invalid argument (22) qemu: terminating on signal 15 from pid 21361 osdc/ObjectCacher.cc: In function 'ObjectCacher::~ObjectCacher()' thread 7f93b412ca40 time 2015-08-04 08:00:48.850180 osdc/ObjectCacher.cc: 551: FAILED assert(i->empty()) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (()+0x160a15) [0x7f93af0dea15] 2: (()+0x3a3dd1) [0x7f93af321dd1] 3: (()+0x57346) [0x7f93aefd5346] 4: (()+0x7d859) [0x7f93aeffb859] 5: (rbd_close()+0x9) [0x7f93aefba719] 6: (()+0xa709d) [0x7f93b420c09d] 7: (()+0x8ace9) [0x7f93b41efce9] 8: (()+0x8aba8) [0x7f93b41efba8] 9: (()+0x8ad73) [0x7f93b41efd73] 10: (()+0x8ae18) [0x7f93b41efe18] 11: (()+0x75d0f) [0x7f93b41dad0f] 12: (__libc_start_main()+0xf5) [0x7f93aa057af5] 13: (()+0x7714d) [0x7f93b41dc14d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' 2015-08-04 12:00:49.402+0000: shutting down Expected results: There should not be any Crash Additional info: Logs attached.
Jason, I am unable to reproduce this rbd crash with the latest build 0.94.1.7 Is the Fix of original Bug (https://bugzilla.redhat.com/show_bug.cgi?id=1249004) have fixed this issue as well, If NOT, can you please let me know if there is a definite step on how to reproduce this particular BUG.
This crash will only be seen when the OSDs return an error. Since the OSD error WRT updating the object map has been corrected, you will need a new mechanism to induce an OSD error to reproduce this issue. One possibility might be to enable a quota on a pool and fill it up -- forcing the OSDs to return an ENOSPC error.
Jason, Can you please clarify: >Is the Fix of original Bug (https://bugzilla.redhat.com show_bug.cgi?id=1249004) have fixed this (1250042) issue as well? Regards, Harish
Negative -- restating comment 3: the bug still exists but you need a new way to expose it since the OSD is no longer returning an error when attempting to update the object map from an IO write operation.
Comment #3 has instructions on how to reproduce, and Harish's e-mail says Tanay will test this. We should have an update Monday on this?
As per Jason's comment 3, I tried to induce ENOSPC by setting quota limit on the pool. Client recieved ENOSPC upon trying to exceed the quota. Log snippet from /var/log/libvirt/qemu/rhel7.0.log ================================================= char device redirected to /dev/pts/2 (label charserial0) main_channel_link: add main channel client main_channel_handle_parsed: net test: latency 574.700000 ms, bitrate 1310529 bps (1.249818 Mbps) LOW BANDWIDTH red_dispatcher_set_cursor_peer: inputs_connect: inputs channel client create block I/O error in device 'drive-virtio-disk0': No space left on device (28) block I/O error in device 'drive-virtio-disk0': No space left on device (28) block I/O error in device 'drive-virtio-disk0': No space left on device (28) block I/O error in device 'drive-virtio-disk0': No space left on device (28) block I/O error in device 'drive-virtio-disk0': No space left on device (28) qemu: terminating on signal 15 from pid 12775 red_channel_client_disconnect_dummy: rcc=0x7f2c086c2d40 (channel=0x7f2c082a7c70 type=5 id=0) snd_channel_put: SndChannel=0x7f2c0867a5e0 freed red_channel_client_disconnect_dummy: rcc=0x7f2c086a4080 (channel=0x7f2c082a7f60 type=6 id=0) snd_channel_put: SndChannel=0x7f2c086b2870 freed =================================================== But the "Previous Crash didn't happen" on the client side. Surprisingly I couldn't find any log related to space issue on OSDs despite having log-level 20. @Jason, Could you please have a look into this or let me know any other ways of reproducing the issue. Cluster Details: 1 13.49991 root default -2 2.69998 host magna094 0 0.89999 osd.0 up 1.00000 1.00000 1 0.89999 osd.1 up 1.00000 1.00000 2 0.89999 osd.2 up 1.00000 1.00000 -3 2.69998 host magna112 3 0.89999 osd.3 up 1.00000 1.00000 4 0.89999 osd.4 up 1.00000 1.00000 5 0.89999 osd.5 up 1.00000 1.00000 -4 2.69998 host magna114 6 0.89999 osd.6 up 1.00000 1.00000 7 0.89999 osd.7 up 1.00000 1.00000 8 0.89999 osd.8 up 1.00000 1.00000 -5 2.69998 host magna106 9 0.89999 osd.9 up 1.00000 1.00000 10 0.89999 osd.10 up 1.00000 1.00000 11 0.89999 osd.11 up 1.00000 1.00000 -6 2.69998 host magna034 12 0.89999 osd.12 up 1.00000 1.00000 13 0.89999 osd.13 up 1.00000 1.00000 14 0.89999 osd.14 up 1.00000 1.00000 Client Details: 10.12.27.2 root/redhat You can use librbd_api.py script to induce the error quickly, or even i have Qemu already setup where the Installation will fail. virsh list Id Name State ---------------------------------------------------- 8 rhel7.0 running
I repeated it on first attempt on my local machine by creating an image, setting the pool max-object quota to three, and running the following with "rbd cache writethrough until flush = false" in my config: import rbd import rados cluster = rados.Rados(conffile='ceph.conf') cluster.connect() ioctx = cluster.open_ioctx('rbd') image = rbd.Image(ioctx, 'foo') data = '1' * 1024 image.write(data, 1<<22) image.write(data, 2<<22)
Thanks Jason, after adding rbd cache writethrough until flush = false in config file i am able to reproduce the issue. I have a general query after understanding the problem, please correct me if my understanding is correct. I still didn't see any ENOSPC error from OSD logs, because client side librados is handling it from its cache and returning from there itself. So OSD are not reached and hence i am not seeing any ENOSPC error message from OSD logs. I tried strace and osd logs to confirm the same. If the above statement is true, could you please briefly explain me the layers which are involved in the interaction. Log snippet: =============================================================================== osdc/ObjectCacher.cc: In function 'ObjectCacher::~ObjectCacher()' thread 7f76b139c740 time 2015-08-25 02:03:09.727456 osdc/ObjectCacher.cc: 551: FAILED assert(i->empty()) ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) 1: (()+0x160bf5) [0x7f7688b7cbf5] 2: (()+0x3a3fe1) [0x7f7688dbffe1] 3: (()+0x57496) [0x7f7688a73496] 4: (()+0x7da29) [0x7f7688a99a29] 5: (rbd_close()+0x9) [0x7f7688a587c9] 6: (ffi_call_unix64()+0x4c) [0x7f76a9797dac] 7: (ffi_call()+0x1f5) [0x7f76a97976d5] 8: (_ctypes_callproc()+0x30b) [0x7f76a99aac8b] 9: (()+0xaa85) [0x7f76a99a4a85] 10: (PyObject_Call()+0x43) [0x7f76b0e0f073] 11: (PyEval_EvalFrameEx()+0x1d4c) [0x7f76b0ea334c] 12: (PyEval_EvalFrameEx()+0x4350) [0x7f76b0ea5950] 13: (PyEval_EvalCodeEx()+0x7ed) [0x7f76b0ea71ad] 14: (()+0x6f098) [0x7f76b0e34098] 15: (PyObject_Call()+0x43) [0x7f76b0e0f073] 16: (()+0x59085) [0x7f76b0e1e085] 17: (PyObject_Call()+0x43) [0x7f76b0e0f073] 18: (PyEval_CallObjectWithKeywords()+0x47) [0x7f76b0ea0ff7] 19: (()+0xa106c) [0x7f76b0e6606c] 20: (()+0x9c0aa) [0x7f76b0e610aa] 21: (()+0x7e9ef) [0x7f76b0e439ef] 22: (()+0x803b0) [0x7f76b0e453b0] 23: (_PyModule_Clear()+0x16c) [0x7f76b0e49e8c] 24: (PyImport_Cleanup()+0x45b) [0x7f76b0eb6a7b] 25: (Py_Finalize()+0xfe) [0x7f76b0ec233e] 26: (Py_Main()+0x6a5) [0x7f76b0ed3545] 27: (__libc_start_main()+0xf5) [0x7f76b0100af5] 28: python() [0x400721] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted =============================================================================
The OSDs report the quota error: 2015-08-24 15:41:26.202818 7f48f478b700 1 -- 172.16.231.161:6800/41202 --> 172.16.231.161:0/1042309 -- osd_op_reply(9 rbd_data.100674b0dc51.0000000000000002 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~1024] v0'0 uv0 ondisk = -28 ((28) No space left on device)) v6 -- ?+0 0x4452000 con 0x4249b20
Fixed in upstream's v0.94.4.
Unable to reproduce the same. Marking it as Verified. Use Case: Set quota on pool and do lots of IO operations Resize operations. (Increase/Shrink in a loop) Librbd VM operations. ( VM creation and creating FS on top of rbd ) VM with IO migration ( Migration from one host to another) Different RBD image size (100 G /781 TB)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0313