Description of problem: KVM guest uses 100% cpu when LVM snapshot reaches 100% usage, then cannot re-activate LVM snapshot after lvextend (not sure what component to select for this report) Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 5.5 (Tikanga) Linux dirsec2-seg.lab.sjc.redhat.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux device-mapper-1.02.39-1.el5 lvm2-2.02.56-8.el5 kvm-qemu-img-83-164.el5 libvirt-0.6.3-33.el5 How reproducible: happened once, till cannot recover LVM snapshot, havn't spend the time to try to recreate one more time the issue Steps to Reproduce: 1. have RHEL 5.5 x86_64 KVM host 2. have LVM and too small lvm snapshot for KVM guest image 3. have KVM guest with RHEL 5 x86_64 3. KVM guest does some task, fills the lvm snapshot up to the max Actual results: - lost control of KVM guest when LVM snapshot reached 100% usage - the cpu allocated to the KVM guest was running at 100% - forced shutdown of the KVM guest (could not virsh reboot) - cannot re-activate LVM snapshot after lvextend usage Expected results: Additional info: after resizing the snapshot with lvextend from 1G to 4G, could not re-activate the lvm snapshot: lvchange -a y /dev/VolGroup00/cs80el5x8664ms2cs8dash2DevelSnap1 /dev/VolGroup00/cs80el5x8664ms2cs8dash2DevelSnap1: read failed after 0 of 4096 at 0: Input/output error Can't change snapshot logical volume "cs80el5x8664ms2cs8dash2DevelSnap1" A lvs or lvsdisplay show the new size of 4G (older value was 1G), but till 100% full, which I somehow do not expect anymore: lvs /dev/VolGroup00/cs80el5x8664ms2cs8dash2DevelSnap1: read failed after 0 of 4096 at 0: Input/output error LV VG Attr LSize Origin Snap% Move Log Copy% Convert ... cs80el5x8664ms2cs8dash2DevelSnap1 VolGroup00 Swi-Io 4.00G cs80el5x8664ms2cs8dash2modnssOcspHttpRHCS80devMaster 100.00 lvdisplay /dev/VolGroup00/cs80el5x8664ms2cs8dash2DevelSnap1 /dev/VolGroup00/cs80el5x8664ms2cs8dash2DevelSnap1: read failed after 0 of 4096 at 0: Input/output error --- Logical volume --- LV Name /dev/VolGroup00/cs80el5x8664ms2cs8dash2DevelSnap1 VG Name VolGroup00 LV UUID xEwr3d-XIFS-Ox8W-6LBe-eHcg-siJl-yHmhZ2 LV Write Access read/write LV snapshot status INACTIVE destination for /dev/VolGroup00/cs80el5x8664ms2cs8dash2modnssOcspHttpRHCS80devMaster LV Status available # open 0 LV Size 6.00 GB Current LE 192 COW-table size 4.00 GB COW-table LE 128 Snapshot chunk size 4.00 KB Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:46 I am not sure how I can re-activate this snapshot. Could not locate too much relevant info in KB, Intranet or Google search. Tried system-config-lvm out of curiousity in a VNC session, but got some exception: [root@dirsec2-seg ~]# system-config-lvm /usr/share/system-config-lvm/cylinder_items.py:1032: GtkWarning: gdk_pixbuf_scale_simple: assertion `dest_width > 0' failed scaled_pixbuf = self.pixbuf.scale_simple(pixmap_width, height, gtk.gdk.INTERP_BILINEAR) Traceback (most recent call last): File "/usr/share/system-config-lvm/Volume_Tab_View.py", line 454, in on_tree_selection_changed self.on_best_fit(None) File "/usr/share/system-config-lvm/Volume_Tab_View.py", line 536, in on_best_fit self.display_view.draw() File "/usr/share/system-config-lvm/renderer.py", line 591, in draw self.display.draw(self.da, self.gc, (10, y_offset)) File "/usr/share/system-config-lvm/cylinder_items.py", line 920, in draw self.cyl_upper.draw(pixmap, gc, (x, y)) File "/usr/share/system-config-lvm/cylinder_items.py", line 305, in draw CylinderItem.draw(self, dc, gc, (x, y)) File "/usr/share/system-config-lvm/cylinder_items.py", line 120, in draw child.draw(dc, gc, (x, y)) File "/usr/share/system-config-lvm/cylinder_items.py", line 305, in draw CylinderItem.draw(self, dc, gc, (x, y)) File "/usr/share/system-config-lvm/cylinder_items.py", line 120, in draw child.draw(dc, gc, (x, y)) File "/usr/share/system-config-lvm/cylinder_items.py", line 311, in draw cyl_pix = self.cyl_gen.get_cyl(dc, self.get_width(), self.height) File "/usr/share/system-config-lvm/cylinder_items.py", line 1039, in get_cyl pixmap.draw_pixbuf(gc, scaled_pixbuf, 0, 0, 0, 0, -1, -1) TypeError: GdkDrawable.draw_pixbuf() argument 2 must be gtk.gdk.Pixbuf, not None The program 'system-config-lvm' received an X Window System error. This probably reflects a bug in the program. The error was 'BadAlloc (insufficient resources for operation)'. (Details: serial 26941 error_code 11 request_code 53 minor_code 0) (Note to programmers: normally, X errors are reported asynchronously; that is, you will receive the error a while after causing it. To debug your program, run it with the --sync command line option to change this behavior. You can then get a meaningful backtrace from your debugger if you break on the gdk_x_error() function.) [root@dirsec2-seg ~]#
I'm pretty sure that if a snapshot or VG containing a snapshot reaches 100% utilization, it is effectively corrupt, because it can't no longer store deltas for the ongoing changes on the master volume. This isn't a virt problem, so re-assigning to LVM due to error messages.
If snapshot reaches 100% it gets invalidated and all IO to it returns IO error (any 100%CPU then is just consequence of IO errors). YOu can only remove invalidate snapshot, no other action is allowed. This doesn't influence origin volume - you can still use it. You have to extend snapshot before it gets invalidated, after it is impossible (some delta data are lost already.)
I "missed" the warnings in the KVM host's system log about the growing snapshot usage of the KVM guest. And did not know about "only" removing invalid snapshots (failed to locate some docs for this before opening this bz) We may want to document this either for KVM/virt or LVM, because the error returned from the lvchange is kind of generic, and ending up with a bad KVM guest is not a good situation. Should a KVM guest shuts down before "corrupting" a file system (with a snapshot)?