Description of problem: User moved a host into maintenance mode and VMs started to migrate off to another hosts in the cluster, one of those migration failed and user had to poweroff the VM to be able to run the maintenance flow for the host. Upon starting that VM it failed stating its qcow disk image become corrupted qemu-img info reports higher size than original virtual size Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.9.0-16.el7_4.5.x86_64 libvirt-daemon-3.2.0-14.el7_4.3.x86_64 vdsm-4.19.31-1.el7ev.x86_64 How reproducible: Uncertain Steps to Reproduce: 1. Place host into maintenance mode and observe migration flows (this is not how to reproduce, it's just information of what user was doing at the time we found the corruption messages) Actual results: In user's environment, at least 1 qcow image become corrupted Expected results: qcow image shouldn't get corrupted Additional info: Adding logs as attachments
Lukas, Can you take a look at this issue? You handled something similar here: https://bugzilla.redhat.com/show_bug.cgi?id=1315332 Thanks
Sorry, wrong bug. Clearing needinfo
Kevin, can you please take a look at this issue?
(In reply to Fred Rolland from comment #8) > Kevin, can you please take a look at this issue? Kevin is on PTO, asking Max instead.
Hi, I am not sure if this will help much, but one thing is that the VDSM logs seem to be generally from a different time span than the qemu logs. The time stamps in the qemu logs are in UTC+0, whereas the ones in the VDSM logs are in UTC+3, so some parts are missing especially from the source VDSM log. Without knowing much about the image in question, I do not think it can be fixed with qemu-img check. qemu-img check can only really repair reference count issues (e.g. where a cluster in the qcow2 image is referenced, but not marked as such, or vice versa), but the corruption message implies that the L1 table has been overwritten with data. (The 64-bit value mentioned in this BZ's title clearly looks like ASCII data.) Since the L1 table is the root structure for locating data stored on the image, I think it is very unlikely that the image data can be recovered. A more or less unrelated issue is that qemu crashes because of this corruption (because of a failed assertion). That should not happen, but of course this is only a result of the corruption. Without a backtrace, I cannot tell where that comes from, I suspect it is because the qcow2 driver tries to set the "corrupt" flag while the image is still inactive due to the migration and may thus not be written to, but I am not sure. As for why the corruption occurred in the first place, I cannot tell either. The most common issue we have had in the past was concurrent use of an image (e.g. using an image in qemu and then taking internal snapshots with qemu-img at the same time). As of qemu 2.10.0 we have file locking in place to prevent such concurrent use, but considering the corruption occurred with 2.9.0, such a conflicting access might have been the issue -- although it is rather unlikely that this occurred if the image has been managed through oVirt alone. In any case, given the logs that just show that the image is already corrupted, I can only do wild guessing as to how the corruption happened, though. Max
Another note: It is actually very interesting to see the L1 table having been overwritten with data, because we have before-write metadata overlap checks. These check by default that no data is written into certain metadata structures, including the L1 table. These checks are not disabled on the command lines present in the logs. One way to bypass these checks is if internal snapshots are used, because they do not cover their metadata structures by default (each snapshot has its own L1 table). So if you take a snapshot, you can overwrite its L1 table without the overlap checks complaining (note, though, that in theory there is no way to do this), and when you revert to it, you get a corrupted L1 table. But as far as I understand, reverting will already result in the corruption message that is this BZ's title, so even that is rather unlikely. The only other way I can think of to bypass these checks is if qemu is not aware that a certain area actually contains metadata and cluelessly writes data to it, and I would say this can only happen if two parties have opened the image at the same time (with one having moved the L1 table to an area the other one wants to write data to). So my best guess remains that the image had been opened in two qemu processes concurrently (e.g. qemu and qemu-img), and this led to the corruption. Otherwise, the metadata overlap checks should have prevented the L1 table corruption, marked the image corrupt and nothing really bad should have happened. Max
RHV doesn't use internal snapshots.
Hi Max, Thanks for the comments, if I understood correctly, with the current logs, we don't have much to share as to why the corruption happened. vdsm from source goes from 8:01 until 11:01 5/23 UTC+3, do you need the one that goes over that time frame ? I doubt that will give us more ideas as to why the corruption happened, but I can add it here too. One thing we wanted to check though, was at what moment and why the qcow image reported 500gb size (virtual size) instead the initial virtual size of 40g, but I would think that this is also a result of the corruption. Am I right ? Like mentioned before, the LV reports 28Gb of size which matches 'actual_size' from the DB Also, I know that since the image was managed through RHV this shoulnd't be case, but we will recommend users to go to qemu-kvm-rhev 2.10.0 to be able to use the file locking mechanism to avoid concurrent usage of the image.
(In reply to Yaniv Kaul from comment #12) > RHV doesn't use internal snapshots. And if it did, I would assume it would take them over QMP and not through a separate process. (In reply to Javier Coscia from comment #13) > Thanks for the comments, if I understood correctly, with the current logs, > we don't have much to share as to why the corruption happened. Yes, and I'm not sure there are any logs that could help us determine the original cause. From my limited experience on corrupted images, usually only looking at the image itself helps. (Well, the logs already told us that overlap checks were not disabled, so they do give us the hint that it is unlikely that qemu itself caused the corruption. Maybe the logs give other hints, too, but as I only know qemu, I cannot speak for the rest.) > vdsm from source goes from 8:01 until 11:01 5/23 UTC+3, do you need the one > that goes over that time frame ? As far as I've seen, on the destination, the first corruption event occurred on 7:51:53 UTC+0, and the last message was printed on 11:46:45 UTC+0. But most of that was covered by the logs, so I don't think we need any more there. But on the source side, the time span is from 11:42:55 UTC+0 to 11:47:32 UTC+0, so 14:42:55 UTC+3 to 14:47:32 UTC+3. I don't think I can pull out much more information than I have done so far, but that would be the affected range. > I doubt that will give us more ideas as to why the corruption happened, but > I can add it here too. I don't have much hope either, but maybe it helps someone who has more experience with interpreting these logs. > One thing we wanted to check though, was at what moment and why the qcow > image reported 500gb size (virtual size) instead the initial virtual size of > 40g, but I would think that this is also a result of the corruption. Am I > right ? Hm. I don't want to jump to conclusions, so I'll try to give you my reasoning as well as possible. But note that this is really just a hunch. It is one possibility that nearly fits what I'm seeing. First thing is, the virtual size doesn't just change like that. It is a value in the image header that is not simply overwritten. If it was, it would mean the image header is overwritten, but obviously not enough so that the image would cease to be a valid qcow2 file. So that is very strange and I've never seen such a case, though it's not impossible. But as I've written before, the usual way to overwrite the L1 table with garbage is if the qemu process (writing data) assumes the L1 table to be somewhere else. Besides snapshotting, there is only a single operation that moves the L1 table around, and that is resizing. When you grow the image, the L1 table may need to be reallocated, i.e. moved somewhere else. So when one process (A) grows the image while it is in use by another process (B), A may place the L1 table without B noticing. Then B allocates some new data block, exactly where A placed the table, and thus overwrites the L1 table, thus producing a corruption like we see here. So it looks a bit like the image was resized to 500 GB while it was still in use by a qemu process, and that latter process then accidentally corrupted the image. (Note that for a 500 GB image with the default cluster size (64 kB), the L1 table would be ~8 kB in size, i.e. still less than a cluster, so in theory it would not need to be reallocated -- but in practice, it is (which may or may not make sense, but that is how it is).) So my best guess is just a hunch, but it is that the image was resized (e.g. with qemu-img) while it was in use by a VM. But it is not a strong guess at all, it is just a hunch, though it would explain all of the symptoms. > Like mentioned before, the LV reports 28Gb of size which matches > 'actual_size' from the DB > > > Also, I know that since the image was managed through RHV this shoulnd't be > case, but we will recommend users to go to qemu-kvm-rhev 2.10.0 to be able > to use the file locking mechanism to avoid concurrent usage of the image. Well, using the most up-to-date software is always a good idea. :-) Max
Thank again Max, I'll share your thoughts/hypothesis and will ask customer if they are aware on something on the terms of running qemu-img resize on the image while the VM was up and running. This would obviously not be printed in the logs if they manually resized the image. Also I'll share the information about file locking in 2.10.0 for them to be aware that this will help avoiding this kind of corruption in the future. Will let you guys know the responses from the user.
Thanks Javier, For now I'm closing the bug as insufficient data as we cannot determine the actual cause, please reopen if you have any new information
BZ<2>Jira Resync