Bug 1879396

Summary: red hat virtio scsi disk device 6.3.9600.18758
Product: [Community] Virtualization Tools Reporter: Evgen Puzanov <e.puzanov>
Component: virtio-winAssignee: Vadim Rozenfeld <vrozenfe>
Status: CLOSED CURRENTRELEASE QA Contact: menli <menli>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: ghammer, haoliu, jinzhao, juzhang, lijin, mdean, virt-maint, vladimir, vrozenfe, yvugenfi
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Windows   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-09 10:48:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log file none

Description Evgen Puzanov 2020-09-16 07:59:23 UTC
Hello,
 
We have a cloud environment that consists of hosts backed by KVM hypervisors. About half of the virtual machines are running Windows Server operating systems (2008R2, 2012R2, 2016, 2019), there are hundreds of such instances and almost all of them use VirtIO drivers (mostly 0.1.160).
 
Sometimes (it occurred about 3-4 times) we encountered the following glitch: a guest operating decides that its primary disk storage has more size than it actually is. For example, an instance had a virtual drive of 200 GB, it had been worked fine for years, but at some moment (no one knows at which exactly one) primary partition (we mean "drive C:", which usually is the 2nd one, as the 1st one is being used by the operating system) became the size of 210 GB just out of the blue. After that, the system event log started growing with the following error messages: `The driver detected a controller error \ Device \ Harddisk0 \ DR`. Obviously, it happens when the operating system tries to write pieces of data to the sectors that don't exist.
 
Once we expand this virtual drive to 210 GB, the error messages don't appear anymore. Still, after that we find some part of the data corrupted (maybe some fragments of files are being stored to the non-existent sectors), so it seems to be a real problem for us when it happens.
 
Alas, we didn't find a way to reproduce that. As we stated before, it happened only 3-4 times, though each time the outcomes are quite unpleasant.
 
Should we provide with more data regarding this issue? Should we consider upgrading the driver? Perhaps, we just don’t know that it’s a bug that had been fixed after the 0.1.160's release? Just curious, did anyone send a similar bug-report before? We tried to find them, though with no luck.
 
Thanks in advance for your feedback.

Comment 1 Vadim Rozenfeld 2020-09-16 09:22:06 UTC
Hi Evgen,

Thank you for reporting the problem.

Does it happen on some specific Windows version or just on all of them?
Can you upload a couple of system log files from different VMs for the 
future investigation? It will be quite useful to see qemu command line and to 
know qemu version as well. 

There were not too many critical changes in vioscsi code between build 160
and the most recent officially released build 184, but updating drivers to 
the latest version is always a good idea.

Best,
Vadim.

Comment 2 Evgen Puzanov 2020-09-16 11:15:28 UTC
Vadim, at the moment we see that the current driver version is 0.1.185, not 0.1.184. Should we upgrade to 0.1.185 or is it 0.1.184?

Comment 3 Evgen Puzanov 2020-09-16 11:39:32 UTC
Vadim, in the logs of the operating system itself there are only such entries: The driver detected a controller error \ Device \ Harddisk0 \ DR. Unfortunately, we learned about the problem too late and the system log entries had already been overwritten at that time. As for the libvirt logs, please tell me how to send you the file?

Comment 4 Vadim Rozenfeld 2020-09-16 11:56:42 UTC
(In reply to Evgen Puzanov from comment #2)
> Vadim, at the moment we see that the current driver version is 0.1.185, not
> 0.1.184. Should we upgrade to 0.1.185 or is it 0.1.184?

my bad. 185 is what you need.

Comment 5 Vadim Rozenfeld 2020-09-16 11:58:21 UTC
(In reply to Evgen Puzanov from comment #3)
> Vadim, in the logs of the operating system itself there are only such
> entries: The driver detected a controller error \ Device \ Harddisk0 \ DR.
> Unfortunately, we learned about the problem too late and the system log
> entries had already been overwritten at that time. As for the libvirt logs,
> please tell me how to send you the file?

you can add it to this bug as an attachment.

Thanks,
Vadim.

Comment 6 Evgen Puzanov 2020-09-16 12:04:51 UTC
Created attachment 1715069 [details]
log file

Comment 7 Evgen Puzanov 2020-09-16 12:05:38 UTC
(In reply to Vadim Rozenfeld from comment #5)
> (In reply to Evgen Puzanov from comment #3)
> > Vadim, in the logs of the operating system itself there are only such
> > entries: The driver detected a controller error \ Device \ Harddisk0 \ DR.
> > Unfortunately, we learned about the problem too late and the system log
> > entries had already been overwritten at that time. As for the libvirt logs,
> > please tell me how to send you the file?
> 
> you can add it to this bug as an attachment.
> 
> Thanks,
> Vadim.

>
added, please check

Comment 8 Vadim Rozenfeld 2020-09-17 08:39:55 UTC
Any particular reason for using rhel6.6.0 machine type, this one is extremely old and missing a lot of new stuff.
I see that you are using virtio-blk-pci device, then yes, please update viostor driver.

Comment 9 Evgen Puzanov 2020-09-18 14:15:12 UTC
Hello
virtio-blk-pci - we use a CD-ROM to emulate, but the hard disk is emulated as virtio-serial-pci

Comment 10 Evgen Puzanov 2020-09-18 14:16:24 UTC
and the problem we had with virtio-serial-pci

Comment 11 Vadim Rozenfeld 2020-09-19 01:10:23 UTC
Sorry, it cannot be true

virtio-serial-pci is a virtio device designed to establish bi-directional communication channels between 
host and quest. It is not a storage device, and cannot be placed into Windows storage stack.

In the following case, taken ftom the log file provided by your as the attachment to this bug ( https://bugzilla.redhat.com/attachment.cgi?id=1715069 )

1. This is a qcow2 image attached to virtio-blk-pci device. I guess it is  the system disk

-drive file=/var/lib/libvirt/images/cb668a18-ec37-44d6-a8ef-ebef647afa3f,if=none,id=drive-virtio-disk0,format=qcow2,serial=cb668a18ec3744d6a8ef,cache=none 
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 

2. This is an iso image attached to emulated IDE controller.
-drive file=/mnt/743a2a5c-d1a6-3c0b-8805-d98548a166ad/453-2-f3a4d3ba-3b91-3480-a2ac-9a39f6d8994c.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw 
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 

3. While this one, according to its name, is the qemu quest agent communication channel, created by virtio-serial-pci
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4  -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0

Cheers,
Vadim.

Comment 12 Volodymyr Melnyk 2020-09-21 21:12:53 UTC
Hello,

I'm sorry, the fault was mine, I misled my colleague Evgen.

Our orchestration system (Apache CloudStack) use virtio-blk-device when is starts a VM.

Should we update the driver because there were a bug that could have led to this issue with virtio-blk-device?

Thanks.

Comment 13 Vadim Rozenfeld 2020-09-22 00:42:05 UTC
(In reply to Volodymyr Melnyk from comment #12)
> Hello,
> 
> I'm sorry, the fault was mine, I misled my colleague Evgen.
> 
> Our orchestration system (Apache CloudStack) use virtio-blk-device when is
> starts a VM.
> 
> Should we update the driver because there were a bug that could have led to
> this issue with virtio-blk-device?
> 
> Thanks.

Hi Volodymyr,

Never mind, I just needed to see the virtual machine configuration settings.
In any case, it is always a good idea to update virtio-win drivers with the
most recent ones. Apart from that, according to the log file that Evgen shared
with us, it looks as you are still using rhel6.6.0 machine type. Can I ask you
what qemu version it is? It might be important, since when it comes to reporting
volume size the driver itself is fully depends on the information provided by 
qemu.

Best regards,
Vadim.

Comment 14 Volodymyr Melnyk 2020-09-22 06:43:11 UTC
Hello,

There are 6 quite old hosts in our cloud, they're running qemu-kvm 0.12.1 and CentOS 6.

Of course, it might be qemu-related issue, but there are 2 aspects that made us to consider this issue as driver-related:
1. There were at least 3 occurrences during the past couple of years, all the guests were running Windows Server (2008R2 and 2012R2), but it never happened to Linux guests.
2. Even if qemu-kvm reported wrong disk size, the guest operating system could just keep it unused, but in our cases the guest operating systems "thought" that the partition size is also bigger than it should be.

Taking into account all the above, what do you think, could it be more likely to be caused by the guest's driver than by the host's virtualization software?

Thanks.

Comment 15 Vadim Rozenfeld 2020-09-22 07:55:29 UTC
(In reply to Volodymyr Melnyk from comment #14)
> Hello,
> 
> There are 6 quite old hosts in our cloud, they're running qemu-kvm 0.12.1
> and CentOS 6.
> 
> Of course, it might be qemu-related issue, but there are 2 aspects that made
> us to consider this issue as driver-related:
> 1. There were at least 3 occurrences during the past couple of years, all
> the guests were running Windows Server (2008R2 and 2012R2), but it never
> happened to Linux guests.
> 2. Even if qemu-kvm reported wrong disk size, the guest operating system
> could just keep it unused, but in our cases the guest operating systems
> "thought" that the partition size is also bigger than it should be.
> 
> Taking into account all the above, what do you think, could it be more
> likely to be caused by the guest's driver than by the host's virtualization
> software?
> 
> Thanks.

Thanks,

Can you give me the exact qemu-kvm package name installed on the CentOS system(s),
where the problem happens? Another question is just to confirm if the problem h
appens on 2008R2 and 2012R2 systems only? 

We definitely need to know how to reproduce the problem, maybe you mentioned a
common pattern of events before the problem happened. 

As I said before the driver itself is quite passive in determination the volume 
size it is attached to. Basically, it reads the size of volume on every boot 
( or driver load to be more precise ) and passes it up on request. Theoretically, 
if there is some "glitch" and driver reported wrong amount of blocks, then there 
is a good chance that it will recover on next load. ( Honestly, I've never seen 
such problem in my life. ) Driver cannot change the volume size (qcow2 file) by 
itself, with only one exception when it is asking "TRIM", but it is a different 
story. 

Next time when the problem happens, I would suggest reading the volume size during 
run-time by checking disk geometry with "info qtree" and later on by checking qcow2 
file size with "qemu-img info".

Best,
Vadim.

Comment 16 Vadim Rozenfeld 2020-12-09 10:48:58 UTC
reproduced, fixed and verified downstream
https://bugzilla.redhat.com/show_bug.cgi?id=1890810