Bug 1021913

Summary:	Guest auto-releases the freeze but the status is still "frozen" which cause error when using "guest-fsfreeze-thaw" to thaw it
Product:	Red Hat Enterprise Linux 6	Reporter:	Qunfang Zhang <qzhang>
Component:	virtio-win	Assignee:	Gal Hammer <ghammer>
Status:	CLOSED WONTFIX	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	6.5	CC:	acathrow, areis, bcao, bsarathy, chayang, ghammer, juzhang, michen, rhod, sluo, vrozenfe, yvugenfi
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-10-31 07:48:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	948017

Description Qunfang Zhang 2013-10-22 10:16:08 UTC

Description of problem:
With the latest windows guest agent installer installed, the "guest-fsfreeze-freeze" command does not work properly. When the guest is frozen, we still could write data in guest disk. And this bug blocks me to verify the bug 948017.

qemu-ga-win version: qemu-ga-win-6.5-3

Version-Release number of selected component (if applicable):
kernel-2.6.32-424.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.414.el6.x86_64
virtio-serial driver: virtio-win-prewhql-72
qemu-ga-win version: qemu-ga-win-6.5-3


How reproducible:
100%

Steps to Reproduce:
1. Boot up a windows guest.
# /usr/libexec/qemu-kvm -cpu SandyBridge -M rhel6.5.0 -enable-kvm -m 2G -smp 2,sockets=2,cores=1,threads=1 -name rhel6.4-64 -uuid 9a0e67ec-f286-d8e7-0548-0c1c9ec93009 -nodefconfig -nodefaults -monitor stdio -rtc base=utc,clock=host,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,name=org.qemu.guest_agent.0 -drive file=/home/win7-32-virtio-qzhang.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d5:51:8a,bus=pci.0,addr=0x3 -device usb-tablet,id=input0 -vnc :10 -vga std -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -qmp tcp:0:5555,server,nowait -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0

2. Install virtio serial driver and windows guest agent installer.

3. Send some commands to guest agent:
  { "execute": "guest-sync-delimited", "arguments": { "id": 123456 } }
�{"return": 123456}

==> It works.

4. Freeze guest filesystem.
   {"execute":"guest-fsfreeze-freeze"}

{"return": 1}


 {"execute":"guest-fsfreeze-status"}
{"return": "frozen"}

5. Write some data inside guest, eg, download/copy some files.

6. Thaw guest
 {"execute":"guest-fsfreeze-thaw"}
{"error": {"desc": "couldn't hold writes: fsfreeze is limited up to 10 seconds:  (error: 8004230f)"}}

==> Prompt error at the first time.

   {"execute":"guest-fsfreeze-thaw"}
{"return": 0}

==> Succeed at the second time.

Actual results:
1. On step 5, it is allowed to write data inside guest during guest frozen.
2. On step 6, it reports error when thawing the guest at the first time.

Expected results:
Should not allow to write data inside guest when guest is frozen.

Additional info:

Comment 3 Gal Hammer 2013-10-23 08:13:14 UTC

(In reply to Qunfang Zhang from comment #0)

> 4. Freeze guest filesystem.
>    {"execute":"guest-fsfreeze-freeze"}
> 
> {"return": 1}
> 
> 
>  {"execute":"guest-fsfreeze-status"}
> {"return": "frozen"}
> 
> 5. Write some data inside guest, eg, download/copy some files.

I can think about two reasons that we might think it is a failure:

1. The write was done after 10 seconds were passed since the freeze command.
2. Windows kept the write request in-memory and didn't flush the data to the disk.

> 6. Thaw guest
>  {"execute":"guest-fsfreeze-thaw"}
> {"error": {"desc": "couldn't hold writes: fsfreeze is limited up to 10
> seconds:  (error: 8004230f)"}}
> 
> ==> Prompt error at the first time.

This error says that Windows auto-released the freeze after 10 seconds, so your request for a thaw is reduandent.

>    {"execute":"guest-fsfreeze-thaw"}
> {"return": 0}
> 
> ==> Succeed at the second time.

This is also a failure. The return value is the number of disks that were thawed.

Comment 4 Qunfang Zhang 2013-10-23 10:28:13 UTC

Hi, Gal

I'm running the following script and then freeze guest. After a while (about 20~30s), open the "C:\\time.txt" file and found there's 9s (sometimes 10s) logs missing. 

......
Wed Oct 23 18:15:11 2013
Wed Oct 23 18:15:12 2013
Wed Oct 23 18:15:13 2013
Wed Oct 23 18:15:14 2013
Wed Oct 23 18:15:23 2013 <-- The time is not continuous here
Wed Oct 23 18:15:24 2013
Wed Oct 23 18:15:25 2013
Wed Oct 23 18:15:26 2013
Wed Oct 23 18:15:27 2013
.....


# cat test.py 
import time

result = open("C:\\time.txt", "w")

while 1:
    result.write("%s\n\r" % time.ctime())
    result.flush()
    time.sleep(1)


Gal,
(1) Does that mean the filesystem is frozen for about 10s and then auto-release as you said?   Have you find some clue about the problem and is it easy to fixed?
(2) Could I use this method to verify the VSS bug 948017 when this one gets fixed? 

Thanks,
Qunfang

Comment 5 Gal Hammer 2013-10-23 10:49:21 UTC

(In reply to Qunfang Zhang from comment #4)
> Hi, Gal
> 
> I'm running the following script and then freeze guest. After a while (about
> 20~30s), open the "C:\\time.txt" file and found there's 9s (sometimes 10s)
> logs missing. 
> 
> ......
> Wed Oct 23 18:15:11 2013
> Wed Oct 23 18:15:12 2013
> Wed Oct 23 18:15:13 2013
> Wed Oct 23 18:15:14 2013
> Wed Oct 23 18:15:23 2013 <-- The time is not continuous here
> Wed Oct 23 18:15:24 2013
> Wed Oct 23 18:15:25 2013
> Wed Oct 23 18:15:26 2013
> Wed Oct 23 18:15:27 2013
> .....
> 
> 
> # cat test.py 
> import time
> 
> result = open("C:\\time.txt", "w")
> 
> while 1:
>     result.write("%s\n\r" % time.ctime())
>     result.flush()
>     time.sleep(1)
> 
> 
> Gal,
> (1) Does that mean the filesystem is frozen for about 10s and then
> auto-release as you said?   Have you find some clue about the problem and is
> it easy to fixed?

I believe that your script proves that the file system was freezed! Congratulation :-).

This is not a problem and will not be fixed. This is the documented behaviour of the VSS system. You only have a grace period of 10 seconds to backup your data without applications trying to write to the disk.

> (2) Could I use this method to verify the VSS bug 948017 when this one gets
> fixed? 

Yes, it looks like a good method to verify it.

> Thanks,
> Qunfang

Comment 6 Qunfang Zhang 2013-10-23 11:20:21 UTC

Gal,

Thanks for the information. Cheers! And there's some other things we talked in this bz need to confirm with you:

(1) Even after 10 seconds passed after guest freezes, the "guest-fsfreeze-status" command will still return "frozen". This result is not suitable I think. Could we improve it to make sure it will not confuse user? As the system is auto-released the freeze after 10s.  

(2) Is the following error expected? 

{"error": {"desc": "couldn't hold writes: fsfreeze is limited up to 10 seconds:  (error: 8004230f)"}}

(3) And what does you mean about your last sentence in comment 3? Why this is a failure?  As the guest actually is auto-released the freeze, so it should return 0.

--------------------
>    {"execute":"guest-fsfreeze-thaw"}
> {"return": 0}
> 
> ==> Succeed at the second time.

This is also a failure. The return value is the number of disks that were thawed.
--------------------

Qunfang

Comment 7 Qunfang Zhang 2013-10-25 02:39:35 UTC

Hi, guys

I remove the "Testblocker" keyword as it does not block us to use the virtagent function and the VSS feature after the discussion. But it is with bad user experience because after the guest auto-releases the freeze, the status is still "frozen" and prompts error if user thaw it via command. So, it's better to fix it in 6.5. Please correct me if I'm wrong.

Thanks,
Qunfang

Comment 8 Ronen Hod 2013-10-25 07:51:02 UTC

As we said all of it is expected (other than the wrong return code). The default for the VSS final freeze is 10 seconds, and they do allow infinite time for preparations before the final freeze, so this should be enough.
The lost time ticks are probably due to the disk flush in the script, that has to wait (unlike the write itself).
So we have one bug with the return code, and it seems to be too late for 6.5, so I am OK with deferring it to 6.6 or a Z-stream. Lets wait a few more days.

Comment 9 Ronen Hod 2013-10-25 23:27:56 UTC

Deferring to 6.6. Once this is cleared up we will decide whether to push in a Z-stream.

Comment 10 Gal Hammer 2013-10-31 07:48:44 UTC

This behavior is by design and the json's schema file state it: "This (get guest fsfreeze state) may fail to properly report the current state as a result of some other guest processes having issued an fs freeze/thaw".

Comment 11 Qunfang Zhang 2013-11-05 06:44:08 UTC

(In reply to Gal Hammer from comment #10)
> This behavior is by design and the json's schema file state it: "This (get
> guest fsfreeze state) may fail to properly report the current state as a
> result of some other guest processes having issued an fs freeze/thaw".

Yes, I found it now in the json's schema file. And do we have a plan to fix it in future release?  Thanks.

Comment 12 Gal Hammer 2013-11-17 07:53:02 UTC

(In reply to Qunfang Zhang from comment #11)
> (In reply to Gal Hammer from comment #10)
> > This behavior is by design and the json's schema file state it: "This (get
> > guest fsfreeze state) may fail to properly report the current state as a
> > result of some other guest processes having issued an fs freeze/thaw".
> 
> Yes, I found it now in the json's schema file. And do we have a plan to fix
> it in future release?  Thanks.

I guess that's will be a management call, but I don't think we should try and fix it. Returning a status of a system that can be modified by more than one user is almost always unreliable.