Bug 1909073 - Filesystem freeze on Windows reports errors frequently [NEEDINFO]
Summary: Filesystem freeze on Windows reports errors frequently
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: virtio-win
Version: 8.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 8.0
Assignee: Virtualization Maintenance
QA Contact: dehanmeng
URL:
Whiteboard:
Depends On: 1915198
Blocks: 1985906 1934492
TreeView+ depends on / blocked
 
Reported: 2020-12-18 09:55 UTC by Roman Hodain
Modified: 2021-07-26 08:30 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1934492 (view as bug list)
Environment:
Last Closed: 2021-05-18 16:25:58 UTC
Type: Bug
Target Upstream Version:
ailan: needinfo? (shipatil)


Attachments (Terms of Use)
VSS_error.png (333.92 KB, image/png)
2021-03-03 01:49 UTC, dehanmeng
no flags Details

Description Roman Hodain 2020-12-18 09:55:32 UTC
Description of problem:
The following errors are observed when using FS freeze on Win system:

1) From qemu monitor
    unable to execute QEMU agent command 'guest-fsfreeze-freeze': timeout when try to receive Frozen event from VSS provider: Unspecified error

2) From VM event log

    Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface.  hr = 0x80070005, Access is denied. This is often caused by incorrect security settings in either the writer or requestor process. 


Version-Release number of selected component (if applicable):
    qemu-kvm-5.1.0-14.module+el8.3.0+8438+644aff69.x86_64

How reproducible:
    1) Time to time
    2) 100%

Comment 2 Roman Hodain 2020-12-18 10:17:49 UTC
@1 - It seems to be caused the IOs on the guest system, but it is not clear to me how much time we should wait.
The requester seems to wait just 10 seconds for the results and I am not sure if that is the right timeout. There is also 60 seconds timeout for the workers to prepare the data for the VSS snapshot. So I believe that we may fail earlier than the workers actually finish the preparation. We wait for the operation to finish in [1]. First, we do the loop just for VSS_TIMEOUT_FREEZE_MSEC/VSS_TIMEOUT_EVENT_MSEC which is the 10s. We may need 60 sec there.  Also, I am not sure why we do the loop and simply do not wait for an event from the operation. 

@2 - I did some investigation and the error seems to be initiated by the Network Service that receives the request for preparing data from the system worker. I can be mitigated by adding system permissions to "NETWORK SERVICE" account. Is there a way to filter out the problematic services?

[1]: https://github.com/qemu/qemu/blob/master/qga/vss-win32/requester.cpp#L440

Comment 4 dehanmeng 2020-12-21 09:49:12 UTC
(In reply to Roman Hodain from comment #0)
> Description of problem:
> The following errors are observed when using FS freeze on Win system:
> 
> 1) From qemu monitor
>     unable to execute QEMU agent command 'guest-fsfreeze-freeze': timeout
> when try to receive Frozen event from VSS provider: Unspecified error
> 
> 2) From VM event log
> 
>     Volume Shadow Copy Service error: Unexpected error querying for the
> IVssWriterCallback interface.  hr = 0x80070005, Access is denied. This is
> often caused by incorrect security settings in either the writer or
> requestor process. 
> 
Hi Roman,
Are there no other settings and operations? actually, I couldn't reproduce it with target version of qemu-kvm on windows guest. qga command "guest-fsfreeze-freeze" executed successfully without any wrong event return or timeout situation. so what about any settings for VSS in guest? can you provide more details to help to reproduce this issue? thanks in advance.

Comment 5 Yvugenfi@redhat.com 2020-12-21 14:38:48 UTC
I think this is a duplicate of BZ#1444702

Comment #2 in the original bug has an explanation for the root cause: https://bugzilla.redhat.com/show_bug.cgi?id=1444702#c2

Comment 6 Basil Salman 2020-12-21 15:23:25 UTC
According to BZ#1444702 it was reproduced on Windows 2012R2 only.

Which OS is used in the guest system?

Comment 7 Yvugenfi@redhat.com 2020-12-28 10:27:31 UTC
Based on comment #2 in BZ#1444702 and internal discussion, closing won't fix

Comment 8 Roman Hodain 2020-12-30 09:48:07 UTC
I am reopening this Bugzilla as the permissionis just part of the BZ (my bad. I should have open two BZs).

The point 1) still remains. The sytem time to time fails on timeout. I do not thisg that there are nay special settings it is rather cauased by the load on the systems and the workers runnig there. So it will not be easy to reproduce. The mor importatn is the code handlig the timeout. Is that code correct? I do not think it is, but I need somebody who can check the code. I do not have experience inthis area. 

    https://github.com/qemu/qemu/blob/master/qga/vss-win32/requester.cpp#L440

Comment 10 dehanmeng 2021-01-08 01:33:41 UTC
Tried win10/win2019/win2012 guest with qemu-kvm-5.1.0-14.module+el8.3.0+8438+644aff69.x86_64 on host and mingw-qemu-ga-win-7.6.2-2.el7ev on guest(asked this info via mail before) still cannot reproduce this issue. will keep reproducing it.

Comment 11 dehanmeng 2021-01-10 13:44:26 UTC
About the "2) From VM event log
Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface.xxxxx.... " from comment0, it was reproduced with qemu-kvm-5.1.0-14.module+el8.3.0+8438+644aff69.x86_64 on host and mingw-qemu-ga-win-7.6.2-2.el7ev on guest. and The return time of the return value is slightly longer than other normal windows with current qemu-ga-win pkg. about 1 or 2 seconds, and didn't check out vss error from guest.

Comment 13 Basil Salman 2021-01-17 13:24:17 UTC
(In reply to Roman Hodain from comment #2)
> @1 - It seems to be caused the IOs on the guest system, but it is not clear
> to me how much time we should wait.
> The requester seems to wait just 10 seconds for the results and I am not
> sure if that is the right timeout. There is also 60 seconds timeout for the
> workers to prepare the data for the VSS snapshot. So I believe that we may
> fail earlier than the workers actually finish the preparation. We wait for
> the operation to finish in [1]. First, we do the loop just for
> VSS_TIMEOUT_FREEZE_MSEC/VSS_TIMEOUT_EVENT_MSEC which is the 10s. We may need
> 60 sec there.  Also, I am not sure why we do the loop and simply do not wait
> for an event from the operation. 
> [1]:
> https://github.com/qemu/qemu/blob/master/qga/vss-win32/requester.cpp#L440

I think you are right about the timeout, 10 seconds is not enough time for timeout
as it might take more.
According to [1], the default timeout for Writer Freeze is 60 seconds, I created a test build with the
timeout increased to 60 secs [2].
As for the loop, it is needed as DoSnapshotSet might fail before the provider sends 
an event that the freeze has started.

[1]: https://docs.microsoft.com/en-us/windows/win32/vss/overview-of-processing-a-backup-under-vss
[2]: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34275685

Comment 19 Michal Skrivanek 2021-02-22 12:00:53 UTC
Lukas, can we run regression test too? From the RHV flow perspective this should be safe with LiveSnapshotPerformFreezeInEngine option enabled, I'm a bit less sure when it's disabled(which is the default)

Comment 21 Basil Salman 2021-02-24 22:39:52 UTC
Scratch build with changes:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=35146591

Comment 23 dehanmeng 2021-03-03 01:49:36 UTC
Created attachment 1760313 [details]
VSS_error.png

Hit the error mentioned as description "(2)Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface.  hr = 0x80070005, Access is denied. This is often caused by incorrect security settings in either the writer or requestor process." when test the scratch build on win2019 guest. but the first issue I didn't hit the first issue with 100 attempts.
test whole lopp in progress, testing it again for safety.
packages:
qemu-kvm-5.2.0-8.module+el8.4.0+10093+e085f1eb.x86_64
kernel-4.18.0-291.el8.x86_64
qemu-ga-win-102.0.0-1.el8.noarch.rpm
seabios-1.14.0-1.module+el8.4.0+8855+a9e237a9.x86_64
RHEL-8.4.0-20210223.n.0

Comment 28 dehanmeng 2021-03-08 01:03:40 UTC
okay, thanks for your effort Basil. I'll run a test-loop for testing the whole function on win10 guest. will udpate any issue and if no further problems are found, I'll come here to verify it.

Comment 29 dehanmeng 2021-03-08 09:47:50 UTC
Verified with the qemu-ga-win-102.0.0-1.el8.noarch.rpm on guest win10/win2012R2 
step to reproduce:
1. boot up guest.
2. install the correct qemu-ga-win rpm.
3. try to run command "guest-fsfreeze-freeze" and "guest-fsfreeze-thaw" in a loop.
[root@dell-per440-03 test-results]# cat <(for i in {1..100}; do echo '{"execute":"guest-fsfreeze-freeze"}'; sleep 2; echo '{"execute":"guest-fsfreeze-thaw"}'; sleep 2; done)  | nc -U /tmp/avocado_adfijf889/serial-org.qemu.guest_agent.0-20210105-032815-BLpNoZnG > /home/test/verify_bz_rebase/log
4. check whether there are any error like comment0.

Actual result:
all {"return": {$i}}.
Expected result:
output should be same as above.

Comment 30 dehanmeng 2021-03-12 02:01:15 UTC
Hi all, 
The newest qemu-ga-win "the qemu-ga-win version is mingw-qemu-ga-win-102.0.0-2.el8" came out and it has been passed from my side.  the whole test loop and new cases got passed as well.  No further errors and regression issues were found now. Thanks everyone for the time and effort.

Cheers
Dehan

Comment 36 Roman Hodain 2021-04-13 11:26:14 UTC
Hi,

the timeout seems to be implemented, but some systems still fail to freeze properly. The error message is not clear.

From the host we see the initiation of the frrezefs call:
2021-04-13 01:00:14,913+0200 INFO  (jsonrpc/6) [virt.vm] (vmId='ae0906d5-c335-47b8-b122-51f471f75019') Freezing guest filesystems (vm:4134)

The guest agent logs:

1618268414.713998: info: guest-fsfreeze called   1618268414.713998 => Tue Apr 13 01:00:14 AM CEST 2021
...
1618268414.713998: warning: disabling logging due to filesystem freeze

The call from the host times out on the vdsm side as the timeout is set to 60s

2021-04-13 01:01:14,914+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/6 running <Task <JsonRpcTask {'jsonrpc': '2.0', 'method': 'VM.freeze', 'params': {'vmID': 'ae0906d5-c335-47b8-b122-51f471f75019'}, 'id': 'decefb1d-b22c-4014-8e58-252ef09d8caf'} at 0x7f6498080be0> timeout=60, duration=60.00 at 0x7f64980801d0> task#=132600 at 0x7f649849fcc0>, traceback:
...

The thread jsonrpc/6 returns back with error:

2021-04-13 01:01:22,840+0200 WARN  (jsonrpc/6) [virt.vm] (vmId='ae0906d5-c335-47b8-b122-51f471f75019') Unable to freeze guest filesystems: internal error: unable to execute QEMU agent command 'guest-fsfreeze-freeze': timeout when try to receive Frozen event from VSS provider: Unspecified error (vm:4143)

6 sec later qemu guest agent reports that it enables the logging again

1618268488.276739: warning: logging re-enabled due to filesystem unfreeze 1618268488.276739 => Tue Apr 13 01:01:28 AM CEST 2021

vdsm calls thawfs

2021-04-13 01:01:47,728+0200 INFO  (jsonrpc/4) [virt.vm] (vmId='ae0906d5-c335-47b8-b122-51f471f75019') Thawing guest filesystems (vm:4161)
2021-04-13 01:01:47,730+0200 INFO  (jsonrpc/4) [virt.vm] (vmId='ae0906d5-c335-47b8-b122-51f471f75019') 0 guest filesystems thawed (vm:4180)

But the agent reports the call 18 sec later. 

1618268525.711117: debug: read data, count: 34, data: {"execute":"guest-fsfreeze-thaw"} => Tue Apr 13 01:02:05 AM CEST 2021


1) timeout when try to receive Frozen event from VSS provider: Unspecified error
I assume that this is due to the wait for the vss snapshot to be completed, but I do not see the error in the Win event logs.
2) The "Unspecified error" Should be replaced with something more understandable.
3) I really do not know why there is so big time difference in the thaw call between the hypervisor and the agent.


The agent was started manually with this command:
    "c:\Program Files\Qemu-ga\qemu-ga.exe"  -v -l c:\agent.log

Is there anything we can do to investigate further the reason for the failure?

Comment 45 errata-xmlrpc 2021-05-18 16:25:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virtio-win bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1959


Note You need to log in before you can comment on or make changes to this bug.