Bug 1378005

Summary: [virtio-win][svvp][ws2016]Job "Clock Interrupt Test " failed when "-rtc base=localtime,clock=host,driftfix=slew"
Product: Red Hat Enterprise Linux 7 Reporter: Yu Wang <wyu>
Component: qemu-kvm-rhevAssignee: Marcelo Tosatti <mtosatti>
Status: CLOSED DEFERRED QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: chayang, dgilbert, juzhang, knoel, lijin, mtosatti, pbonzini, phou, siliu, virt-maint, wyu, xfu, xiagao
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-22 20:30:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
hck_log
none
clock-none-pass-hckx
none
rhel7.6 svvp test package on intel host with q35+seabios
none
hlk_log html file
none
qemu command lines none

Description Yu Wang 2016-09-21 09:42:57 UTC
Created attachment 1203193 [details]
hck_log

Description of problem:
Job "Clock Interrupt Test " failed when "-rtc base=localtime,clock=host,driftfix=slew" on AMD host

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-25.el7.x86_64
kernel-3.10.0-500.el7.x86_64
virtio-win-1.9.0_2.el7
seabios-1.9.1-5.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. boot guest w/ -rtc base=localtime,clock=host,driftfix=slew
2. run job Clock Interrupt Test 
3.

Actual results:
Failed as "Previous results in this log marked with '*' are out of tolerance. max = 14646, min = 488" and "Median time delta is out of tolerance. max = 7323, min = 3222"

Expected results:
Pass without error

Additional info:
1 it could pass with "driftfix=none"
2 svvp failed log refer to attachment.

Comment 2 Dr. David Alan Gilbert 2016-09-22 10:59:18 UTC
just a guess, but I wonder if the AMD box has the high precision timer on the host?

Comment 3 Vadim Rozenfeld 2016-09-23 02:40:55 UTC
the test fails when trying to set timer resolution to 100 ns and found that "median time delta is out of tolerance"
can you check min_timer_period_us parameter (sys/module/kvm/parameters/min_timer_period_us) and decreasing it to 100 us if it is bigger?

Thanks,
Vadim.

Comment 4 Yu Wang 2016-09-23 03:14:06 UTC
(In reply to Vadim Rozenfeld from comment #3)
> the test fails when trying to set timer resolution to 100 ns and found that
> "median time delta is out of tolerance"
> can you check min_timer_period_us parameter
> (sys/module/kvm/parameters/min_timer_period_us) and decreasing it to 100 us
> if it is bigger?
> 
> Thanks,
> Vadim.

checked the min_timer_period_us, it is 100.

Comment 5 Paolo Bonzini 2016-09-27 10:45:20 UTC
Can you include a log with driftfix=none?

Comment 6 Yu Wang 2016-09-28 02:06:28 UTC
Created attachment 1205379 [details]
clock-none-pass-hckx

clock job w/ driftfix=none

Comment 7 Vadim Rozenfeld 2016-09-28 03:04:27 UTC
(In reply to Yu Wang from comment #6)
> Created attachment 1205379 [details]
> clock-none-pass-hckx
> 
> clock job w/ driftfix=none

So it works fine now. Maybe we should put a mark somewhere not to use
driftfix=slew for SVVP/WHQL testing?

Comment 8 Yu Wang 2016-09-28 07:39:15 UTC
(In reply to Vadim Rozenfeld from comment #7)
> (In reply to Yu Wang from comment #6)
> > Created attachment 1205379 [details]
> > clock-none-pass-hckx
> > 
> > clock job w/ driftfix=none
> 
> So it works fine now. Maybe we should put a mark somewhere not to use
> driftfix=slew for SVVP/WHQL testing?

Hi Vadim,

QE have checked default commandline on rhevm, it included "base=2016-09-28T06:48:19,driftfix=slew" , so QE think this bug should be fixed w/ driftfix=slew.

Thanks
Yu Wang

Comment 10 Marcelo Tosatti 2016-10-05 14:30:42 UTC
(In reply to Yu Wang from comment #0)
> Created attachment 1203193 [details]
> hck_log
> 
> Description of problem:
> Job "Clock Interrupt Test " failed when "-rtc
> base=localtime,clock=host,driftfix=slew" on AMD host
> 
> Version-Release number of selected component (if applicable):
> qemu-kvm-rhev-2.6.0-25.el7.x86_64
> kernel-3.10.0-500.el7.x86_64
> virtio-win-1.9.0_2.el7
> seabios-1.9.1-5.el7.x86_64
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. boot guest w/ -rtc base=localtime,clock=host,driftfix=slew
> 2. run job Clock Interrupt Test 
> 3.
> 
> Actual results:
> Failed as "Previous results in this log marked with '*' are out of
> tolerance. max = 14646, min = 488" and "Median time delta is out of
> tolerance. max = 7323, min = 3222"
> 
> Expected results:
> Pass without error
> 
> Additional info:
> 1 it could pass with "driftfix=none"
> 2 svvp failed log refer to attachment.

Yu Wang,

Where can i find instructions to run SVVP ? 

Which Windows version is this ?

Comment 11 Marcelo Tosatti 2016-10-07 20:33:40 UTC
(In reply to Marcelo Tosatti from comment #10)
> (In reply to Yu Wang from comment #0)
> > Created attachment 1203193 [details]
> > hck_log
> > 
> > Description of problem:
> > Job "Clock Interrupt Test " failed when "-rtc
> > base=localtime,clock=host,driftfix=slew" on AMD host
> > 
> > Version-Release number of selected component (if applicable):
> > qemu-kvm-rhev-2.6.0-25.el7.x86_64
> > kernel-3.10.0-500.el7.x86_64
> > virtio-win-1.9.0_2.el7
> > seabios-1.9.1-5.el7.x86_64
> > 
> > How reproducible:
> > 100%
> > 
> > Steps to Reproduce:
> > 1. boot guest w/ -rtc base=localtime,clock=host,driftfix=slew
> > 2. run job Clock Interrupt Test 
> > 3.
> > 
> > Actual results:
> > Failed as "Previous results in this log marked with '*' are out of
> > tolerance. max = 14646, min = 488" and "Median time delta is out of
> > tolerance. max = 7323, min = 3222"
> > 
> > Expected results:
> > Pass without error
> > 
> > Additional info:
> > 1 it could pass with "driftfix=none"
> > 2 svvp failed log refer to attachment.
> 
> Yu Wang,
> 
Q1)
> Where can i find instructions to run SVVP ? 
Q2)
> Which Windows version is this ?

Q3)

Is this a regression from RHEL-7.2? 

Q4)

Is it possible to disable driftfix=slew and still have SVVP validated?
that is what are the rules for SVVP validation?

Comment 12 Yu Wang 2016-10-09 02:41:30 UTC
Hi karen,
> > 
> Q1)
> > Where can i find instructions to run SVVP ? 

QE have a guide to setup the svvp env. https://mojo.redhat.com/docs/DOC-1069057 and how to run svvp https://mojo.redhat.com/docs/DOC-1068119, you can find the hlk tool, filter and playlist on https://sysdev.microsoft.com/en-US/Hardware/member/HlkWizard/

> Q2)
> > Which Windows version is this ?
> 
14393.0.160808-1702.RS1_Release_srvmedia_SERVER_OEMRET_X64FRE_EN-US.ISO


> Q3)
> 
> Is this a regression from RHEL-7.2? 
> 
tried on rhel7.2 qemu-kvm-rhev, it hit the same issue. Since ws2016 start at rhel7.3, there is no "Clock Interrupt Test " job on hck and ws2012R2.

> Q4)
> 
> Is it possible to disable driftfix=slew and still have SVVP validated?
> that is what are the rules for SVVP validation?

It is possible to run driftfix=none and get the validated, so it is not a test blocker.

Comment 13 Marcelo Tosatti 2016-12-27 10:17:23 UTC
Moving this bug to RHEL-7.5 while assuming that:

1) Reinjection does indeed cause "abnormal" deltas between timer interrupts
to be caused.

2) However that is not fatal to Windows timekeeping.

---

Having making those two points, have not actually investigated
what test is expecting and what its getting (ENOCAPACITY for RHEL 7.4).

Comment 14 Marcelo Tosatti 2017-03-07 23:21:58 UTC
On physical HW timer interrupts are generated as follows:


        d         d          d
   *----------*----------*----------*
   i          i          i          i


Where i is the interrupt generation event and
"d" is the length of time between consecutive
interrupts.

Vadim mentions that the test fails when d = 100ns.

So it is expected that, on a virtualized machine:


        100ns        100ns
   *----------*...*----------* ...
   i          i          i

   [    window when vcpu is scheduled out     ]

on reinjection:

        100ns
   *--*--*--*--
   i  i  i  i

Several interrupts will be injected when the vcpu
is scheduled in again.

And this is where the error must be coming from.

So its not a bug, but valid and expected behaviour
of the "virtualized HW" exposed to Windows.

Unless this causes misbehaviour of Windows timekeeping
for some reason (which i don't think it does, because
we know Windows timekeeping works properly), then this
should be marked as not a bug.

Anyone has any reason to not close this bug?

Comment 15 Yu Wang 2017-03-28 02:24:38 UTC
Hi,

(In reply to Marcelo Tosatti from comment #14)
> On physical HW timer interrupts are generated as follows:
> 
> 
>         d         d          d
>    *----------*----------*----------*
>    i          i          i          i
> 
> 
> Where i is the interrupt generation event and
> "d" is the length of time between consecutive
> interrupts.
> 
> Vadim mentions that the test fails when d = 100ns.
> 
> So it is expected that, on a virtualized machine:
> 
> 
>         100ns        100ns
>    *----------*...*----------* ...
>    i          i          i
> 
>    [    window when vcpu is scheduled out     ]
> 
> on reinjection:
> 
>         100ns
>    *--*--*--*--
>    i  i  i  i
> 
> Several interrupts will be injected when the vcpu
> is scheduled in again.
> 
> And this is where the error must be coming from.
> 
> So its not a bug, but valid and expected behaviour
> of the "virtualized HW" exposed to Windows.
> 
> Unless this causes misbehaviour of Windows timekeeping
> for some reason (which i don't think it does, because
> we know Windows timekeeping works properly), then this
> should be marked as not a bug.
> 
> Anyone has any reason to not close this bug?

Thanks for your reply, but we did not hit this issue on intel host with the same env.(include host version and guest)

Thank
Yu Wang

Comment 16 Marcelo Tosatti 2017-03-29 16:59:03 UTC
(In reply to Yu Wang from comment #15)
> Hi,
> 
> (In reply to Marcelo Tosatti from comment #14)
> > On physical HW timer interrupts are generated as follows:
> > 
> > 
> >         d         d          d
> >    *----------*----------*----------*
> >    i          i          i          i
> > 
> > 
> > Where i is the interrupt generation event and
> > "d" is the length of time between consecutive
> > interrupts.
> > 
> > Vadim mentions that the test fails when d = 100ns.
> > 
> > So it is expected that, on a virtualized machine:
> > 
> > 
> >         100ns        100ns
> >    *----------*...*----------* ...
> >    i          i          i
> > 
> >    [    window when vcpu is scheduled out     ]
> > 
> > on reinjection:
> > 
> >         100ns
> >    *--*--*--*--
> >    i  i  i  i
> > 
> > Several interrupts will be injected when the vcpu
> > is scheduled in again.
> > 
> > And this is where the error must be coming from.
> > 
> > So its not a bug, but valid and expected behaviour
> > of the "virtualized HW" exposed to Windows.
> > 
> > Unless this causes misbehaviour of Windows timekeeping
> > for some reason (which i don't think it does, because
> > we know Windows timekeeping works properly), then this
> > should be marked as not a bug.
> > 
> > Anyone has any reason to not close this bug?
> 
> Thanks for your reply, but we did not hit this issue on intel host with the
> same env.(include host version and guest)
> 
> Thank
> Yu Wang

Ok, can you give the specs of the AMD and Intel machines where this was tested? 

It could be that Intel hardware is more powerful, in that case there is
no interrupt being missed at d=100ns.

To confirm that (the reason for Intel hardware not to exhibit problem is 
performance executing instructions), could try an Intel host with lower 
performance (lower frequency, smaller cache) than the AMD host where the 
problem is reproducible.

If its not performance, then it could be an issue with AMD KVM support (either interrupt injection or something that affects it).

Comment 17 lijin 2017-04-19 07:47:59 UTC
this job passed with latest HLK filter during rhel7.4 svvp test under q35&ovmf,so I close this bug

Comment 18 Peixiu Hou 2018-07-10 09:40:36 UTC
Hit the same issue on rhel 7.6 host, with q35+seabios, both hit on AMD host and Intel host. And apply the latest HLK filter, cannot filter it pass.

The job "Clock Interrupt Test" failed with follows error message:
===============================================================================
Error 7/10/2018 3:29:18.484 PM Previous results in this log marked with '*' are out of tolerance. max = 14646, min = 488 WexTraceInfo ThreadId=6016 ProcessId=4604 TimeStamp=4702385725 LogSessionId=1  
 
File:   minkernel\hals\tools\npctest\exe\clkint.cpp Line: 1022 
Error Type:    
Error Code:   0x0 
Error Text:   Error 0x00000000 

Error 7/10/2018 3:33:27.257 PM Unable to complete testing due to repeated interference from another component. Please investigate, resolve conflict, and retry. WexTraceInfo ThreadId=6016 ProcessId=4604 TimeStamp=7190161349 LogSessionId=1  
 
File:   minkernel\hals\tools\npctest\exe\clkint.cpp Line: 880 
Error Type:    
Error Code:   0x0 
Error Text:   Error 0x00000000 
================================================================================

Tried to test with follows rtc command on AMD host, all hit this issue.
1. Tried with "-rtc base=localtime,clock=host,driftfix=slew", reproduced this issue.
2. Tried with "-rtc base=localtime,clock=host,driftfix=none", reproduced this issue.
3. Tried with "-rtc base=localtime,driftfix=slew", reproduced this issue.
4. Tried with "-rtc base=localtime,driftfix=none", reproduced this issue.
5. Tried with "-rtc base=localtime,clock=host", tested 3 times, passed 1 times.
6. Tried with "-rtc base=localtime", reproduced this issue.
7. Totally test this job 20 times on AMD host, passed 1 time.

Also Tried same scenarios on Intel host, Totally tested this job 25 times, but all were failed.

Tested with pc+seabios before, on AMD host, run this job 4 times, 1 pass. On Intel host, run this job 15 times, 1 pass.

And checked this job status on RHEL7.5 svvp test, it also report the same error, but it can be filtered pass.

Guest: windows 2016 Datacenter build 14393

Used versions:
kernel-3.10.0-916.el7.x86_64
qemu-kvm-rhev-2.12.0-5.el7.x86_64
seabios-bin-1.11.0-2.el7.noarch
virtio-win-prewhql-156

Due to it has high reproduce rate, re-open it~

Best Regards~
Peixiu

Comment 19 Peixiu Hou 2018-07-10 09:49:49 UTC
Created attachment 1457750 [details]
rhel7.6 svvp test package on intel host with q35+seabios

Comment 20 Dr. David Alan Gilbert 2018-07-10 10:04:09 UTC
Please provide details of the exact hosts used in the test.

Comment 21 Peixiu Hou 2018-07-10 10:30:54 UTC
(In reply to Dr. David Alan Gilbert from comment #20)
> Please provide details of the exact hosts used in the test.

For Intel host:
[root@ibm-x3850x5-11 svvp-intel]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Model name:            Intel(R) Xeon(R) CPU E7- 8860  @ 2.27GHz
Stepping:              2
CPU MHz:               2262.000
CPU max MHz:           2262.0000
CPU min MHz:           1064.0000
BogoMIPS:              4522.14
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              24576K
NUMA node0 CPU(s):     0-9,40-49
NUMA node1 CPU(s):     10-19,50-59
NUMA node2 CPU(s):     20-29,60-69
NUMA node3 CPU(s):     30-39,70-79
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm epb tpr_shadow vnmi flexpriority ept vpid ibpb ibrs stibp dtherm ida arat spec_ctrl intel_stibp


For AMD host:
[root@hp-dl585g7-03 svvp-amd]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 2
Model name:            AMD Opteron(tm) Processor 6380
Stepping:              0
CPU MHz:               2493.816
BogoMIPS:              4987.63
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0,4,8,12,16,20,24,28
NUMA node1 CPU(s):     32,36,40,44,48,52,56,60
NUMA node2 CPU(s):     1,5,9,13,17,21,25,29
NUMA node3 CPU(s):     33,37,41,45,49,53,57,61
NUMA node4 CPU(s):     2,6,10,14,18,22,26,30
NUMA node5 CPU(s):     34,38,42,46,50,54,58,62
NUMA node6 CPU(s):     35,39,43,47,51,55,59,63
NUMA node7 CPU(s):     3,7,11,15,19,23,27,31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate retpoline_amd vmmcall bmi1 ibpb arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold ssbd

Qemu command line for AMD&Intel paste on attachment.


And considering may be you cannot open the attachment .hlkx file, here change it to html type. you can refer to hlk_logs.html file.

And I summited this job on Intel host 5 times continue, 2 times passed.

If you also need other message, please let me know~

Thanks~
Peixiu

Comment 22 Peixiu Hou 2018-07-10 10:33:34 UTC
Created attachment 1457762 [details]
hlk_log html file

Comment 23 Peixiu Hou 2018-07-10 10:35:36 UTC
Created attachment 1457764 [details]
qemu command lines

Comment 24 Peixiu Hou 2018-07-31 02:43:00 UTC
Hit this issue on rhel7.4.z qemu-kvm-1.5.3-141.el7_4.9.x86_64.
Tried run "Clock Interrupt Test" 30 times, only passed 1 time. passed 1/10 with "-rtc base=localtime,clock=host,driftfix=none" setting. 
Also  tested with "-rtc base=localtime,clock=host,driftfix=slew" and "-rtc base=localtime,clock=host" many times, all failed.

Used versions:
virtio-win-1.9.3-1.el7.noarch
qemu-kvm-1.5.3-141.el7_4.9.x86_64
kernel-3.10.0-693.el7.x86_64
seabios-bin-1.10.2-3.el7.noarch
Guest : windows server 2016 GA
HLK version:10.1.14393.4
HLK playlist version:2.1

Comment 27 xiagao 2019-06-25 05:59:54 UTC
Hit this issue on RHEL7.7(AMD host), run 50 times, only passed 1 time with "-rtc base=localtime,clock=host,driftfix=none".
Also tested with "-rtc base=localtime,clock=host,driftfix=slew" many times, all failed.


pkg:
kernel-3.10.0-1057.el7
qemu-kvm-rhev-2.12.0-33.el7
seabios-1.11.0-2.el7
virtio-win-1.9.8-2.el8.iso -viostor
virtio-win-prewhql-0.1-172.iso -netkvm