| Summary: | [virtio-win][svvp][ws2016]Job "Clock Interrupt Test " failed when "-rtc base=localtime,clock=host,driftfix=slew" | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Yu Wang <wyu> | ||||||||||||
| Component: | qemu-kvm-rhev | Assignee: | Marcelo Tosatti <mtosatti> | ||||||||||||
| Status: | CLOSED DEFERRED | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||
| Priority: | high | ||||||||||||||
| Version: | 7.3 | CC: | chayang, dgilbert, juzhang, knoel, lijin, mtosatti, pbonzini, phou, siliu, virt-maint, wyu, xfu, xiagao | ||||||||||||
| Target Milestone: | rc | Keywords: | Reopened | ||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2019-07-22 20:30:03 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Attachments: |
|
||||||||||||||
just a guess, but I wonder if the AMD box has the high precision timer on the host? the test fails when trying to set timer resolution to 100 ns and found that "median time delta is out of tolerance" can you check min_timer_period_us parameter (sys/module/kvm/parameters/min_timer_period_us) and decreasing it to 100 us if it is bigger? Thanks, Vadim. (In reply to Vadim Rozenfeld from comment #3) > the test fails when trying to set timer resolution to 100 ns and found that > "median time delta is out of tolerance" > can you check min_timer_period_us parameter > (sys/module/kvm/parameters/min_timer_period_us) and decreasing it to 100 us > if it is bigger? > > Thanks, > Vadim. checked the min_timer_period_us, it is 100. Can you include a log with driftfix=none? Created attachment 1205379 [details]
clock-none-pass-hckx
clock job w/ driftfix=none
(In reply to Yu Wang from comment #6) > Created attachment 1205379 [details] > clock-none-pass-hckx > > clock job w/ driftfix=none So it works fine now. Maybe we should put a mark somewhere not to use driftfix=slew for SVVP/WHQL testing? (In reply to Vadim Rozenfeld from comment #7) > (In reply to Yu Wang from comment #6) > > Created attachment 1205379 [details] > > clock-none-pass-hckx > > > > clock job w/ driftfix=none > > So it works fine now. Maybe we should put a mark somewhere not to use > driftfix=slew for SVVP/WHQL testing? Hi Vadim, QE have checked default commandline on rhevm, it included "base=2016-09-28T06:48:19,driftfix=slew" , so QE think this bug should be fixed w/ driftfix=slew. Thanks Yu Wang (In reply to Yu Wang from comment #0) > Created attachment 1203193 [details] > hck_log > > Description of problem: > Job "Clock Interrupt Test " failed when "-rtc > base=localtime,clock=host,driftfix=slew" on AMD host > > Version-Release number of selected component (if applicable): > qemu-kvm-rhev-2.6.0-25.el7.x86_64 > kernel-3.10.0-500.el7.x86_64 > virtio-win-1.9.0_2.el7 > seabios-1.9.1-5.el7.x86_64 > > How reproducible: > 100% > > Steps to Reproduce: > 1. boot guest w/ -rtc base=localtime,clock=host,driftfix=slew > 2. run job Clock Interrupt Test > 3. > > Actual results: > Failed as "Previous results in this log marked with '*' are out of > tolerance. max = 14646, min = 488" and "Median time delta is out of > tolerance. max = 7323, min = 3222" > > Expected results: > Pass without error > > Additional info: > 1 it could pass with "driftfix=none" > 2 svvp failed log refer to attachment. Yu Wang, Where can i find instructions to run SVVP ? Which Windows version is this ? (In reply to Marcelo Tosatti from comment #10) > (In reply to Yu Wang from comment #0) > > Created attachment 1203193 [details] > > hck_log > > > > Description of problem: > > Job "Clock Interrupt Test " failed when "-rtc > > base=localtime,clock=host,driftfix=slew" on AMD host > > > > Version-Release number of selected component (if applicable): > > qemu-kvm-rhev-2.6.0-25.el7.x86_64 > > kernel-3.10.0-500.el7.x86_64 > > virtio-win-1.9.0_2.el7 > > seabios-1.9.1-5.el7.x86_64 > > > > How reproducible: > > 100% > > > > Steps to Reproduce: > > 1. boot guest w/ -rtc base=localtime,clock=host,driftfix=slew > > 2. run job Clock Interrupt Test > > 3. > > > > Actual results: > > Failed as "Previous results in this log marked with '*' are out of > > tolerance. max = 14646, min = 488" and "Median time delta is out of > > tolerance. max = 7323, min = 3222" > > > > Expected results: > > Pass without error > > > > Additional info: > > 1 it could pass with "driftfix=none" > > 2 svvp failed log refer to attachment. > > Yu Wang, > Q1) > Where can i find instructions to run SVVP ? Q2) > Which Windows version is this ? Q3) Is this a regression from RHEL-7.2? Q4) Is it possible to disable driftfix=slew and still have SVVP validated? that is what are the rules for SVVP validation? Hi karen, > > > Q1) > > Where can i find instructions to run SVVP ? QE have a guide to setup the svvp env. https://mojo.redhat.com/docs/DOC-1069057 and how to run svvp https://mojo.redhat.com/docs/DOC-1068119, you can find the hlk tool, filter and playlist on https://sysdev.microsoft.com/en-US/Hardware/member/HlkWizard/ > Q2) > > Which Windows version is this ? > 14393.0.160808-1702.RS1_Release_srvmedia_SERVER_OEMRET_X64FRE_EN-US.ISO > Q3) > > Is this a regression from RHEL-7.2? > tried on rhel7.2 qemu-kvm-rhev, it hit the same issue. Since ws2016 start at rhel7.3, there is no "Clock Interrupt Test " job on hck and ws2012R2. > Q4) > > Is it possible to disable driftfix=slew and still have SVVP validated? > that is what are the rules for SVVP validation? It is possible to run driftfix=none and get the validated, so it is not a test blocker. Moving this bug to RHEL-7.5 while assuming that: 1) Reinjection does indeed cause "abnormal" deltas between timer interrupts to be caused. 2) However that is not fatal to Windows timekeeping. --- Having making those two points, have not actually investigated what test is expecting and what its getting (ENOCAPACITY for RHEL 7.4). On physical HW timer interrupts are generated as follows:
d d d
*----------*----------*----------*
i i i i
Where i is the interrupt generation event and
"d" is the length of time between consecutive
interrupts.
Vadim mentions that the test fails when d = 100ns.
So it is expected that, on a virtualized machine:
100ns 100ns
*----------*...*----------* ...
i i i
[ window when vcpu is scheduled out ]
on reinjection:
100ns
*--*--*--*--
i i i i
Several interrupts will be injected when the vcpu
is scheduled in again.
And this is where the error must be coming from.
So its not a bug, but valid and expected behaviour
of the "virtualized HW" exposed to Windows.
Unless this causes misbehaviour of Windows timekeeping
for some reason (which i don't think it does, because
we know Windows timekeeping works properly), then this
should be marked as not a bug.
Anyone has any reason to not close this bug?
Hi, (In reply to Marcelo Tosatti from comment #14) > On physical HW timer interrupts are generated as follows: > > > d d d > *----------*----------*----------* > i i i i > > > Where i is the interrupt generation event and > "d" is the length of time between consecutive > interrupts. > > Vadim mentions that the test fails when d = 100ns. > > So it is expected that, on a virtualized machine: > > > 100ns 100ns > *----------*...*----------* ... > i i i > > [ window when vcpu is scheduled out ] > > on reinjection: > > 100ns > *--*--*--*-- > i i i i > > Several interrupts will be injected when the vcpu > is scheduled in again. > > And this is where the error must be coming from. > > So its not a bug, but valid and expected behaviour > of the "virtualized HW" exposed to Windows. > > Unless this causes misbehaviour of Windows timekeeping > for some reason (which i don't think it does, because > we know Windows timekeeping works properly), then this > should be marked as not a bug. > > Anyone has any reason to not close this bug? Thanks for your reply, but we did not hit this issue on intel host with the same env.(include host version and guest) Thank Yu Wang (In reply to Yu Wang from comment #15) > Hi, > > (In reply to Marcelo Tosatti from comment #14) > > On physical HW timer interrupts are generated as follows: > > > > > > d d d > > *----------*----------*----------* > > i i i i > > > > > > Where i is the interrupt generation event and > > "d" is the length of time between consecutive > > interrupts. > > > > Vadim mentions that the test fails when d = 100ns. > > > > So it is expected that, on a virtualized machine: > > > > > > 100ns 100ns > > *----------*...*----------* ... > > i i i > > > > [ window when vcpu is scheduled out ] > > > > on reinjection: > > > > 100ns > > *--*--*--*-- > > i i i i > > > > Several interrupts will be injected when the vcpu > > is scheduled in again. > > > > And this is where the error must be coming from. > > > > So its not a bug, but valid and expected behaviour > > of the "virtualized HW" exposed to Windows. > > > > Unless this causes misbehaviour of Windows timekeeping > > for some reason (which i don't think it does, because > > we know Windows timekeeping works properly), then this > > should be marked as not a bug. > > > > Anyone has any reason to not close this bug? > > Thanks for your reply, but we did not hit this issue on intel host with the > same env.(include host version and guest) > > Thank > Yu Wang Ok, can you give the specs of the AMD and Intel machines where this was tested? It could be that Intel hardware is more powerful, in that case there is no interrupt being missed at d=100ns. To confirm that (the reason for Intel hardware not to exhibit problem is performance executing instructions), could try an Intel host with lower performance (lower frequency, smaller cache) than the AMD host where the problem is reproducible. If its not performance, then it could be an issue with AMD KVM support (either interrupt injection or something that affects it). this job passed with latest HLK filter during rhel7.4 svvp test under q35&ovmf,so I close this bug Hit the same issue on rhel 7.6 host, with q35+seabios, both hit on AMD host and Intel host. And apply the latest HLK filter, cannot filter it pass. The job "Clock Interrupt Test" failed with follows error message: =============================================================================== Error 7/10/2018 3:29:18.484 PM Previous results in this log marked with '*' are out of tolerance. max = 14646, min = 488 WexTraceInfo ThreadId=6016 ProcessId=4604 TimeStamp=4702385725 LogSessionId=1 File: minkernel\hals\tools\npctest\exe\clkint.cpp Line: 1022 Error Type: Error Code: 0x0 Error Text: Error 0x00000000 Error 7/10/2018 3:33:27.257 PM Unable to complete testing due to repeated interference from another component. Please investigate, resolve conflict, and retry. WexTraceInfo ThreadId=6016 ProcessId=4604 TimeStamp=7190161349 LogSessionId=1 File: minkernel\hals\tools\npctest\exe\clkint.cpp Line: 880 Error Type: Error Code: 0x0 Error Text: Error 0x00000000 ================================================================================ Tried to test with follows rtc command on AMD host, all hit this issue. 1. Tried with "-rtc base=localtime,clock=host,driftfix=slew", reproduced this issue. 2. Tried with "-rtc base=localtime,clock=host,driftfix=none", reproduced this issue. 3. Tried with "-rtc base=localtime,driftfix=slew", reproduced this issue. 4. Tried with "-rtc base=localtime,driftfix=none", reproduced this issue. 5. Tried with "-rtc base=localtime,clock=host", tested 3 times, passed 1 times. 6. Tried with "-rtc base=localtime", reproduced this issue. 7. Totally test this job 20 times on AMD host, passed 1 time. Also Tried same scenarios on Intel host, Totally tested this job 25 times, but all were failed. Tested with pc+seabios before, on AMD host, run this job 4 times, 1 pass. On Intel host, run this job 15 times, 1 pass. And checked this job status on RHEL7.5 svvp test, it also report the same error, but it can be filtered pass. Guest: windows 2016 Datacenter build 14393 Used versions: kernel-3.10.0-916.el7.x86_64 qemu-kvm-rhev-2.12.0-5.el7.x86_64 seabios-bin-1.11.0-2.el7.noarch virtio-win-prewhql-156 Due to it has high reproduce rate, re-open it~ Best Regards~ Peixiu Created attachment 1457750 [details]
rhel7.6 svvp test package on intel host with q35+seabios
Please provide details of the exact hosts used in the test. (In reply to Dr. David Alan Gilbert from comment #20) > Please provide details of the exact hosts used in the test. For Intel host: [root@ibm-x3850x5-11 svvp-intel]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 47 Model name: Intel(R) Xeon(R) CPU E7- 8860 @ 2.27GHz Stepping: 2 CPU MHz: 2262.000 CPU max MHz: 2262.0000 CPU min MHz: 1064.0000 BogoMIPS: 4522.14 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 24576K NUMA node0 CPU(s): 0-9,40-49 NUMA node1 CPU(s): 10-19,50-59 NUMA node2 CPU(s): 20-29,60-69 NUMA node3 CPU(s): 30-39,70-79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm epb tpr_shadow vnmi flexpriority ept vpid ibpb ibrs stibp dtherm ida arat spec_ctrl intel_stibp For AMD host: [root@hp-dl585g7-03 svvp-amd]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 4 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 21 Model: 2 Model name: AMD Opteron(tm) Processor 6380 Stepping: 0 CPU MHz: 2493.816 BogoMIPS: 4987.63 Virtualization: AMD-V L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 6144K NUMA node0 CPU(s): 0,4,8,12,16,20,24,28 NUMA node1 CPU(s): 32,36,40,44,48,52,56,60 NUMA node2 CPU(s): 1,5,9,13,17,21,25,29 NUMA node3 CPU(s): 33,37,41,45,49,53,57,61 NUMA node4 CPU(s): 2,6,10,14,18,22,26,30 NUMA node5 CPU(s): 34,38,42,46,50,54,58,62 NUMA node6 CPU(s): 35,39,43,47,51,55,59,63 NUMA node7 CPU(s): 3,7,11,15,19,23,27,31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate retpoline_amd vmmcall bmi1 ibpb arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold ssbd Qemu command line for AMD&Intel paste on attachment. And considering may be you cannot open the attachment .hlkx file, here change it to html type. you can refer to hlk_logs.html file. And I summited this job on Intel host 5 times continue, 2 times passed. If you also need other message, please let me know~ Thanks~ Peixiu Created attachment 1457762 [details]
hlk_log html file
Created attachment 1457764 [details]
qemu command lines
Hit this issue on rhel7.4.z qemu-kvm-1.5.3-141.el7_4.9.x86_64. Tried run "Clock Interrupt Test" 30 times, only passed 1 time. passed 1/10 with "-rtc base=localtime,clock=host,driftfix=none" setting. Also tested with "-rtc base=localtime,clock=host,driftfix=slew" and "-rtc base=localtime,clock=host" many times, all failed. Used versions: virtio-win-1.9.3-1.el7.noarch qemu-kvm-1.5.3-141.el7_4.9.x86_64 kernel-3.10.0-693.el7.x86_64 seabios-bin-1.10.2-3.el7.noarch Guest : windows server 2016 GA HLK version:10.1.14393.4 HLK playlist version:2.1 Hit this issue on RHEL7.7(AMD host), run 50 times, only passed 1 time with "-rtc base=localtime,clock=host,driftfix=none". Also tested with "-rtc base=localtime,clock=host,driftfix=slew" many times, all failed. pkg: kernel-3.10.0-1057.el7 qemu-kvm-rhev-2.12.0-33.el7 seabios-1.11.0-2.el7 virtio-win-1.9.8-2.el8.iso -viostor virtio-win-prewhql-0.1-172.iso -netkvm |
Created attachment 1203193 [details] hck_log Description of problem: Job "Clock Interrupt Test " failed when "-rtc base=localtime,clock=host,driftfix=slew" on AMD host Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.6.0-25.el7.x86_64 kernel-3.10.0-500.el7.x86_64 virtio-win-1.9.0_2.el7 seabios-1.9.1-5.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. boot guest w/ -rtc base=localtime,clock=host,driftfix=slew 2. run job Clock Interrupt Test 3. Actual results: Failed as "Previous results in this log marked with '*' are out of tolerance. max = 14646, min = 488" and "Median time delta is out of tolerance. max = 7323, min = 3222" Expected results: Pass without error Additional info: 1 it could pass with "driftfix=none" 2 svvp failed log refer to attachment.