Bug 1269828

Summary: [windows 10] [3.6 engine/3.5 cluster] Windows 10 are restarting - SYSTEM_THREAD_EXCEPTION_NOT_HANDLES
Product: [oVirt] ovirt-engine Reporter: Jiri Belka <jbelka>
Component: Frontend.WebAdminAssignee: jniederm
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: bugs, ehabkost, gklein, jbelka, jdenemar, jniederm, knoel, lijin, mavital, mgoldboi, michal.skrivanek, michen, nsimsolo, sbonazzo, tjelinek
Target Milestone: ovirt-3.5.7Keywords: ZStream
Target Release: 3.5.7Flags: rule-engine: ovirt-3.5.z+
ylavi: planning_ack+
michal.skrivanek: devel_ack+
mavital: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-17 09:06:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1277214, 1288090    
Bug Blocks:    
Attachments:
Description Flags
windows 10 error screenshot
none
Reassign engine log none

Description Jiri Belka 2015-10-08 10:11:07 UTC
Created attachment 1080958 [details]
windows 10 error screenshot

Description of problem:

I had Windows 10 running on my 3.6 engine env (including RHEL7 hosts with 3.6 vdsm). I needed to test something so I modified Windows 10 VM settings and changed cluster to another one with 3.5 cluster level (no warning, no error - thus seemed to be OK).

Then I started this (previously working OK) Windows 10 on RHEL6 hosts with 3.5 vdsm in this 3.5 cluster.

Windows are restarting all over again with error:

  SYSTEM_THREAD_EXCEPTION_NOT_HANDLES

If Windows 10 is not supported to be run on 3.5 cluster level (which allow RHEL6 hosts, thus allowing older qemu-kvm etc...), then there is missing check while a user is modifying VM configuration and changes cluster level.

I'm speculating here but if Windows 10 is supported only on 3.6 cluster level then:

- there should be always check what OS has VM defined and check if requested cluster level would support it
- there should be a warning or it should not be allowed to change cluster level for specific OS types (to below levels)

Version-Release number of selected component (if applicable):

- host:
libvirt-0.10.2-54.el6.x86_64
vdsm-4.16.27-1.el6ev.x86_64
kernel-2.6.32-573.7.1.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.479.el6_7.2.x86_64

- engine:
rhevm-webadmin-portal-3.6.0-0.18.el6.noarch

How reproducible:
tried once

Steps to Reproduce:
1. 3.6 engine, 3.6 cluster level, 3.6 host (RHEL7)
2. install Windows 10
3. 3.5 cluster level, 3.5 host (RHEL6)
4. while Windows 10 VM is down, modify cluster to 3.5 level
5. start Windows 10 VM

Actual results:
- no warning, no error about issues to downgrade cluster level for Windows 10
- Windows 10 is restarting all over again

Expected results:
no idea
- either should not be allowed to downgrade cluster level for OS types which
  require new features from above cluster levels or there should be a warning
- running Windows 10 on 3.5 cluster level?

Additional info:

Comment 2 Michal Skrivanek 2015-10-09 13:21:18 UTC
Do we support Windows 10 only with certain machine types? 
RHEV 3.5 here means rhel_6.5.0 type, 3.6 would be the latest 7.2 machine type...

Comment 3 Karen Noel 2015-10-09 14:40:01 UTC
(In reply to Michal Skrivanek from comment #2)
> Do we support Windows 10 only with certain machine types? 
> RHEV 3.5 here means rhel_6.5.0 type, 3.6 would be the latest 7.2 machine
> type...

Yes, I believe Win10 is only supported with the rhel-7.2.0 machine type. Eduardo?

There is also a -cpu flag workaround. Add +fsgsbase. There is a way to do this with libvirt. Jirka?

I'm not sure if it works in this case, though. I'm curious if RHEV allows using this libvirt workaround. Thanks.

Comment 9 Jiri Belka 2015-10-26 13:06:00 UTC
Yes, the host has Xeon 5507, Gainestown serie based on Nehalem microarchitecture (thx wikipedia).

I'm going to search for Westmere host.

Comment 10 Jiri Belka 2015-10-26 13:48:39 UTC
So windows 10 starts OK on a host (rhel 6.7 with 3.5 vdsm) in 3.5 cluster with Westmere CPU module.

(Anyway, I think there would be some check if one can specific OS type put into a specific cluster.)

Comment 11 Jiri Belka 2015-10-29 15:03:38 UTC
And what was repaired? What behaviour should we expect?

Comment 12 Tomas Jelinek 2015-10-29 15:19:10 UTC
currently you can not run a windows 10 VM on any of this CPUs:

conroe, penryn, nehalem, opteron_g1, opteron_g2, opteron_g3, opteron_g4, opteron_g5

Comment 13 Nisim Simsolo 2015-11-05 15:47:38 UTC
- Reassigned. cluster level cannot be changed to 3.5. Rejected with the next webadmin message: 
Error while executing action: Cannot decrease data center compatibility version

Relevant engine logs:
2015-11-05 15:37:58,288 WARN  [org.ovirt.engine.core.bll.UpdateVdsGroupCommand] (ajp-/127.0.0.1:8702-5) [1d402ae3] CanDoAction of action 'UpdateVdsGroup' failed for user admin@internal. Reasons: VAR__TYPE__CLUSTER,VAR__ACTION__UPDATE,ACTION_TYPE_FAILED_CANNOT_DECREASE_COMPATIBILITY_VERSION
2015-11-05 15:37:58,571 ERROR [org.ovirt.engine.core.bll.host.provider.foreman.SystemProviderFinder] (ajp-/127.0.0.1:8702-3) [] Failed to find host on any provider by host name 'dhcp163-68.scl.lab.tlv.redhat.com' 
2015-11-05 15:37:59,289 ERROR [org.ovirt.engine.core.bll.host.provider.foreman.SystemProviderFinder] (ajp-/127.0.0.1:8702-4) [] Failed to find host on any provider by host name 'dhcp163-68.scl.lab.tlv.redhat.com' 

Scenario: 
1. Change data center compatibility version to 3.5
2. Change cluster compatibility version to 3.5 and click ok.

Actual result: 
Action rejected by webadmin.

Issue is relevant for both AMD and Intel CPU types (Verified on AMD Opteron G5 and Intel Nehalem).

If https://bugzilla.redhat.com/show_bug.cgi?id=1269828#c12 is the bug fix, then an appropriate error message should appear in webadmin instead of "cannot decrease data center compatibility version" without an explanation.

Comment 14 Red Hat Bugzilla Rules Engine 2015-11-05 15:47:41 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 15 Nisim Simsolo 2015-11-05 15:54:32 UTC
BTW, I verified if Win10 VM can be move from 3.6 cluster to 3.5 cluster (with 3.5.6 host) and it functioned as expected (Win10 VM can be run on 3.6 engine with 3.5 cluster)
Host CPU type was Opteron G3.

Comment 16 Nisim Simsolo 2015-11-05 15:55:53 UTC
Created attachment 1090194 [details]
Reassign engine log

Comment 17 Tomas Jelinek 2015-11-06 07:42:19 UTC
Decreasing cluster version is allowed only if:
- there are no hosts in the cluster
- the new version is not smaller than the DC version

This is an old behavior which has not been touched by this patch.
If this is the case than it is correct.

Also, what exact version of engine are you testing this on?
Strange is that on Opteron it was running...
Can you please double check the cluster's CPU Type?
Because on 3.6 branch I see this:
os.windows_10.cpu.unsupported.value = conroe, penryn, nehalem, opteron_g1, opteron_g2, opteron_g3, opteron_g4, opteron_g5

Comment 18 Nisim Simsolo 2015-11-08 07:54:40 UTC
Hi.
I verified it again, this time i first removed all the hosts from cluster as you mentioned.
Exact verification scenario: 
1. Install windows 10 on 3.6 engine, 3.6 host (DC and cluster levels are 3.6).
2. Verify VM is running.
3. Power off VMs, remove host (verify no other hosts are in this cluster).
4. Reduce DC and cluster to compatibility version 3.5
5. Add 3.5 host to cluster.
6. Run Windows 10 VM.

Actual result: 
VM is running properly.

Tested versions: 
engine: rhevm-3.6.0.3-0.1.el6 (3.6.0-17)
- 3.6 host: 
libvirt-client-1.2.17-5.el7.x86_64
vdsm-4.17.10.1-0.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7.x86_64
- 3.5.6 host: 
vdsm-4.16.29-1.el7ev.x86_64
qemu-kvm-rhev-2.1.2-23.el7_1.10.x86_64
libvirt-client-1.2.8-16.el7_1.4.x86_64
sanlock-3.2.2-2.el7.x86_64

Cluster CPU type: AMD Opteron G3
Host CPU type: Quad-Core AMD Opteron(tm) Processor 2350
VM OS type was set to windows 10 x64 in webadmin During new VM creation (see bug https://bugzilla.redhat.com/show_bug.cgi?id=1278442)

Please move bug to ON_QA and I'll verify it.

Comment 19 jniederm 2015-11-09 14:42:20 UTC
Hi Nisim,

current expected behavior is that Win10 (32bit, 64bit) does NOT run on (among others) Opteron G1-G5.

I suggest following test procedure.
1. Make sure DC and Cluster levels are set to 3.5
2. Make sure Opteron G3 host is in this cluster
4. Make sure Cluster CPU is set to Opteron G3
5. Create VM, set guest OS to 'Other OS. (no real installation required)
6. (ASSERT) vm runs
7. Stop vm
8. Set host to maitenance
9. Set Cluster CPU to Haswell
10. Set VM guest OS to Win10 (no real installation required)
11. Set Cluster CPU to Opteron G3
12. Activate Host
13. Run vm

Expected result: VM refuses to run with error message
The guest OS doesn't support the following CPUs: opteron_g5, opteron_g3, opteron_g4, opteron_g1, opteron_g2, conroe, nehalem, penryn. Its possible to change the cluster cpu or set a different one per VM

Comment 20 Red Hat Bugzilla Rules Engine 2015-11-09 14:42:24 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 21 Michal Skrivanek 2015-12-17 09:06:34 UTC
working since ~ 3.5.5

Westmere+ CPUS working fine (except el6 hosts with SandyBridge during W10 installation)