Bug 1386507

Summary: [RFE] better logging of cluster version upgrade failures
Product: Red Hat Enterprise Virtualization Manager Reporter: Marcus West <mwest>
Component: ovirt-engineAssignee: Shmuel Melamud <smelamud>
Status: CLOSED ERRATA QA Contact: sefi litmanovich <slitmano>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.0.3CC: gklein, lsurette, mavital, mgoldboi, michal.skrivanek, rbalakri, Rhev-m-bugs, smelamud, srevivo, tjelinek, ykaul
Target Milestone: ovirt-4.1.0-alphaKeywords: FutureFeature
Target Release: ---Flags: gklein: testing_plan_complete+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-25 00:55:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marcus West 2016-10-19 07:00:35 UTC
## Description of problem:

An invalid timezone setting for a single VM can cause cluster compatibility upgrade to not work.  The logs do not clearly indicate the problem VM

## Version-Release number of selected component (if applicable):

rhevm-4.0.4.4-0.1.el7ev.noarch

## How reproducible:

always

## Steps to Reproduce:
1. create a 3.6 DC/cluster, and VM's
2. change one of the VM's timezone to '' (vm_static, time_zone)

engine=# select vm_name, vm_guid, os, time_zone from vm_static where cluster_id = 'f0f30779-6e8b-46e8-8689-9fd46cea220b' order by vm_name;
  vm_name   |               vm_guid                | os |     time_zone     
------------+--------------------------------------+----+-------------------
 linux-test | 0dbf06c6-2734-4d72-84ee-30d7dc230c56 |  5 | Etc/GMT
 rhel6-test | 5d5fe9fe-3e70-4a22-9cdc-d6c8c8f9694f | 19 | Etc/GMT
 rhel7-test | 12c43e79-648a-4c7f-a0a7-54a7a6be9e7f | 24 | 
 win-test   | b57eb0c7-f933-4a23-aff8-1ed6424ff0ed | 25 | GMT Standard Time

3.

## Actual results:

From the gui, action fails with error:

"Error while executing action Edit Cluster properties: Internal Engine Error"

## Expected results:

GUI (or logs) should report specifically which VM is in error.

## Additional info:

I don't have a reproducer for creating a VM with an invalid timezone.  Not sure how the customer managed to achieve it, but we spent several hours messing around with the wrong VM's in an attempt to isolate the problem.

In larger environments (with a mix of Linux and other OS's), it may be difficult to see which ones are 'invalid'

Comment 2 sefi litmanovich 2016-11-22 13:26:32 UTC
So far I had a look at this feature in the nightly build.
There is some more information, but not sure if it might be enough to satisfy the request for pin pointing the problem and pointing to the problematic vm.
e.g. I forced some invalid string in DB for vm's time_zone and try to upgrade the cluster I get:

Error while executing action:

    Cannot edit Cluster. Invalid time zone for given OS type.
    Attribute: vmStatic

While this does add the real reason to the message, if I had 200 VMs in my env I'd have a hard time figuring which one had caused the problem.
I can figure that out by this line in engine.log, but I'm thinking it's not enough, open for a discussion about it.

2016-11-22 14:10:27,854 INFO  [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-12) [28807589] Lock freed to object 'EngineLock:{exclusiveLocks='null', sharedLocks='[{faulty_vm's_id}=<VM, ACTION_TYPE_FAILED_CLUSTER_IS_BEING_UPDATED$clusterName another-clust>]'}'

As this RFE doesn't hold many cases, I'm adding 1 case to check that an error in upgrade cluster doesn't produce Internal Error message - please review this case once I upload the link to polarion in here and tell me if you think I need to add more cases.

Comment 4 sefi litmanovich 2017-02-02 12:05:03 UTC
Verifying based on my comment #2 and attached test case.
Opening a new RFE - https://bugzilla.redhat.com/show_bug.cgi?id=1418641
for more specific information in logging as I understand that my request from comment #2 will not be easily implemented within the scope of this RFE.