1386507 – [RFE] better logging of cluster version upgrade failures

Bug 1386507 - [RFE] better logging of cluster version upgrade failures

Summary: [RFE] better logging of cluster version upgrade failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.3
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	ovirt-4.1.0-alpha
Target Release:	---
Assignee:	Shmuel Melamud
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-19 07:00 UTC by Marcus West
Modified:	2021-08-30 12:06 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-25 00:55:25 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	gklein: testing_plan_complete+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43199	None	None	None	2021-08-30 12:06:20 UTC
Red Hat Knowledge Base (Solution)	2715711	None	None	None	2016-10-19 07:28:02 UTC
Red Hat Product Errata	RHEA-2017:0997	normal	SHIPPED_LIVE	Red Hat Virtualization Manager (ovirt-engine) 4.1 GA	2017-04-18 20:11:26 UTC
oVirt gerrit	66205	None	MERGED	core: Propagate UpdateVm failure to UpdateClusterCommand	2021-02-12 07:25:19 UTC

Description Marcus West 2016-10-19 07:00:35 UTC

## Description of problem:

An invalid timezone setting for a single VM can cause cluster compatibility upgrade to not work.  The logs do not clearly indicate the problem VM

## Version-Release number of selected component (if applicable):

rhevm-4.0.4.4-0.1.el7ev.noarch

## How reproducible:

always

## Steps to Reproduce:
1. create a 3.6 DC/cluster, and VM's
2. change one of the VM's timezone to '' (vm_static, time_zone)

engine=# select vm_name, vm_guid, os, time_zone from vm_static where cluster_id = 'f0f30779-6e8b-46e8-8689-9fd46cea220b' order by vm_name;
  vm_name   |               vm_guid                | os |     time_zone     
------------+--------------------------------------+----+-------------------
 linux-test | 0dbf06c6-2734-4d72-84ee-30d7dc230c56 |  5 | Etc/GMT
 rhel6-test | 5d5fe9fe-3e70-4a22-9cdc-d6c8c8f9694f | 19 | Etc/GMT
 rhel7-test | 12c43e79-648a-4c7f-a0a7-54a7a6be9e7f | 24 | 
 win-test   | b57eb0c7-f933-4a23-aff8-1ed6424ff0ed | 25 | GMT Standard Time

3.

## Actual results:

From the gui, action fails with error:

"Error while executing action Edit Cluster properties: Internal Engine Error"

## Expected results:

GUI (or logs) should report specifically which VM is in error.

## Additional info:

I don't have a reproducer for creating a VM with an invalid timezone.  Not sure how the customer managed to achieve it, but we spent several hours messing around with the wrong VM's in an attempt to isolate the problem.

In larger environments (with a mix of Linux and other OS's), it may be difficult to see which ones are 'invalid'

Comment 2 sefi litmanovich 2016-11-22 13:26:32 UTC

So far I had a look at this feature in the nightly build.
There is some more information, but not sure if it might be enough to satisfy the request for pin pointing the problem and pointing to the problematic vm.
e.g. I forced some invalid string in DB for vm's time_zone and try to upgrade the cluster I get:

Error while executing action:

    Cannot edit Cluster. Invalid time zone for given OS type.
    Attribute: vmStatic

While this does add the real reason to the message, if I had 200 VMs in my env I'd have a hard time figuring which one had caused the problem.
I can figure that out by this line in engine.log, but I'm thinking it's not enough, open for a discussion about it.

2016-11-22 14:10:27,854 INFO  [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-12) [28807589] Lock freed to object 'EngineLock:{exclusiveLocks='null', sharedLocks='[{faulty_vm's_id}=<VM, ACTION_TYPE_FAILED_CLUSTER_IS_BEING_UPDATED$clusterName another-clust>]'}'

As this RFE doesn't hold many cases, I'm adding 1 case to check that an error in upgrade cluster doesn't produce Internal Error message - please review this case once I upload the link to polarion in here and tell me if you think I need to add more cases.

Comment 4 sefi litmanovich 2017-02-02 12:05:03 UTC

Verifying based on my comment #2 and attached test case.
Opening a new RFE - https://bugzilla.redhat.com/show_bug.cgi?id=1418641
for more specific information in logging as I understand that my request from comment #2 will not be easily implemented within the scope of this RFE.

Note You need to log in before you can comment on or make changes to this bug.