Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1092914

Summary: RHEV host turns unassigned for 5 minutes in UI in case NFS mount error happens during upgrade
Product: Red Hat Enterprise Virtualization Manager Reporter: Tomas Dosek <tdosek>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED WORKSFORME QA Contact: Pavel Stehlik <pstehlik>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.4.0CC: acathrow, amureini, bazulay, gklein, iheim, laravot, lpeer, oourfali, pstehlik, Rhev-m-bugs, scohen, tdosek, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-28 15:38:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1072347    
Attachments:
Description Flags
engine log
none
vdsm log none

Description Tomas Dosek 2014-04-30 09:01:18 UTC
Created attachment 891084 [details]
engine log

Description of problem:
In case NFS mount error happens during 3.3 -> 3.4 upgrade, host turns to Unassigned state for more than 5 minutes until it turns Non-operational

Version-Release number of selected component (if applicable):
av7

How reproducible:
100 %

Steps to Reproduce:
1. Start the upgrade
2. Block NFS connection to SPM host before the upgrade finishes

Actual results:
Host turns to unassigned state for more than 5 minutes and there is no way of recovering the state via UI than wait.

Expected results:
Host should turn Non-operational instantly

Additional info:

engine and vdsm logs to follow.

Relevant audit log messages:
engine=# select log_time,message from audit_log order by log_time desc;
          log_time          |                                                                         
             message                                                                                  
    
----------------------------+-------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
----
 2014-04-30 11:00:01.999+02 | State was set to Up for host 10.34.27.209.
 2014-04-30 11:00:01.976+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 11:00:00.724+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 10:56:16.252+02 | Failed to connect Host 10.34.27.209 to Storage Pool NFS
 2014-04-30 10:56:16.235+02 | Host 10.34.27.209 cannot access one of the Storage Domains attached to t
he Data Center NFS. Setting Host state to Non-Operational.
 2014-04-30 10:56:15.735+02 | Failed to connect Host 10.34.27.209 to Storage Servers
 2014-04-30 10:56:15.727+02 | The error message for connection 10.34.1.251:/mnt/storage2 returned by V
DSM was: Problem while trying to mount target
 2014-04-30 10:56:15.72+02  | Failed to connect Host 10.34.27.209 to the Storage Domains storage2.
 2014-04-30 10:55:27.133+02 | Storage Domain export2 was detached from Data Center NFS by admin
 2014-04-30 10:54:52.493+02 | Host 10.34.27.209 configuration was updated by admin.
 2014-04-30 10:54:10.155+02 | State was set to Up for host 10.34.27.209.
 2014-04-30 10:54:10.133+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 10:54:08.621+02 | Host 10.34.27.209 was activated by admin.
 2014-04-30 10:54:08.577+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 10:53:03.555+02 | Failed to connect Host 10.34.27.209 to Storage Pool NFS
 2014-04-30 10:53:03.529+02 | Host 10.34.27.209 cannot access one of the Storage Domains attached to t
he Data Center NFS. Setting Host state to Non-Operational.
 2014-04-30 10:52:04.558+02 | Storage Domain export2 (Data Center NFS) was deactivated by admin
 2014-04-30 10:50:02.789+02 | State was set to Up for host 10.34.27.209.
 2014-04-30 10:50:02.762+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 10:50:00.739+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 10:48:05.086+02 | Failed to connect Host 10.34.27.209 to Storage Pool NFS
 2014-04-30 10:48:05.04+02  | Host 10.34.27.209 cannot access one of the Storage Domains attached to t
he Data Center NFS. Setting Host state to Non-Operational.
 2014-04-30 10:45:03.75+02  | State was set to Up for host 10.34.27.209.
 2014-04-30 10:45:03.721+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 10:45:03.163+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 10:45:01.172+02 | Host 10.34.27.209 was autorecovered.
 2014-04-30 10:45:01.021+02 | Could not get hardware information for host 10.34.27.209
 2014-04-30 10:44:33.355+02 | User admin logged in.

Comment 1 Tomas Dosek 2014-04-30 09:03:04 UTC
Created attachment 891086 [details]
vdsm log

Comment 2 Allon Mureinik 2014-04-30 16:25:57 UTC
Liron, please take a look?

Comment 3 Liron Aravot 2014-05-25 12:03:38 UTC
Tomas, can you elaborate on the issue here?
When host is being activated its status is being changed to unassigned, when error during the activation in inspected the host status changes to non operational.

Perhaps the UX in that scenario could be improved, as this host lifecycle seems like it's infra related.

Comment 4 Tomas Dosek 2014-05-26 07:01:39 UTC
The host lifecycle is indeed this way. What I assume is happening is:

A) Host is turned to unassigned before the upgrade takes engine down
B) Upgrade runs smoothly
C) The unassigned state is still in DB so the new RHEV-M presents it to user
D) On next run of getVdsStat the status gets correctly fetched.

This is the only flow that sounds reasonable of all the ones is inspected in the code.

Comment 5 Allon Mureinik 2014-06-01 07:20:13 UTC
Barak,

As commented above, this BZ seems to be about host lifecycle (specifically during upgrade).
This seems like your team's area more than mine.
Can you please confirm/refute?

Comment 6 Barak 2014-07-27 12:20:57 UTC
(In reply to Allon Mureinik from comment #5)
> Barak,
> 
> As commented above, this BZ seems to be about host lifecycle (specifically
> during upgrade).
> This seems like your team's area more than mine.
> Can you please confirm/refute?

The 2 questions are:
- why did the host move to unassigned before restarting the engine ?
- what happens to host im UNASSIGNED status after engine restart


Allon it seems this issue is in the seam between storage and infra (leans more towards storage).

Anyway help me answer the above questions and we'll proceed from there.

Comment 7 Barak 2014-07-27 12:21:53 UTC
Tomas - how much time between disconnec of SPM from storage till it became unassigned ?

Comment 8 Allon Mureinik 2014-07-27 12:28:44 UTC
(In reply to Barak from comment #6)
> (In reply to Allon Mureinik from comment #5)
> > Barak,
> > 
> > As commented above, this BZ seems to be about host lifecycle (specifically
> > during upgrade).
> > This seems like your team's area more than mine.
> > Can you please confirm/refute?
> 
> The 2 questions are:
> - why did the host move to unassigned before restarting the engine ?
> - what happens to host im UNASSIGNED status after engine restart
> 
> 
> Allon it seems this issue is in the seam between storage and infra (leans
> more towards storage).
> 
> Anyway help me answer the above questions and we'll proceed from there.
Liron - can you answer this please?

Comment 9 Tomas Dosek 2014-07-28 07:02:21 UTC
I'm not able to exactly tell that Barak, it all happened spontaneously while guys in lab performed maintenance on shared storage. I know everything was fine before I started the setup and it looks like the relevant storage was reconnecting before upgrade restarted engine. It had to be a microfailure which indeed is very hard to reproduce.

Comment 10 Eli Mesika 2014-07-28 09:19:47 UTC
I had tried to reproduce by

1) stop the engine
2) set host status to Unassigned manually in DB
3) block connection from host to storage 
4) start the engine

Host was moved upon engine start from Unassigned to non-operational as expected