Bug 1092914
| Summary: | RHEV host turns unassigned for 5 minutes in UI in case NFS mount error happens during upgrade | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Tomas Dosek <tdosek> | ||||||
| Component: | ovirt-engine | Assignee: | Eli Mesika <emesika> | ||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Pavel Stehlik <pstehlik> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 3.4.0 | CC: | acathrow, amureini, bazulay, gklein, iheim, laravot, lpeer, oourfali, pstehlik, Rhev-m-bugs, scohen, tdosek, yeylon | ||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||
| Target Release: | 3.5.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | infra | ||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-07-28 15:38:18 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1072347 | ||||||||
| Attachments: |
|
||||||||
Created attachment 891086 [details]
vdsm log
Liron, please take a look? Tomas, can you elaborate on the issue here? When host is being activated its status is being changed to unassigned, when error during the activation in inspected the host status changes to non operational. Perhaps the UX in that scenario could be improved, as this host lifecycle seems like it's infra related. The host lifecycle is indeed this way. What I assume is happening is: A) Host is turned to unassigned before the upgrade takes engine down B) Upgrade runs smoothly C) The unassigned state is still in DB so the new RHEV-M presents it to user D) On next run of getVdsStat the status gets correctly fetched. This is the only flow that sounds reasonable of all the ones is inspected in the code. Barak, As commented above, this BZ seems to be about host lifecycle (specifically during upgrade). This seems like your team's area more than mine. Can you please confirm/refute? (In reply to Allon Mureinik from comment #5) > Barak, > > As commented above, this BZ seems to be about host lifecycle (specifically > during upgrade). > This seems like your team's area more than mine. > Can you please confirm/refute? The 2 questions are: - why did the host move to unassigned before restarting the engine ? - what happens to host im UNASSIGNED status after engine restart Allon it seems this issue is in the seam between storage and infra (leans more towards storage). Anyway help me answer the above questions and we'll proceed from there. Tomas - how much time between disconnec of SPM from storage till it became unassigned ? (In reply to Barak from comment #6) > (In reply to Allon Mureinik from comment #5) > > Barak, > > > > As commented above, this BZ seems to be about host lifecycle (specifically > > during upgrade). > > This seems like your team's area more than mine. > > Can you please confirm/refute? > > The 2 questions are: > - why did the host move to unassigned before restarting the engine ? > - what happens to host im UNASSIGNED status after engine restart > > > Allon it seems this issue is in the seam between storage and infra (leans > more towards storage). > > Anyway help me answer the above questions and we'll proceed from there. Liron - can you answer this please? I'm not able to exactly tell that Barak, it all happened spontaneously while guys in lab performed maintenance on shared storage. I know everything was fine before I started the setup and it looks like the relevant storage was reconnecting before upgrade restarted engine. It had to be a microfailure which indeed is very hard to reproduce. I had tried to reproduce by 1) stop the engine 2) set host status to Unassigned manually in DB 3) block connection from host to storage 4) start the engine Host was moved upon engine start from Unassigned to non-operational as expected |
Created attachment 891084 [details] engine log Description of problem: In case NFS mount error happens during 3.3 -> 3.4 upgrade, host turns to Unassigned state for more than 5 minutes until it turns Non-operational Version-Release number of selected component (if applicable): av7 How reproducible: 100 % Steps to Reproduce: 1. Start the upgrade 2. Block NFS connection to SPM host before the upgrade finishes Actual results: Host turns to unassigned state for more than 5 minutes and there is no way of recovering the state via UI than wait. Expected results: Host should turn Non-operational instantly Additional info: engine and vdsm logs to follow. Relevant audit log messages: engine=# select log_time,message from audit_log order by log_time desc; log_time | message ----------------------------+------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------ ---- 2014-04-30 11:00:01.999+02 | State was set to Up for host 10.34.27.209. 2014-04-30 11:00:01.976+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 11:00:00.724+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 10:56:16.252+02 | Failed to connect Host 10.34.27.209 to Storage Pool NFS 2014-04-30 10:56:16.235+02 | Host 10.34.27.209 cannot access one of the Storage Domains attached to t he Data Center NFS. Setting Host state to Non-Operational. 2014-04-30 10:56:15.735+02 | Failed to connect Host 10.34.27.209 to Storage Servers 2014-04-30 10:56:15.727+02 | The error message for connection 10.34.1.251:/mnt/storage2 returned by V DSM was: Problem while trying to mount target 2014-04-30 10:56:15.72+02 | Failed to connect Host 10.34.27.209 to the Storage Domains storage2. 2014-04-30 10:55:27.133+02 | Storage Domain export2 was detached from Data Center NFS by admin 2014-04-30 10:54:52.493+02 | Host 10.34.27.209 configuration was updated by admin. 2014-04-30 10:54:10.155+02 | State was set to Up for host 10.34.27.209. 2014-04-30 10:54:10.133+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 10:54:08.621+02 | Host 10.34.27.209 was activated by admin. 2014-04-30 10:54:08.577+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 10:53:03.555+02 | Failed to connect Host 10.34.27.209 to Storage Pool NFS 2014-04-30 10:53:03.529+02 | Host 10.34.27.209 cannot access one of the Storage Domains attached to t he Data Center NFS. Setting Host state to Non-Operational. 2014-04-30 10:52:04.558+02 | Storage Domain export2 (Data Center NFS) was deactivated by admin 2014-04-30 10:50:02.789+02 | State was set to Up for host 10.34.27.209. 2014-04-30 10:50:02.762+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 10:50:00.739+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 10:48:05.086+02 | Failed to connect Host 10.34.27.209 to Storage Pool NFS 2014-04-30 10:48:05.04+02 | Host 10.34.27.209 cannot access one of the Storage Domains attached to t he Data Center NFS. Setting Host state to Non-Operational. 2014-04-30 10:45:03.75+02 | State was set to Up for host 10.34.27.209. 2014-04-30 10:45:03.721+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 10:45:03.163+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 10:45:01.172+02 | Host 10.34.27.209 was autorecovered. 2014-04-30 10:45:01.021+02 | Could not get hardware information for host 10.34.27.209 2014-04-30 10:44:33.355+02 | User admin logged in.