Bug 1566471
Summary: | Retry LSM after previous failure fails with exception in vdsm | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Kevin Alon Goldblatt <kgoldbla> | ||||||||
Component: | BLL.Storage | Assignee: | Benny Zlotnik <bzlotnik> | ||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Shir Fishbain <sfishbai> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4.2.2 | CC: | aefrat, bugs, bzlotnik, frolland, tnisan | ||||||||
Target Milestone: | ovirt-4.4.0 | Flags: | pm-rhel:
ovirt-4.4+
ylavi: exception+ |
||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | rhv-4.4.0-28 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2020-05-11 10:35:19 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1697261 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Kevin Alon Goldblatt
2018-04-12 11:55:20 UTC
Just a note while I am still looking into this: If the storage was blocked during LSM the second attempt will most likely fail as there leftovers on the storage. For obvious reasons, we cannot perform cleanup if the storage is not available. The exception in vdsm seems to be the same as in bug 1574631 Might be, not sure it's the same root cause though This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. I think this has been fixed, setting MODIFIED to retest Half Verified The steps to reproduce are: 1. Create a VM with disks, OS and write data to disks 2. Run the VM 3. Running LSM from gluster domain to iscsi domain 4. Run the iptables command: iptables -A OUTPUT -d 10.35.83.240 -j DROP (on gluster domain) --------------The LSM failed------------------- 5. Waiting (I waited half an hour) for "Non-Operational" status for the host - (Doesn't appear) 6. Running iptables command: iptables -D OUTPUT -d 10.35.83.240 -j DROP (Restore the connection to gluster storage) 7. The LSM succeeded My question is why the host isn't in "Non-Operational" mode after LSM failure? Logs attached Created attachment 1551415 [details]
Logs
In each attempt to reproduce this bug, the VM moves to the following two modes: 1. Paused , ran on host_mixed_3 From engine log: 2019-04-07 18:01:41,724+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: VM_PAUSED(1,025), VM shir_vm_1 has been paused. 2019-04-07 18:01:41,742+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: VM_PAUSED_ERROR(139), VM shir_vm_1 has been paused due to unknown storage error. 2. Not responding, ran on host_mixed_2 From engine log: 2019-04-07 18:51:44,894+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-5159) [4d019771] Failed to migrate VM 'shir_vm_2' 2019-04-07 18:52:22,058+03 INFO [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [101faad5] Lock Acquired to object 'EngineLock:{exclus iveLocks='[ebe30960-efcd-4b81-8c6f-262b311893bb=PROVIDER]', sharedLocks=''}' 2019-04-07 18:52:22,079+03 INFO [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [101faad5] Running command: SyncNetworkProviderCommand internal: true. 2019-04-07 18:52:22,257+03 INFO [org.ovirt.engine.core.sso.utils.AuthenticationUtils] (default task-50) [] User admin@internal successfully logged in with scopes: ovirt-app-api ovirt-ext=token-info:authz-search ovirt-ext=token-info:public-authz-search ovirt-ext=token-info:validate ovirt-ext=token:password-access 2019-04-07 18:52:22,507+03 INFO [org.ovirt.engine.core.bll.provider.network.SyncNetworkProviderCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-81) [101faad5] Lock freed to object 'EngineLock:{exclusive Locks='[ebe30960-efcd-4b81-8c6f-262b311893bb=PROVIDER]', sharedLocks=''}' 2019-04-07 18:52:45,212+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM 'd5bf5866-2f72-4f0e-9997-60dc13d0004b'(shir_vm_2) moved from 'Paused' --> 'Up' 2019-04-07 18:52:45,346+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM shir_vm_2 has recovered from paused back to up. 2019-04-07 18:52:50,225+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-21) [] VM 'd5bf5866-2f72-4f0e-9997-60dc13d0004b'(shir_vm_2) moved from 'Up ' --> 'NotResponding' 2019-04-07 18:52:50,243+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-21) [] EVENT_ID: VM_NOT_RESPONDING(126), VM shir_vm_2 is not responding. 2019-04-07 18:54:35,730+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [] EVENT_ID: VM_NOT_RESPONDING(126), VM shir_vm_2 is not responding. Steps to reproduce: 1. Create a VM with 2 gluster disks, OS and write to one of the disks 2. Running LSM from gluster domain to iscsi domain 3. Block connection between host running the VM and gluster storage (iptables -A OUTPUT -d 10.35.83.240,241,242 (3 nodes of the gluster domain) -j DROP) - host is moved to Non-operational 4. Restore connection to gluster storage - host is moved Up again Actual results: The VM moved to paused or "Not-Responding" status Expected results: The VM move up Additional info: I can't finish verifying the bug, since the VM is not getting to "up" status in order to conduct LSM for the second time. Created attachment 1553285 [details]
logs_1566471
Shir, I think the best thing to do is open a new bug on the issues you see comment #8 and mark the new bug as 'blocks' this bug. (In reply to Avihai from comment #10) > Shir, I think the best thing to do is open a new bug on the issues you see > comment #8 and mark the new bug as 'blocks' this bug. Then, we(Tal) will retarget this bug according to the blocker bug and when the blocker bug is fixed we can also verify this bug. Benny, what do you think ? This bug is now defined as "blocks" I can't finish verifying the bug, since the VM is not getting to "up" status in order to conduct LSM for the second time. I have opened new bug on this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1697261 Is this fixed in ovirt-engine-4.3.5? (In reply to Sandro Bonazzola from comment #13) > Is this fixed in ovirt-engine-4.3.5? It is currently blocked on another bug Re-targeting to 4.3.6 since this bug is blocked by bug #1697261 which has been re-targeted to 4.3.6 yesterday. If this needs to block 4.3.5 release please move both this and bug #1697261 back to 4.3.5 (In reply to Sandro Bonazzola from comment #15) > Re-targeting to 4.3.6 since this bug is blocked by bug #1697261 which has > been re-targeted to 4.3.6 yesterday. > If this needs to block 4.3.5 release please move both this and bug #1697261 > back to 4.3.5 Tal, see Sandro Comment#16 , this bug is blocked by bug #1697261 which has been re-targeted to 4.3.6. Please retargeted to 4.3.5 (In reply to Avihai from comment #16) > (In reply to Sandro Bonazzola from comment #15) > > Re-targeting to 4.3.6 since this bug is blocked by bug #1697261 which has > > been re-targeted to 4.3.6 yesterday. > > If this needs to block 4.3.5 release please move both this and bug #1697261 > > back to 4.3.5 > > Tal, see Sandro Comment#16 , this bug is blocked by bug #1697261 which has > been re-targeted to 4.3.6. > > Please retargeted to 4.3.5 I meant retarget to 4.3.6 INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: infra INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: infra INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: infra INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: infra INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [No relevant external trackers attached] For more info please contact: infra Tal, this bug is blocked by bug 1697261( see Comment 12) which has been re-targeted to 4.4. Please retarget this bug as 4.4 as well. I'm closing this bug, there were a lot of changes in the relevant flows since the report. Open a new bug on a current version if it's still relevant. |