| Summary: | Snapshot remains locked when performing live-merge on HSM | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Kevin Alon Goldblatt <kgoldbla> | ||||||||
| Component: | Core | Assignee: | Ala Hino <ahino> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Kevin Alon Goldblatt <kgoldbla> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 4.18.15.2 | CC: | ahino, amureini, bugs, danken, fromani, gklein, kgoldbla, ratamir, rgolan, tnisan, ylavi | ||||||||
| Target Milestone: | ovirt-4.0.6 | Keywords: | Regression | ||||||||
| Target Release: | --- | Flags: | rule-engine:
ovirt-4.0.z+
rule-engine: blocker+ |
||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | All | ||||||||||
| Whiteboard: | storage | ||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-01-18 07:29:00 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
Created attachment 1226330 [details]
server, vdsm, engine.log
Added Logs
Francesco,
While looking into this BZ, I found that merge command wasn't received by vdsm at all and engine got network timeout. When run top on the host, I found that vdsm consumes almost 100% cpu. I doubled checked and found that before live merge vdsm consumed almost no cpu (as expected as the host does nothing really), and after starting live merge, vdsm immediately consumes ~100% cpu and continues to consume 100% even after engine receive timeout.
The vm I use to test this is a really 'thin' vm with 1Gb disk and no OS installed on it.
In Vdsm log I see the following error:
2016-12-04 11:18:57,008 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.35.0.131', 59896, 0, 0) at 0x3ca1710>: unexpected eof (betterAsyncore:113)
Dan mentioned that recently, Mor Kalfon reported something similar when he tested OVN migration. Dan suggested consulting you since there were changes made in
the virt code to accommodate post-copy migration.
Would appreciate if you can a look and advice.
(In reply to Ala Hino from comment #2) > Francesco, [...]ore:113) > > Dan mentioned that recently, Mor Kalfon reported something similar when he > tested OVN migration. Dan suggested consulting you since there were changes > made in > the virt code to accommodate post-copy migration. True, but those changes are been made in master branch, with no intentions of backporting, while here it seems we are talking about 4.0.z, aren't we? > Would appreciate if you can a look and advice. On it (In reply to Francesco Romani from comment #3) > (In reply to Ala Hino from comment #2) > > Francesco, > [...]ore:113) > > > > Dan mentioned that recently, Mor Kalfon reported something similar when he > > tested OVN migration. Dan suggested consulting you since there were changes > > made in > > the virt code to accommodate post-copy migration. > > True, but those changes are been made in master branch, with no intentions > of backporting, while here it seems we are talking about 4.0.z, aren't we? > > > Would appreciate if you can a look and advice. > > On it We need more Vdsm logs to understand what happened VDSM side. I don't see the beginning of the operation. I can tell somehow Vdsm lost its state: periodic/2::ERROR::2016-11-24 16:43:33,868::executor::232::Executor::(_execute_task) Unhandled exception in Task(callable=<BlockjobMonitor vm=20a5e29e-5576-4e2b-8901-4147bfba13ab at 0x29f74d0>, timeout=7.5) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 230, in _execute_task task.callable() File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 302, in __call__ self._execute() File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 373, in _execute self._vm.updateVmJobs() File "/usr/share/vdsm/virt/vm.py", line 4506, in updateVmJobs self._vmJobs = self.queryBlockJobs() File "/usr/share/vdsm/virt/vm.py", line 4529, in queryBlockJobs drive = self._findDriveByUUIDs(storedJob['disk']) File "/usr/share/vdsm/virt/vm.py", line 3248, in _findDriveByUUIDs raise LookupError("No such drive: '%s'" % drive) LookupError: No such drive: '{'domainID': '38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e', 'imageID': '2df48dca-c4d1-431d-81e6-08676f7a469f', 'volumeID': '1ac5ad49-3e63-454a-954b-6f074acd4ed6', 'poolID': 'f30058ff-5799-45b2-a272-22b2d198aa16'}' But this is not the cause, it's the periodic update which fails because Vdsm has now stale information. Please make sure both Vdsm and Engine are updated properly and the comminication between the two works fine. Added a new set of logs:
NB! To help search the scenario in the logs in engine with:
echo "testone_start" >> engine.log
echo "testone_end" >> engine.log
NB! To help search the scenario in the logs in SPM and HSM with:
echo "testone_start" >> vdsm.log
echo "testone_end" >> vdsm.log
Verified again with the following code:
---------------------------------------------------
vdsm-4.18.999-759.git435a852.el7.centos.x86_64 SPM
vdsm-4.18.17-1.el7ev.x86_64 HSM
ovirt-engine-4.0.6-0.1.el7ev.noarch
How reproducible:
Every tine
Steps to Reproduce:
1. Create vm with disk on SPM and run
2. Create snapshot
3. Delete snapshot - snapshot link is deleted in /rhev/data-center
4. Stop vm
5. Start the vm on HSM
6.Create a new snapshot
7. Delete the snapshot - snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center
Actual results:
snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center
Expected results:
Additional info:
HSM
Before
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
│ │ │ ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462
With snapshots
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
│ │ │ ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462
│ │ │ └── 93d39a43-7d81-40dd-9060-f33857d7bfd7 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/93d39a43-7d81-40dd-9060-f33857d7bfd7
After deleting snapshots
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
│ │ │ ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462
│ │ │ └── 93d39a43-7d81-40dd-9060-f33857d7bfd7 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/93d39a43-7d81-40dd-9060-f33857d7bfd7
^^^^^^SHOULD HAVE BEEN DELETED^^^^^^^
From Engine.log
-------------------------------
ternal: true. Entities affected : ID: 00000000-0000-0000-0000-000000000000 Type: Storage
2016-12-06 19:34:10,181 INFO [org.ovirt.engine.core.bll.RemoveMemoryVolumesCommand] (pool-5-thread-7) [1489e1f1] Running command: RemoveMemoryVolumesCommand internal: true.
2016-12-06 19:34:10,217 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [] Correlation ID: 5cab2dba, Job ID: 9056c6db-b868-4807-8776-ba8c48183a0b, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'sns_hms_testone' deletion for VM 'testone' was initiated by admin@internal-authz.
2016-12-06 19:34:10,254 INFO [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-7) [65a94a76] Lock Acquired to object 'EngineLock:{exclusiveLocks='[40c3733e-b0b4-4f72-aa09-a903dacba9cd=<DISK, ACTION_TYPE_FAILED_DISK_IS_BEING_REMOVED$DiskName snapshot_memory>]', sharedLocks='null'}'
2016-12-06 19:34:10,273 INFO [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-7) [65a94a76] Running command: RemoveDiskCommand internal: true. Entities affected : ID: 40c3733e-b0b4-4f72-aa09-a903dacba9cd Type: DiskAction group DELETE_DISK with role type USER
Created attachment 1228663 [details]
server, vdsm, engine.log
added new logs
Turns out https://bugzilla.redhat.com/show_bug.cgi?id=1400137#c2 is the first blocker, and https://gerrit.ovirt.org/#/c/67902/ will fix it. While 67902 is surely needed, I'm not sure yet it's the only fix we need for this BZ. Please note that 67902 is master only. The patch who broke master never landed on 4.0.z (and there are no future plans for it), and this BZ refers to 4.0.5. Ala you wrote that the engine misses a task and that really could be explained by Bug 1401585 which specifically fixes this missing task scenario. *** Bug 1400707 has been marked as a duplicate of this bug. *** This will only be included in v4.18.19 and onwards, moving back to MODIFIED (In reply to Roy Golan from comment #9) > Ala you wrote that the engine misses a task and that really could be > explained by Bug 1401585 which specifically fixes this missing task scenario. Thanks for Roy for raising this but eventually root cause wasn't related to BZ 1401585. Created attachment 1231283 [details]
engine and hsm logs
Still not working - the snapshot remains in locked state and from logs of hsm which running the VM:
Thread-2783::ERROR::2016-12-13 18:03:08,878::utils::375::root::(wrapper) Unhandled exception
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 372, in wrapper
return f(*a, **kw)
File "/usr/share/vdsm/virt/vm.py", line 5003, in run
self.teardown_top_volume()
File "/usr/share/vdsm/virt/vm.py", line 4989, in teardown_top_volume
sd_manifest = sdc.sdCache.produce_manifest(self.drive.domainID)
AttributeError: StorageDomainCache instance has no attribute 'produce_manifest'
* attached
Seems like this host was with wrong vdsm vdsm-4.18.18-1.el7ev.x86_64 . moving back to ON_QA Verified with the following version: ---------------------------------------------- vdsm-4.18.20-1.el7ev.x86_64 SPM vdsm-4.18.20-1.el7ev.x86_64 HSM rhevm-4.0.6.3-0.1.el7ev.noarch How reproducible: Every tine Verified with the following scenario: ---------------------------------------------- Steps to Reproduce: 1. Create vm with disk on SPM and run 2. Create snapshot 3. Delete snapshot - snapshot link is deleted in /rhev/data-center 4. Stop vm 5. Start the vm on HSM 6.Create a new snapshot 7. Delete the snapshot - snapshot successfully deleted and links deleted Actual results: snapshot is deleted and link removed in /rhev/data-center |
Description of problem: Snapshot delete on HSM fails - works fine on SPM Version-Release number of selected component (if applicable): vdsm-4.18.999-759.git435a852.el7.centos.x86_64 SPM vdsm-4.18.17-1.el7ev.x86_64 HSM rhevm-4.0.6-0.1.el7ev.noarch How reproducible: Every tine Steps to Reproduce: 1. Create vm with disk on SPM and run 2. Create snapshot 3. Delete snapshot - snapshot link is deleted in /rhev/data-center 4. Stop vm 5. Start the vm on HSM 6.Create a new snapshot 7. Delete the snapshot - snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center Actual results: snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center Expected results: Additional info: SPM Before ├── 2df48dca-c4d1-431d-81e6-08676f7a469f │ │ │ └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036 With snapshots ├── 2df48dca-c4d1-431d-81e6-08676f7a469f │ │ │ ├── 79463314-d607-4f93-82e9-372c84697c1f -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/79463314-d607-4f93-82e9-372c84697c1f │ │ │ └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036 After deleting snapshot ├── 2df48dca-c4d1-431d-81e6-08676f7a469f │ │ │ └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036 HSM Before ├── 2df48dca-c4d1-431d-81e6-08676f7a469f │ │ │ ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6 With snapshots ├── 2df48dca-c4d1-431d-81e6-08676f7a469f │ │ │ ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6 │ │ │ └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036 After deleting snapshots ├── 2df48dca-c4d1-431d-81e6-08676f7a469f │ │ │ ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6 │ │ │ └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036 SHOULD HAVE BEEN DELETED From engine.log ----------------------------------------------------------------- 2016-11-24 16:42:40,181 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [] Correlation ID: 5e2a8d2d, Job ID: f03fe7bf-9e3a-433b-8356-8c11d6bd9 cb1, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'snapper1' deletion for VM 'bbbb' was initiated by admin@internal-authz. 2016-11-24 16:42:40,184 INFO [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-3) [21f25490] Running command: RemoveDiskCommand internal: true. Entities affected : ID: aab7e02f-229f-4bf3-8942-634858e873f5 Type: DiskAction group DELETE_DISK with role type USER 2016-11-24 16:42:40,191 INFO [org.ovirt.engine.core.bll.storage.disk.image.RemoveImageCommand] (pool-5-thread-3) [7ea6c4ea] Running command: RemoveImageCommand internal: true. Entities aff ected : ID: 38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e Type: Storage 2016-11-24 16:42:40,228 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-3) [7ea6c4ea] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSComma ndParameters:{runAsync='true', storagePoolId='f30058ff-5799-45b2-a272-22b2d198aa16', ignoreFailoverLimit='false', storageDomainId='38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e', imageGroupId='aab7e 02f-229f-4bf3-8942-634858e873f5', postZeros='false', forceDelete='false'}), log id: 223961f5