Bug 1434105
Summary: | Live storage migration failed | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | SATHEESARAN <sasundar> | ||||||||
Component: | BLL.Storage | Assignee: | Benny Zlotnik <bzlotnik> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | SATHEESARAN <sasundar> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.1.1.4 | CC: | alitke, bill.james, bugs, nsoffer, sabose, sasundar, tnisan | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | --- | Flags: | sbonazzo:
ovirt-4.3-
|
||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-07-02 08:51:29 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1312909 | ||||||||||
Bug Blocks: | 1277939, 1433896 | ||||||||||
Attachments: |
|
Description
SATHEESARAN
2017-03-20 17:49:48 UTC
Created attachment 1264818 [details]
compressed archived vdsm.log
Created attachment 1264819 [details]
compressed archived engine.log
I could see that the VM was running on host6 and I see the following errors with vdsm.log on host6 # xzgrep ERROR /var/log/vdsm/vdsm.log.1.xz 2017-03-20 08:54:51,213+0530 ERROR (monitor/1025388) [storage.Monitor] Error checking domain 10253882-040a-4f28-a6b6-3c2015fb7594 (monitor:426) 2017-03-20 08:54:51,250+0530 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/10.70.36.73:_engine/10253882-040a-4f28-a6b6-3c2015fb7594/dom_md/metadata (monitor:485) 2017-03-20 08:54:55,631+0530 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/10.70.36.73:_engine/10253882-040a-4f28-a6b6-3c2015fb7594/dom_md/metadata (monitor:485) 2017-03-20 08:55:01,224+0530 ERROR (monitor/1025388) [storage.Monitor] Error checking domain 10253882-040a-4f28-a6b6-3c2015fb7594 (monitor:426) 2017-03-20 18:58:52,706+0530 ERROR (jsonrpc/3) [virt.vm] (vmId='226433a3-4a34-46d6-91b6-2a679c394ec9') Replication job unfinished (drive: 'sda', srcDisk: {u'device': u'disk', u'poolID': u'00000001-0001-0001-0001-000000000311', u'volumeID': u'80c9f20c-3dd7-466b-808d-62337f6cc153', u'domainID': u'4828a460-1825-4ba9-b456-64afad078198', u'imageID': u'da3c126e-d848-4b2e-b1fe-eb9c4cf6b04d'}, job: {'end': 5059117056L, 'bandwidth': 0L, 'type': 2, 'cur': 5040308224L}) (vm:3753) 2017-03-20 18:58:53,990+0530 ERROR (jsonrpc/1) [virt.vm] (vmId='226433a3-4a34-46d6-91b6-2a679c394ec9') Replication job unfinished (drive: 'sda', srcDisk: {u'device': u'disk', u'poolID': u'00000001-0001-0001-0001-000000000311', u'volumeID': u'80c9f20c-3dd7-466b-808d-62337f6cc153', u'domainID': u'4828a460-1825-4ba9-b456-64afad078198', u'imageID': u'da3c126e-d848-4b2e-b1fe-eb9c4cf6b04d'}, job: {'end': 5062262784L, 'bandwidth': 0L, 'type': 2, 'cur': 5061869568L}) (vm:3753) 2017-03-20 18:59:00,186+0530 ERROR (jsonrpc/0) [virt.vm] (vmId='226433a3-4a34-46d6-91b6-2a679c394ec9') Live merge failed (job: 140f8b05-293f-4347-a0d1-fcde17df1e38) (vm:4798) Created attachment 1265006 [details]
vdsm.log from host6 which ran the VM
Seeing the following exception in the log. This seems to indicate a problem on the gluster side. Can you check to make sure your gluster storage wasn't reporting errors? I am not really sure where to look for that but something has caused the gluster mount to become disconnected. 2017-03-20 08:54:55,631+0530 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/10.70.36.73:_engine/10253882-040a-4f28-a6b6-3c2015fb7594/dom_md/metadata (monitor:485) Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 483, in _pathChecked delay = result.delay() File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 368, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/10.70.36.73:_engine/10253882-040a-4f28-a6b6-3c2015fb7594/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/10.70.36.73:_engine/10253882-040a-4f28-a6b6-3c2015fb7594/dom_md/metadata\': Transport endpoint is not connected\n")) (In reply to Adam Litke from comment #5) > Seeing the following exception in the log. This seems to indicate a problem > on the gluster side. Can you check to make sure your gluster storage wasn't > reporting errors? I am not really sure where to look for that but something > has caused the gluster mount to become disconnected. > > 2017-03-20 08:54:55,631+0530 ERROR (check/loop) [storage.Monitor] Error > checking path > /rhev/data-center/mnt/glusterSD/10.70.36.73:_engine/10253882-040a-4f28-a6b6- > 3c2015fb7594/dom_md/metadata (monitor:485) > Traceback (most recent call last): > File "/usr/share/vdsm/storage/monitor.py", line 483, in _pathChecked > delay = result.delay() > File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 368, > in delay > raise exception.MiscFileReadException(self.path, self.rc, self.err) > MiscFileReadException: Internal file read failure: > (u'/rhev/data-center/mnt/glusterSD/10.70.36.73:_engine/10253882-040a-4f28- > a6b6-3c2015fb7594/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to > open > \'/rhev/data-center/mnt/glusterSD/10.70.36.73:_engine/10253882-040a-4f28- > a6b6-3c2015fb7594/dom_md/metadata\': Transport endpoint is not connected\n")) This error is from the HE gluster volume (HE storage domain). Does that affect the live migration on the other storage domains? Sahina you are correct. Those monitoring errors are unrelated. After further analyzing the engine logs I am concerned that there could be a problem with the engine flow for LiveMigrateDiskCommand. This command performs the following actions in sequence: 1. CreateImagePlaceholder - Clone the disk structure from the source domain to the destination domain. 2. VmReplicateDiskStart - Instruct qemu to replicate the active volume of the source disk to the active layer of the destination disk and mirror all writes to both the source and destination disks. This operation is a two phase process. First, data is copied from the source to the destination. Once this converges, subsequent writes are sent to both the source and destination simultaneously. 3. SyncImageGroupData - Sync the contents of source disk's internal volumes to the corresponding destination disk volumes. --- wait for SyncImageGroupData to complete --- 4. VmReplicateDiskFinish - Instruct qemu to pivot to the destination disk and suspend write mirroring on the source disk. This only works if step 2 has converged to the mirroring phase and step 3 is also complete. What appears to be happening in the attached logs is that due to high I/O, step 2 is unable to converge before step 4 is executed. This is usually not a problem since step 3 can take a relatively long time and with low I/O step 2 will converge quickly. The following log entry confirms that step 2 did not converge (cur != end): 2017-03-20 18:58:52,706+0530 ERROR (jsonrpc/3) [virt.vm] (vmId='226433a3-4a34-46d6-91b6-2a679c394ec9') Replication job unfinished (drive: 'sda', srcDisk: {u'device': u'disk', u'poolID': u'00000001-0001-0001-0001-000000000311', u'volumeID': u'80c9f20c-3dd7-466b-808d-62337f6cc153', u'domainID': u'4828a460-1825-4ba9-b456-64afad078198', u'imageID': u'da3c126e-d848-4b2e-b1fe-eb9c4cf6b04d'}, job: {'end': 5059117056L, 'bandwidth': 0L, 'type': 2, 'cur': 5040308224L}) (vm:3753) I believe the engine needs to add an additional step before VmReplicateDiskFinish which checks that the replication has converged. Alternatively, we could wait and retry VmReplicateDiskFinish a number of times before failing the entire operation. I just talked to Liron and this is a known issue and already has an assigned bug. *** This bug has been marked as a duplicate of bug 1312909 *** *** Bug 1440198 has been marked as a duplicate of this bug. *** The whole LSM flow was changed, please reopen if the issue still exists. |