Bug 1315306
Summary: | Removing master storage from DC fails | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | nicolas | ||||||
Component: | BLL.Storage | Assignee: | Liron Aravot <laravot> | ||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Aharon Canan <acanan> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.6.3.3 | CC: | amureini, bugs, laravot, nicolas, sabose, v.astafiev, ylavi | ||||||
Target Milestone: | ovirt-4.1.0-alpha | Flags: | amureini:
ovirt-4.1?
rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack? |
||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-08-21 12:08:55 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Created attachment 1133729 [details]
vdsm.log
Liron, haven't we seen a similar issue before? Allon, not that i remember. Thanks for reporting Nicolas In the provided logs we can see that the deactivation fails because of failed master migration - we fail when copying the master fs (see [1]). I can see that this issue happens from time to time when having a Gluster master storage domain (see https://bugzilla.redhat.com/show_bug.cgi?id=1248453) as in your usecase. Just to verify, can you please provide the following info? 1. have you tried to deactivate the domain again? was the issue consistent? 2. Have you performed any operations on the the source/target domains storage that could cause to the copy failure? We need to dig into further into that issue (possibly with the Gluster team). [1]- -9bec08957c5f/master/vms jsonrpc.Executor/0::DEBUG::2016-03-02 13:01:50,575::fileUtils::124::Storage.fileUtils::(cleanupdir) Removing directory: /rhev/data-center/mnt/blockSD/9339780c-3667-4fef-aa13 -9bec08957c5f/master/tasks jsonrpc.Executor/0::ERROR::2016-03-02 13:01:50,713::sp::820::Storage.StoragePool::(masterMigrate) migration to new master failed Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 809, in masterMigrate exclude=('./lost+found',)) File "/usr/share/vdsm/storage/fileUtils.py", line 68, in tarCopy raise TarCopyFailed(tsrc.returncode, tdst.returncode, out, err) TarCopyFailed: (1, 0, '', '') jsonrpc.Executor/0::DEBUG::2016-03-02 13:01:50,721::persistentDict::167::Storage.PersistentDict::(transaction) Starting transaction Sahina, can you take a look in this bug (and possibly in BZ 1248453 which includes system logs as well) and let us know if this issue or its cause are known? thanks, Liron. Liron, 1. Yes, several colleagues tried the same operation and all of them failed. In total, 5-7 attempts. 2. No that I know, we basically moved everything from glusterfs to iSCSI, once we were sure there's not anything left on glusterfs and after a few days we wanted to detach it, so we tried putting it on maintenance and it crashed all times. What I don't remember is whether after each of these attempts the SPM changed, but I remember that at least on two of these attempts it did. (In reply to Liron Aravot from comment #4) > Sahina, can you take a look in this bug (and possibly in BZ 1248453 which > includes system logs as well) and let us know if this issue or its cause are > known? > > thanks, > Liron. As per gluster dev, for the bug 1248453 - the error was "TarCopyFailed: (1, 0, '', '/usr/bin/tar: ./vms: time stamp 2015-07-30 14:59:44 is 23312.863365733 s in the future" and is possibly related to Bug 1312721 in glusterfs (fixed in 3.7.9) This bug has no gluster logs and the error in vdsm.log has no error message to indicate the cause. Is it possible to set below option on your gluster volume and try to migrate again? # gluster volume set <volname> cluster.consistent-metadata on (This option has performance impact - but should be ok in your case as you're trying to detach the gluster domain) At some point within the next 2 weeks we're removing the gluster storage domain from our second oVirt infrastructure too, so I'll make that change and provide feedback then. Today we've removed the glusterfs-based backend. As the first approach we didn't touch configuration so we could reproduce the problem first and confirm it happens on both infrastructures, but unfortunately we weren't able. We put the storage domain backend on maintenance and it went smoothly - we're using the same versions of engine & VDSM on both infrastructures, which makes it even stranger. FWIW, although both infrastructures used glusterfs backends, they were different servers and volumes for both oVirt installations. Nicolas, can you please specify what version of gluster are you using? And possibly attach the gluster logs? (In reply to Sahina Bose from comment #6) > As per gluster dev, for the bug 1248453 - the error was "TarCopyFailed: (1, > 0, '', '/usr/bin/tar: ./vms: time stamp 2015-07-30 14:59:44 is > 23312.863365733 s in the future" and is possibly related to Bug 1312721 in > glusterfs (fixed in 3.7.9) > The bug isn't caused/solved by the gluster BZ, but to general copy using tar when the clocks aren't synced (let's continue the discussion about it in the BZ) - so its not related to our bug here. > This bug has no gluster logs and the error in vdsm.log has no error message > to indicate the cause. > > Is it possible to set below option on your gluster volume and try to migrate > again? > # gluster volume set <volname> cluster.consistent-metadata on > > (This option has performance impact - but should be ok in your case as > you're trying to detach the gluster domain) In comment 9 I've asked Nicolas to provide the logs if possible. thanks, Liron. (In reply to Sahina Bose from comment #6) > (In reply to Liron Aravot from comment #4) > > Sahina, can you take a look in this bug (and possibly in BZ 1248453 which > > includes system logs as well) and let us know if this issue or its cause are > > known? > > > > thanks, > > Liron. > > As per gluster dev, for the bug 1248453 - the error was "TarCopyFailed: (1, > 0, '', '/usr/bin/tar: ./vms: time stamp 2015-07-30 14:59:44 is > 23312.863365733 s in the future" and is possibly related to Bug 1312721 in > glusterfs (fixed in 3.7.9) > > This bug has no gluster logs and the error in vdsm.log has no error message > to indicate the cause. > Sahina- the tar warnings in bug 1248453 on the timestamp aren't the cause for the failure (but just warnings on stderr), so if needed you can use the provided info there to diagnose the gluster copy failure (till the info will be provided here as well). Thanks. glusterfs versions are: 3.7.2-3 for client (oVirt nodes) 3.6.3 for server We needed to use this version for client because any further version wouldn't work with this server version (we tried mounting the volume manually and it failed). Unfortunately, at this time we have no logs for the time this happened as they rotated. Apologies - haven't been able to spend time on this bug. From attached logs in bug 1248453 I could only see tar copy errors on migration - no gluster specific errors. Unless we have a reproducer with newer versions, we can close this bug? Can you reproduce with latest gluster\ovirt? We don't use gluster anymore since it caused a lot of problems, so we don't have a gluster infrastructure to try to reproduce it. I think it is ok to close this BZ since no one seems to have complained about this issue and I cannot provide additional info at this time. |
Created attachment 1133728 [details] engine.log Description of problem: We've migrated storage from glusterfs to iSCSI, so now we have 2 storages in our data center. As we've already finished, we want to remove the gluster storage from our data center (which is the master storage right now). We've tried to put it on maintenance but we're getting this error: 2016-03-02 13:02:02,087 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-34) [259a3130] Command 'DisconnectStoragePoolVDSCommand(HostName = ovirt01.domain.com [1], DisconnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='c31dca1a-e5bc-43f6-940f-6397e3ddbee4', storagePoolId='fa155d43-4e68-486f-9f9d-ae3e3916cc4f', vds_spm_id='7'})' execution failed: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Operation not allowed while SPM is active: ('fa155d43-4e68-486f-9f9d-ae3e3916cc4f',), code = 656 Then the DC set its status to Down, however, after a while it autorecovered and was Up again. No harm on VMs, however, we couldn't remove the storage domain until we put all hosts on maintenance (which in turn means that all VMs must have been set down). Version-Release number of selected component (if applicable): 3.6.3.4 How reproducible: Not sure, we tried 2 times before we decided setting all hosts on maintenance. Steps to Reproduce: 1. Click on the Data Center tab 2. Click on the Storage sub-tab 3. Select the storage domain to remove 4. Click on "Maintenance" Actual results: After a while, data center is set down. After a while it recovered, however, we noticed that the SPM was on a different node now (and after the DC was set Up again, the storage domain was still active). Expected results: The storage domain should have been set to maintenance so we could detach it. Additional info: engine.log + vdsm.log As mentioned, we finally put all nodes on maintenance and that way we were able to remove the storage domain. We have another oVirt infrastructure where we can test further if needed.