Bug 1315306

Summary: Removing master storage from DC fails
Product: [oVirt] ovirt-engine Reporter: nicolas
Component: BLL.StorageAssignee: Liron Aravot <laravot>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Aharon Canan <acanan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.6.3.3CC: amureini, bugs, laravot, nicolas, sabose, v.astafiev, ylavi
Target Milestone: ovirt-4.1.0-alphaFlags: amureini: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-21 12:08:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log
none
vdsm.log none

Description nicolas 2016-03-07 12:45:48 UTC
Created attachment 1133728 [details]
engine.log

Description of problem:

We've migrated storage from glusterfs to iSCSI, so now we have 2
storages in our data center. As we've already finished, we want to
remove the gluster storage from our data center (which is the master
storage right now).

We've tried to put it on maintenance but we're getting this error:

2016-03-02 13:02:02,087 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand]
(org.ovirt.thread.pool-8-thread-34) [259a3130] Command
'DisconnectStoragePoolVDSCommand(HostName = ovirt01.domain.com [1],
DisconnectStoragePoolVDSCommandParameters:{runAsync='true',
hostId='c31dca1a-e5bc-43f6-940f-6397e3ddbee4',
storagePoolId='fa155d43-4e68-486f-9f9d-ae3e3916cc4f',
vds_spm_id='7'})' execution failed: VDSGenericException:
VDSErrorException: Failed to DisconnectStoragePoolVDS, error =
Operation not allowed while SPM is active:
('fa155d43-4e68-486f-9f9d-ae3e3916cc4f',), code = 656

Then the DC set its status to Down, however, after a while it autorecovered and was Up again. No harm on VMs, however, we couldn't remove the storage domain until we put all hosts on maintenance (which in turn means that all VMs must have been set down).

Version-Release number of selected component (if applicable):

3.6.3.4

How reproducible:

Not sure, we tried 2 times before we decided setting all hosts on maintenance.

Steps to Reproduce:
1. Click on the Data Center tab
2. Click on the Storage sub-tab
3. Select the storage domain to remove
4. Click on "Maintenance"

Actual results:

After a while, data center is set down. After a while it recovered, however, we noticed that the SPM was on a different node now (and after the DC was set Up again, the storage domain was still active).

Expected results:

The storage domain should have been set to maintenance so we could detach it.

Additional info:

engine.log + vdsm.log

As mentioned, we finally put all nodes on maintenance and that way we were able to remove the storage domain. We have another oVirt infrastructure where we can test further if needed.

Comment 1 nicolas 2016-03-07 12:46:18 UTC
Created attachment 1133729 [details]
vdsm.log

Comment 2 Allon Mureinik 2016-03-07 13:15:47 UTC
Liron, haven't we seen a similar issue before?

Comment 3 Liron Aravot 2016-03-08 12:47:46 UTC
Allon, not that i remember.


Thanks for reporting Nicolas

In the provided logs we can see that the deactivation fails because of failed master migration - we fail when copying the master fs (see [1]).

I can see that this issue happens from time to time when having a Gluster master storage domain (see https://bugzilla.redhat.com/show_bug.cgi?id=1248453) as in your usecase.


Just to verify, can you please provide the following info?
1. have you tried to deactivate the domain again? was the issue consistent? 
2. Have you performed any operations on the the source/target domains storage that could cause to the copy failure? 


We need to dig into further into that issue (possibly with the Gluster team).


[1]-
-9bec08957c5f/master/vms
jsonrpc.Executor/0::DEBUG::2016-03-02 13:01:50,575::fileUtils::124::Storage.fileUtils::(cleanupdir) Removing directory: /rhev/data-center/mnt/blockSD/9339780c-3667-4fef-aa13
-9bec08957c5f/master/tasks
jsonrpc.Executor/0::ERROR::2016-03-02 13:01:50,713::sp::820::Storage.StoragePool::(masterMigrate) migration to new master failed
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 809, in masterMigrate
    exclude=('./lost+found',))
  File "/usr/share/vdsm/storage/fileUtils.py", line 68, in tarCopy
    raise TarCopyFailed(tsrc.returncode, tdst.returncode, out, err)
TarCopyFailed: (1, 0, '', '')
jsonrpc.Executor/0::DEBUG::2016-03-02 13:01:50,721::persistentDict::167::Storage.PersistentDict::(transaction) Starting transaction

Comment 4 Liron Aravot 2016-03-08 12:59:49 UTC
Sahina, can you take a look in this bug (and possibly in BZ 1248453 which includes system logs as well) and let us know if this issue or its cause are known?

thanks,
Liron.

Comment 5 nicolas 2016-03-08 13:54:33 UTC
Liron,

1. Yes, several colleagues tried the same operation and all of them failed. In total, 5-7 attempts.
2. No that I know, we basically moved everything from glusterfs to iSCSI, once we were sure there's not anything left on glusterfs and after a few days we wanted to detach it, so we tried putting it on maintenance and it crashed all times. What I don't remember is whether after each of these attempts the SPM changed, but I remember that at least on two of these attempts it did.

Comment 6 Sahina Bose 2016-03-09 08:05:27 UTC
(In reply to Liron Aravot from comment #4)
> Sahina, can you take a look in this bug (and possibly in BZ 1248453 which
> includes system logs as well) and let us know if this issue or its cause are
> known?
> 
> thanks,
> Liron.

As per gluster dev, for the bug 1248453 - the error was "TarCopyFailed: (1, 0, '', '/usr/bin/tar: ./vms: time stamp 2015-07-30 14:59:44 is 23312.863365733 s in the future" and is possibly related to Bug 1312721 in glusterfs (fixed in 3.7.9)

This bug has no gluster logs and the error in vdsm.log has no error message to indicate the cause.

Is it possible to set below option on your gluster volume and try to migrate again?
# gluster volume set <volname> cluster.consistent-metadata on

(This option has performance impact - but should be ok in your case as you're trying to detach the gluster domain)

Comment 7 nicolas 2016-03-09 08:19:19 UTC
At some point within the next 2 weeks we're removing the gluster storage domain from our second oVirt infrastructure too, so I'll make that change and provide feedback then.

Comment 8 nicolas 2016-03-17 08:28:52 UTC
Today we've removed the glusterfs-based backend. As the first approach we didn't touch configuration so we could reproduce the problem first and confirm it happens on both infrastructures, but unfortunately we weren't able.

We put the storage domain backend on maintenance and it went smoothly - we're using the same versions of engine & VDSM on both infrastructures, which makes it even stranger.

FWIW, although both infrastructures used glusterfs backends, they were different servers and volumes for both oVirt installations.

Comment 9 Liron Aravot 2016-03-22 12:29:05 UTC
Nicolas, can you please specify what version of gluster are you using? And possibly attach the gluster logs?

Comment 10 Liron Aravot 2016-03-22 12:38:24 UTC
(In reply to Sahina Bose from comment #6)
> As per gluster dev, for the bug 1248453 - the error was "TarCopyFailed: (1,
> 0, '', '/usr/bin/tar: ./vms: time stamp 2015-07-30 14:59:44 is
> 23312.863365733 s in the future" and is possibly related to Bug 1312721 in
> glusterfs (fixed in 3.7.9)
> 

The bug isn't caused/solved by the gluster BZ, but to general copy using tar when the clocks aren't synced (let's continue the discussion about it in the BZ) - so its not related to our bug here.

> This bug has no gluster logs and the error in vdsm.log has no error message
> to indicate the cause.
> 
> Is it possible to set below option on your gluster volume and try to migrate
> again?
> # gluster volume set <volname> cluster.consistent-metadata on
> 
> (This option has performance impact - but should be ok in your case as
> you're trying to detach the gluster domain)

In comment 9 I've asked Nicolas to provide the logs if possible.

thanks,
Liron.

Comment 11 Liron Aravot 2016-03-23 13:53:39 UTC
(In reply to Sahina Bose from comment #6)
> (In reply to Liron Aravot from comment #4)
> > Sahina, can you take a look in this bug (and possibly in BZ 1248453 which
> > includes system logs as well) and let us know if this issue or its cause are
> > known?
> > 
> > thanks,
> > Liron.
> 
> As per gluster dev, for the bug 1248453 - the error was "TarCopyFailed: (1,
> 0, '', '/usr/bin/tar: ./vms: time stamp 2015-07-30 14:59:44 is
> 23312.863365733 s in the future" and is possibly related to Bug 1312721 in
> glusterfs (fixed in 3.7.9)
> 
> This bug has no gluster logs and the error in vdsm.log has no error message
> to indicate the cause.
> 

Sahina- the tar warnings in bug 1248453 on the timestamp aren't the cause for the failure (but just warnings on stderr), so if needed you can use the provided info there to diagnose the gluster copy failure (till the info will be provided here as well). Thanks.

Comment 12 nicolas 2016-03-28 08:19:55 UTC
glusterfs versions are:

3.7.2-3 for client (oVirt nodes)
3.6.3 for server

We needed to use this version for client because any further version wouldn't work with this server version (we tried mounting the volume manually and it failed).

Unfortunately, at this time we have no logs for the time this happened as they rotated.

Comment 13 Sahina Bose 2016-08-16 09:37:56 UTC
Apologies - haven't been able to spend time on this bug. From attached logs in bug 1248453  I could only see tar copy errors on migration - no gluster specific errors.
Unless we have a reproducer with newer versions, we can close this bug?

Comment 14 Yaniv Lavi 2016-08-21 11:15:48 UTC
Can you reproduce with latest gluster\ovirt?

Comment 15 nicolas 2016-08-21 11:33:05 UTC
We don't use gluster anymore since it caused a lot of problems, so we don't have a gluster infrastructure to try to reproduce it.

I think it is ok to close this BZ since no one seems to have complained about this issue and I cannot provide additional info at this time.