Red Hat Bugzilla – Bug 1304268
[HC] Activate Gluster domain fails with StorageDomainDoesNotExist
Last modified: 2016-04-20 06:50:47 EDT
Created attachment 1120683 [details]
Description of problem:
From a replica 3 gluster volume which was added as Master data domain - remove and add brick. The gluster volume heal is initiated and completed.
The storage domain goes to Inactive state, on trying to activate it from engine, error is -
VDSM command failed: Storage domain does not exist: (u'12a2c867-dbc6-43aa-8b62-95c0f1805ac0',)
On the hypervisor, the gluster volume is mounted and accessible.
rhsdev-docker2 ~]# cat /rhev/data-center/mnt/glusterSD/rhsdev-docker1.lab.eng.blr.redhat.com:_vmstore/12a2c867-dbc6-43aa-8b62-95c0f1805ac0/dom_md/metadata
gluster volume status vmstore
Status of volume: vmstore
Gluster process TCP Port RDMA Port Online Pid
:/rhgs/vmstore/brick1 49154 0 Y 45798
:/rhgs/vmstore/brick1 49154 0 Y 4474
vmstore/brick1 49152 0 Y 45782
NFS Server on localhost 2049 0 Y 32936
Self-heal Daemon on localhost N/A N/A Y 32991
NFS Server on rhsdev9.lab.eng.blr.redhat.co
m 2049 0 Y 4752
Self-heal Daemon on rhsdev9.lab.eng.blr.red
hat.com N/A N/A Y 4760
NFS Server on rhsdev-docker1.lab.eng.blr.re
dhat.com 2049 0 Y 48905
Self-heal Daemon on rhsdev-docker1.lab.eng.
blr.redhat.com N/A N/A Y 48914
[root@rhsdev-docker2 ~]# gluster volume heal vmstore info
Number of entries: 0
Number of entries: 0
Number of entries: 0
Gluster mount logs and vdsm logs from hypervisor attached.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. On a working RHEV + RHGS setup, remove brick reducing replica count from 3 to 2 from gluster volume added as master data storage domain (brick removed to remove host that was re-installed. Engine does not allow to remove hosts that contain bricks)
2. Now add brick to the gluster volume increasing replica count from 2 back to 3
There were problems with heal, which needed to be triggered by adding a dummy file to the mount point and initiating "gluster volume heal vmstore full"
Created attachment 1120684 [details]
Is there any way to debug this? To check manually what vdsm is looking for in the gluster volume mount?
I have the setup as is, so I can run the checks.
I restarted vdsm on all 3 nodes, but still get the same error (engine events):
VDSM rhsdev9 command failed: Storage domain does not exist: (u'12a2c867-dbc6-43aa-8b62-95c0f1805ac0',)
VDSM rhsdev-docker1 command failed: Storage domain does not exist: (u'12a2c867-dbc6-43aa-8b62-95c0f1805ac0',)
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
I will have to reproduce and look into this to better understand what's going on
Is it hosted engine?
Can you add the engine log ?
Created attachment 1144284 [details]
Yes, it is hosted engine
I ran into error again, so have all the logs.
In my setup - the vmstore volume is running, all bricks are up. If I mount on a tmp directory, it is accessible, but the rhev/data-center/mnt/.. folder is not.
[root@rhsdev9 ~]# mount -t glusterfs 10.70.42.203:/vmstore tmp
[root@rhsdev9 ~]# cd tmp
[root@rhsdev9 tmp]# ll
drwxr-xr-x. 5 vdsm kvm 45 Apr 5 11:17 1499860e-9fcd-44d9-a41e-e54ad0a7ff84
-rwxr-xr-x. 1 vdsm kvm 0 Apr 6 22:10 __DIRECT_IO_TEST__
[root@rhsdev9 tmp]# ll /rhev/data-center/mnt/glusterSD/10.70.42.203\:_vmstore
ls: cannot access /rhev/data-center/mnt/glusterSD/10.70.42.203:_vmstore: Transport endpoint is not connected
Created attachment 1144285 [details]
Created attachment 1144287 [details]
Krutika, do you know why situation in comment 11 could happen?
Was there a core dump this time around?
No, no core dumps in /var/log/core
The volume option changed. The volume was stopped, ssl enabled, and restarted. Post that storage domain cannot be activated on one of the hosts. The other 2 hosts can connect to storage domain.
So Pranith and I looked into your setup. So it looks like you enabled the ssl options long after the volume had been mounted. This is a problem since from that point onwards, the server refuses to accept requests from the existing client. And this is a known issue.
I will try and dig out the specific BZ where this issue was originally reported and post it here again.
Meanwhile, there are two ways to get around the issue:
1) Make sure to have already enabled ssl settings before you mount the volume.
2) If you enable gluster ssl settings AFTER the volume is mounted, we would need to unmount and mount the volume again on these old(er) clients.
It looks that the root cause is not related to oVirt.
Can you move this report to the right product/component ?
I will move component once I'm sure that we're not missing anything here. As the earlier issue (Comment 1) had nothing to do with SSL.
When trying to connect storage domain to server, does vdsm remount the Storage domain?
In general, the vdsm will not remount, it will mount.
I think in the scenario described, you should de-activate and reactivate the storage domain.
When you re-add the host, you need to make sure you don't have old ovirt mount configured.
(In reply to Krutika Dhananjay from comment #17)
> 2) If you enable gluster ssl settings AFTER the volume is mounted, we would
> need to unmount and mount the volume again on these old(er) clients.
Moving the domain/host to maintenance (which you should do anyway before changing any configurations on the storage side) would unmount the directory.
Seems like this BZ is either a misuse, or a bug on RHGS' side that it doesn't allow maintenancing a host with a brick.
What am I missing?
(In reply to Allon Mureinik from comment #22)
> (In reply to Krutika Dhananjay from comment #17)
> > 2) If you enable gluster ssl settings AFTER the volume is mounted, we would
> > need to unmount and mount the volume again on these old(er) clients.
> Moving the domain/host to maintenance (which you should do anyway before
> changing any configurations on the storage side) would unmount the directory.
> Seems like this BZ is either a misuse, or a bug on RHGS' side that it
> doesn't allow maintenancing a host with a brick.
> What am I missing?
Gluster only prevents removing a host if the host has a brick - there's no check yet for maintenance
In this case, we tried to move the stoage domain and hosts to maintenance. The domain failed to be moved to maintenance as it was the master domain - and it's status changed to active state after the failure message.
One of the hosts which was running HE VM failed to move to maintenance, and caused this error.
I think the wrong step done was - we clicked on "confirm the host was rebooted" without unmounting sd.
Is there a way to move master domain to maintenance?
You should be able to just maintenance a master domain. Liron - can you take a look? You had some recent work in that area.
We are able to maintenance a master domain.
If it is the last SD then the Data Center will move to maintenance.
If you have a running VM on this SD, you will not be able to move the SD to maintenance.
Also, you cannot move to maintenance the host that is running the Hosted Engine.
In any case, the VDSM will not unmount existing SD.
Sahina, is there another flow that we need to investigate ?
Sahina, any insights ?
Sorry, I have not had time to look into this before now. We did have issues with moving master SD to maintenance - there were no VMs running on it, but there were no other active SDs either
I tried to reproduce this again - and I was able to move master SD to maintenance when there were other storage domains.
Closing this as this seems more like a workflow error. Will re-open if we reproduce problem with correct workflow