907492 – [RHSC] Console failed to detect creation of volume from gluster CLI

Bug 907492 - [RHSC] Console failed to detect creation of volume from gluster CLI

Summary: [RHSC] Console failed to detect creation of volume from gluster CLI

Keywords:
Status:	CLOSED DUPLICATE of bug 905904
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhsc
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Shireesh
QA Contact:	Shruti Sampat
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-04 14:41 UTC by Shruti Sampat
Modified:	2013-07-03 06:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-18 09:56:35 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
engine logs (6.84 MB, text/x-log) 2013-02-04 14:43 UTC, Shruti Sampat	no flags	Details
vdsm logs (2.52 MB, text/x-log) 2013-02-04 14:46 UTC, Shruti Sampat	no flags	Details
View All

Description Shruti Sampat 2013-02-04 14:41:10 UTC

Description of problem:
---------------------------------------
After a host was added to the cluster managed from RHSC, and a volume created from gluster CLI, the volume failed to appear on the Console.

The following is seen in the engine logs - 
---------------------------------------
2013-02-04 19:53:09,606 ERROR [org.ovirt.engine.core.utils.ServletUtils] (ajp-/127.0.0.1:8702-29) Can't read file "/usr/share/ovirt-engine/docs/Docume
ntationPath.csv" for request "/docs/DocumentationPath.csv", will send a 404 error response.
2013-02-04 19:55:00,003 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-62) Autorecovering 0 hosts
2013-02-04 19:55:00,004 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-62) Autorecovering 0 storage domains
2013-02-04 19:55:05,605 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeAdvancedDetailsVDSCommand] (QuartzScheduler_Worker-54) START, G
etGlusterVolumeAdvancedDetailsVDSCommand(HostName = rhs-client31.lab.eng.blr.redhat.com, HostId = 91ddf6d9-6347-49d9-b68a-3a2064172bc3), log id: 47836
de8
2013-02-04 19:55:05,705 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-54) XML RPC error in command GetGlu
sterVolumeAdvancedDetailsVDS ( HostName = rhs-client31.lab.eng.blr.redhat.com ), the error was: java.util.concurrent.ExecutionException: java.lang.ref
lect.InvocationTargetException, <type 'exceptions.TypeError'>:sequence item 0: expected string, NoneType found 
2013-02-04 19:55:05,705 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeAdvancedDetailsVDSCommand] (QuartzScheduler_Worker-54) FINISH, 
GetGlusterVolumeAdvancedDetailsVDSCommand, log id: 47836de8
2013-02-04 19:55:05,705 ERROR [org.ovirt.engine.core.bll.gluster.GlusterManager] (QuartzScheduler_Worker-54) Error while refreshing brick statuses for
 volume fromclihost1 of cluster cluster: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbrok
er.VDSNetworkException: org.apache.xmlrpc.XmlRpcException: <type 'exceptions.TypeError'>:sequence item 0: expected string, NoneType found
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.runVdsCommand(GlusterManager.java:260) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.getVolumeAdvancedDetails(GlusterManager.java:894) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshBrickStatuses(GlusterManager.java:867) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshClusterHeavyWeightData(GlusterManager.java:852) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshHeavyWeightData(GlusterManager.java:827) [engine-bll.jar:]
        at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source) [:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]

2013-02-04 19:55:05,706 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (QuartzScheduler_Worker-56) Failed to acquire lock and wait lock EngineLock [exclusiveLocks= key: 58001b50-cfc7-4c0a-89dc-16f2b808febb value: GLUSTER
, sharedLocks= ]


Version-Release number of selected component (if applicable):
Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 

How reproducible:
Intermittent

Steps to Reproduce:
1. Add a host to the cluster managed from RHSC.
2. Create a volume on the host from gluster CLI.
  
Actual results:
Console fails to detect changes made from the gluster CLI.

Expected results:
Console is supposed to sync information with gluster FS.

Additional info:

Comment 1 Shruti Sampat 2013-02-04 14:43:40 UTC

Created attachment 692786 [details]
engine logs

Comment 2 Shruti Sampat 2013-02-04 14:46:13 UTC

Created attachment 692787 [details]
vdsm logs

Comment 3 Shireesh 2013-02-04 14:49:32 UTC

I think it is related to Bug 905904. Can you confirm if similar scenario had happened before this problem started? i.e. You tried to remove a server, and got a validation error? If this is truly the case, just re-starting the engine should resolve the lock issue, and the volume should then get pulled in. Can you please check this and confirm?

Comment 5 Shruti Sampat 2013-02-04 15:04:26 UTC

No, I had not tried to remove hosts. I did create, start, stop and delete volumes and added new hosts quite a few times before this started happening.

Comment 6 Shireesh 2013-02-05 05:40:35 UTC

(In reply to comment #5)
> No, I had not tried to remove hosts. I did create, start, stop and delete
> volumes and added new hosts quite a few times before this started happening.

Did you check what happens after restarting the engine?

Comment 7 Shruti Sampat 2013-02-05 06:03:35 UTC

Restarted the engine. The volume that was not being displayed earlier is now being displayed in the GUI.

Comment 8 Shireesh 2013-02-05 06:17:18 UTC

(In reply to comment #5)
> No, I had not tried to remove hosts. I did create, start, stop and delete
> volumes and added new hosts quite a few times before this started happening.

During this whole time, had you received any validation error on any of your actions? The issue I mentioned is not limited to "remove server", but any action. i.e. If you select one or more entities in any of the tables, and try to perform an action, and it fails because of "validation errors". In such cases, a lock is getting acquired on the cluster, which is not getting released, causing all these problems. This has already been fixed upstream.

(In reply to comment #7)
> Restarted the engine. The volume that was not being displayed earlier is now
> being displayed in the GUI.

OK. So this confirms that the issue is similar, if not same as Bug 905904

Comment 9 Shruti Sampat 2013-02-05 07:44:33 UTC

(In reply to comment #8)
> (In reply to comment #5)
> > No, I had not tried to remove hosts. I did create, start, stop and delete
> > volumes and added new hosts quite a few times before this started happening.
> 
> During this whole time, had you received any validation error on any of your
> actions? The issue I mentioned is not limited to "remove server", but any
> action. i.e. If you select one or more entities in any of the tables, and
> try to perform an action, and it fails because of "validation errors". In
> such cases, a lock is getting acquired on the cluster, which is not getting
> released, causing all these problems. This has already been fixed upstream.

I tried to stop a volume, which failed because the action was failing on the storage node. Could this have caused a failure to release the lock on the cluster?

Comment 10 Shireesh 2013-02-05 07:49:05 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #5)
> > > No, I had not tried to remove hosts. I did create, start, stop and delete
> > > volumes and added new hosts quite a few times before this started happening.
> > 
> > During this whole time, had you received any validation error on any of your
> > actions? The issue I mentioned is not limited to "remove server", but any
> > action. i.e. If you select one or more entities in any of the tables, and
> > try to perform an action, and it fails because of "validation errors". In
> > such cases, a lock is getting acquired on the cluster, which is not getting
> > released, causing all these problems. This has already been fixed upstream.
> 
> I tried to stop a volume, which failed because the action was failing on the
> storage node. Could this have caused a failure to release the lock on the
> cluster?

No, this should happen only in case of validation failure at engine level i.e. when the command was not executed on the storage node at all.

Comment 12 Scott Haines 2013-02-06 20:49:09 UTC

Per Feb-06 bug triage meeting, targeting for 2.1.0.

Comment 13 Shireesh 2013-02-15 06:36:37 UTC

Does this cluster contain only one server? If not, please attach vdsm logs from other servers as well.

One possibility I see is that either glusterfs or vdsm version on one of the servers is an old one.

If the problem is reproducible right now, please share the setup details so that I can have a look at it.

Comment 14 Shruti Sampat 2013-02-15 07:38:40 UTC

This cluster had only one server. But there were other clusters being managed from the Console. I can provide vdsm and gluster logs for all the servers present in the system when this issue was seen, if required.

Comment 15 Shireesh 2013-02-18 09:56:35 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #5)
> > > No, I had not tried to remove hosts. I did create, start, stop and delete
> > > volumes and added new hosts quite a few times before this started happening.
> > 
> > During this whole time, had you received any validation error on any of your
> > actions? The issue I mentioned is not limited to "remove server", but any
> > action. i.e. If you select one or more entities in any of the tables, and
> > try to perform an action, and it fails because of "validation errors". In
> > such cases, a lock is getting acquired on the cluster, which is not getting
> > released, causing all these problems. This has already been fixed upstream.
> 
> I tried to stop a volume, which failed because the action was failing on the
> storage node. Could this have caused a failure to release the lock on the
> cluster?

The log suggests that the error was *not* coming from the gluster command on node, but it was indeed a validation error coming from the engine itself.

CanDoAction of action StopGlusterVolume failed. Reasons:VAR__ACTION__STOP,VAR__TYPE__GLUSTER_VOLUME,ACTION_TYPE_FAILED_GLUSTER_VOLUME_ALREADY_STOPPED,$volumeName fromclitest

So it confirms that this indeed is a duplicate of Bug 905904
I'm marking it as duplicate now.

*** This bug has been marked as a duplicate of bug 905904 ***

Note You need to log in before you can comment on or make changes to this bug.