Bug 907492

Summary: [RHSC] Console failed to detect creation of volume from gluster CLI
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shruti Sampat <ssampat>
Component: rhscAssignee: Shireesh <shireesh>
Status: CLOSED DUPLICATE QA Contact: Shruti Sampat <ssampat>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 2.1CC: dtsang, mmahoney, pprakash, rhs-bugs, shtripat
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-18 09:56:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine logs
none
vdsm logs none

Description Shruti Sampat 2013-02-04 14:41:10 UTC
Description of problem:
---------------------------------------
After a host was added to the cluster managed from RHSC, and a volume created from gluster CLI, the volume failed to appear on the Console.

The following is seen in the engine logs - 
---------------------------------------
2013-02-04 19:53:09,606 ERROR [org.ovirt.engine.core.utils.ServletUtils] (ajp-/127.0.0.1:8702-29) Can't read file "/usr/share/ovirt-engine/docs/Docume
ntationPath.csv" for request "/docs/DocumentationPath.csv", will send a 404 error response.
2013-02-04 19:55:00,003 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-62) Autorecovering 0 hosts
2013-02-04 19:55:00,004 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-62) Autorecovering 0 storage domains
2013-02-04 19:55:05,605 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeAdvancedDetailsVDSCommand] (QuartzScheduler_Worker-54) START, G
etGlusterVolumeAdvancedDetailsVDSCommand(HostName = rhs-client31.lab.eng.blr.redhat.com, HostId = 91ddf6d9-6347-49d9-b68a-3a2064172bc3), log id: 47836
de8
2013-02-04 19:55:05,705 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-54) XML RPC error in command GetGlu
sterVolumeAdvancedDetailsVDS ( HostName = rhs-client31.lab.eng.blr.redhat.com ), the error was: java.util.concurrent.ExecutionException: java.lang.ref
lect.InvocationTargetException, <type 'exceptions.TypeError'>:sequence item 0: expected string, NoneType found 
2013-02-04 19:55:05,705 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeAdvancedDetailsVDSCommand] (QuartzScheduler_Worker-54) FINISH, 
GetGlusterVolumeAdvancedDetailsVDSCommand, log id: 47836de8
2013-02-04 19:55:05,705 ERROR [org.ovirt.engine.core.bll.gluster.GlusterManager] (QuartzScheduler_Worker-54) Error while refreshing brick statuses for
 volume fromclihost1 of cluster cluster: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbrok
er.VDSNetworkException: org.apache.xmlrpc.XmlRpcException: <type 'exceptions.TypeError'>:sequence item 0: expected string, NoneType found
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.runVdsCommand(GlusterManager.java:260) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.getVolumeAdvancedDetails(GlusterManager.java:894) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshBrickStatuses(GlusterManager.java:867) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshClusterHeavyWeightData(GlusterManager.java:852) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshHeavyWeightData(GlusterManager.java:827) [engine-bll.jar:]
        at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source) [:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]

2013-02-04 19:55:05,706 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (QuartzScheduler_Worker-56) Failed to acquire lock and wait lock EngineLock [exclusiveLocks= key: 58001b50-cfc7-4c0a-89dc-16f2b808febb value: GLUSTER
, sharedLocks= ]


Version-Release number of selected component (if applicable):
Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs 

How reproducible:
Intermittent

Steps to Reproduce:
1. Add a host to the cluster managed from RHSC.
2. Create a volume on the host from gluster CLI.
  
Actual results:
Console fails to detect changes made from the gluster CLI.

Expected results:
Console is supposed to sync information with gluster FS.

Additional info:

Comment 1 Shruti Sampat 2013-02-04 14:43:40 UTC
Created attachment 692786 [details]
engine logs

Comment 2 Shruti Sampat 2013-02-04 14:46:13 UTC
Created attachment 692787 [details]
vdsm logs

Comment 3 Shireesh 2013-02-04 14:49:32 UTC
I think it is related to Bug 905904. Can you confirm if similar scenario had happened before this problem started? i.e. You tried to remove a server, and got a validation error? If this is truly the case, just re-starting the engine should resolve the lock issue, and the volume should then get pulled in. Can you please check this and confirm?

Comment 5 Shruti Sampat 2013-02-04 15:04:26 UTC
No, I had not tried to remove hosts. I did create, start, stop and delete volumes and added new hosts quite a few times before this started happening.

Comment 6 Shireesh 2013-02-05 05:40:35 UTC
(In reply to comment #5)
> No, I had not tried to remove hosts. I did create, start, stop and delete
> volumes and added new hosts quite a few times before this started happening.

Did you check what happens after restarting the engine?

Comment 7 Shruti Sampat 2013-02-05 06:03:35 UTC
Restarted the engine. The volume that was not being displayed earlier is now being displayed in the GUI.

Comment 8 Shireesh 2013-02-05 06:17:18 UTC
(In reply to comment #5)
> No, I had not tried to remove hosts. I did create, start, stop and delete
> volumes and added new hosts quite a few times before this started happening.

During this whole time, had you received any validation error on any of your actions? The issue I mentioned is not limited to "remove server", but any action. i.e. If you select one or more entities in any of the tables, and try to perform an action, and it fails because of "validation errors". In such cases, a lock is getting acquired on the cluster, which is not getting released, causing all these problems. This has already been fixed upstream.

(In reply to comment #7)
> Restarted the engine. The volume that was not being displayed earlier is now
> being displayed in the GUI.

OK. So this confirms that the issue is similar, if not same as Bug 905904

Comment 9 Shruti Sampat 2013-02-05 07:44:33 UTC
(In reply to comment #8)
> (In reply to comment #5)
> > No, I had not tried to remove hosts. I did create, start, stop and delete
> > volumes and added new hosts quite a few times before this started happening.
> 
> During this whole time, had you received any validation error on any of your
> actions? The issue I mentioned is not limited to "remove server", but any
> action. i.e. If you select one or more entities in any of the tables, and
> try to perform an action, and it fails because of "validation errors". In
> such cases, a lock is getting acquired on the cluster, which is not getting
> released, causing all these problems. This has already been fixed upstream.

I tried to stop a volume, which failed because the action was failing on the storage node. Could this have caused a failure to release the lock on the cluster?

Comment 10 Shireesh 2013-02-05 07:49:05 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #5)
> > > No, I had not tried to remove hosts. I did create, start, stop and delete
> > > volumes and added new hosts quite a few times before this started happening.
> > 
> > During this whole time, had you received any validation error on any of your
> > actions? The issue I mentioned is not limited to "remove server", but any
> > action. i.e. If you select one or more entities in any of the tables, and
> > try to perform an action, and it fails because of "validation errors". In
> > such cases, a lock is getting acquired on the cluster, which is not getting
> > released, causing all these problems. This has already been fixed upstream.
> 
> I tried to stop a volume, which failed because the action was failing on the
> storage node. Could this have caused a failure to release the lock on the
> cluster?

No, this should happen only in case of validation failure at engine level i.e. when the command was not executed on the storage node at all.

Comment 12 Scott Haines 2013-02-06 20:49:09 UTC
Per Feb-06 bug triage meeting, targeting for 2.1.0.

Comment 13 Shireesh 2013-02-15 06:36:37 UTC
Does this cluster contain only one server? If not, please attach vdsm logs from other servers as well.

One possibility I see is that either glusterfs or vdsm version on one of the servers is an old one.

If the problem is reproducible right now, please share the setup details so that I can have a look at it.

Comment 14 Shruti Sampat 2013-02-15 07:38:40 UTC
This cluster had only one server. But there were other clusters being managed from the Console. I can provide vdsm and gluster logs for all the servers present in the system when this issue was seen, if required.

Comment 15 Shireesh 2013-02-18 09:56:35 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #5)
> > > No, I had not tried to remove hosts. I did create, start, stop and delete
> > > volumes and added new hosts quite a few times before this started happening.
> > 
> > During this whole time, had you received any validation error on any of your
> > actions? The issue I mentioned is not limited to "remove server", but any
> > action. i.e. If you select one or more entities in any of the tables, and
> > try to perform an action, and it fails because of "validation errors". In
> > such cases, a lock is getting acquired on the cluster, which is not getting
> > released, causing all these problems. This has already been fixed upstream.
> 
> I tried to stop a volume, which failed because the action was failing on the
> storage node. Could this have caused a failure to release the lock on the
> cluster?

The log suggests that the error was *not* coming from the gluster command on node, but it was indeed a validation error coming from the engine itself.

CanDoAction of action StopGlusterVolume failed. Reasons:VAR__ACTION__STOP,VAR__TYPE__GLUSTER_VOLUME,ACTION_TYPE_FAILED_GLUSTER_VOLUME_ALREADY_STOPPED,$volumeName fromclitest

So it confirms that this indeed is a duplicate of Bug 905904
I'm marking it as duplicate now.

*** This bug has been marked as a duplicate of bug 905904 ***