Description of problem: --------------------------------------- After a host was added to the cluster managed from RHSC, and a volume created from gluster CLI, the volume failed to appear on the Console. The following is seen in the engine logs - --------------------------------------- 2013-02-04 19:53:09,606 ERROR [org.ovirt.engine.core.utils.ServletUtils] (ajp-/127.0.0.1:8702-29) Can't read file "/usr/share/ovirt-engine/docs/Docume ntationPath.csv" for request "/docs/DocumentationPath.csv", will send a 404 error response. 2013-02-04 19:55:00,003 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-62) Autorecovering 0 hosts 2013-02-04 19:55:00,004 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-62) Autorecovering 0 storage domains 2013-02-04 19:55:05,605 INFO [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeAdvancedDetailsVDSCommand] (QuartzScheduler_Worker-54) START, G etGlusterVolumeAdvancedDetailsVDSCommand(HostName = rhs-client31.lab.eng.blr.redhat.com, HostId = 91ddf6d9-6347-49d9-b68a-3a2064172bc3), log id: 47836 de8 2013-02-04 19:55:05,705 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-54) XML RPC error in command GetGlu sterVolumeAdvancedDetailsVDS ( HostName = rhs-client31.lab.eng.blr.redhat.com ), the error was: java.util.concurrent.ExecutionException: java.lang.ref lect.InvocationTargetException, <type 'exceptions.TypeError'>:sequence item 0: expected string, NoneType found 2013-02-04 19:55:05,705 INFO [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeAdvancedDetailsVDSCommand] (QuartzScheduler_Worker-54) FINISH, GetGlusterVolumeAdvancedDetailsVDSCommand, log id: 47836de8 2013-02-04 19:55:05,705 ERROR [org.ovirt.engine.core.bll.gluster.GlusterManager] (QuartzScheduler_Worker-54) Error while refreshing brick statuses for volume fromclihost1 of cluster cluster: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbrok er.VDSNetworkException: org.apache.xmlrpc.XmlRpcException: <type 'exceptions.TypeError'>:sequence item 0: expected string, NoneType found at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:169) [engine-bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:] at org.ovirt.engine.core.bll.gluster.GlusterManager.runVdsCommand(GlusterManager.java:260) [engine-bll.jar:] at org.ovirt.engine.core.bll.gluster.GlusterManager.getVolumeAdvancedDetails(GlusterManager.java:894) [engine-bll.jar:] at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshBrickStatuses(GlusterManager.java:867) [engine-bll.jar:] at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshClusterHeavyWeightData(GlusterManager.java:852) [engine-bll.jar:] at org.ovirt.engine.core.bll.gluster.GlusterManager.refreshHeavyWeightData(GlusterManager.java:827) [engine-bll.jar:] at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source) [:1.7.0_09-icedtea] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea] at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:] 2013-02-04 19:55:05,706 INFO [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (QuartzScheduler_Worker-56) Failed to acquire lock and wait lock EngineLock [exclusiveLocks= key: 58001b50-cfc7-4c0a-89dc-16f2b808febb value: GLUSTER , sharedLocks= ] Version-Release number of selected component (if applicable): Red Hat Storage Console Version: 2.1.0-0.qa5.el6rhs How reproducible: Intermittent Steps to Reproduce: 1. Add a host to the cluster managed from RHSC. 2. Create a volume on the host from gluster CLI. Actual results: Console fails to detect changes made from the gluster CLI. Expected results: Console is supposed to sync information with gluster FS. Additional info:
Created attachment 692786 [details] engine logs
Created attachment 692787 [details] vdsm logs
I think it is related to Bug 905904. Can you confirm if similar scenario had happened before this problem started? i.e. You tried to remove a server, and got a validation error? If this is truly the case, just re-starting the engine should resolve the lock issue, and the volume should then get pulled in. Can you please check this and confirm?
No, I had not tried to remove hosts. I did create, start, stop and delete volumes and added new hosts quite a few times before this started happening.
(In reply to comment #5) > No, I had not tried to remove hosts. I did create, start, stop and delete > volumes and added new hosts quite a few times before this started happening. Did you check what happens after restarting the engine?
Restarted the engine. The volume that was not being displayed earlier is now being displayed in the GUI.
(In reply to comment #5) > No, I had not tried to remove hosts. I did create, start, stop and delete > volumes and added new hosts quite a few times before this started happening. During this whole time, had you received any validation error on any of your actions? The issue I mentioned is not limited to "remove server", but any action. i.e. If you select one or more entities in any of the tables, and try to perform an action, and it fails because of "validation errors". In such cases, a lock is getting acquired on the cluster, which is not getting released, causing all these problems. This has already been fixed upstream. (In reply to comment #7) > Restarted the engine. The volume that was not being displayed earlier is now > being displayed in the GUI. OK. So this confirms that the issue is similar, if not same as Bug 905904
(In reply to comment #8) > (In reply to comment #5) > > No, I had not tried to remove hosts. I did create, start, stop and delete > > volumes and added new hosts quite a few times before this started happening. > > During this whole time, had you received any validation error on any of your > actions? The issue I mentioned is not limited to "remove server", but any > action. i.e. If you select one or more entities in any of the tables, and > try to perform an action, and it fails because of "validation errors". In > such cases, a lock is getting acquired on the cluster, which is not getting > released, causing all these problems. This has already been fixed upstream. I tried to stop a volume, which failed because the action was failing on the storage node. Could this have caused a failure to release the lock on the cluster?
(In reply to comment #9) > (In reply to comment #8) > > (In reply to comment #5) > > > No, I had not tried to remove hosts. I did create, start, stop and delete > > > volumes and added new hosts quite a few times before this started happening. > > > > During this whole time, had you received any validation error on any of your > > actions? The issue I mentioned is not limited to "remove server", but any > > action. i.e. If you select one or more entities in any of the tables, and > > try to perform an action, and it fails because of "validation errors". In > > such cases, a lock is getting acquired on the cluster, which is not getting > > released, causing all these problems. This has already been fixed upstream. > > I tried to stop a volume, which failed because the action was failing on the > storage node. Could this have caused a failure to release the lock on the > cluster? No, this should happen only in case of validation failure at engine level i.e. when the command was not executed on the storage node at all.
Per Feb-06 bug triage meeting, targeting for 2.1.0.
Does this cluster contain only one server? If not, please attach vdsm logs from other servers as well. One possibility I see is that either glusterfs or vdsm version on one of the servers is an old one. If the problem is reproducible right now, please share the setup details so that I can have a look at it.
This cluster had only one server. But there were other clusters being managed from the Console. I can provide vdsm and gluster logs for all the servers present in the system when this issue was seen, if required.
(In reply to comment #9) > (In reply to comment #8) > > (In reply to comment #5) > > > No, I had not tried to remove hosts. I did create, start, stop and delete > > > volumes and added new hosts quite a few times before this started happening. > > > > During this whole time, had you received any validation error on any of your > > actions? The issue I mentioned is not limited to "remove server", but any > > action. i.e. If you select one or more entities in any of the tables, and > > try to perform an action, and it fails because of "validation errors". In > > such cases, a lock is getting acquired on the cluster, which is not getting > > released, causing all these problems. This has already been fixed upstream. > > I tried to stop a volume, which failed because the action was failing on the > storage node. Could this have caused a failure to release the lock on the > cluster? The log suggests that the error was *not* coming from the gluster command on node, but it was indeed a validation error coming from the engine itself. CanDoAction of action StopGlusterVolume failed. Reasons:VAR__ACTION__STOP,VAR__TYPE__GLUSTER_VOLUME,ACTION_TYPE_FAILED_GLUSTER_VOLUME_ALREADY_STOPPED,$volumeName fromclitest So it confirms that this indeed is a duplicate of Bug 905904 I'm marking it as duplicate now. *** This bug has been marked as a duplicate of bug 905904 ***