Created attachment 709049[details]
logs
Description of problem:
we are getting java.lang.NullPointerException for SpmStatusVDS
Version-Release number of selected component (if applicable):
sf10
How reproducible:
100%
Steps to Reproduce:
1. in two hosts cluster with 1 NFS domain block connectivity to the storage from both hosts
2.
3.
Actual results:
java.lang.NullPointerException for SpmStatusVDS
Expected results:
we should not get NPE
Additional info:logs
engine:
2013-03-12 05:21:47,176 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-72) Failed in SpmStatusVDS method
2013-03-12 05:21:47,176 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-72) Error code GeneralException and error message VDSGenericException: VDSErrorException: Failed to SpmStatusVDS, error
=
2013-03-12 05:21:47,176 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-72) Command SpmStatusVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to SpmStatusV
DS, error =
2013-03-12 05:21:47,177 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (QuartzScheduler_Worker-72) CreateCommand failed: java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:332) [rt.jar:1.7.0_09-icedtea]
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:987) [rt.jar:1.7.0_09-icedtea]
at org.ovirt.engine.core.vdsbroker.ResourceManager.GetVdsManager(ResourceManager.java:199) [engine-vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.initializeVdsBroker(VdsBrokerCommand.java:51) [engine-vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.<init>(VdsBrokerCommand.java:32) [engine-vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand.<init>(HSMGetAllTasksStatusesVDSCommand.java:22) [engine-vdsbroker.jar:]
at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown Source) [:1.7.0_09-icedtea]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.7.0_09-icedtea]
at java.lang.reflect.Constructor.newInstance(Constructor.java:525) [rt.jar:1.7.0_09-icedtea]
at org.ovirt.engine.core.vdsbroker.ResourceManager.CreateCommand(ResourceManager.java:333) [engine-vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:370) [engine-vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData.proceedStoragePoolStats(IrsBrokerCommand.java:231) [engine-vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData._updatingTimer_Elapsed(IrsBrokerCommand.java:183) [engine-vdsbroker.jar:]
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_09-icedtea]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]
vdsm:
Thread-66::INFO::2013-03-12 05:42:52,323::logUtils::37::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7', options=None)
Thread-66::ERROR::2013-03-12 05:42:52,324::task::833::TaskManager.Task::(_setError) Task=`1ea6a0d5-10dd-4cb6-961f-1fa66d719bf9`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 840, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 603, in getSpmStatus
pool = self.getPool(spUUID)
File "/usr/share/vdsm/storage/hsm.py", line 313, in getPool
raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected: ('5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7',)
Thread-66::DEBUG::2013-03-12 05:42:52,324::task::852::TaskManager.Task::(_run) Task=`1ea6a0d5-10dd-4cb6-961f-1fa66d719bf9`::Task._run: 1ea6a0d5-10dd-4cb6-961f-1fa66d719bf9 ('5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7',) {} failed - stopping t
ask
From what I can see at the log,
there is an attempt to InitializeVdsBroker with a non existing id of vds.
This happens at invocation of HSMGetAllTaskStatuses during the failover.
I also took a look at where we remove at the ResourceManager the entry from the concurrent hash map - I see it is removed by RemoveVdsCommand.
During my search in code, I also saw that RemoveVdsCommand occurs during VdsCommand.InitializeVds - which performs removeVds and addVds - so here is another thing to fix, this "refresh" is not atomic, and may cause problems.
Michael, please shed more light here
Actually a problem is:
Inside IrsBrokerCommand we have a variable mCurrentVdsId, it is shared between different threads.
This variable can be set to null from different thread for example via ResetIrsVDSCommand.
The critical section are not synchronized and never were.
So this is a race by design and this is , I suppose, an oldest race - it is our moby-dick.
The problem is: monitoring of pool not stopped during reconstruct, monitoring of pool can run during spm stop.
Why we see a problem now and did not saw before, I suppose nobody take attention to exception in logs.
Created attachment 709049 [details] logs Description of problem: we are getting java.lang.NullPointerException for SpmStatusVDS Version-Release number of selected component (if applicable): sf10 How reproducible: 100% Steps to Reproduce: 1. in two hosts cluster with 1 NFS domain block connectivity to the storage from both hosts 2. 3. Actual results: java.lang.NullPointerException for SpmStatusVDS Expected results: we should not get NPE Additional info:logs engine: 2013-03-12 05:21:47,176 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-72) Failed in SpmStatusVDS method 2013-03-12 05:21:47,176 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-72) Error code GeneralException and error message VDSGenericException: VDSErrorException: Failed to SpmStatusVDS, error = 2013-03-12 05:21:47,176 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-72) Command SpmStatusVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to SpmStatusV DS, error = 2013-03-12 05:21:47,177 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (QuartzScheduler_Worker-72) CreateCommand failed: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:332) [rt.jar:1.7.0_09-icedtea] at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:987) [rt.jar:1.7.0_09-icedtea] at org.ovirt.engine.core.vdsbroker.ResourceManager.GetVdsManager(ResourceManager.java:199) [engine-vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.initializeVdsBroker(VdsBrokerCommand.java:51) [engine-vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.<init>(VdsBrokerCommand.java:32) [engine-vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand.<init>(HSMGetAllTasksStatusesVDSCommand.java:22) [engine-vdsbroker.jar:] at sun.reflect.GeneratedConstructorAccessor98.newInstance(Unknown Source) [:1.7.0_09-icedtea] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.7.0_09-icedtea] at java.lang.reflect.Constructor.newInstance(Constructor.java:525) [rt.jar:1.7.0_09-icedtea] at org.ovirt.engine.core.vdsbroker.ResourceManager.CreateCommand(ResourceManager.java:333) [engine-vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:370) [engine-vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData.proceedStoragePoolStats(IrsBrokerCommand.java:231) [engine-vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData._updatingTimer_Elapsed(IrsBrokerCommand.java:183) [engine-vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_09-icedtea] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea] at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [engine-scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:] vdsm: Thread-66::INFO::2013-03-12 05:42:52,323::logUtils::37::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7', options=None) Thread-66::ERROR::2013-03-12 05:42:52,324::task::833::TaskManager.Task::(_setError) Task=`1ea6a0d5-10dd-4cb6-961f-1fa66d719bf9`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 840, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 603, in getSpmStatus pool = self.getPool(spUUID) File "/usr/share/vdsm/storage/hsm.py", line 313, in getPool raise se.StoragePoolUnknown(spUUID) StoragePoolUnknown: Unknown pool id, pool not connected: ('5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7',) Thread-66::DEBUG::2013-03-12 05:42:52,324::task::852::TaskManager.Task::(_run) Task=`1ea6a0d5-10dd-4cb6-961f-1fa66d719bf9`::Task._run: 1ea6a0d5-10dd-4cb6-961f-1fa66d719bf9 ('5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7',) {} failed - stopping t ask