Bug 1653258 - Failed to upload image with "Too many tasks" error and "Worker blocked" exceptions in VDSM
Summary: Failed to upload image with "Too many tasks" error and "Worker blocked" excep...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.30.3
Hardware: x86_64
OS: Unspecified
low
high vote
Target Milestone: ovirt-4.3.0
: ---
Assignee: Edward Haas
QA Contact: Roni
URL:
Whiteboard:
Depends On:
Blocks: 1666123
TreeView+ depends on / blocked
 
Reported: 2018-11-26 11:52 UTC by Yosi Ben Shimon
Modified: 2019-12-10 10:57 UTC (History)
12 users (show)

Fixed In Version: v4.30.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-13 07:43:18 UTC
oVirt Team: Network
rule-engine: ovirt-4.3+
rule-engine: ovirt-4.4+


Attachments (Terms of Use)
logs (1.96 MB, application/x-gzip)
2018-11-26 12:19 UTC, Yosi Ben Shimon
no flags Details
logs2 (1.38 MB, application/zip)
2018-12-19 15:46 UTC, Elad
no flags Details
deactivate_host (904.84 KB, application/zip)
2018-12-19 15:51 UTC, Roni
no flags Details
test_autosync_30_networks.zip (702.13 KB, application/zip)
2018-12-20 12:01 UTC, Roni
no flags Details
logs3 (1.46 MB, application/gzip)
2018-12-26 11:15 UTC, Elad
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 96455 master MERGED net: Avoid deadlock when using the netlink package 2018-12-27 06:59:52 UTC

Description Yosi Ben Shimon 2018-11-26 11:52:15 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yosi Ben Shimon 2018-11-26 12:18:20 UTC
* Previous description submitted by mistake.

Description of problem:
Upload image test case failed with "too many tasks" error in the engine log and "Worker blocked" exceptions in VDSM.
The test case adds 4 disks and tries to upload them Asynchronous.
3 of the disks were uploaded successfully.
The 4th disk failed with errors in the engine log:

2018-11-24 08:28:14,712+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [ebbc056d-829a-4b38-9ed9-6fd8ed4a62f1] Command 'ExtendImageTicketVDSCommand(HostName = host_mixed_3, ExtendImageTicketVDSCommandParameters:{hostId='f5382529-b7cc-4727-83f2-cb12b9fb9b80', ticketId='ce2a124a-fac9-48ba-a429-2e0759f2061d', timeout='300'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2018-11-24 08:28:14,712+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [ebbc056d-829a-4b38-9ed9-6fd8ed4a62f1] FINISH, ExtendImageTicketVDSCommand, return: , log id: 1c09a200
2018-11-24 08:28:14,713+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-12794) [ebbc056d-829a-4b38-9ed9-6fd8ed4a62f1] Host 'host_mixed_3' is not responding.
2018-11-24 08:28:14,713+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [ebbc056d-829a-4b38-9ed9-6fd8ed4a62f1] Failed to extend image transfer session for ticket 'ce2a124a-fac9-48ba-a429-2e0759f2061d': {}: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
	at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:118) [bll.jar:]
	at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
	at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.extendImageTransferSession(TransferDiskImageCommand.java:914) [bll.jar:]
	at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.extendTicketIfNecessary(TransferDiskImageCommand.java:506) [bll.jar:]
	at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.handleTransferring(TransferDiskImageCommand.java:495) [bll.jar:]
	at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.executeStateHandler(TransferDiskImageCommand.java:364) [bll.jar:]
	at org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.proceedCommandExecution(TransferDiskImageCommand.java:345) [bll.jar:]
	at org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommandCallback.doPolling(TransferImageCommandCallback.java:21) [bll.jar:]
	at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:146) [bll.jar:]
	at org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:107) [bll.jar:]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_191]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_191]
	at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_191]
	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_191]
	at org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250) [javax.enterprise.concurrent.jar:1.0.0.redhat-1]
	at org.jboss.as.ee.concurrent.service.ElytronManagedThreadFactory$ElytronManagedThread.run(ElytronManagedThreadFactory.java:78)
Caused by: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
	at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:189) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand.executeVdsBrokerCommand(ExtendImageTicketVDSCommand.java:16) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVdsCommandWithNetworkEvent(VdsBrokerCommand.java:123) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
	at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:396) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand$$super(Unknown Source) [vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source) [:1.8.0_191]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_191]
	at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_191]
	at org.jboss.weld.interceptor.proxy.TerminalAroundInvokeInvocationContext.proceedInternal(TerminalAroundInvokeInvocationContext.java:51) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.interceptor.proxy.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:79) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.ovirt.engine.core.common.di.interceptor.LoggingInterceptor.apply(LoggingInterceptor.java:12) [common.jar:]
	at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source) [:1.8.0_191]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_191]
	at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_191]
	at org.jboss.weld.interceptor.reader.SimpleInterceptorInvocation$SimpleMethodInvocation.invoke(SimpleInterceptorInvocation.java:73) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeAroundInvoke(InterceptorMethodHandler.java:85) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.executeInterception(InterceptorMethodHandler.java:73) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.interceptor.proxy.InterceptorMethodHandler.invoke(InterceptorMethodHandler.java:57) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:79) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.jboss.weld.bean.proxy.CombinedInterceptorAndDecoratorStackMethodHandler.invoke(CombinedInterceptorAndDecoratorStackMethodHandler.java:68) [weld-core-impl.jar:2.4.7.Final-redhat-1]
	at org.ovirt.engine.core.vdsbroker.ResourceManager$Proxy$_$$_WeldSubclass.runVdsCommand(Unknown Source) [vdsbroker.jar:]
	... 18 more


2018-11-24 08:28:14,765+02 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [ebbc056d-829a-4b38-9ed9-6fd8ed4a62f1] Failed to renew transfer ticket for Upload disk 'disk_qcow2_v3_2408234260' (disk id: '7a81e99c-03ec-4e12-afc3-5b3f394d0e9f', image id: 'ed1b759c-2c31-4d8e-896d-bc5cdf7bd2ae')


2018-11-24 08:31:27,035+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-22) [ebbc056d-829a-4b38-9ed9-6fd8ed4a62f1] Transfer was paused by system. Upload disk 'disk_qcow2_v3_2408234260' (disk id: '7a81e99c-03ec-4e12-afc3-5b3f394d0e9f', image id: 'ed1b759c-2c31-4d8e-896d-bc5cdf7bd2ae')


In VDSM log, ~100 exception like this can be found:
2018-11-24 08:28:12,941+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getStats', 'id': u'ff344c57-2d79-459c-9082-deb14d234808'} at 0x7f81fc5ffc90> timeout=60, duration=1080.00 at 0x7f81fc5fff50> task#=8653 at 0x7f821c357190>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in getStats
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1407, in getStats
  multipath=True)}
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in get_stats
  decStats = stats.produce(first_sample, last_sample)
File: "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in produce
  stats.update(get_interfaces_stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in get_interfaces_stats
  return net_api.network_stats()
File: "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in network_stats
  return netstats.report()
File: "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in report
  stats = link_stats.report()
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 34, in report
  stats[i.device] = i.statistics()
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 232, in statistics
  'state': 'up' if self.is_oper_up() else 'down',
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 193, in is_oper_up
  properties = self.properties()
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 165, in properties
  info = link.get_link(self._dev)
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 39, in get_link
  link_info = _link_info(link)
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 98, in _link_info
  info['master'] = _link_index_to_name(master_index, cache=cache)
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 112, in _link_index_to_name
  with _get_link(index=link_index) as link:
File: "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
  return self.gen.next()
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 139, in _get_link
  with _pool.socket() as sock:
File: "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
  return self.gen.next()
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 47, in socket
  with self._semaphore:
File: "/usr/lib64/python2.7/threading.py", line 471, in acquire
  self.__cond.wait()
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 127, in wait
  return self.__cond.wait()
File: "/usr/lib/python2.7/site-packages/pthread.py", line 131, in wait
  return _libpthread.pthread_cond_wait(self._cond, m.mutex()) (executor:363)


Version-Release number of selected component (if applicable):
ovirt-engine-4.3.0-0.5.alpha1.el7.noarch
ovirt-imageio-proxy-1.4.6-1.el7ev.noarch
vdsm-4.30.3-1.el7ev.x86_64
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64


How reproducible:
Unknown as it failed in automation and didn't manage to reproduce it again manually/autmation.

Steps to Reproduce:
1.
2.
3.

Actual results:
The upload failed

Expected results:
The upload should succeed

Additional info:
Attached engine, VDSM, art and image-proxy logs

Comment 2 Yosi Ben Shimon 2018-11-26 12:19:30 UTC
Created attachment 1508550 [details]
logs

Comment 3 Martin Perina 2018-11-27 09:20:16 UTC
Dan, could you please take a look? From above vdsm.log part it seems to me that VDSM task is blocked in networking code ...

Comment 4 Dan Kenigsberg 2018-11-27 13:46:19 UTC
I think that Eddy already has, and can share his current understanding.

Comment 5 Dominik Holler 2018-12-04 11:47:27 UTC
Since Edy is not available, can you please have a look, Dan?

Comment 6 Dan Kenigsberg 2018-12-04 13:13:47 UTC
I'm afraid I cannot. I would guess that there is a ctype bug that can lock a thread; if vdsm is regardlessly busy, we can see this on the logs.

I would not consider it an urgency unless it suddenly reproduces more often.

Comment 7 Dan Kenigsberg 2018-12-11 11:31:03 UTC
BTW, I assume that "How reproducible:" is "very rarely" or "once ever". Is that true?

Comment 8 Yosi Ben Shimon 2018-12-13 13:57:30 UTC
(In reply to Dan Kenigsberg from comment #7)
> BTW, I assume that "How reproducible:" is "very rarely" or "once ever". Is
> that true?

Yes.
As I wrote in the description, I didn't manage to reproduce it again so I'll have to say "once ever".

Comment 9 Elad 2018-12-19 15:46:59 UTC
Created attachment 1515629 [details]
logs2

Encountered again, this time during start VM:

2018-12-12 05:22:05,983+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getStats', 'id': u'79e6a07c-00f2-4dfb-9d82-e3f221e506cf'} at 0x7ff1a4342810> timeout=60, duration=960.01 at 0x7ff1a4342ad0> task#=7761 at 0x7ff1c4121410>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 198, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in getStats
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1407, in getStats
  multipath=True)}
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in get_stats
  decStats = stats.produce(first_sample, last_sample)
File: "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in produce
  stats.update(get_interfaces_stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in get_interfaces_stats
  return net_api.network_stats()
File: "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in network_stats
  return netstats.report()
File: "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in report
  stats = link_stats.report()
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 32, in report
  for iface_properties in iface.list():
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 257, in list
  for properties in itertools.chain(link.iter_links(), dpdk_links):
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 46, in iter_links
  with _pool.socket() as sock:
File: "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
  return self.gen.next()
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 47, in socket
  with self._semaphore:
File: "/usr/lib64/python2.7/threading.py", line 471, in acquire
  self.__cond.wait()
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 127, in wait
  return self.__cond.wait()
File: "/usr/lib/python2.7/site-packages/pthread.py", line 131, in wait
  return _libpthread.pthread_cond_wait(self._cond, m.mutex()) (executor:363)

Comment 10 Roni 2018-12-19 15:51:27 UTC
Created attachment 1515631 [details]
deactivate_host

Comment 11 Roni 2018-12-19 15:54:18 UTC
The problem reproduces at 4.3 during Automation, when it tries to deactivate_host
see attached logs "deactivate_host"

Comment 12 Raz Tamir 2018-12-19 16:46:13 UTC
This bug reproduced more and more in our automation.

Raising severity. 
Please consider retarget it to 4.3

Comment 13 Roni 2018-12-20 12:00:19 UTC
At the following log we can see that it occurs due to timeout exception
Maybe this is the root cause?
See attached logs: test_autosync_30_networks.zip

2018-12-19 09:17:14,412+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f48ec73ed50> at 0x7f48ec73ed90> timeout=15, duration=15.09 at 0x7f48ec3b2790> task#=316 at 0x7f48ec71ed10>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in __call__
  self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in __call__
  stats = hostapi.get_stats(self._cif, self._samples.stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 50, in get_stats
  decStats = stats.produce(first_sample, last_sample)
File: "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 72, in produce
  stats.update(get_interfaces_stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/stats.py", line 154, in get_interfaces_stats
  return net_api.network_stats()
File: "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 63, in network_stats
  return netstats.report()
File: "/usr/lib/python2.7/site-packages/vdsm/network/netstats.py", line 32, in report
  stats = link_stats.report()
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/stats.py", line 39, in report
  elif i.type() == iface.Type.BOND:
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 225, in type
  return self.properties().get('type', get_alternative_type(self._dev))
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/iface.py", line 165, in properties
  info = link.get_link(self._dev)
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 39, in get_link
  link_info = _link_info(link)
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 98, in _link_info
  info['master'] = _link_index_to_name(master_index, cache=cache)
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 112, in _link_index_to_name
  with _get_link(index=link_index) as link:
File: "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
  return self.gen.next()
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/link.py", line 139, in _get_link
  with _pool.socket() as sock:
File: "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
  return self.gen.next()
File: "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 47, in socket
  with self._semaphore:
File: "/usr/lib64/python2.7/threading.py", line 471, in acquire
  self.__cond.wait()
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 127, in wait
  return self.__cond.wait()
File: "/usr/lib/python2.7/site-packages/pthread.py", line 131, in wait
  return _libpthread.pthread_cond_wait(self._cond, m.mutex()) (executor:363)

Comment 14 Roni 2018-12-20 12:01:44 UTC
Created attachment 1515843 [details]
test_autosync_30_networks.zip

Comment 15 Elad 2018-12-26 11:15:20 UTC
Created attachment 1516824 [details]
logs3

Encountered for 'method': u'Image.syncData':


2018-12-26 11:41:21,868+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'syncType': u'INTERNAL', u'dstSdUUID': u'99cd05da-d404-449d-8d2f-9330bb087c61', u'storagepoolID': u'9cc6e14a-784e-418f-84db-11fa9ff07a80', u'storagedomainID': u'19bafff8-5d66-4550-a24e-c8710768e7a9', u'imageID': u'320b2ef0-f35d-47dc-b046-
e85f2438f6c1'}, 'jsonrpc': '2.0', 'method': u'Image.syncData', 'id': u'8856d7ee-552a-4b88-b101-426fe01cfa2a'} at 0x7f4420029290> timeout=60, duration=120.00 at 0x7f4420029c90> task#=8912 at 0x7f442c0e91d0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 198, in _dynamicMethod
  result = fn(*methodArgs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 907, in syncData
  dstSdUUID, syncType)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 72, in wrapper
  result = ctask.prepare(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
  return m(self, *a, **kw)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare
  result = self._run(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
  return fn(*args, **kargs)
File: "<string>", line 2, in syncImageData
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1705, in syncImageData
  vars.task.getSharedLock(STORAGE, dom)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1383, in getSharedLock
  timeout)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 784, in acquire
  timeout)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1026, in acquireResource
  return _manager.acquireResource(namespace, name, lockType, timeout=timeout)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 465, in acquireResource
  request.wait(timeout)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 235, in wait
  return self._doneEvent.wait(timeout)
File: "/usr/lib64/python2.7/threading.py", line 622, in wait
  self.__cond.wait(timeout, balancing)
File: "/usr/lib/python2.7/site-packages/pthreading.py", line 125, in wait
  return self.__cond.timedwait(abstime)
File: "/usr/lib/python2.7/site-packages/pthread.py", line 136, in timedwait
  C.pointer(abstime)) (executor:363)



2018-12-26 11:41:21,883+0200 INFO  (jsonrpc/5) [vdsm.api] FINISH syncImageData error=Resource timeout: () from=::ffff:10.35.161.127,60606, flow_id=disks_syncAction_1142755c-bfd3-44e7, task_id=76e2dba2-d15b-4acc-88f0-d3961f940bc1 (api:52)
2018-12-26 11:41:21,884+0200 ERROR (jsonrpc/5) [storage.TaskManager.Task] (Task='76e2dba2-d15b-4acc-88f0-d3961f940bc1') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in syncImageData
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1705, in syncImageData
    vars.task.getSharedLock(STORAGE, dom)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1383, in getSharedLock
    timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 793, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()
2018-12-26 11:41:21,884+0200 DEBUG (jsonrpc/5) [storage.TaskManager.Task] (Task='76e2dba2-d15b-4acc-88f0-d3961f940bc1') Task._run: 76e2dba2-d15b-4acc-88f0-d3961f940bc1 (u'9cc6e14a-784e-418f-84db-11fa9ff07a80', u'19bafff8-5d66-4550-a24e-c8710768e7a9', u'320b2ef0-f35d-47dc-b046-e85f2438f6c1', u'99cd05da-d404-449d-8d2f-9330bb087c61', u'INTERNAL') {} failed - stopping task (task:894)

Comment 17 Roni 2019-02-05 16:26:05 UTC
Verified d/s v4.3.0.4-0.1.el7

Comment 18 Sandro Bonazzola 2019-02-13 07:43:18 UTC
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.