.Ansible does not properly handle unresponsive tasks
Certain tasks, for example adding monitors with the same host name, cause the `ceph-ansible` utility to become unresponsive. Currently, there is no timeout set after which the unresponsive tasks is marked as failed.
Description of problem:
Sometimes things can happen that make the ansible-playbook command hang. For instance, adding mons with the same hostname can lock up the process. The issue here is that if the process hangs and there is no timeout of any kind on the celery side of things the worker never completes and the queue becomes stalled.
Version-Release number of selected component (if applicable):
How reproducible: Not highly reproducible.
Steps to Reproduce:
Actual results: The whole process seems "stuck"
Expected results: A timeout is handled, and the task is set to failed.
tracked upstream @ https://github.com/ceph/ceph-installer/issues/97
Can we ship 2.0 without this fix?
Yes, we should just ship. Andrew and I discussed this a bit and couldn't get to a reasonable agreement.
Hi Alfredo and Gregory,
This can happen for customers too. What is the plan to inform the customers/users what went wrong? Without that information they can get into the same situation again and again. I feel this issue needs to be fixed in 2.0 to make sure our customers are getting the right information.
Federico, Can you please check comment 5 and let me know PM decision on this?
(In reply to Harish NV Rao from comment #5)
> Hi Alfredo and Gregory,
> This can happen for customers too. What is the plan to inform the
> customers/users what went wrong? Without that information they can get into
> the same situation again and again. I feel this issue needs to be fixed in
> 2.0 to make sure our customers are getting the right information.
That is business logic that the storage controller could implement. There is no correct way to determine what/how/where a call to ansible is "stuck".
Sure, in the strictest sense there's no solution to the Halting problem, but practically speaking, if any individual task takes longer than 20 minutes, it's probably hung because something broke.
(In reply to Ken Dreyer (Red Hat) from comment #10)
> Sure, in the strictest sense there's no solution to the Halting problem, but
> practically speaking, if any individual task takes longer than 20 minutes,
> it's probably hung because something broke.
This is specifically why this is hard to solve. Where does the 20 minute coming from? If configuring one OSD usually takes 5 minutes, sure. What if it is configuring 100 OSDs? Or if the network is slow and a task is installing packages?
In ceph-deploy for example, timeouts had to be completely disabled for installation procedures: https://github.com/ceph/ceph-deploy/commit/2e6a480d03ef16ae09a281648617802d2d1eede0
There are other use cases where the 20 minute rule would fail as well even if configuring one OSD: If a client makes 30 requests, those will get processed in a first-come-first-served basis, so even if request #30 is configuring one OSD that should take 5 minutes, it can potentially be waiting way longer than 20 minutes to complete.
What is the alternative way to unwedge a stuck celery worker?
clearing need info as Alfredo provided it in c14
> For instance, adding MONs with the same hostname
Let's reduce the scope to what is known: let's check for this error and exit. Any other issue will be filed separately.