Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1313935 - validate input: adding two mons with same host
validate input: adding two mons with same host
Status: CLOSED WONTFIX
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Installer (Show other bugs)
3.0
Unspecified Unspecified
high Severity medium
: rc
: 3.1
Assigned To: Gregory Meno
sds-qe-bugs
Bara Ancincova
:
Depends On:
Blocks: 1322504 1383917 1412948 1494421
  Show dependency treegraph
 
Reported: 2016-03-02 11:09 EST by Alfredo Deza
Modified: 2018-02-20 11:36 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Ansible does not properly handle unresponsive tasks Certain tasks, for example adding monitors with the same host name, cause the `ceph-ansible` utility to become unresponsive. Currently, there is no timeout set after which the unresponsive tasks is marked as failed.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-02-20 11:36:38 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Alfredo Deza 2016-03-02 11:09:30 EST
Description of problem:
Sometimes things can happen that make the ansible-playbook command hang. For instance, adding mons with the same hostname can lock up the process. The issue here is that if the process hangs and there is no timeout of any kind on the celery side of things the worker never completes and the queue becomes stalled.


Version-Release number of selected component (if applicable):


How reproducible: Not highly reproducible.


Steps to Reproduce:
1.
2.
3.

Actual results: The whole process seems "stuck"


Expected results: A timeout is handled, and the task is set to failed.


Additional info:
Comment 2 Ken Dreyer (Red Hat) 2016-03-02 13:01:50 EST
tracked upstream @ https://github.com/ceph/ceph-installer/issues/97
Comment 3 Gregory Meno 2016-04-25 17:37:34 EDT
Can we ship 2.0 without this fix?
Comment 4 Alfredo Deza 2016-04-26 07:21:05 EDT
Yes, we should just ship. Andrew and I discussed this a bit and couldn't get to a reasonable agreement.
Comment 5 Harish NV Rao 2016-04-27 02:28:05 EDT
Hi Alfredo and Gregory,

This can happen for customers too. What is the plan to inform the customers/users what went wrong? Without that information they can get into the same situation again and again. I feel this issue needs to be fixed in 2.0 to make sure our customers are getting the right information.

Harish
Comment 8 Harish NV Rao 2016-05-12 05:07:52 EDT
Federico, Can you please check comment 5 and let me know PM decision on this?
Comment 9 Alfredo Deza 2016-05-12 07:41:32 EDT
(In reply to Harish NV Rao from comment #5)
> Hi Alfredo and Gregory,
> 
> This can happen for customers too. What is the plan to inform the
> customers/users what went wrong? Without that information they can get into
> the same situation again and again. I feel this issue needs to be fixed in
> 2.0 to make sure our customers are getting the right information.
> 
> Harish

That is business logic that the storage controller could implement. There is no correct way to determine what/how/where a call to ansible is "stuck".
Comment 10 Ken Dreyer (Red Hat) 2016-05-12 10:40:53 EDT
Sure, in the strictest sense there's no solution to the Halting problem, but practically speaking, if any individual task takes longer than 20 minutes, it's probably hung because something broke.
Comment 11 Alfredo Deza 2016-05-12 10:56:18 EDT
(In reply to Ken Dreyer (Red Hat) from comment #10)
> Sure, in the strictest sense there's no solution to the Halting problem, but
> practically speaking, if any individual task takes longer than 20 minutes,
> it's probably hung because something broke.

This is specifically why this is hard to solve. Where does the 20 minute coming from? If configuring one OSD usually takes 5 minutes, sure. What if it is configuring 100 OSDs? Or if the network is slow and a task is installing packages?

In ceph-deploy for example, timeouts had to be completely disabled for installation procedures: https://github.com/ceph/ceph-deploy/commit/2e6a480d03ef16ae09a281648617802d2d1eede0

There are other use cases where the 20 minute rule would fail as well even if configuring one OSD: If a client makes 30 requests, those will get processed in a first-come-first-served basis, so even if request #30 is configuring one OSD that should take 5 minutes, it can potentially be waiting way longer than 20 minutes to complete.
Comment 12 Ken Dreyer (Red Hat) 2016-05-12 11:03:12 EDT
What is the alternative way to unwedge a stuck celery worker?
Comment 15 Gregory Meno 2016-08-29 12:56:21 EDT
clearing need info as Alfredo provided it in c14
Comment 18 Federico Lucifredi 2017-06-29 09:25:37 EDT
> For instance, adding MONs with the same hostname 

Let's reduce the scope to what is known: let's check for this error and exit. Any other issue will be filed separately.
Comment 19 Federico Lucifredi 2017-06-29 09:25:52 EDT
> For instance, adding MONs with the same hostname 

Let's reduce the scope to what is known: let's check for this error and exit. Any other issue will be filed separately.

Note You need to log in before you can comment on or make changes to this bug.