Bug 1313935 - validate input: adding two mons with same host
Summary: validate input: adding two mons with same host
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Installer
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: rc
: 3.1
Assignee: Christina Meno
QA Contact: sds-qe-bugs
Bara Ancincova
URL:
Whiteboard:
Depends On:
Blocks: 1322504 1383917 1412948 1494421
TreeView+ depends on / blocked
 
Reported: 2016-03-02 16:09 UTC by Alfredo Deza
Modified: 2018-02-20 16:36 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Ansible does not properly handle unresponsive tasks Certain tasks, for example adding monitors with the same host name, cause the `ceph-ansible` utility to become unresponsive. Currently, there is no timeout set after which the unresponsive tasks is marked as failed.
Clone Of:
Environment:
Last Closed: 2018-02-20 16:36:38 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1334636 0 urgent CLOSED ceph-disk waiting for file lock 2021-02-22 00:41:40 UTC

Internal Links: 1334636

Description Alfredo Deza 2016-03-02 16:09:30 UTC
Description of problem:
Sometimes things can happen that make the ansible-playbook command hang. For instance, adding mons with the same hostname can lock up the process. The issue here is that if the process hangs and there is no timeout of any kind on the celery side of things the worker never completes and the queue becomes stalled.


Version-Release number of selected component (if applicable):


How reproducible: Not highly reproducible.


Steps to Reproduce:
1.
2.
3.

Actual results: The whole process seems "stuck"


Expected results: A timeout is handled, and the task is set to failed.


Additional info:

Comment 2 Ken Dreyer (Red Hat) 2016-03-02 18:01:50 UTC
tracked upstream @ https://github.com/ceph/ceph-installer/issues/97

Comment 3 Christina Meno 2016-04-25 21:37:34 UTC
Can we ship 2.0 without this fix?

Comment 4 Alfredo Deza 2016-04-26 11:21:05 UTC
Yes, we should just ship. Andrew and I discussed this a bit and couldn't get to a reasonable agreement.

Comment 5 Harish NV Rao 2016-04-27 06:28:05 UTC
Hi Alfredo and Gregory,

This can happen for customers too. What is the plan to inform the customers/users what went wrong? Without that information they can get into the same situation again and again. I feel this issue needs to be fixed in 2.0 to make sure our customers are getting the right information.

Harish

Comment 8 Harish NV Rao 2016-05-12 09:07:52 UTC
Federico, Can you please check comment 5 and let me know PM decision on this?

Comment 9 Alfredo Deza 2016-05-12 11:41:32 UTC
(In reply to Harish NV Rao from comment #5)
> Hi Alfredo and Gregory,
> 
> This can happen for customers too. What is the plan to inform the
> customers/users what went wrong? Without that information they can get into
> the same situation again and again. I feel this issue needs to be fixed in
> 2.0 to make sure our customers are getting the right information.
> 
> Harish

That is business logic that the storage controller could implement. There is no correct way to determine what/how/where a call to ansible is "stuck".

Comment 10 Ken Dreyer (Red Hat) 2016-05-12 14:40:53 UTC
Sure, in the strictest sense there's no solution to the Halting problem, but practically speaking, if any individual task takes longer than 20 minutes, it's probably hung because something broke.

Comment 11 Alfredo Deza 2016-05-12 14:56:18 UTC
(In reply to Ken Dreyer (Red Hat) from comment #10)
> Sure, in the strictest sense there's no solution to the Halting problem, but
> practically speaking, if any individual task takes longer than 20 minutes,
> it's probably hung because something broke.

This is specifically why this is hard to solve. Where does the 20 minute coming from? If configuring one OSD usually takes 5 minutes, sure. What if it is configuring 100 OSDs? Or if the network is slow and a task is installing packages?

In ceph-deploy for example, timeouts had to be completely disabled for installation procedures: https://github.com/ceph/ceph-deploy/commit/2e6a480d03ef16ae09a281648617802d2d1eede0

There are other use cases where the 20 minute rule would fail as well even if configuring one OSD: If a client makes 30 requests, those will get processed in a first-come-first-served basis, so even if request #30 is configuring one OSD that should take 5 minutes, it can potentially be waiting way longer than 20 minutes to complete.

Comment 12 Ken Dreyer (Red Hat) 2016-05-12 15:03:12 UTC
What is the alternative way to unwedge a stuck celery worker?

Comment 15 Christina Meno 2016-08-29 16:56:21 UTC
clearing need info as Alfredo provided it in c14

Comment 18 Federico Lucifredi 2017-06-29 13:25:37 UTC
> For instance, adding MONs with the same hostname 

Let's reduce the scope to what is known: let's check for this error and exit. Any other issue will be filed separately.

Comment 19 Federico Lucifredi 2017-06-29 13:25:52 UTC
> For instance, adding MONs with the same hostname 

Let's reduce the scope to what is known: let's check for this error and exit. Any other issue will be filed separately.


Note You need to log in before you can comment on or make changes to this bug.