Bug 1313935
| Summary: | validate input: adding two mons with same host | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Alfredo Deza <adeza> |
| Component: | Ceph-Installer | Assignee: | Christina Meno <gmeno> |
| Status: | CLOSED WONTFIX | QA Contact: | sds-qe-bugs |
| Severity: | medium | Docs Contact: | Bara Ancincova <bancinco> |
| Priority: | high | ||
| Version: | 3.0 | CC: | adeza, aschoen, ceph-eng-bugs, flucifre, gmeno, hnallurv, kdreyer, nthomas, racpatel, sankarshan, wusui |
| Target Milestone: | rc | ||
| Target Release: | 3.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
.Ansible does not properly handle unresponsive tasks
Certain tasks, for example adding monitors with the same host name, cause the `ceph-ansible` utility to become unresponsive. Currently, there is no timeout set after which the unresponsive tasks is marked as failed.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-02-20 16:36:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1322504, 1383917, 1412948, 1494421 | ||
|
Description
Alfredo Deza
2016-03-02 16:09:30 UTC
tracked upstream @ https://github.com/ceph/ceph-installer/issues/97 Can we ship 2.0 without this fix? Yes, we should just ship. Andrew and I discussed this a bit and couldn't get to a reasonable agreement. Hi Alfredo and Gregory, This can happen for customers too. What is the plan to inform the customers/users what went wrong? Without that information they can get into the same situation again and again. I feel this issue needs to be fixed in 2.0 to make sure our customers are getting the right information. Harish Federico, Can you please check comment 5 and let me know PM decision on this? (In reply to Harish NV Rao from comment #5) > Hi Alfredo and Gregory, > > This can happen for customers too. What is the plan to inform the > customers/users what went wrong? Without that information they can get into > the same situation again and again. I feel this issue needs to be fixed in > 2.0 to make sure our customers are getting the right information. > > Harish That is business logic that the storage controller could implement. There is no correct way to determine what/how/where a call to ansible is "stuck". Sure, in the strictest sense there's no solution to the Halting problem, but practically speaking, if any individual task takes longer than 20 minutes, it's probably hung because something broke. (In reply to Ken Dreyer (Red Hat) from comment #10) > Sure, in the strictest sense there's no solution to the Halting problem, but > practically speaking, if any individual task takes longer than 20 minutes, > it's probably hung because something broke. This is specifically why this is hard to solve. Where does the 20 minute coming from? If configuring one OSD usually takes 5 minutes, sure. What if it is configuring 100 OSDs? Or if the network is slow and a task is installing packages? In ceph-deploy for example, timeouts had to be completely disabled for installation procedures: https://github.com/ceph/ceph-deploy/commit/2e6a480d03ef16ae09a281648617802d2d1eede0 There are other use cases where the 20 minute rule would fail as well even if configuring one OSD: If a client makes 30 requests, those will get processed in a first-come-first-served basis, so even if request #30 is configuring one OSD that should take 5 minutes, it can potentially be waiting way longer than 20 minutes to complete. What is the alternative way to unwedge a stuck celery worker? clearing need info as Alfredo provided it in c14 > For instance, adding MONs with the same hostname
Let's reduce the scope to what is known: let's check for this error and exit. Any other issue will be filed separately.
> For instance, adding MONs with the same hostname
Let's reduce the scope to what is known: let's check for this error and exit. Any other issue will be filed separately.
|