Bug 1341525

Summary:

cluster creation task has failed but cluster seems installed properly

Product:

[Red Hat Storage] Red Hat Storage Console

Reporter:

Martin Kudlej <mkudlej>

Component:

core

Assignee:

Nishanth Thomas <nthomas>

core sub component:

provisioning

QA Contact:

Martin Bukatovic <mbukatov>

Status:

CLOSED CURRENTRELEASE

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

mbukatov, nthomas

Version:

Target Milestone:

---

Target Release:

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

rhscon-core-0.0.26-1.el7scon.x86_64 rhscon-ceph-0.0.26-1.el7scon.x86_64

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-11-19 05:34:49 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1344192

Attachments:

Description	Flags
failure and success	none
log from monitor2	none

Description Martin Kudlej 2016-06-01 09:02:55 UTC

Description of problem:
I've tried to create cluster and cluster creation task has time-outed. But even if cluster creation task has failed Ceph cluster seems installed properly. I think this is not acceptable state. There should be consistency between Console and Ceph clusters. In this case user cannot be sure that Ceph cluster is installed according his inputs in cluster creation wizard in Console. 

I expect that cluster creation task has time-outed but calamari task continued so cluster is installed. Anyway this is not right behavior of Console.

Version-Release number of selected component (if applicable):
ceph-ansible-1.0.5-15.el7scon.noarch
ceph-installer-1.0.11-1.el7scon.noarch
rhscon-ceph-0.0.20-1.el7scon.x86_64
rhscon-core-0.0.21-1.el7scon.x86_64
rhscon-ui-0.0.34-1.el7scon.noarch

How reproducible:
100%

Steps to Reproduce:
1. try to create cluster and this task should time-outed
2. check if ceph cluster has been created

Actual results:
Cluster creation task in Console is inconsistent with Ceph cluster status. User is not sure if Ceph cluster is created according inputs in cluster creation wizard in Console.

Expected results:
Cluster creation task in Console is consistent with Ceph cluster status.

Comment 3 Jeff Applewhite 2016-06-02 03:59:04 UTC

It sounds like 1) the root cause needs to be addressed here to determine why the console loses contact with the installation process. There also 2) needs to be some way for a cluster creation definition to be "recovered" if it is lost. In other words the console should go out and verify the state of nodes it was attempting to install on and correct errors. It is my understanding that ceph-ansible (and ansible in general) is idempotent - meaning it can be safely run again without doing harm. If it finds items already completed it should not matter. So my suggestion is to look into ways to simply repeat ansible operations on a node or nodes if we lose contact.  This may or may not result in a resolution but should be tried first and then fail with info if a second try times out. It may be that we are simply hitting timeouts but ops are succeeding after the timeout.

Comment 4 Nishanth Thomas 2016-06-02 15:07:39 UTC

1. There are situations where the southbound API requests are hung due to various reasons. Currently USM doesn't handle this effectively. USM should detect these and time out the requests which are hung for long time. 

2. Detecting the current state and repairing nodes might not be a good option this point. For example the disks might have already partitioned and configurations might have been already done. Cleaning all these might not in the job of USM. If the southbound API can provide this functionality, USM can implement the  business logic around this. Probably requires more work and may not be feasible for 2.0

3. Re-attempting task again is another option. USM should cache the request and replay it when requested by user. I assume that as Jeff mentioned  ceph-ansible is idempotent. This also requires good amount work requires changes both in UI as well as in the core

So is it a good solution to implement option 1 at this point think about others post 2.0?

Comment 5 Jeff Applewhite 2016-06-13 13:19:38 UTC

Yes let's proceed with 1) as suggested.

Comment 6 Darshan 2016-06-20 12:07:57 UTC

As a fix to this issue, we have introduced timeout to ceph-installer calls, any call that does not respond until 15 minutes will be timed out. That particular call will be treated as failure and we proceed with next requests.

Comment 7 Martin Kudlej 2016-07-27 14:08:14 UTC

I still see this as you can see at screenshot. How can I recognize if this is 2) or 3) which will not be fixed in 2.0?

Comment 8 Martin Kudlej 2016-07-27 14:09:32 UTC

Created attachment 1184668 [details]
failure and success

Comment 10 Martin Kudlej 2016-07-27 14:19:01 UTC

Created attachment 1184673 [details]
log from monitor2

Comment 11 Martin Bukatovic 2016-07-29 13:39:15 UTC

QE team will be checking if this issue will ever happen again during the whole
testing phase. If the issue will not appear, we are going to close it.

We are going to assume the fix in a sense of option 1 from comment 4.

Comment 12 Martin Bukatovic 2016-08-08 13:48:42 UTC

Based on comment 11, I'm moving this BZ into VERIFIED state, since the QE team
haven't seen this issue again during our testing phase.

The fact that QE team haven't reproduced this issue again most likely means
that changes done by dev team made this particular problem either impossible or
less likely to happen. That said, the general root cause is not fixed, as
noted in comment 4 and as demonstrated by recent BZ 1364547, which means that
similar issue still could happen.