1341525 – cluster creation task has failed but cluster seems installed properly

Bug 1341525 - cluster creation task has failed but cluster seems installed properly

Summary: cluster creation task has failed but cluster seems installed properly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	2
Assignee:	Nishanth Thomas
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Console-2-Feature-Freeze
TreeView+	depends on / blocked

Reported:	2016-06-01 09:02 UTC by Martin Kudlej
Modified:	2018-11-19 05:34 UTC (History)
CC List:	2 users (show)
Fixed In Version:	rhscon-core-0.0.26-1.el7scon.x86_64 rhscon-ceph-0.0.26-1.el7scon.x86_64
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-19 05:34:49 UTC
Embargoed:

Attachments	(Terms of Use)
failure and success (219.18 KB, image/png) 2016-07-27 14:09 UTC, Martin Kudlej	no flags	Details
log from monitor2 (833.77 KB, application/x-gzip) 2016-07-27 14:19 UTC, Martin Kudlej	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1356016	0	unspecified	CLOSED	Host initialization task strange fail	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1364547	0	unspecified	CLOSED	execution step of Create Cluster task is reported as failed while the ceph-installer task finises with success	2021-02-22 00:41:40 UTC

Internal Links: 1356016 1364547

Description Martin Kudlej 2016-06-01 09:02:55 UTC

Description of problem:
I've tried to create cluster and cluster creation task has time-outed. But even if cluster creation task has failed Ceph cluster seems installed properly. I think this is not acceptable state. There should be consistency between Console and Ceph clusters. In this case user cannot be sure that Ceph cluster is installed according his inputs in cluster creation wizard in Console. 

I expect that cluster creation task has time-outed but calamari task continued so cluster is installed. Anyway this is not right behavior of Console.

Version-Release number of selected component (if applicable):
ceph-ansible-1.0.5-15.el7scon.noarch
ceph-installer-1.0.11-1.el7scon.noarch
rhscon-ceph-0.0.20-1.el7scon.x86_64
rhscon-core-0.0.21-1.el7scon.x86_64
rhscon-ui-0.0.34-1.el7scon.noarch

How reproducible:
100%

Steps to Reproduce:
1. try to create cluster and this task should time-outed
2. check if ceph cluster has been created

Actual results:
Cluster creation task in Console is inconsistent with Ceph cluster status. User is not sure if Ceph cluster is created according inputs in cluster creation wizard in Console.

Expected results:
Cluster creation task in Console is consistent with Ceph cluster status.

Comment 3 Jeff Applewhite 2016-06-02 03:59:04 UTC

It sounds like 1) the root cause needs to be addressed here to determine why the console loses contact with the installation process. There also 2) needs to be some way for a cluster creation definition to be "recovered" if it is lost. In other words the console should go out and verify the state of nodes it was attempting to install on and correct errors. It is my understanding that ceph-ansible (and ansible in general) is idempotent - meaning it can be safely run again without doing harm. If it finds items already completed it should not matter. So my suggestion is to look into ways to simply repeat ansible operations on a node or nodes if we lose contact.  This may or may not result in a resolution but should be tried first and then fail with info if a second try times out. It may be that we are simply hitting timeouts but ops are succeeding after the timeout.

Comment 4 Nishanth Thomas 2016-06-02 15:07:39 UTC

1. There are situations where the southbound API requests are hung due to various reasons. Currently USM doesn't handle this effectively. USM should detect these and time out the requests which are hung for long time. 

2. Detecting the current state and repairing nodes might not be a good option this point. For example the disks might have already partitioned and configurations might have been already done. Cleaning all these might not in the job of USM. If the southbound API can provide this functionality, USM can implement the  business logic around this. Probably requires more work and may not be feasible for 2.0

3. Re-attempting task again is another option. USM should cache the request and replay it when requested by user. I assume that as Jeff mentioned  ceph-ansible is idempotent. This also requires good amount work requires changes both in UI as well as in the core

So is it a good solution to implement option 1 at this point think about others post 2.0?

Comment 5 Jeff Applewhite 2016-06-13 13:19:38 UTC

Yes let's proceed with 1) as suggested.

Comment 6 Darshan 2016-06-20 12:07:57 UTC

As a fix to this issue, we have introduced timeout to ceph-installer calls, any call that does not respond until 15 minutes will be timed out. That particular call will be treated as failure and we proceed with next requests.

Comment 7 Martin Kudlej 2016-07-27 14:08:14 UTC

I still see this as you can see at screenshot. How can I recognize if this is 2) or 3) which will not be fixed in 2.0?

Comment 8 Martin Kudlej 2016-07-27 14:09:32 UTC

Created attachment 1184668 [details]
failure and success

Comment 10 Martin Kudlej 2016-07-27 14:19:01 UTC

Created attachment 1184673 [details]
log from monitor2

Comment 11 Martin Bukatovic 2016-07-29 13:39:15 UTC

QE team will be checking if this issue will ever happen again during the whole
testing phase. If the issue will not appear, we are going to close it.

We are going to assume the fix in a sense of option 1 from comment 4.

Comment 12 Martin Bukatovic 2016-08-08 13:48:42 UTC

Based on comment 11, I'm moving this BZ into VERIFIED state, since the QE team
haven't seen this issue again during our testing phase.

The fact that QE team haven't reproduced this issue again most likely means
that changes done by dev team made this particular problem either impossible or
less likely to happen. That said, the general root cause is not fixed, as
noted in comment 4 and as demonstrated by recent BZ 1364547, which means that
similar issue still could happen.

Note You need to log in before you can comment on or make changes to this bug.