1356016 – Host initialization task strange fail

Bug 1356016 - Host initialization task strange fail

Summary: Host initialization task strange fail

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	unclassified
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	2
Assignee:	gowtham
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Console-2-GA
TreeView+	depends on / blocked

Reported:	2016-07-13 09:36 UTC by Lubos Trilety
Modified:	2016-08-23 19:56 UTC (History)
CC List:	6 users (show)
Fixed In Version:	rhscon-core-0.0.37-1.el7scon
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 19:56:46 UTC
Embargoed:

Attachments	(Terms of Use)
screenshot 1: View and Accept Hosts page (112.34 KB, image/png) 2016-07-25 15:32 UTC, Martin Bukatovic	no flags	Details
screenshot 2: Initialize Node task details (124.04 KB, image/png) 2016-07-25 15:33 UTC, Martin Bukatovic	no flags	Details
screenshot 3: Hosts list page (78.64 KB, image/png) 2016-07-25 15:33 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Gerrithub.io	285267	None	None	None	2016-07-28 01:49:37 UTC
Red Hat Bugzilla	1341525	unspecified	CLOSED	cluster creation task has failed but cluster seems installed properly	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1364547	unspecified	CLOSED	execution step of Create Cluster task is reported as failed while the ceph-installer task finises with success	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2016:1754	normal	SHIPPED_LIVE	New packages: Red Hat Storage Console 2.0	2017-04-18 19:09:06 UTC

Internal Links: 1341525 1364547

Description Lubos Trilety 2016-07-13 09:36:10 UTC

Description of problem:
Sometimes initialization of host looks like it failed. Initialization tasks ends with fail
Force Stop. Task: 6ad61af9-c2ab-4939-aeeb-b1cc08e4356c explicitly stopped.
However host looks fine as it is in OK state and can be used during cluster creation.
And so re-initialize cannot be run from Hosts list at all. If user tries to run it from the page where accepting of all hosts is done, it never starts as the host is not in failed state.
# grep <host_name> /var/log/skyring/skyring.log
...
ERROR    nodes.go:1541 POST_Actions] admin:0d9902b0-f83d-4d1a-b6a4-5103bc818156-Node <host_name> is not in failed state
...

Version-Release number of selected component (if applicable):
rhscon-ui-0.0.46-1.el7scon.noarch
rhscon-core-selinux-0.0.32-1.el7scon.noarch
rhscon-core-0.0.32-1.el7scon.x86_64
rhscon-ceph-0.0.31-1.el7scon.x86_6

How reproducible:
70%

Steps to Reproduce:
1. Have a bigger number of host to be accepted
2. Try to accept all of them together
3.

Actual results:
Initialization task reports to be failed for some of the hosts, however they can be used during cluster creation and they are working as they should.

Expected results:
Initialization task reports failure only if it really does not succeed.

Additional info:

Comment 1 Nishanth Thomas 2016-07-13 19:07:55 UTC

Need the below information:

1. Is there a time-out? how much time it takes to complete the init of all the 20 nodes as you mentioned

2. Does it get all the required information from the Node like discs, network if etc?

Comment 2 gowtham 2016-07-15 06:25:11 UTC

I have tried with 20 nodes, But acceptance and initialization's are done perfectly. Tasks also in success state only. So give me a more information to reproduce the bug.

Comment 3 Lubos Trilety 2016-07-15 09:02:18 UTC

(In reply to Nishanth Thomas from comment #1)
> Need the below information:
> 
> 1. Is there a time-out? how much time it takes to complete the init of all
> the 20 nodes as you mentioned
>
 
It was not 20 nodes, just 16. Anyway it was less than 10 minutes, but I don't know exact time, as I didn't measure it.

> 2. Does it get all the required information from the Node like discs,
> network if etc?

Yes, everything seems to be correct. The hosts could be used for creating a cluster. And the cluster is working.

I'll try to reproduce it again on my setup.

Comment 4 Lubos Trilety 2016-07-15 14:36:17 UTC

Retested on:
rhscon-ceph-0.0.33-1.el7scon.x86_64
rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-core-0.0.34-1.el7scon.x86_64
rhscon-ui-0.0.48-1.el7scon.noarch

I was not able to reproduce it again. So I am closing this BZ, if it happens again I will reopen it.

Comment 5 Martin Bukatovic 2016-07-25 15:31:43 UTC

I just noticed this issue again with the latest build.

Based on Lubos's note from comment 4, I'm reopening the BZ.

Version-Release
===============

On RHSC 2.0 server machine:

rhscon-ceph-0.0.36-1.el7scon.x86_64
rhscon-core-selinux-0.0.36-1.el7scon.noarch
rhscon-core-0.0.36-1.el7scon.x86_64
rhscon-ui-0.0.50-1.el7scon.noarch
ceph-ansible-1.0.5-31.el7scon.noarch
ceph-installer-1.0.14-1.el7scon.noarch

On Ceph storage machines:

rhscon-core-selinux-0.0.36-1.el7scon.noarch
rhscon-agent-0.0.16-1.el7scon.noarch

Cluster configuration
=====================

I hit the issue on OS1 (openstack) cluster, which have:

* 1 RHSC 2.0 server machine
* 3 ceph monitor machines
* 4 ceph osd machines

Since the cluster had just 7 machines, it seems that large number of hosts
is not needed to hit this issue.

Steps to Reproduce
==================

I'm not sure about the reproducer here.

All I did was to deploy the cluster and run "Accept All".

Details
=======

It seems as if 3 Initialization tasks failed. On "View and Accept Hosts" page,
the RHSC 2.0 reports this failure for 3 machines.

Eg. see line for one of affected machines:

~~~
mbukatov-usm1-mon2.os1.phx2.redhat.com
e5:2e:a3:5c:79:c1:d6:98:7b:4f:90:e4:ff:1d:b5:15
Initialization Failed: Jul 25, 2016 3:38:18 PM
~~~

And the Task details for it's "Initizalize Node" task:

~~~
1 	started the task for InitializeNode: c876523c-9b08-4d57-9cd7-6642412a5374 	Jul 25 2016, 03:36:42 PM
2 	Force Stop. Task: c876523c-9b08-4d57-9cd7-6642412a5374 explicitly stopped. 	Jul 25 2016, 03:38:18 PM
~~~

This looks as if the task timeouted after 2 minutes.

One can compare that with the successful "Initialize Node" task:

~~~
1 	started the task for InitializeNode: b48124f3-8f77-431b-ad5b-ae9f45aa94d9 	Jul 25 2016, 03:36:43 PM
2 	Success 	Jul 25 2016, 03:38:50 PM
~~~

and notice the weird fact that the successful tasks took a bit longer compared
to the failed one.

Anyway, when I check the Hosts page, all machines are reported as "ok" in the
same way, no matter if the "Initialize Node" task failed for the node.

Moreover, checking salt status from the RHSC 2.0 machine, it seems that all
machines are properly accepted in the salt master:

~~~
# salt-key --finger-all
Local Keys:
master.pem:  42:98:07:52:ac:21:26:62:5f:c3:e9:6c:98:33:5a:10
master.pub:  cb:d3:aa:c5:04:b3:de:ad:db:f0:c2:10:e0:f0:db:94
Accepted Keys:
mbukatov-usm1-mon1.os1.phx2.redhat.com:  d0:5f:07:92:cd:c2:c2:e2:59:0b:00:89:39:ec:43:50
mbukatov-usm1-node3.os1.phx2.redhat.com:  2d:08:29:fd:b5:00:fc:cf:06:89:db:ad:64:8b:4b:02
mbukatov-usm1-mon2.os1.phx2.redhat.com:  e5:2e:a3:5c:79:c1:d6:98:7b:4f:90:e4:ff:1d:b5:15
mbukatov-usm1-node1.os1.phx2.redhat.com:  f4:30:52:0b:3f:18:43:95:b6:2f:06:b5:57:6f:4c:15
mbukatov-usm1-mon3.os1.phx2.redhat.com:  29:56:9a:50:bd:b4:21:cf:f9:4c:74:97:14:3d:fd:9b
mbukatov-usm1-node4.os1.phx2.redhat.com:  8a:b2:e6:8c:a8:56:fb:a9:bd:07:4d:51:f0:c6:a9:fe
mbukatov-usm1-node2.os1.phx2.redhat.com:  67:b1:76:47:fb:b3:47:71:d8:92:3d:8d:6d:0a:26:21
~~~

And indeed, when I hit "Re-Initialize" button on one of the affected nodes,
nothing happens and I can find the following line in the logs:

~~~
2016-07-25T16:42:54.521+02:00 ERROR    nodes.go:1553 POST_Actions] admin:94575190-d0ce-46c0-9173-38205553b2c2-Node mbukatov-usm1-mon2.os1.phx2.redhat.com is not in failed state
~~~

Based on this evidence, I think it's save to assume that I just hit the same
BZ.

Comment 6 Martin Bukatovic 2016-07-25 15:32:29 UTC

Created attachment 1183849 [details]
screenshot 1: View and Accept Hosts page

Comment 7 Martin Bukatovic 2016-07-25 15:33:08 UTC

Created attachment 1183850 [details]
screenshot 2: Initialize Node task details

Comment 8 Martin Bukatovic 2016-07-25 15:33:31 UTC

Created attachment 1183851 [details]
screenshot 3: Hosts list page

Comment 9 Martin Bukatovic 2016-07-25 15:35:13 UTC

The summary of the issue seems to be that for some unknown reason, "Initialize
Node" tasks fails can sometimes fail even though the action is actually
completed.

Comment 11 Martin Bukatovic 2016-07-25 15:46:19 UTC

(In reply to Nishanth Thomas from comment #1)
> Need the below information:
> 
> 1. Is there a time-out?

While it looks as a timeout, I'm no sure because this is clear from the 
RHSC 2.0 error messages as reported by the web ui. How would I tell
that for sure?

> how much time it takes to complete the init of all
> the 20 nodes as you mentioned

Based on the Task details, the Initialize Node task tool about 2 minutes.

> 2. Does it get all the required information from the Node like discs,
> network if etc?

Check the tarball with full config and log dump (see comment 10).

Comment 13 Martin Bukatovic 2016-07-29 13:38:41 UTC

QE team will be checking if this issue will ever happen again during the whole
testing phase. If the issue will not appear, we are going to close it.

Comment 14 Martin Bukatovic 2016-08-08 13:50:36 UTC

Based on comment 13, I'm moving this BZ into VERIFIED state, since the QE team
haven't seen this issue again during our testing phase.

The fact that QE team haven't reproduced this issue again most likely means
that changes done by dev team made this particular problem either impossible or
less likely to happen. That said, the general root cause is not fixed, as
noted in BZ's 1341525 comment 4 and as demonstrated by recent BZ 1364547, which
means that similar issues still could happen.

Comment 16 errata-xmlrpc 2016-08-23 19:56:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754

Note You need to log in before you can comment on or make changes to this bug.