Bug 1356016

Summary: Host initialization task strange fail
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Lubos Trilety <ltrilety>
Component: unclassifiedAssignee: gowtham <gshanmug>
Status: CLOSED ERRATA QA Contact: Martin Bukatovic <mbukatov>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2CC: ltrilety, mbukatov, mkudlej, nthomas, sankarshan, vsarmila
Target Milestone: ---Keywords: Reopened
Target Release: 2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhscon-core-0.0.37-1.el7scon Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:56:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1353450    
Attachments:
Description Flags
screenshot 1: View and Accept Hosts page
none
screenshot 2: Initialize Node task details
none
screenshot 3: Hosts list page none

Description Lubos Trilety 2016-07-13 09:36:10 UTC
Description of problem:
Sometimes initialization of host looks like it failed. Initialization tasks ends with fail
Force Stop. Task: 6ad61af9-c2ab-4939-aeeb-b1cc08e4356c explicitly stopped.
However host looks fine as it is in OK state and can be used during cluster creation.
And so re-initialize cannot be run from Hosts list at all. If user tries to run it from the page where accepting of all hosts is done, it never starts as the host is not in failed state.
# grep <host_name> /var/log/skyring/skyring.log
...
ERROR    nodes.go:1541 POST_Actions] admin:0d9902b0-f83d-4d1a-b6a4-5103bc818156-Node <host_name> is not in failed state
...

Version-Release number of selected component (if applicable):
rhscon-ui-0.0.46-1.el7scon.noarch
rhscon-core-selinux-0.0.32-1.el7scon.noarch
rhscon-core-0.0.32-1.el7scon.x86_64
rhscon-ceph-0.0.31-1.el7scon.x86_6

How reproducible:
70%

Steps to Reproduce:
1. Have a bigger number of host to be accepted
2. Try to accept all of them together
3.

Actual results:
Initialization task reports to be failed for some of the hosts, however they can be used during cluster creation and they are working as they should.

Expected results:
Initialization task reports failure only if it really does not succeed.

Additional info:

Comment 1 Nishanth Thomas 2016-07-13 19:07:55 UTC
Need the below information:

1. Is there a time-out? how much time it takes to complete the init of all the 20 nodes as you mentioned

2. Does it get all the required information from the Node like discs, network if etc?

Comment 2 gowtham 2016-07-15 06:25:11 UTC
I have tried with 20 nodes, But acceptance and initialization's are done perfectly. Tasks also in success state only. So give me a more information to reproduce the bug.

Comment 3 Lubos Trilety 2016-07-15 09:02:18 UTC
(In reply to Nishanth Thomas from comment #1)
> Need the below information:
> 
> 1. Is there a time-out? how much time it takes to complete the init of all
> the 20 nodes as you mentioned
>
 
It was not 20 nodes, just 16. Anyway it was less than 10 minutes, but I don't know exact time, as I didn't measure it.

> 2. Does it get all the required information from the Node like discs,
> network if etc?

Yes, everything seems to be correct. The hosts could be used for creating a cluster. And the cluster is working.

I'll try to reproduce it again on my setup.

Comment 4 Lubos Trilety 2016-07-15 14:36:17 UTC
Retested on:
rhscon-ceph-0.0.33-1.el7scon.x86_64
rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-core-0.0.34-1.el7scon.x86_64
rhscon-ui-0.0.48-1.el7scon.noarch

I was not able to reproduce it again. So I am closing this BZ, if it happens again I will reopen it.

Comment 5 Martin Bukatovic 2016-07-25 15:31:43 UTC
I just noticed this issue again with the latest build.

Based on Lubos's note from comment 4, I'm reopening the BZ.

Version-Release
===============

On RHSC 2.0 server machine:

rhscon-ceph-0.0.36-1.el7scon.x86_64
rhscon-core-selinux-0.0.36-1.el7scon.noarch
rhscon-core-0.0.36-1.el7scon.x86_64
rhscon-ui-0.0.50-1.el7scon.noarch
ceph-ansible-1.0.5-31.el7scon.noarch
ceph-installer-1.0.14-1.el7scon.noarch

On Ceph storage machines:

rhscon-core-selinux-0.0.36-1.el7scon.noarch
rhscon-agent-0.0.16-1.el7scon.noarch

Cluster configuration
=====================

I hit the issue on OS1 (openstack) cluster, which have:

* 1 RHSC 2.0 server machine
* 3 ceph monitor machines
* 4 ceph osd machines

Since the cluster had just 7 machines, it seems that large number of hosts
is not needed to hit this issue.

Steps to Reproduce
==================

I'm not sure about the reproducer here.

All I did was to deploy the cluster and run "Accept All".

Details
=======

It seems as if 3 Initialization tasks failed. On "View and Accept Hosts" page,
the RHSC 2.0 reports this failure for 3 machines.

Eg. see line for one of affected machines:

~~~
mbukatov-usm1-mon2.os1.phx2.redhat.com
e5:2e:a3:5c:79:c1:d6:98:7b:4f:90:e4:ff:1d:b5:15
Initialization Failed: Jul 25, 2016 3:38:18 PM
~~~

And the Task details for it's "Initizalize Node" task:

~~~
1 	started the task for InitializeNode: c876523c-9b08-4d57-9cd7-6642412a5374 	Jul 25 2016, 03:36:42 PM
2 	Force Stop. Task: c876523c-9b08-4d57-9cd7-6642412a5374 explicitly stopped. 	Jul 25 2016, 03:38:18 PM
~~~

This looks as if the task timeouted after 2 minutes.

One can compare that with the successful "Initialize Node" task:

~~~
1 	started the task for InitializeNode: b48124f3-8f77-431b-ad5b-ae9f45aa94d9 	Jul 25 2016, 03:36:43 PM
2 	Success 	Jul 25 2016, 03:38:50 PM
~~~

and notice the weird fact that the successful tasks took a bit longer compared
to the failed one.

Anyway, when I check the Hosts page, all machines are reported as "ok" in the
same way, no matter if the "Initialize Node" task failed for the node.

Moreover, checking salt status from the RHSC 2.0 machine, it seems that all
machines are properly accepted in the salt master:

~~~
# salt-key --finger-all
Local Keys:
master.pem:  42:98:07:52:ac:21:26:62:5f:c3:e9:6c:98:33:5a:10
master.pub:  cb:d3:aa:c5:04:b3:de:ad:db:f0:c2:10:e0:f0:db:94
Accepted Keys:
mbukatov-usm1-mon1.os1.phx2.redhat.com:  d0:5f:07:92:cd:c2:c2:e2:59:0b:00:89:39:ec:43:50
mbukatov-usm1-node3.os1.phx2.redhat.com:  2d:08:29:fd:b5:00:fc:cf:06:89:db:ad:64:8b:4b:02
mbukatov-usm1-mon2.os1.phx2.redhat.com:  e5:2e:a3:5c:79:c1:d6:98:7b:4f:90:e4:ff:1d:b5:15
mbukatov-usm1-node1.os1.phx2.redhat.com:  f4:30:52:0b:3f:18:43:95:b6:2f:06:b5:57:6f:4c:15
mbukatov-usm1-mon3.os1.phx2.redhat.com:  29:56:9a:50:bd:b4:21:cf:f9:4c:74:97:14:3d:fd:9b
mbukatov-usm1-node4.os1.phx2.redhat.com:  8a:b2:e6:8c:a8:56:fb:a9:bd:07:4d:51:f0:c6:a9:fe
mbukatov-usm1-node2.os1.phx2.redhat.com:  67:b1:76:47:fb:b3:47:71:d8:92:3d:8d:6d:0a:26:21
~~~

And indeed, when I hit "Re-Initialize" button on one of the affected nodes,
nothing happens and I can find the following line in the logs:

~~~
2016-07-25T16:42:54.521+02:00 ERROR    nodes.go:1553 POST_Actions] admin:94575190-d0ce-46c0-9173-38205553b2c2-Node mbukatov-usm1-mon2.os1.phx2.redhat.com is not in failed state
~~~

Based on this evidence, I think it's save to assume that I just hit the same
BZ.

Comment 6 Martin Bukatovic 2016-07-25 15:32:29 UTC
Created attachment 1183849 [details]
screenshot 1: View and Accept Hosts page

Comment 7 Martin Bukatovic 2016-07-25 15:33:08 UTC
Created attachment 1183850 [details]
screenshot 2: Initialize Node task details

Comment 8 Martin Bukatovic 2016-07-25 15:33:31 UTC
Created attachment 1183851 [details]
screenshot 3: Hosts list page

Comment 9 Martin Bukatovic 2016-07-25 15:35:13 UTC
The summary of the issue seems to be that for some unknown reason, "Initialize
Node" tasks fails can sometimes fail even though the action is actually
completed.

Comment 11 Martin Bukatovic 2016-07-25 15:46:19 UTC
(In reply to Nishanth Thomas from comment #1)
> Need the below information:
> 
> 1. Is there a time-out?

While it looks as a timeout, I'm no sure because this is clear from the 
RHSC 2.0 error messages as reported by the web ui. How would I tell
that for sure?

> how much time it takes to complete the init of all
> the 20 nodes as you mentioned

Based on the Task details, the Initialize Node task tool about 2 minutes.

> 2. Does it get all the required information from the Node like discs,
> network if etc?

Check the tarball with full config and log dump (see comment 10).

Comment 13 Martin Bukatovic 2016-07-29 13:38:41 UTC
QE team will be checking if this issue will ever happen again during the whole
testing phase. If the issue will not appear, we are going to close it.

Comment 14 Martin Bukatovic 2016-08-08 13:50:36 UTC
Based on comment 13, I'm moving this BZ into VERIFIED state, since the QE team
haven't seen this issue again during our testing phase.

The fact that QE team haven't reproduced this issue again most likely means
that changes done by dev team made this particular problem either impossible or
less likely to happen. That said, the general root cause is not fixed, as
noted in BZ's 1341525 comment 4 and as demonstrated by recent BZ 1364547, which
means that similar issues still could happen.

Comment 16 errata-xmlrpc 2016-08-23 19:56:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754