Bug 1356016
Summary: | Host initialization task strange fail | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Lubos Trilety <ltrilety> | ||||||||
Component: | unclassified | Assignee: | gowtham <gshanmug> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Bukatovic <mbukatov> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 2 | CC: | ltrilety, mbukatov, mkudlej, nthomas, sankarshan, vsarmila | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | 2 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | rhscon-core-0.0.37-1.el7scon | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-08-23 19:56:46 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1353450 | ||||||||||
Attachments: |
|
Description
Lubos Trilety
2016-07-13 09:36:10 UTC
Need the below information: 1. Is there a time-out? how much time it takes to complete the init of all the 20 nodes as you mentioned 2. Does it get all the required information from the Node like discs, network if etc? I have tried with 20 nodes, But acceptance and initialization's are done perfectly. Tasks also in success state only. So give me a more information to reproduce the bug. (In reply to Nishanth Thomas from comment #1) > Need the below information: > > 1. Is there a time-out? how much time it takes to complete the init of all > the 20 nodes as you mentioned > It was not 20 nodes, just 16. Anyway it was less than 10 minutes, but I don't know exact time, as I didn't measure it. > 2. Does it get all the required information from the Node like discs, > network if etc? Yes, everything seems to be correct. The hosts could be used for creating a cluster. And the cluster is working. I'll try to reproduce it again on my setup. Retested on: rhscon-ceph-0.0.33-1.el7scon.x86_64 rhscon-core-selinux-0.0.34-1.el7scon.noarch rhscon-core-0.0.34-1.el7scon.x86_64 rhscon-ui-0.0.48-1.el7scon.noarch I was not able to reproduce it again. So I am closing this BZ, if it happens again I will reopen it. I just noticed this issue again with the latest build. Based on Lubos's note from comment 4, I'm reopening the BZ. Version-Release =============== On RHSC 2.0 server machine: rhscon-ceph-0.0.36-1.el7scon.x86_64 rhscon-core-selinux-0.0.36-1.el7scon.noarch rhscon-core-0.0.36-1.el7scon.x86_64 rhscon-ui-0.0.50-1.el7scon.noarch ceph-ansible-1.0.5-31.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch On Ceph storage machines: rhscon-core-selinux-0.0.36-1.el7scon.noarch rhscon-agent-0.0.16-1.el7scon.noarch Cluster configuration ===================== I hit the issue on OS1 (openstack) cluster, which have: * 1 RHSC 2.0 server machine * 3 ceph monitor machines * 4 ceph osd machines Since the cluster had just 7 machines, it seems that large number of hosts is not needed to hit this issue. Steps to Reproduce ================== I'm not sure about the reproducer here. All I did was to deploy the cluster and run "Accept All". Details ======= It seems as if 3 Initialization tasks failed. On "View and Accept Hosts" page, the RHSC 2.0 reports this failure for 3 machines. Eg. see line for one of affected machines: ~~~ mbukatov-usm1-mon2.os1.phx2.redhat.com e5:2e:a3:5c:79:c1:d6:98:7b:4f:90:e4:ff:1d:b5:15 Initialization Failed: Jul 25, 2016 3:38:18 PM ~~~ And the Task details for it's "Initizalize Node" task: ~~~ 1 started the task for InitializeNode: c876523c-9b08-4d57-9cd7-6642412a5374 Jul 25 2016, 03:36:42 PM 2 Force Stop. Task: c876523c-9b08-4d57-9cd7-6642412a5374 explicitly stopped. Jul 25 2016, 03:38:18 PM ~~~ This looks as if the task timeouted after 2 minutes. One can compare that with the successful "Initialize Node" task: ~~~ 1 started the task for InitializeNode: b48124f3-8f77-431b-ad5b-ae9f45aa94d9 Jul 25 2016, 03:36:43 PM 2 Success Jul 25 2016, 03:38:50 PM ~~~ and notice the weird fact that the successful tasks took a bit longer compared to the failed one. Anyway, when I check the Hosts page, all machines are reported as "ok" in the same way, no matter if the "Initialize Node" task failed for the node. Moreover, checking salt status from the RHSC 2.0 machine, it seems that all machines are properly accepted in the salt master: ~~~ # salt-key --finger-all Local Keys: master.pem: 42:98:07:52:ac:21:26:62:5f:c3:e9:6c:98:33:5a:10 master.pub: cb:d3:aa:c5:04:b3:de:ad:db:f0:c2:10:e0:f0:db:94 Accepted Keys: mbukatov-usm1-mon1.os1.phx2.redhat.com: d0:5f:07:92:cd:c2:c2:e2:59:0b:00:89:39:ec:43:50 mbukatov-usm1-node3.os1.phx2.redhat.com: 2d:08:29:fd:b5:00:fc:cf:06:89:db:ad:64:8b:4b:02 mbukatov-usm1-mon2.os1.phx2.redhat.com: e5:2e:a3:5c:79:c1:d6:98:7b:4f:90:e4:ff:1d:b5:15 mbukatov-usm1-node1.os1.phx2.redhat.com: f4:30:52:0b:3f:18:43:95:b6:2f:06:b5:57:6f:4c:15 mbukatov-usm1-mon3.os1.phx2.redhat.com: 29:56:9a:50:bd:b4:21:cf:f9:4c:74:97:14:3d:fd:9b mbukatov-usm1-node4.os1.phx2.redhat.com: 8a:b2:e6:8c:a8:56:fb:a9:bd:07:4d:51:f0:c6:a9:fe mbukatov-usm1-node2.os1.phx2.redhat.com: 67:b1:76:47:fb:b3:47:71:d8:92:3d:8d:6d:0a:26:21 ~~~ And indeed, when I hit "Re-Initialize" button on one of the affected nodes, nothing happens and I can find the following line in the logs: ~~~ 2016-07-25T16:42:54.521+02:00 ERROR nodes.go:1553 POST_Actions] admin:94575190-d0ce-46c0-9173-38205553b2c2-Node mbukatov-usm1-mon2.os1.phx2.redhat.com is not in failed state ~~~ Based on this evidence, I think it's save to assume that I just hit the same BZ. Created attachment 1183849 [details]
screenshot 1: View and Accept Hosts page
Created attachment 1183850 [details]
screenshot 2: Initialize Node task details
Created attachment 1183851 [details]
screenshot 3: Hosts list page
The summary of the issue seems to be that for some unknown reason, "Initialize Node" tasks fails can sometimes fail even though the action is actually completed. (In reply to Nishanth Thomas from comment #1) > Need the below information: > > 1. Is there a time-out? While it looks as a timeout, I'm no sure because this is clear from the RHSC 2.0 error messages as reported by the web ui. How would I tell that for sure? > how much time it takes to complete the init of all > the 20 nodes as you mentioned Based on the Task details, the Initialize Node task tool about 2 minutes. > 2. Does it get all the required information from the Node like discs, > network if etc? Check the tarball with full config and log dump (see comment 10). QE team will be checking if this issue will ever happen again during the whole testing phase. If the issue will not appear, we are going to close it. Based on comment 13, I'm moving this BZ into VERIFIED state, since the QE team haven't seen this issue again during our testing phase. The fact that QE team haven't reproduced this issue again most likely means that changes done by dev team made this particular problem either impossible or less likely to happen. That said, the general root cause is not fixed, as noted in BZ's 1341525 comment 4 and as demonstrated by recent BZ 1364547, which means that similar issues still could happen. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754 |