Bug 1460197 - Missing nodes in create cluster form [NEEDINFO]
Missing nodes in create cluster form
Status: MODIFIED
Product: Red Hat Storage Console
Classification: Red Hat
Component: Dashboard (Show other bugs)
3
Unspecified Unspecified
unspecified Severity unspecified
: alpha
: 3-alpha
Assigned To: Ankush Behl
sds-qe-bugs
:
Depends On:
Blocks: 1457278
  Show dependency treegraph
 
Reported: 2017-06-09 07:01 EDT by Filip Balák
Modified: 2017-06-28 05:44 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mbukatov: needinfo? (gshanmug)


Attachments (Terms of Use)
Response from GetNodeList API call (240.81 KB, text/plain)
2017-06-09 07:01 EDT, Filip Balák
no flags Details
/va/log/messages from one of the missing nodes (503.42 KB, text/plain)
2017-06-09 07:03 EDT, Filip Balák
no flags Details
how it looks in UI (96.53 KB, image/png)
2017-06-09 07:05 EDT, Filip Balák
no flags Details
Empty list (updated) (49.29 KB, image/png)
2017-06-12 10:05 EDT, Filip Balák
no flags Details
Response from GetNodeList API call (updated) (228.11 KB, text/plain)
2017-06-12 10:06 EDT, Filip Balák
no flags Details
Described responses from /nodes etcd key part1 (4.96 MB, text/plain)
2017-06-16 06:33 EDT, Filip Balák
no flags Details
Described responses from /nodes etcd key part2 (2.00 MB, text/plain)
2017-06-16 06:34 EDT, Filip Balák
no flags Details

  None (edit)
Description Filip Balák 2017-06-09 07:01:36 EDT
Created attachment 1286375 [details]
Response from GetNodeList API call

Description of problem:
I have prepared 11 nodes from which I can create clusters. In website when I try to create cluster I do not see 3 of them but in response from GetNodeList they are listed (in attachment):
usm1-gl4.usmqe.lab.eng.brq.redhat.com
usm1-mon2.usmqe.lab.eng.brq.redhat.com
usm1-mon3.usmqe.lab.eng.brq.redhat.com

Node-agent is running on these machines

Version-Release number of selected component (if applicable):
tendrl-alerting-3.0-alpha.3.el7scon.noarch
tendrl-api-3.0-alpha.4.el7scon.noarch
tendrl-api-doc-3.0-alpha.4.el7scon.noarch
tendrl-api-httpd-3.0-alpha.4.el7scon.noarch
tendrl-commons-3.0-alpha.8.el7scon.noarch
tendrl-dashboard-3.0-alpha.4.el7scon.noarch
tendrl-node-agent-3.0-alpha.8.el7scon.noarch
tendrl-node-monitoring-3.0-alpha.4.el7scon.noarch
tendrl-performance-monitoring-3.0-alpha.6.el7scon.noarch


How reproducible:
50%

Steps to Reproduce:
1. Prepare tendrl with 11 nodes without sds.
2. In dashboard open create cluster form.
3. Check list of nodes.

Actual results:
There are missing nodes.

Expected results:
There should be listed all nodes.

Additional info:
Comment 3 Filip Balák 2017-06-09 07:03 EDT
Created attachment 1286377 [details]
/va/log/messages from one of the missing nodes
Comment 4 Filip Balák 2017-06-09 07:05 EDT
Created attachment 1286378 [details]
how it looks in UI
Comment 5 Filip Balák 2017-06-12 10:03:15 EDT
Now it got worse and I can't see any hosts in list. It happened for me with 2 installations. Right after installation I restart all hosts. Attaching new screenshot and response from GetNodeList.

Tested with:
tendrl-alerting-3.0-alpha.3.el7scon.noarch
tendrl-api-3.0-alpha.4.el7scon.noarch
tendrl-api-doc-3.0-alpha.4.el7scon.noarch
tendrl-api-httpd-3.0-alpha.4.el7scon.noarch
tendrl-commons-3.0-alpha.9.el7scon.noarch
tendrl-dashboard-3.0-alpha.4.el7scon.noarch
tendrl-node-agent-3.0-alpha.9.el7scon.noarch
tendrl-node-monitoring-3.0-alpha.5.el7scon.noarch
tendrl-performance-monitoring-3.0-alpha.7.el7scon.noarch
Comment 6 Filip Balák 2017-06-12 10:05 EDT
Created attachment 1287057 [details]
Empty list (updated)
Comment 7 Filip Balák 2017-06-12 10:06 EDT
Created attachment 1287058 [details]
Response from GetNodeList API call (updated)
Comment 8 Nishanth Thomas 2017-06-13 01:23:07 EDT
So if I understand correctly, you are getting all the nodes in response to `GetNodeList` API but in the UI few/all of them are missing?
Comment 10 Filip Balák 2017-06-13 02:33:53 EDT
Yes, in attachments are responses from `GetNodeList` API and screenshots of how it looks.
I will provide the setup. I have it in snapshot but the machines are now busy and in a different state. I will provide it probably later today.
Comment 11 gowtham 2017-06-13 09:10:12 EDT
i have set ttl for disk and network details in etcd, If node-agent is stopped for 5 min or 10 min in all machines are few machines then disk and network details will deleted from node. when you run node-agent again then all detail will populated within some mins. 

have you stopped node-agent before seeing this problem in all or few nodes?
Comment 12 gowtham 2017-06-13 09:10:39 EDT
i have set ttl for disk and network details in etcd, If node-agent is stopped for 5 min or 10 min in all machines are few machines then disk and network details will deleted from node. when you run node-agent again then all detail will populated within some mins. 

have you stopped node-agent before seeing this problem in all or few nodes?
Comment 13 Filip Balák 2017-06-13 09:40:51 EDT
I don't think that I stopped node-agent but I restarted machines. ‎I will try to reproduce it again, If I will be able to, I will let you know, otherwise I will update the bz.
Comment 14 Nishanth Thomas 2017-06-14 00:54:21 EDT
Please re-test this and let us know whether the issue is re-producible
Comment 15 Ju Lim 2017-06-14 07:44:54 EDT
It has something to do with restarting host.
Comment 16 Nishanth Thomas 2017-06-15 03:51:08 EDT
As per filip, this how he reproduced the issue:

` I installed fresh tendrl instance with 11 nodes without cluster (prepared for cluster creation), shut down machines, created snapshot and started machines. Right after start the issue appears. Sometimes it starts working after a few minutes and sometimes after few hours.`

Also my understanding is that, this occurs only once with frist boot. subsequent restarts of node-agent will not reproduce this issue
Comment 19 gowtham 2017-06-15 09:48:47 EDT
So the problem is UI is trying to list a nodes from json which is given by API,
lets take example first node detail in json have network detail and second node does not have network detail then in this case only first node is listed. if first three nodes have network detail means only three nodes are listed. why some times it showing empty list means next time that three node detail which have network detail is moved to some where in json,  i mean next time those three comes last three means nothing is displayed because now first node in json does not have network detail.

Node agent takes time to update disk and network details. so some nodes populated slowly some node populate quickly. if you see after 5min all nodes are displayed in UI. 

This the problem
Comment 20 gowtham 2017-06-15 09:49:26 EDT
filip please make sure after 5min also you are seeing this issue.
Comment 21 Filip Balák 2017-06-15 09:59:43 EDT
Nodes are now displayed correctly, but I do not see a difference in Networks part of each node in GetNodeList response.
Comment 22 Filip Balák 2017-06-16 06:33 EDT
Created attachment 1288309 [details]
Described responses from /nodes etcd key part1
Comment 23 Filip Balák 2017-06-16 06:34 EDT
Created attachment 1288310 [details]
Described responses from /nodes etcd key part2
Comment 24 gowtham 2017-06-20 02:24:50 EDT
This problem came because of node sync problem, this is solved rohan i think
Comment 25 Neha Gupta 2017-06-20 05:00:13 EDT
@gshanmug@redhat.com Is it solved?
Comment 26 Daniel Horák 2017-06-21 06:27:20 EDT
I'm facing the same issue with the latest packages:
  tendrl-alerting-3.0-alpha.4.el7scon.noarch
  tendrl-api-3.0-alpha.5.el7scon.noarch
  tendrl-api-httpd-3.0-alpha.5.el7scon.noarch
  tendrl-commons-3.0-alpha.10.el7scon.noarch
  tendrl-dashboard-3.0-alpha.5.el7scon.noarch
  tendrl-node-agent-3.0-alpha.10.el7scon.noarch
  tendrl-performance-monitoring-3.0-alpha.8.el7scon.noarch

I was able to create Ceph cluster from 7 nodes, but now I don't see any (or just one) of the rest 4 nodes prepared for Gluster cluster (while GetNodeList response contains all 12 nodes - 1 Tendrl server, 4+3 Ceph nodes and 4 Gluster nodes).
Comment 28 gowtham 2017-06-21 09:04:41 EDT
@negupta@redhat.com
if some nodes does not have network details why remaining nodes are displayed.
i think this has to be fix in ui
Comment 29 gowtham 2017-06-21 09:14:01 EDT
(In reply to Daniel Horák from comment #26)
> I'm facing the same issue with the latest packages:
>   tendrl-alerting-3.0-alpha.4.el7scon.noarch
>   tendrl-api-3.0-alpha.5.el7scon.noarch
>   tendrl-api-httpd-3.0-alpha.5.el7scon.noarch
>   tendrl-commons-3.0-alpha.10.el7scon.noarch
>   tendrl-dashboard-3.0-alpha.5.el7scon.noarch
>   tendrl-node-agent-3.0-alpha.10.el7scon.noarch
>   tendrl-performance-monitoring-3.0-alpha.8.el7scon.noarch
> 
> I was able to create Ceph cluster from 7 nodes, but now I don't see any (or
> just one) of the rest 4 nodes prepared for Gluster cluster (while
> GetNodeList response contains all 12 nodes - 1 Tendrl server, 4+3 Ceph nodes
> and 4 Gluster nodes).

please wait for 5 min and then check nodes are coming or not. if you see same bug after 5 min also then it could be a different issue. Now as per UI implementation all nodes are displayed when all nodes have disk and network details in etcd.
Comment 30 Daniel Horák 2017-06-21 09:22:43 EDT
I saw the issue after more than one hour. I'll try to reproduce it and let you know.
Comment 31 Martin Kudlej 2017-06-21 13:54:04 EDT
(In reply to gowtham from comment #29)
If there is need to wait 5 minutes after some action it will be great if user knows about this. How can user differentiate between "there is problem" and "user should wait 5 minutes"?
Comment 32 gowtham 2017-06-22 00:24:21 EDT
(In reply to Martin Kudlej from comment #31)
> (In reply to gowtham from comment #29)
> If there is need to wait 5 minutes after some action it will be great if
> user knows about this. How can user differentiate between "there is problem"
> and "user should wait 5 minutes"?

no why i am saying this means to check any problem in backend or not. It is kind of debugging. If you are not see this problem after 5 min then it is a UI fix.
Comment 35 gowtham 2017-06-27 07:46:56 EDT
I have found this issue, it happening because etcd timeout error:
https://github.com/Tendrl/node-agent/issues/529

etcdctl cluster-health command is not giving any result about etcd, why it is not giving in QE machines please anyone from QE tell this why
Comment 38 gowtham 2017-06-27 08:47:19 EDT
Pull request for this issue: https://github.com/Tendrl/commons/pull/632
Comment 41 Martin Kudlej 2017-06-27 09:35:21 EDT
(In reply to gowtham from comment #35)
Try (where ${HOSTNAME} is node with etcd instance):
$ etcdctl --endpoints="http://${HOSTNAME}:2379" cluster-health

Note You need to log in before you can comment on or make changes to this bug.