Created attachment 1428673 [details] Screenshot of RHGSWA page which shows bricks from only two nodes Description of problem: A 3-node gluster cluster was created and a plain distributed gluster volume was created using it. The nodes that were part of cluster had two interfaces associated with it. One 1GB interface for management and another 10GB for data IO. Gluster peer probe was done using 10GB interface and corresponding volume creation used 10GB interface as well Here is the output of gluster volume info: Volume Name: vol1 Type: Distribute Volume ID: d2b11ebc-956b-4ceb-8fe2-7eea06f1940e Status: Started Snapshot Count: 0 Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: 172.17.40.14:/bricks/b01/g Brick2: 172.17.40.15:/bricks/b01/g Brick3: 172.17.40.16:/bricks/b01/g Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on transport.address-family: inet nfs.disable: on This cluster was then imported to RHGSWA. Expectation was RHGSWA will import the cluster and we shall see 3 Hosts (172.17.40.14,172.17.40.15 and 172.17.40.16) with corresponding 1 brick in each. From 2 of the 3 hosts the bricks were visible but 3rd node in cluster could not be detected from 10GB interface as it was detected from 1GB interface and hence no bricks were seen corresponding to it. The 3rd node should also have been detected from 10GB interface. Just FYI, the tendrl server connects to tendrl nodes using management interface (1GB) and I suspect the host from where its getting the information about the volume is being detected as 1GB (gprfs015.sbu.lab.eng.bos.redhat.com) and no bricks are seen there. See screenshot attached with the bug for better understanding. Snippet of Inventory_file [gluster_servers] gprfs014.sbu.lab.eng.bos.redhat.com gprfs015.sbu.lab.eng.bos.redhat.com gprfs016.sbu.lab.eng.bos.redhat.com [tendrl_server] dhcp159-16.sbu.lab.eng.bos.redhat.com [all:vars] etcd_ip_address=10.16.159.16 etcd_fqdn=dhcp159-16.sbu.lab.eng.bos.redhat.com graphite_fqdn=dhcp159-16.sbu.lab.eng.bos.redhat.com Version-Release number of selected component (if applicable): On Tendrl Server ---------------- rpm -qa | grep tendrl tendrl-commons-1.6.3-3.el7rhgs.noarch tendrl-ui-1.6.3-1.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-node-agent-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-2.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-api-1.6.3-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch tendrl-ansible-1.6.3-2.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch On Storage Nodes ---------------- rpm -qa | grep tendrl tendrl-node-agent-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-3.el7rhgs.noarch tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch How reproducible: Always
I can't create peer probe using one interface and brick volume create using another interface IP. Gluster gives an error for this. storage node1: eth0: 10.70.43.186 eth1: 10.70.42.230 storage node2: eth0: 10.70.43.151 eth1: 10.70.42.231 storage node3: eth0: 10.70.43.153 eth1: 10.70.43.14 from 10.70.43.186 (all eth0) gluster peer probe 10.70.43.151 gluster peer probe 10.70.43.153 gluster peer status: Number of Peers: 2 Hostname: 10.70.43.151 Uuid: 3f088d2b-105a-4a3d-817f-88cc2ce9cc10 State: Peer in Cluster (Connected) Hostname: 10.70.43.153 Uuid: 7e81cdd5-5dad-458c-9bdc-db8abe574e7e State: Peer in Cluster (Connected) gluster volume create V1 10.70.42.230:/root/glusters/b1 10.70.42.231:/root/glusters/b1 10.70.43.14:/root/glusters/b1 force (all eth1) volume create: V1: failed: Host 10.70.42.231 is not in 'Peer in Cluster' state but 10.70.42.231 is there in eth1 eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:1a:4a:f7:23:20 brd ff:ff:ff:ff:ff:ff inet 10.70.43.151/22 brd 10.70.43.255 scope global dynamic eth0 valid_lft 81298sec preferred_lft 81298sec inet6 2620:52:0:4628:21a:4aff:fef7:2320/64 scope global noprefixroute dynamic valid_lft 2591732sec preferred_lft 604532sec inet6 fe80::21a:4aff:fef7:2320/64 scope link valid_lft forever preferred_lft forever eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:1a:4a:f7:23:24 brd ff:ff:ff:ff:ff:ff inet 10.70.42.231/22 brd 10.70.43.255 scope global dynamic eth1 valid_lft 81301sec preferred_lft 81301sec inet6 2620:52:0:4628:21a:4aff:fef7:2324/64 scope global noprefixroute dynamic valid_lft 2591732sec preferred_lft 604532sec inet6 fe80::21a:4aff:fef7:2324/64 scope link valid_lft forever preferred_lft forever
I have talked with Shekhar Berry about this bug, actually, the problem is during ansible installation he gave IP of eth0 for all storage nodes and during peer probe he used eth1 ips. But if we do peer probe using the following step then multiple interfaces is possible peer probe node B using eth1 from node A now do peer probe for node c from node B if you see peer probe status hostname of node B is in the "other name" field : Hostname: 10.70.42.231 Uuid: 42422510-bb1c-42f8-b324-00658e2371ca State: Peer in Cluster (Connected) Other names: dhcp43-151.lab.eng.blr.redhat.com Hostname: 10.70.43.186 Uuid: 568a2fbe-7f8c-4d38-a01c-b1cca1879d36 State: Peer in Cluster (Connected) so now we can create brick using eth1 event peer probe is done by eth0 so if we use socket.getbyhostname("10.70.42.231"), it always gives ip of 10.70.43.151. So peer probe hostname won't match with brick hostname again. So no brick for that node is displayed. Here the problem is an other_name field.
PR for this issue is under review https://github.com/Tendrl/node-agent/pull/815
Nishanth, I feel all it needs is verification and based on latest discussions around multiple network support, I feel its worth doing the verification again.