Bug 1572151
Summary: | A storage node which is peer probe with IP is always showing deleted bricks in UI | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | gowtham <gshanmug> | ||||
Component: | web-admin-tendrl-node-agent | Assignee: | gowtham <gshanmug> | ||||
Status: | CLOSED ERRATA | QA Contact: | Daniel Horák <dahorak> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rhgs-3.4 | CC: | dahorak, gshanmug, mbukatov, nthomas, rhs-bugs, sankarshan | ||||
Target Milestone: | --- | ||||||
Target Release: | RHGS 3.4.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | tendrl-ui-1.6.3-3.el7rhgs tendrl-gluster-integration-1.6.3-4.el7rhgs tendrl-monitoring-integration-1.6.3-4.el7rhgs tendrl-commons-1.6.3-6.el7rhgs tendrl-node-agent-1.6.3-6.el7rhgs | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-09-04 07:05:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1503137 | ||||||
Attachments: |
|
Description
gowtham
2018-04-26 09:39:05 UTC
Please include full version of affected package. tendrl-node-agent-1.6.3-3.el7rhgs.noarch During testing of this Bug, I've hit similar/related issue also on the new packages: Steps to Reproduce: 1. Prepare Gluster cluster with one volume, using following gdeploy configuration file (the hosts are defined by IP, not fqdn): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [peer] action=probe ignore_peer_errors=no [backend-setup] devices=vdb,vdc vgs=vg_alpha_distrep_{1,2} pools=pool_alpha_distrep_{1,2} lvs=lv_alpha_distrep_{1,2} mountpoints=/mnt/brick_alpha_distrep_{1,2} brick_dirs=/mnt/brick_alpha_distrep_1/1,/mnt/brick_alpha_distrep_2/2 [volume] volname=volume_alpha_distrep_6x2 action=create brick_dirs=/mnt/brick_alpha_distrep_1/1,/mnt/brick_alpha_distrep_2/2 transport=tcp replica=yes replica_count=2 [hosts] 10.37.169.136 10.37.169.137 10.37.169.138 10.37.169.139 10.37.169.127 10.37.169.142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The result is, that the first Gluster server is identified by hostname (this is the server, from which was the peer probe initiated) and all the other servers are identified by IP. # gluster pool list UUID Hostname State a53caa1a-4397-4b22-82a6-fe5c6674e1b7 dahorak-usm1-gl1..com Connected 2208df91-9d53-4aa3-92d1-52e354bb9fce 10.37.169.127 Connected dde08d62-4fad-4de3-b692-19b736d36380 10.37.169.137 Connected a98ecca7-cc2f-46ca-99c1-ac36f9270531 10.37.169.138 Connected 1ee762b2-972a-45fa-8151-aab02431262c 10.37.169.139 Connected 70277974-f887-45cb-9268-db9240f2df9c localhost Connected # gluster volume info Volume Name: volume_alpha_distrep_6x2 Type: Distributed-Replicate Volume ID: 7621f2c4-4614-4f05-895e-8134dcec3d51 Status: Started Snapshot Count: 0 Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.37.169.136:/mnt/brick_alpha_distrep_1/1 Brick2: 10.37.169.137:/mnt/brick_alpha_distrep_1/1 Brick3: 10.37.169.138:/mnt/brick_alpha_distrep_1/1 Brick4: 10.37.169.139:/mnt/brick_alpha_distrep_1/1 Brick5: 10.37.169.127:/mnt/brick_alpha_distrep_1/1 Brick6: 10.37.169.142:/mnt/brick_alpha_distrep_1/1 Brick7: 10.37.169.136:/mnt/brick_alpha_distrep_2/2 Brick8: 10.37.169.137:/mnt/brick_alpha_distrep_2/2 Brick9: 10.37.169.138:/mnt/brick_alpha_distrep_2/2 Brick10: 10.37.169.139:/mnt/brick_alpha_distrep_2/2 Brick11: 10.37.169.127:/mnt/brick_alpha_distrep_2/2 Brick12: 10.37.169.142:/mnt/brick_alpha_distrep_2/2 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off 2. Install and configure RHGS WA. 3. Stop tendrl-node-agent on all nodes (including RHGS WA Server) # systemctl stop tendrl-node-agent 4. Clean up content of etcd database. I've simply removed all the "directories": # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 ls /nodes /notifications /indexes /clusters /queue /networks /alerting /messages # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /nodes # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /notifications # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /indexes # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /clusters # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /queue # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /networks # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /alerting # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /messages 5. Start tendrl-node-agent on the second node (which was peer probed by IP). # systemctl start tendrl-node-agent 6. After few seconds, start tendrl-node-agent on all other storage nodes and and all tendrl-* services on RHGS WA Server. # systemctl start tendrl-node-agent # systemctl start tendrl-node-agent tendrl-api tendrl-monitoring-integration tendrl-notifier 7. Import the cluster into RHGS WA. 8. Check Hosts page. 9. Check Volumes -> <volume> -> Bricks Details page. 10. Remove two bricks (because of replica-count), one of the bricks have to be from the second node (where was tendrl-node-agent started firstly in step 5.) # gluster volume remove-brick volume_alpha_distrep_6x2 10.37.169.136:/mnt/brick_alpha_distrep_1/1 10.37.169.137:/mnt/brick_alpha_distrep_1/1 start # gluster volume remove-brick volume_alpha_distrep_6x2 10.37.169.136:/mnt/brick_alpha_distrep_1/1 10.37.169.137:/mnt/brick_alpha_distrep_1/1 status # gluster volume remove-brick volume_alpha_distrep_6x2 10.37.169.136:/mnt/brick_alpha_distrep_1/1 10.37.169.137:/mnt/brick_alpha_distrep_1/1 commit 10. Check Volumes -> <volume> -> Bricks Details page. Actual results: 7. First and second nodes are identified by hostname. 8. Missing bricks count for the second node. 9. All the bricks seems to be correctly displayed. 11. The bricks details wasn't correctly updated. Version-Release number of selected component: RHGS WA Server: collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 etcd-3.2.7-1.el7.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 python-etcd-0.4.5-2.el7rhgs.noarch rubygem-etcd-0.3.0-2.el7rhgs.noarch tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-5.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch tendrl-node-agent-1.6.3-5.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-2.el7rhgs.noarch Gluster Storage Server: collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-12.el7rhgs.x86_64 glusterfs-api-3.12.2-12.el7rhgs.x86_64 glusterfs-cli-3.12.2-12.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64 glusterfs-events-3.12.2-12.el7rhgs.x86_64 glusterfs-fuse-3.12.2-12.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64 glusterfs-libs-3.12.2-12.el7rhgs.x86_64 glusterfs-rdma-3.12.2-12.el7rhgs.x86_64 glusterfs-server-3.12.2-12.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 python-etcd-0.4.5-2.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-5.el7rhgs.noarch tendrl-gluster-integration-1.6.3-3.el7rhgs.noarch tendrl-node-agent-1.6.3-5.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch Created attachment 1441461 [details] Screenshot for Comment 8 I've tried to reproduce it on the older packages, to fully understand the issue, but, I didn't hit the described issue. I've tried it by via following steps, as we discussed over chat: 1. I've created Gluster Trusted storage pool via peer probe (using hostnames). The result is this: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # gluster pool list UUID Hostname State fef54a90-384e-4161-b890-81f43118cf2a gl2.example.com Connected f79486c8-14ae-4dea-a435-e340bf9d9c28 gl3.example.com Connected 46c824c6-7cda-490f-b158-750b0128d7bc gl4.example.com Connected c3dc9128-bb54-491d-803e-ed209c360105 gl5.example.com Connected a1befdff-13f0-4e0f-bb4a-aa14116f807f gl6.example.com Connected b0310ad4-829a-45c1-937d-89f1e4c5ef77 localhost Connected ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2. I've created Distributed-Replicated volume, using IPs for bricks: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # gluster volume info volume_alpha_distrep_6x2 Volume Name: volume_alpha_distrep_6x2 Type: Distributed-Replicate Volume ID: 184ad3ba-65e1-4593-a22c-58dcf711bcc2 Status: Started Snapshot Count: 0 Number of Bricks: 5 x 2 = 10 Transport-type: tcp Bricks: Brick5: 10.37.169.67:/mnt/brick_alpha_distrep_1/1 Brick6: 10.37.169.80:/mnt/brick_alpha_distrep_1/1 Brick1: 10.37.169.93:/mnt/brick_alpha_distrep_1/1 Brick2: 10.37.169.102:/mnt/brick_alpha_distrep_1/1 Brick3: 10.37.169.112:/mnt/brick_alpha_distrep_1/1 Brick4: 10.37.169.120:/mnt/brick_alpha_distrep_1/1 Brick5: 10.37.169.67:/mnt/brick_alpha_distrep_2/2 Brick6: 10.37.169.80:/mnt/brick_alpha_distrep_2/2 Brick7: 10.37.169.93:/mnt/brick_alpha_distrep_2/2 Brick8: 10.37.169.102:/mnt/brick_alpha_distrep_2/2 Brick9: 10.37.169.112:/mnt/brick_alpha_distrep_2/2 Brick10: 10.37.169.120:/mnt/brick_alpha_distrep_2/2 Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3. Then I've installed RHGS WA using tendrl-ansible, stop tendrl-node-agents on all hosts and cleanup etcd database (same as in comment 8 steps 2-4). 4. Start tendrl-node-agent on the gl2 node (Comment 8 step 5.) 5. After few seconds, start tendrl-node-agent on all other storage nodes and and all tendrl-* services on RHGS WA Server (Comment 8 step 6.) 6. Import the cluster into RHGS WA, wait some time. 7. Remove first two bricks (the second one is on the gl2 node). 8. Wait for some time and check Hosts -> <host> -> Bricks Details page. 9. Check Volumes -> <volume> -> Brick Details page. The list of bricks is correct on both pages. Tried with packages from puddle repo from 2018-04-27.1: tendrl-ansible-1.6.3-3.el7rhgs.noarch tendrl-api-1.6.3-2.el7rhgs.noarch tendrl-api-httpd-1.6.3-2.el7rhgs.noarch tendrl-commons-1.6.3-3.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch tendrl-node-agent-1.6.3-3.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-1.el7rhgs.noarch @gowtham, could you please check this scenario and guide me, what I'm doing wrong, that I'm not able to reproduce the issue? Thanks This issue which is mentioned by me and Daniel is actually same, I send PR in upstream for this issue APR 26 but https://github.com/Tendrl/node-agent/commit/2c8643ed8deef5e508a16b1a5774d43190520627#diff-3dfeadf899a1fe4e98d3f57feef0b4da. But During the downstream build of tendrl-node-agent-1.6.3-4.el7rhgs this patch is missed to cherrypick. But I thought this change is already in downstream. That why i misunderstood daniels comment. So lot of confusion are happened in between. This patch should be taken to next build. Based on comment 12, steps from comment 8 were identified as reproduction scenario for this bug. Tested and Verified with the same steps as described in comment 8. Beside that, tested with few variations (with volumes created using IPs, fqdn, short name). Version-Release number of selected component: RHGS WA Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 etcd-3.2.7-1.el7.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libcollection-0.7.0-29.el7.x86_64 python-etcd-0.4.5-2.el7rhgs.noarch rubygem-etcd-0.3.0-2.el7rhgs.noarch tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-6.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch tendrl-node-agent-1.6.3-6.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-3.el7rhgs.noarch Gluster Storage Server: Red Hat Gluster Storage Server 3.4.0 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-12.el7rhgs.x86_64 glusterfs-api-3.12.2-12.el7rhgs.x86_64 glusterfs-cli-3.12.2-12.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64 glusterfs-events-3.12.2-12.el7rhgs.x86_64 glusterfs-fuse-3.12.2-12.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64 glusterfs-libs-3.12.2-12.el7rhgs.x86_64 glusterfs-rdma-3.12.2-12.el7rhgs.x86_64 glusterfs-server-3.12.2-12.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libcollection-0.7.0-29.el7.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64 python2-gluster-3.12.2-12.el7rhgs.x86_64 python-debtcollector-1.8.0-1.el7ost.noarch python-etcd-0.4.5-2.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-6.el7rhgs.noarch tendrl-gluster-integration-1.6.3-4.el7rhgs.noarch tendrl-node-agent-1.6.3-6.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch >> VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616 |