Bug 1572151

Summary: A storage node which is peer probe with IP is always showing deleted bricks in UI
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: gowtham <gshanmug>
Component: web-admin-tendrl-node-agentAssignee: gowtham <gshanmug>
Status: CLOSED ERRATA QA Contact: Daniel Horák <dahorak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: dahorak, gshanmug, mbukatov, nthomas, rhs-bugs, sankarshan
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-ui-1.6.3-3.el7rhgs tendrl-gluster-integration-1.6.3-4.el7rhgs tendrl-monitoring-integration-1.6.3-4.el7rhgs tendrl-commons-1.6.3-6.el7rhgs tendrl-node-agent-1.6.3-6.el7rhgs Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 07:05:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503137    
Attachments:
Description Flags
Screenshot for Comment 8 none

Description gowtham 2018-04-26 09:39:05 UTC
Description of problem:

As per new fqdn/ip change, tendrl is using peer hostname as node fqdn in node_context object. But if a node peer probe with ip and its node-agent is started first then it takes and uses hostname as fqdn instead of ip. So while processing any gluster-event like brick-delete, peer detach having some problem with unmatched fqdn. 

Version-Release number of selected component (if applicable):


How reproducible:

Peer probe node with ip and start node agent, in UI node is displayed with hostname not IP. and import cluster after that delete some brick. In brick page deleted brick is always present.

Steps to Reproduce:
1. Peer probe any one node with IP
2. Before starting node-agent in other node, Start node-agent in storage-node which peer probe with IP 
3. Import the cluser
4. After brick sync, delete some brick in CLI
5. Deleted brick always present in brick page, it won't deleted

Actual results:
When brick is deleted then deleted brick always displayed in tendrl UI

Expected results:
When brick is deleted then tendrl should not show the brick

Additional info:

Comment 2 Martin Bukatovic 2018-05-04 17:06:37 UTC
Please include full version of affected package.

Comment 3 Nishanth Thomas 2018-05-04 18:27:47 UTC
tendrl-node-agent-1.6.3-3.el7rhgs.noarch

Comment 8 Daniel Horák 2018-05-25 09:10:34 UTC
During testing of this Bug, I've hit similar/related issue also on the
new packages:

Steps to Reproduce:
1. Prepare Gluster cluster with one volume, using following gdeploy
  configuration file (the hosts are defined by IP, not fqdn):
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    [peer]
    action=probe
    ignore_peer_errors=no

    [backend-setup]
    devices=vdb,vdc
    vgs=vg_alpha_distrep_{1,2}
    pools=pool_alpha_distrep_{1,2}
    lvs=lv_alpha_distrep_{1,2}
    mountpoints=/mnt/brick_alpha_distrep_{1,2}
    brick_dirs=/mnt/brick_alpha_distrep_1/1,/mnt/brick_alpha_distrep_2/2

    [volume]
    volname=volume_alpha_distrep_6x2
    action=create
    brick_dirs=/mnt/brick_alpha_distrep_1/1,/mnt/brick_alpha_distrep_2/2
    transport=tcp
    replica=yes
    replica_count=2

    [hosts]
    10.37.169.136
    10.37.169.137
    10.37.169.138
    10.37.169.139
    10.37.169.127
    10.37.169.142
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The result is, that the first Gluster server is identified by hostname
  (this is the server, from which was the peer probe initiated) and all the
  other servers are identified by IP. 

  # gluster pool list
    UUID					Hostname                                     	State
    a53caa1a-4397-4b22-82a6-fe5c6674e1b7	dahorak-usm1-gl1..com	Connected 
    2208df91-9d53-4aa3-92d1-52e354bb9fce	10.37.169.127        	Connected 
    dde08d62-4fad-4de3-b692-19b736d36380	10.37.169.137        	Connected 
    a98ecca7-cc2f-46ca-99c1-ac36f9270531	10.37.169.138        	Connected 
    1ee762b2-972a-45fa-8151-aab02431262c	10.37.169.139        	Connected 
    70277974-f887-45cb-9268-db9240f2df9c	localhost            	Connected 

  # gluster volume info
    Volume Name: volume_alpha_distrep_6x2
    Type: Distributed-Replicate
    Volume ID: 7621f2c4-4614-4f05-895e-8134dcec3d51
    Status: Started
    Snapshot Count: 0
    Number of Bricks: 6 x 2 = 12
    Transport-type: tcp
    Bricks:
    Brick1: 10.37.169.136:/mnt/brick_alpha_distrep_1/1
    Brick2: 10.37.169.137:/mnt/brick_alpha_distrep_1/1
    Brick3: 10.37.169.138:/mnt/brick_alpha_distrep_1/1
    Brick4: 10.37.169.139:/mnt/brick_alpha_distrep_1/1
    Brick5: 10.37.169.127:/mnt/brick_alpha_distrep_1/1
    Brick6: 10.37.169.142:/mnt/brick_alpha_distrep_1/1
    Brick7: 10.37.169.136:/mnt/brick_alpha_distrep_2/2
    Brick8: 10.37.169.137:/mnt/brick_alpha_distrep_2/2
    Brick9: 10.37.169.138:/mnt/brick_alpha_distrep_2/2
    Brick10: 10.37.169.139:/mnt/brick_alpha_distrep_2/2
    Brick11: 10.37.169.127:/mnt/brick_alpha_distrep_2/2
    Brick12: 10.37.169.142:/mnt/brick_alpha_distrep_2/2
    Options Reconfigured:
    diagnostics.count-fop-hits: on
    diagnostics.latency-measurement: on
    transport.address-family: inet
    nfs.disable: on
    performance.client-io-threads: off

2. Install and configure RHGS WA.

3. Stop tendrl-node-agent on all nodes (including RHGS WA Server)
  # systemctl stop tendrl-node-agent

4. Clean up content of etcd database.
  I've simply removed all the "directories":
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 ls
    /nodes
    /notifications
    /indexes
    /clusters
    /queue
    /networks
    /alerting
    /messages
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /nodes
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /notifications
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /indexes
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /clusters
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /queue
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /networks
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /alerting
  # etcdctl ${ETCD_SSL_ARGS} --endpoints https://${HOSTNAME}:2379 rm -r /messages

5. Start tendrl-node-agent on the second node (which was peer probed by IP).
  # systemctl start tendrl-node-agent

6. After few seconds, start tendrl-node-agent on all other storage nodes and
  and all tendrl-* services on RHGS WA Server.
  # systemctl start tendrl-node-agent
  # systemctl start tendrl-node-agent tendrl-api tendrl-monitoring-integration tendrl-notifier

7. Import the cluster into RHGS WA.

8. Check Hosts page.

9. Check Volumes -> <volume> -> Bricks Details page.

10. Remove two bricks (because of replica-count), one of the bricks have to be
  from the second node (where was tendrl-node-agent started firstly in step 5.)
  # gluster volume remove-brick volume_alpha_distrep_6x2 10.37.169.136:/mnt/brick_alpha_distrep_1/1 10.37.169.137:/mnt/brick_alpha_distrep_1/1 start
  # gluster volume remove-brick volume_alpha_distrep_6x2 10.37.169.136:/mnt/brick_alpha_distrep_1/1 10.37.169.137:/mnt/brick_alpha_distrep_1/1 status
  # gluster volume remove-brick volume_alpha_distrep_6x2 10.37.169.136:/mnt/brick_alpha_distrep_1/1 10.37.169.137:/mnt/brick_alpha_distrep_1/1 commit

10. Check Volumes -> <volume> -> Bricks Details page.

Actual results:
  7. First and second nodes are identified by hostname.
  8. Missing bricks count for the second node.
  9. All the bricks seems to be correctly displayed.
  11. The bricks details wasn't correctly updated.


Version-Release number of selected component:
RHGS WA Server:
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  etcd-3.2.7-1.el7.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  rubygem-etcd-0.3.0-2.el7rhgs.noarch
  tendrl-ansible-1.6.3-4.el7rhgs.noarch
  tendrl-api-1.6.3-3.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
  tendrl-commons-1.6.3-5.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch
  tendrl-node-agent-1.6.3-5.el7rhgs.noarch
  tendrl-notifier-1.6.3-3.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-2.el7rhgs.noarch

Gluster Storage Server:
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-12.el7rhgs.x86_64
  glusterfs-api-3.12.2-12.el7rhgs.x86_64
  glusterfs-cli-3.12.2-12.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
  glusterfs-events-3.12.2-12.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
  glusterfs-libs-3.12.2-12.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
  glusterfs-server-3.12.2-12.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-5.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-3.el7rhgs.noarch
  tendrl-node-agent-1.6.3-5.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch

Comment 9 Daniel Horák 2018-05-25 09:17:52 UTC
Created attachment 1441461 [details]
Screenshot for Comment 8

Comment 11 Daniel Horák 2018-05-29 11:46:05 UTC
I've tried to reproduce it on the older packages, to fully understand the issue,
but, I didn't hit the described issue.

I've tried it by via following steps, as we discussed over chat:

1. I've created Gluster Trusted storage pool via peer probe (using hostnames).
  The result is this:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # gluster pool list
  UUID					Hostname                                 	State
  fef54a90-384e-4161-b890-81f43118cf2a	gl2.example.com 	Connected 
  f79486c8-14ae-4dea-a435-e340bf9d9c28	gl3.example.com 	Connected 
  46c824c6-7cda-490f-b158-750b0128d7bc	gl4.example.com 	Connected 
  c3dc9128-bb54-491d-803e-ed209c360105	gl5.example.com 	Connected 
  a1befdff-13f0-4e0f-bb4a-aa14116f807f	gl6.example.com 	Connected 
  b0310ad4-829a-45c1-937d-89f1e4c5ef77	localhost       	Connected 
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. I've created Distributed-Replicated volume, using IPs for bricks:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # gluster volume info volume_alpha_distrep_6x2
 
  Volume Name: volume_alpha_distrep_6x2
  Type: Distributed-Replicate
  Volume ID: 184ad3ba-65e1-4593-a22c-58dcf711bcc2
  Status: Started
  Snapshot Count: 0
  Number of Bricks: 5 x 2 = 10
  Transport-type: tcp
  Bricks:
  Brick5: 10.37.169.67:/mnt/brick_alpha_distrep_1/1
  Brick6: 10.37.169.80:/mnt/brick_alpha_distrep_1/1
  Brick1: 10.37.169.93:/mnt/brick_alpha_distrep_1/1
  Brick2: 10.37.169.102:/mnt/brick_alpha_distrep_1/1
  Brick3: 10.37.169.112:/mnt/brick_alpha_distrep_1/1
  Brick4: 10.37.169.120:/mnt/brick_alpha_distrep_1/1
  Brick5: 10.37.169.67:/mnt/brick_alpha_distrep_2/2
  Brick6: 10.37.169.80:/mnt/brick_alpha_distrep_2/2
  Brick7: 10.37.169.93:/mnt/brick_alpha_distrep_2/2
  Brick8: 10.37.169.102:/mnt/brick_alpha_distrep_2/2
  Brick9: 10.37.169.112:/mnt/brick_alpha_distrep_2/2
  Brick10: 10.37.169.120:/mnt/brick_alpha_distrep_2/2
  Options Reconfigured:
  transport.address-family: inet
  nfs.disable: on
  performance.client-io-threads: off
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3. Then I've installed RHGS WA using tendrl-ansible, stop tendrl-node-agents
  on all hosts and cleanup etcd database (same as in comment 8 steps 2-4).

4. Start tendrl-node-agent on the gl2 node (Comment 8 step 5.)

5. After few seconds, start tendrl-node-agent on all other storage nodes and
  and all tendrl-* services on RHGS WA Server (Comment 8 step 6.)

6. Import the cluster into RHGS WA, wait some time.

7. Remove first two bricks (the second one is on the gl2 node).

8. Wait for some time and check Hosts -> <host> -> Bricks Details page.

9. Check Volumes -> <volume> -> Brick Details page.

The list of bricks is correct on both pages.

Tried with packages from puddle repo from 2018-04-27.1:
  tendrl-ansible-1.6.3-3.el7rhgs.noarch
  tendrl-api-1.6.3-2.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-2.el7rhgs.noarch
  tendrl-commons-1.6.3-3.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
  tendrl-node-agent-1.6.3-3.el7rhgs.noarch
  tendrl-notifier-1.6.3-2.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-1.el7rhgs.noarch

@gowtham, could you please check this scenario and guide me, what I'm doing
wrong, that I'm not able to reproduce the issue?
Thanks

Comment 12 gowtham 2018-05-30 12:29:46 UTC
This issue which is mentioned by me and Daniel is actually same, I send PR in upstream for this issue APR 26 but https://github.com/Tendrl/node-agent/commit/2c8643ed8deef5e508a16b1a5774d43190520627#diff-3dfeadf899a1fe4e98d3f57feef0b4da. But During the downstream build of tendrl-node-agent-1.6.3-4.el7rhgs this patch is missed to cherrypick. But I thought this change is already in downstream. That why i misunderstood daniels comment. So lot of confusion are happened in between. This patch should be taken to next build.

Comment 13 Daniel Horák 2018-06-04 15:08:00 UTC
Based on comment 12, steps from comment 8 were identified as reproduction scenario for this bug.

Tested and Verified with the same steps as described in comment 8.
Beside that, tested with few variations (with volumes created using IPs,
fqdn, short name).

Version-Release number of selected component:
RHGS WA Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  etcd-3.2.7-1.el7.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libcollection-0.7.0-29.el7.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  rubygem-etcd-0.3.0-2.el7rhgs.noarch
  tendrl-ansible-1.6.3-4.el7rhgs.noarch
  tendrl-api-1.6.3-3.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
  tendrl-commons-1.6.3-6.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
  tendrl-node-agent-1.6.3-6.el7rhgs.noarch
  tendrl-notifier-1.6.3-3.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-3.el7rhgs.noarch

Gluster Storage Server:
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-12.el7rhgs.x86_64
  glusterfs-api-3.12.2-12.el7rhgs.x86_64
  glusterfs-cli-3.12.2-12.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
  glusterfs-events-3.12.2-12.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
  glusterfs-libs-3.12.2-12.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
  glusterfs-server-3.12.2-12.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libcollection-0.7.0-29.el7.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
  python2-gluster-3.12.2-12.el7rhgs.x86_64
  python-debtcollector-1.8.0-1.el7ost.noarch
  python-etcd-0.4.5-2.el7rhgs.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-6.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-4.el7rhgs.noarch
  tendrl-node-agent-1.6.3-6.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

>> VERIFIED

Comment 15 errata-xmlrpc 2018-09-04 07:05:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616