Description of problem: If a setup keeps running for some time then unmanage and import flows are working fine. If we do node-agent stop or down the node or server for sometimes then import and unmanage flows are failing. Version-Release number of selected component (if applicable): How reproducible: It is reprodusable using following steps: Steps to Reproduce: 1. Stop node-agent in a server and few storage nodes for 200 sec, after 200 sec status watcher field in a node_context object will be deleted. It won't mark down because a server is down. the only server is monitoring the nodes and mark as down but in this case server also down. So status watcher filed for a node_context object in etcd won't be created. 2. Start node-agent in a server and other storage nodes, then node-agent sync failed with exception: Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: Traceback (most recent call last): Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: self.run() Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/node_agent/node_sync/__init__.py", line 52, in run Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: NS.node_context.save(ttl=_sync_ttl) Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/commons/objects/node_context/__init__.py", line 98, in save Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: etcd_utils.refresh(status, ttl) Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/commons/utils/etcd_utils.py", line 82, in refresh Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: raise ex Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: EtcdKeyNotFound: Key not found : /nodes/3d389d10-2e02-43f1-9e40-9104825798a4/NodeContext/status 3. Because there is no status watcher field but node-agent is trying to refresh watcher field so it raised a KeyNotFound exception. 4. If an import is already done then fire unmanage, it will fail. Because job won't pick up by the nodes so it will timeout. 5. If import flow is not done then fire import, that is also failed for the same reason. 6. node-agent is down so all flows will fail. Actual results: when I stop the server and some storage nodes for sometimes and then start an import and unmanage flows are failing. Expected results: import and unmanage should work when server and node are up. Additional info:
upstream PR is merged for this : https://github.com/Tendrl/commons/pull/984
(In reply to gowtham from comment #0) > Version-Release number of selected component (if applicable): Could you provide version of affected builds by running on the following on on machine from the cluster: ``` # rpm -qa | grep tendrl | sort ```
tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-6.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch tendrl-node-agent-1.6.3-6.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-3.el7rhgs.noarch
I tested reproducer from this BZ and BZ 1570048 few times and everything seem ok. --> VERIFIED Tested with: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-4.el7rhgs.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616