Bug 1383901

Summary: re-run config.yaml failed when metrics_deploy are enabled
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: InstallerAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: aos-bugs, dgoodwin, jokerman, mmccomas, sdodson, tdawson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously the installer would re-run the metrics deloyment steps if the configuration playbook was re-run. The playbooks have been updated to only run the metrics dpeloyment tasks once. If a previous installation of metrics has failed the admin must manually resolve the issue or remove the metrics deployment and re-run the config playbook. See the following documentation for cleanup instructions https://docs.openshift.com/container-platform/3.3/install_config/cluster_metrics.html#metrics-cleanup
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 12:42:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anping Li 2016-10-12 06:11:39 UTC
Description of problem:
When metrics_deploy is enabled in inventory file,  Re-run config.yaml failed at TASK [openshift_metrics : Wait for image pull and deployer pod].

The root cause is metrics-deployer pod failed for resource already exists. 
We should set mode=redeploy when redeploy metrics.  


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.2.28-1

How reproducible:
always

Steps to Reproduce:
1. enable openshift-metrics in inventory and install openshift

   openshift_hosted_metrics_deploy=True
   openshift_hosted_metrics_write_access=True
   
   ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml
   
2. re-run config.yml
   ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml
3. check the deployer pod status 
   oc logs metrics-deployer-kdv02 -n openshift-infra

Actual results:
2.  ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml
TASK [openshift_metrics : Wait for image pull and deployer pod] ****************
FAILED - RETRYING: TASK: openshift_metrics : Wait for image pull and deployer pod (60 retries left).
<--snip-->
<--snip-->
FAILED - RETRYING: TASK: openshift_metrics : Wait for image pull and deployer pod (3 retries left).
FAILED - RETRYING: TASK: openshift_metrics : Wait for image pull and deployer pod (2 retries left).
FAILED - RETRYING: TASK: openshift_metrics : Wait for image pull and deployer pod (1 retries left).
fatal: [openshift-111.lab.eng.nay.redhat.com]: FAILED! => {"changed": true, "cmd": "oc get pods -n openshift-infra | grep metrics-deployer.*Completed", "delta": "0:00:00.402017", "end": "2016-10-11 22:56:58.051938", "failed": true, "rc": 1, "start": "2016-10-11 22:56:57.649921", "stderr": "", "stdout": "", "stdout_lines": [], "warnings": []}

NO MORE HOSTS LEFT *************************************************************
    to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
localhost                  : ok=13   changed=7    unreachable=0    failed=0   
openshift-111.lab.eng.nay.redhat.com : ok=436  changed=28   unreachable=0    failed=1   
openshift-112.lab.eng.nay.redhat.com : ok=129  changed=6    unreachable=0    failed=0   

3. [root@openshift-111 ~]# oc logs metrics-deployer-kdv02 -n openshift-infra
+ image_prefix=registry.access.redhat.com/openshift3/
+ image_version=3.2.1
+ master_url=https://kubernetes.default.svc:443
+ redeploy=false
+ mode=deploy
+ cassandra_nodes=1
+ use_persistent_storage=false
+ cassandra_pv_size=10Gi
+ metric_duration=7
+ heapster_node_id=nodename
+ metric_resolution=10s
+ project=openshift-infra
+ master_ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
+ token_file=/var/run/secrets/kubernetes.io/serviceaccount/token
+ dir=/etc/deploy/_output
+ hawkular_metrics_hostname=hawkular-metrics.1008-lzo.qe.rhcloud.com
+ hawkular_metrics_alias=hawkular-metrics
+ hawkular_cassandra_alias=hawkular-cassandra
+ rm -rf /etc/deploy/_output
<--snip-->
<--snip-->
<--snip-->
Creating the Cassandra Certificate Secrets configuration json file
++ echo
++ echo 'Creating the Cassandra Certificate Secrets configuration json file'
++ cat
+++ base64 -w 0 /etc/deploy/_output/hawkular-cassandra.cert
+++ base64 -w 0 /etc/deploy/_output/hawkular-cassandra-ca.cert
Creating Hawkular Metrics & Cassandra Secrets
++ echo 'Creating Hawkular Metrics & Cassandra Secrets'
++ oc create -f /etc/deploy/_output/hawkular-metrics-secrets.json
Error from server: error when creating "/etc/deploy/_output/hawkular-metrics-secrets.json": secrets "hawkular-metrics-secrets" already exists


Expected results:
config.yaml can be re-run without error.


Additional info:

Comment 1 Devan Goodwin 2016-11-01 17:59:47 UTC
This does not seem to fail now, the deployer error is present:

[root@ip-172-18-9-38 ec2-user]# oc get pods
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-jm0xg   1/1       Running   0          2h
hawkular-metrics-e1wm9       1/1       Running   0          2h
heapster-nfkv6               1/1       Running   0          2h
metrics-deployer-sy409       0/1       Error     0          1h

Creating the Cassandra Certificate Secrets configuration json file
++ base64 -w 0 /etc/deploy/_output/hawkular-cassandra.cert
++ base64 -w 0 /etc/deploy/_output/hawkular-cassandra-ca.cert
Creating Hawkular Metrics & Cassandra Secrets
+ echo 'Creating Hawkular Metrics & Cassandra Secrets'
+ oc create -f /etc/deploy/_output/hawkular-metrics-secrets.json
Error from server: error when creating "/etc/deploy/_output/hawkular-metrics-secrets.json": secrets "hawkular-metrics-secrets" already exists


However in ansible:

FAILED - RETRYING: TASK: openshift_metrics : Wait for image pull and deployer pod (2 retries left).
FAILED - RETRYING: TASK: openshift_metrics : Wait for image pull and deployer pod (1 retries left).
changed: [ec2-54-242-151-226.compute-1.amazonaws.com]


It appears to be handled gracefully. Scott do you think anything needs to be done here?

Comment 2 Scott Dodson 2016-11-02 13:31:21 UTC
Anping,

The metrics work that exists in the 3.2 installer was community contribution and doesn't include all of the fixes that we made during the 3.3 development cycle. So metrics deployment is only supported under 3.3 installer and newer. If we can't reproduce there, and Devan's testing so far shows we can't, then we should close this bug.

Comment 3 Anping Li 2016-11-08 10:01:06 UTC
No such issue with 3.4, so move t verified.

Comment 5 errata-xmlrpc 2017-01-18 12:42:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066