Bug 1415268

Summary: OCP 3.4 - Metrics deployment fails
Product: OpenShift Container Platform Reporter: Veer Muchandi <veer>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED NOTABUG QA Contact: Peng Li <penli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: aos-bugs, jcantril, jforrest, mwringe, pweil, veer, yinzhou
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-15 15:39:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Veer Muchandi 2017-01-20 17:15:54 UTC
Description of problem:
I am trying to upgrade my system from 3.3 to 3.4 using advanced installation method. The upgrade is complete. I tried to upgrade metrics using procedure described here https://docs.openshift.com/container-platform/3.4/install_config/upgrading/automated_upgrades.html#automated-upgrading-cluster-metrics
Metrics deployment failed.
Next I cleaned up the openshift-infra project and tried to install metrics afresh as per the docs here
https://docs.openshift.com/container-platform/3.4/install_config/cluster_metrics.html#install-config-cluster-metrics

The deployment process failed with the following error

Deploying Hawkular Metrics & Cassandra Components
scripts/hawkular.sh: line 200: STARTUP_TIMEOUT: unbound variable

This issue seems to have been fixed in origin in a different context

https://bugzilla.redhat.com/show_bug.cgi?id=1395267


https://github.com/openshift/origin-metrics/commit/643515aa28919240c01e5ee70d1d692116f104f6


However I am not sure if it made into the current version.

Metrics Deployer Template seems to be dated September although the folder is downloaded yesterday.

# ls -l /usr/share/openshift/examples/infrastructure-templates/enterprise/metrics-deployer.yaml
-rw-r--r--. 1 root root 5375 Sep  4 22:59 /usr/share/openshift/examples/infrastructure-templates/enterprise/metrics-deployer.yaml

# # ls -ld /usr/share/openshift/examples/infrastructure-templates/enterprise
drwxr-xr-x. 2 root root 90 Jan 19 22:11 /usr/share/openshift/examples/infrastructure-templates/enterprise




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Clean up openshift-infra project on a 3.4 cluster
2. Deploy metrics following the documentation here
https://docs.openshift.com/container-platform/3.4/install_config/cluster_metrics.html#install-config-cluster-metrics
In my case I used the following command to start the deployer
oc new-app --as=system:serviceaccount:openshift-infra:metrics-deployer     -f metrics-deployer.yaml     -p HAWKULAR_METRICS_HOSTNAME=hawkular.apps.testv3.osecloud.com -p USE_PERSISTENT_STORAGE=false -p MASTER_URL=https://master.testv3.osecloud.com:8443 -p IMAGE_PREFIX=openshift3/ -p IMAGE_VERSION=v3.4

The deployer starts. But it fails with the following error.
"
Deploying Hawkular Metrics & Cassandra Components
scripts/hawkular.sh: line 200: STARTUP_TIMEOUT: unbound variable

“



Actual results:
Metrics deployer fails

Expected results:
Metrics deployer should be successfully completed

Additional info:

Here are the deployer logs

# oc logs -f metrics-deployer-1uasp
++ parse_bool false CONTINUE_ON_ERROR
++ local v=false
++ '[' false '!=' true -a false '!=' false ']'
++ echo false
+ continue_on_error=false
+ '[' false == false ']'
+ set -eu
+ deployer_mode=deploy
+ image_prefix=openshift3/
+ image_version=v3.4
+ master_url=https://master.testv3.osecloud.com:8443
+ [[ 3 == \/ ]]
++ parse_bool false REDEPLOY
++ local v=false
++ '[' false '!=' true -a false '!=' false ']'
++ echo false
+ redeploy=false
+ '[' false == true ']'
+ mode=deploy
+ '[' deploy = redeploy ']'
++ parse_bool false IGNORE_PREFLIGHT
++ local v=false
++ '[' false '!=' true -a false '!=' false ']'
++ echo false
+ ignore_preflight=false
+ cassandra_nodes=1
++ parse_bool false USE_PERSISTENT_STORAGE
++ local v=false
++ '[' false '!=' true -a false '!=' false ']'
++ echo false
+ use_persistent_storage=false
++ parse_bool false DYNAMICALLY_PROVISION_STORAGE
++ local v=false
++ '[' false '!=' true -a false '!=' false ']'
++ echo false
+ dynamically_provision_storage=false
+ cassandra_pv_size=10Gi
+ metric_duration=7
+ user_write_access=false
+ heapster_node_id=nodename
+ metric_resolution=15s
+ project=openshift-infra
+ master_ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
+ token_file=/var/run/secrets/kubernetes.io/serviceaccount/token
+ dir=/etc/deploy/_output
+ secret_dir=/secret
+ rm -rf /etc/deploy/_output
+ mkdir -p /etc/deploy/_output
+ chmod 700 /etc/deploy/_output
+ mkdir -p /secret
+ chmod 700 /secret
chmod: changing permissions of '/secret': Read-only file system
+ :
+ hawkular_metrics_hostname=hawkular.apps.testv3.osecloud.com
+ hawkular_metrics_alias=hawkular-metrics
+ hawkular_cassandra_alias=hawkular-cassandra
++ date +%s
+ openshift admin ca create-signer-cert --key=/etc/deploy/_output/ca.key --cert=/etc/deploy/_output/ca.crt --serial=/etc/deploy/_output/ca.serial.txt --name=metrics-signer@1484886470
+ '[' -n 1 ']'
+ oc config set-cluster master --api-version=v1 --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt --server=https://master.testv3.osecloud.com:8443
cluster "master" set.
++ cat /var/run/secrets/kubernetes.io/serviceaccount/token
+ oc config set-credentials account --token=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtaW5mcmEiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlY3JldC5uYW1lIjoibWV0cmljcy1kZXBsb3llci10b2tlbi1scnZ4NiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJtZXRyaWNzLWRlcGxveWVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiOTlkZTI2YzQtZGVjOC0xMWU2LWFlZTktZmExNjNlNjlhMTk3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1pbmZyYTptZXRyaWNzLWRlcGxveWVyIn0.svwTz9KivfoHFibn3dLKa6OdVwXNmIKUSELO-fNNK288rTbeuCBloUANtwEF2TtZdrjELtVmnCNgHeGz_rgHfIQPsUIT_JAhHGMBHMO4qDbCKSORou8cgJb8AUTGiADg_A9sGo7RqBuGNgj79TdjDOZpMlE6m4NYjY4tarq1d3mlLzbQb1xOetsDdDL8K9QHu4nL6H44pOkE6c2MIXLrB71ZFBZx4j2B4hkPVPq_5-aKA-0dM8yXYJ0PiFz895ntAT8lOUixahzwar4OKxeS-mN-4AfWTpnx-vrHQsk8D5QeqK-O8M2Pfow90mvTFBz-YTcsQl6k6R7otM6pZGw7QA
user "account" set.
+ oc config set-context current --cluster=master --user=account --namespace=openshift-infra
context "current" set.
+ oc config use-context current
switched to context "current".
+ old_kc=/etc/deploy/.kubeconfig
+ KUBECONFIG=/etc/deploy/_output/kube.conf
+ '[' -z 1 ']'
+ oc config set-cluster deployer-master --api-version=v1 --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt --server=https://master.testv3.osecloud.com:8443
cluster "deployer-master" set.
++ cat /var/run/secrets/kubernetes.io/serviceaccount/token
+ oc config set-credentials deployer-account --token=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtaW5mcmEiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlY3JldC5uYW1lIjoibWV0cmljcy1kZXBsb3llci10b2tlbi1scnZ4NiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJtZXRyaWNzLWRlcGxveWVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiOTlkZTI2YzQtZGVjOC0xMWU2LWFlZTktZmExNjNlNjlhMTk3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1pbmZyYTptZXRyaWNzLWRlcGxveWVyIn0.svwTz9KivfoHFibn3dLKa6OdVwXNmIKUSELO-fNNK288rTbeuCBloUANtwEF2TtZdrjELtVmnCNgHeGz_rgHfIQPsUIT_JAhHGMBHMO4qDbCKSORou8cgJb8AUTGiADg_A9sGo7RqBuGNgj79TdjDOZpMlE6m4NYjY4tarq1d3mlLzbQb1xOetsDdDL8K9QHu4nL6H44pOkE6c2MIXLrB71ZFBZx4j2B4hkPVPq_5-aKA-0dM8yXYJ0PiFz895ntAT8lOUixahzwar4OKxeS-mN-4AfWTpnx-vrHQsk8D5QeqK-O8M2Pfow90mvTFBz-YTcsQl6k6R7otM6pZGw7QA
user "deployer-account" set.
+ oc config set-context deployer-context --cluster=deployer-master --user=deployer-account --namespace=openshift-infra
context "deployer-context" set.
+ '[' -n 1 ']'
+ oc config use-context deployer-context
switched to context "deployer-context".
+ case $deployer_mode in
+ '[' false '!=' true ']'
+ validate_preflight
+ set +x

PREFLIGHT CHECK SUCCEEDED
validate_master_accessible: ok
validate_hostname: The HAWKULAR_METRICS_HOSTNAME value is deemed acceptable.
validate_deployer_secret: ok
Generating randomized passwords for the Hawkular Metrics and Cassandra keystores and truststores
Creating the Hawkular Metrics keystore from the PEM file
Entry for alias hawkular-metrics successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled
[Storing /etc/deploy/_output/hawkular-metrics.keystore]
Creating the Hawkular Cassandra keystore from the PEM file
Entry for alias hawkular-cassandra successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled
[Storing /etc/deploy/_output/hawkular-cassandra.keystore]
Creating the Hawkular Metrics Certificate
Certificate stored in file </etc/deploy/_output/hawkular-metrics.cert>
Creating the Hawkular Cassandra Certificate
Certificate stored in file </etc/deploy/_output/hawkular-cassandra.cert>
Importing the Hawkular Metrics Certificate into the Cassandra Truststore
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-cassandra.truststore]
Importing the Hawkular Cassandra Certificate into the Hawkular Metrics Truststore
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-metrics.truststore]
Importing the Hawkular Cassandra Certificate into the Cassandra Truststore
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-cassandra.truststore]
Importing the CA Certificate into the Cassandra Truststore
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-cassandra.truststore]
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-cassandra.truststore]
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-cassandra.truststore]
Importing the CA Certificate into the Hawkular Metrics Truststore
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-metrics.truststore]
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-metrics.truststore]
Certificate was added to keystore
[Storing /etc/deploy/_output/hawkular-metrics.truststore]
Adding password for user hawkular
Generating the JGroups Keystore

Creating the Hawkular Metrics Secrets configuration json file

Creating the Hawkular Metrics Certificate Secrets configuration json file

Creating the Hawkular Metrics User Account Secrets

Creating the Cassandra Secrets configuration file

Creating the Cassandra Certificate Secrets configuration json file
Creating Hawkular Metrics & Cassandra Secrets
secret "hawkular-metrics-secrets" created
secret "hawkular-metrics-certificate" created
secret "hawkular-metrics-account" created
secret "hawkular-cassandra-secrets" created
secret "hawkular-cassandra-certificate" created
Creating Hawkular Metrics & Cassandra Templates
template "hawkular-metrics" created
template "hawkular-cassandra-services" created
template "hawkular-cassandra-node-pv" created
template "hawkular-cassandra-node-dynamic-pv" created
template "hawkular-cassandra-node-emptydir" created
template "hawkular-support" created
Deploying Hawkular Metrics & Cassandra Components
scripts/hawkular.sh: line 200: STARTUP_TIMEOUT: unbound variable

Comment 1 Veer Muchandi 2017-01-20 21:04:11 UTC
It seems the latest version of metrics-deployer is not downloaded with atomic-openshift-utils(?).

I downloaded this from GitHub

 wget https://raw.githubusercontent.com/openshift/openshift-ansible/master/roles/openshift_hosted_templates/files/v1.4/enterprise/metrics-deployer.yaml


and tried this version of metrics-deployer and metrics got successfully deployed.

Comment 2 Matt Wringe 2017-01-31 20:13:46 UTC
This is a just a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1416213

Or did the v1.4 metrics deployer not get brought into the system?

Comment 3 Matt Wringe 2017-01-31 20:14:41 UTC
Also, setting this back to deployments, as its not directly related to the metric images but an installation problem.

Comment 4 Veer Muchandi 2017-01-31 20:50:47 UTC
yes.. this is an installation problem. The version of metrics-deployer.yaml is incorrect.

Comment 5 Michal Fojtik 2017-02-01 09:47:04 UTC
(In reply to Matt Wringe from comment #3)
> Also, setting this back to deployments, as its not directly related to the
> metric images but an installation problem.

I don't think the PM team know how to fix:

scripts/hawkular.sh: line 200: STARTUP_TIMEOUT: unbound variable

Comment 7 Jeff Cantrill 2017-02-01 15:44:33 UTC
@Veer,

Can you provide information from where the template was sourced?  Did it come in from installing an rpm?  Can you provide additional details and version information.

Comment 8 Matt Wringe 2017-02-01 16:35:03 UTC
"scripts/hawkular.sh: line 200: STARTUP_TIMEOUT: unbound variable" is because you are using the wrong template.

Which is why I was asking if this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1416213 in https://bugzilla.redhat.com/show_bug.cgi?id=1415268#c2

This is not an issue with the metrics images themselves, but with the template that is being used. Either with the docs, or with how ansible is being used here.

Comment 11 Veer Muchandi 2017-02-02 22:52:28 UTC
@Matt, The template that gets downloaded the box with 3.4 (I think when you install atomic-openshift-utils) is incorrect.
/usr/share/openshift/examples/infrastructure-templates/enterprise/metrics-deployer.yaml

That is exactly what I showed in my description above. It is dated Sep 4, whereas the defect was fixed later.

Comment 13 Peng Li 2017-02-03 08:11:31 UTC
Adding https://bugzilla.redhat.com/show_bug.cgi?id=1418700, we way have a way to update these files to Installer and have a way to document this.

Comment 14 Matt Wringe 2017-02-09 20:11:10 UTC
(In reply to Veer Muchandi from comment #11)
> @Matt, The template that gets downloaded the box with 3.4 (I think when you
> install atomic-openshift-utils) is incorrect.
> /usr/share/openshift/examples/infrastructure-templates/enterprise/metrics-
> deployer.yaml
> 
> That is exactly what I showed in my description above. It is dated Sep 4,
> whereas the defect was fixed later.

So is this already fixed then by updating the version of ansible you are using? or is the file on your system still bad?

The fix is already in https://github.com/openshift/openshift-ansible/blob/release-1.4/roles/openshift_hosted_templates/files/v1.4/origin/metrics-deployer.yaml

I am not sure what else the metric team needs to do with this, or if it can be closed as being fixed already.

Comment 15 Veer Muchandi 2017-02-09 22:10:07 UTC
(In reply to Matt Wringe from comment #14)
> (In reply to Veer Muchandi from comment #11)
> > @Matt, The template that gets downloaded the box with 3.4 (I think when you
> > install atomic-openshift-utils) is incorrect.
> > /usr/share/openshift/examples/infrastructure-templates/enterprise/metrics-
> > deployer.yaml
> > 
> > That is exactly what I showed in my description above. It is dated Sep 4,
> > whereas the defect was fixed later.
> 
> So is this already fixed then by updating the version of ansible you are
> using? or is the file on your system still bad?

When I opened this defect, metrics-deployer.yaml was incorrect. I was able to get the right version and use it.
If this file is changed and if yum install atomic-openshift-utils get the right version downloaded (I guess that is how it is copied to the box), then we should be good. 

A good way to test is to run the installer and see that the metrics are deployed.

> 
> The fix is already in
> https://github.com/openshift/openshift-ansible/blob/release-1.4/roles/
> openshift_hosted_templates/files/v1.4/origin/metrics-deployer.yaml
> 
> I am not sure what else the metric team needs to do with this, or if it can
> be closed as being fixed already.