Description of problem: This bug is from Bug 1607984, the default timeoutSeconds for hawkular-cassandra readiness check is 1 second, but if the readiness check takes more than 1 second to get the response, metrics pods could not started up # oc get pod -n openshift-infra NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-njcbq 0/1 Running 0 1h hawkular-metrics-642hp 0/1 Running 8 1h hawkular-metrics-schema-4k4hj 1/1 Running 0 1h heapster-lmc8m 0/1 Running 9 1h # oc rsh hawkular-cassandra-1-njcbq sh-4.2$ time nodetool status Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.131.0.64 103.11 KB 256 100.0% df669d60-a338-4057-a4c2-00cf92b6291b rack1 real 0m1.499s user 0m2.417s sys 0m0.187s sh-4.2$ time nodetool help Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss <--snip > See 'nodetool help <command>' for more information on a specific command. real 0m1.626s user 0m1.807s sys 0m0.133s after changing it to bigger value in roles/openshift_metrics/templates/hawkular_cassandra_rc.j2, added timeoutSeconds: 10, metrics works well. readinessProbe: exec: command: - "/opt/apache-cassandra/bin/cassandra-docker-ready.sh" timeoutSeconds: 10 Version-Release number of selected component (if applicable): # rpm -qa | grep openshift-ansible openshift-ansible-roles-3.10.14-1.git.273.a64b86b.el7.noarch openshift-ansible-playbooks-3.10.14-1.git.273.a64b86b.el7.noarch openshift-ansible-3.10.14-1.git.273.a64b86b.el7.noarch openshift-ansible-docs-3.10.14-1.git.273.a64b86b.el7.noarch openshift3-metrics-cassandra-v3.10.14-7 metrics-hawkular-metrics-v3.10.14-7 metrics-schema-installer-v3.10.14-7 metrics-heapster-v3.10.14-8 How reproducible: Always Steps to Reproduce: 1. Deploy metrics 2. 3. Actual results: metrics pods could not started up Expected results: metrics pods can start up Additional info:
I've sent a PR https://github.com/openshift/openshift-ansible/pull/9417 Which is already merged, I'll move this to MODIFIED.
Issue is fixed, timeoutSeconds for hawkular-cassandra readiness check is 10s # rpm -qa | grep ansible ansible-2.6.3-1.el7ae.noarch openshift-ansible-playbooks-3.11.0-0.20.0.git.0.ec6d8caNone.noarch openshift-ansible-roles-3.11.0-0.20.0.git.0.ec6d8caNone.noarch openshift-ansible-3.11.0-0.20.0.git.0.ec6d8caNone.noarch openshift-ansible-docs-3.11.0-0.20.0.git.0.ec6d8caNone.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652