Description of problem:
On OCP 3.11 and OCS 3.11.1 deployment of metrics using openshift-metrics/config.yml playbook the playbook is successfully passing but the Hawkular and Heapster pods are not in Running (1/1) state.
Version-Release number of selected component (if applicable): OCP 3.11 and OCS 3.11.1
How reproducible: Always
Steps to Reproduce:
1. On a set up of OCP 3.11 and OCS 3.11.1, deployed metrics using the openshift-metrics/config.yml playbook.
2. The playbook passed but the Hawkular-metrics and Heapster pods are not coming up.
3. hawkular-metrics-schema pod was created and was in Running 1/1 state but the pod disappers after some time.
4. I'll be attaching the ansible logs and the inventory file.
Actual results: The Hawkular-metrics and heapster pods are not in Running 1/1 state and the hawkular-metrics-schema pod has disappered.
Expected results: The Hawkular-metrics and heapster pods should be in Running 1/1 state and the hawkular-metrics-schema pod should be in completed state.
This issue happens with gluster-block storage, there is not such issue when using aws dynamic pv
# oc -n openshift-infra get pod
NAME READY STATUS RESTARTS AGE
hawkular-cassandra-1-v5q64 1/1 Running 0 1h
hawkular-metrics-ft8gn 0/1 Running 17 1h
heapster-dgpk2 0/1 Running 12 1h
# oc -n openshift-infra get job
NAME DESIRED SUCCESSFUL AGE
hawkular-metrics-schema 1 0 1h
hawkular-metrics-schema pod has disappered after a few minutes,and hawkular-metrics-schema job is not SUCCESSFUL, same issue if we re-run playbooks/openshift-metrics/schema.yml
This looks to me like there must probably be a problem with writing to the PVC, so the schema job fails when the schema is being created.
Could you attach a log of the failing hawkular-metrics-schema pod? Could you verify that metrics-cassandra-1 PVC is writable, eg. any other pod is able to successfully write into it?
Yes, it's writable.
Ok then I need to take a look at the log of the failing hawkular-metrics-schema pod, I suppose the root cause should be visible there
Created attachment 1507866 [details]
metrics logs with gluster-block storage
(In reply to Jan Martiska from comment #10)
> Ok then I need to take a look at the log of the failing
> hawkular-metrics-schema pod, I suppose the root cause should be visible there
see the attached file "metrics logs with gluster-block storage"
Ok, I see this in the schema installer log:
WARN 2018-11-20 07:19:16,217 [main] org.hawkular.metrics.schema.Installer:run:102 - Installation failed
com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.31.56.116:9042] Timed out waiting for server response
The installation seems to correctly proceed through some parts, but somewhere in the middle, we get a timeout. The failed query seems to be this one:
DROP TABLE tenants_by_time
.. but it could be that there had been multiple attempts and each had failed at a different stage during the installation..
The Cassandra log also suggests that all install operations are taking unexpectedly long time, eg. CREATE TABLE LEASES taking ~70 seconds.
This suggests that the storage is just slow and/or unstable, thus triggering timeouts for some operations that fail to complete fast enough... I doubt that this would be an issue somewhere in Hawkular Metrics code, difference in the storage types should be transparent to the process.
Perhaps try increasing the CASSANDRA_CONNECTION_MAX_DELAY, CASSANDRA_CONNECTION_MAX_RETRIES, VERSION_UPDATE_DELAY, VERSION_UPDATE_MAX_RETRIES env properties for the schema installer (see https://access.redhat.com/containers/?tab=tech-details#/registry.access.redhat.com/openshift3/ose-metrics-schema-installer for details) to some big numbers to see if the problem goes away?