Description of problem: On OCP 3.11 and OCS 3.11.1 deployment of metrics using openshift-metrics/config.yml playbook the playbook is successfully passing but the Hawkular and Heapster pods are not in Running (1/1) state. Version-Release number of selected component (if applicable): OCP 3.11 and OCS 3.11.1 How reproducible: Always Steps to Reproduce: 1. On a set up of OCP 3.11 and OCS 3.11.1, deployed metrics using the openshift-metrics/config.yml playbook. 2. The playbook passed but the Hawkular-metrics and Heapster pods are not coming up. 3. hawkular-metrics-schema pod was created and was in Running 1/1 state but the pod disappers after some time. 4. I'll be attaching the ansible logs and the inventory file. Actual results: The Hawkular-metrics and heapster pods are not in Running 1/1 state and the hawkular-metrics-schema pod has disappered. Expected results: The Hawkular-metrics and heapster pods should be in Running 1/1 state and the hawkular-metrics-schema pod should be in completed state.
This issue happens with gluster-block storage, there is not such issue when using aws dynamic pv metrics-cassandra-v3.11.16-4 metrics-hawkular-metrics-v3.11.16-4 metrics-schema-installer-v3.11.16-4 metrics-heapster-v3.11.16-2
# oc -n openshift-infra get pod NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-v5q64 1/1 Running 0 1h hawkular-metrics-ft8gn 0/1 Running 17 1h heapster-dgpk2 0/1 Running 12 1h # oc -n openshift-infra get job NAME DESIRED SUCCESSFUL AGE hawkular-metrics-schema 1 0 1h hawkular-metrics-schema pod has disappered after a few minutes,and hawkular-metrics-schema job is not SUCCESSFUL, same issue if we re-run playbooks/openshift-metrics/schema.yml
This looks to me like there must probably be a problem with writing to the PVC, so the schema job fails when the schema is being created. Could you attach a log of the failing hawkular-metrics-schema pod? Could you verify that metrics-cassandra-1 PVC is writable, eg. any other pod is able to successfully write into it?
Jan, Yes, it's writable.
Ok then I need to take a look at the log of the failing hawkular-metrics-schema pod, I suppose the root cause should be visible there
Created attachment 1507866 [details] metrics logs with gluster-block storage
(In reply to Jan Martiska from comment #10) > Ok then I need to take a look at the log of the failing > hawkular-metrics-schema pod, I suppose the root cause should be visible there see the attached file "metrics logs with gluster-block storage"
Ok, I see this in the schema installer log: WARN 2018-11-20 07:19:16,217 [main] org.hawkular.metrics.schema.Installer:run:102 - Installation failed com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.31.56.116:9042] Timed out waiting for server response The installation seems to correctly proceed through some parts, but somewhere in the middle, we get a timeout. The failed query seems to be this one: DROP TABLE tenants_by_time .. but it could be that there had been multiple attempts and each had failed at a different stage during the installation.. The Cassandra log also suggests that all install operations are taking unexpectedly long time, eg. CREATE TABLE LEASES taking ~70 seconds. This suggests that the storage is just slow and/or unstable, thus triggering timeouts for some operations that fail to complete fast enough... I doubt that this would be an issue somewhere in Hawkular Metrics code, difference in the storage types should be transparent to the process. Perhaps try increasing the CASSANDRA_CONNECTION_MAX_DELAY, CASSANDRA_CONNECTION_MAX_RETRIES, VERSION_UPDATE_DELAY, VERSION_UPDATE_MAX_RETRIES env properties for the schema installer (see https://access.redhat.com/containers/?tab=tech-details#/registry.access.redhat.com/openshift3/ose-metrics-schema-installer for details) to some big numbers to see if the problem goes away?