Bug 1651455 - ansible-metrics deployment fails in OCP 3.11 and OCS 3.11.1
Summary: ansible-metrics deployment fails in OCP 3.11 and OCS 3.11.1
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.11.z
Assignee: Jan Martiska
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1651483 1651485
TreeView+ depends on / blocked
 
Reported: 2018-11-20 06:58 UTC by Ashmitha Ambastha
Modified: 2019-11-21 06:49 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1651483 1651485 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:
jmartisk: needinfo+


Attachments (Terms of Use)
metrics logs with gluster-block storage (22.43 KB, application/x-gzip)
2018-11-22 01:10 UTC, Junqi Zhao
no flags Details

Description Ashmitha Ambastha 2018-11-20 06:58:28 UTC
Description of problem:
On OCP 3.11 and OCS 3.11.1 deployment of metrics using openshift-metrics/config.yml playbook the playbook is successfully passing but the Hawkular and Heapster pods are not in Running (1/1) state. 

Version-Release number of selected component (if applicable): OCP 3.11 and OCS 3.11.1

How reproducible: Always

Steps to Reproduce:
1. On a set up of OCP 3.11 and OCS 3.11.1, deployed metrics using the openshift-metrics/config.yml playbook. 
2. The playbook passed but the Hawkular-metrics and Heapster pods are not coming up. 
3. hawkular-metrics-schema pod was created and was in Running 1/1 state but the pod disappers after some time. 
4. I'll be attaching the ansible logs and the inventory file. 

Actual results: The Hawkular-metrics and heapster pods are not in Running 1/1 state and the hawkular-metrics-schema pod has disappered.

Expected results: The Hawkular-metrics and heapster pods should be in Running 1/1 state and the hawkular-metrics-schema pod should be in completed state.

Comment 5 Junqi Zhao 2018-11-20 08:24:22 UTC
This issue happens with gluster-block storage, there is not such issue when using aws dynamic pv

metrics-cassandra-v3.11.16-4
metrics-hawkular-metrics-v3.11.16-4
metrics-schema-installer-v3.11.16-4
metrics-heapster-v3.11.16-2

Comment 6 Junqi Zhao 2018-11-20 08:28:20 UTC
# oc -n openshift-infra get pod
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-v5q64   1/1       Running   0          1h
hawkular-metrics-ft8gn       0/1       Running   17         1h
heapster-dgpk2               0/1       Running   12         1h

# oc -n openshift-infra get job
NAME                      DESIRED   SUCCESSFUL   AGE
hawkular-metrics-schema   1         0            1h


hawkular-metrics-schema pod has disappered after a few minutes,and hawkular-metrics-schema job is not SUCCESSFUL, same issue if we re-run playbooks/openshift-metrics/schema.yml

Comment 7 Jan Martiska 2018-11-20 09:11:40 UTC
This looks to me like there must probably be a problem with writing to the PVC, so the schema job fails when the schema is being created.
Could you attach a log of the failing hawkular-metrics-schema pod? Could you verify that metrics-cassandra-1 PVC is writable, eg. any other pod is able to successfully write into it?

Comment 8 Ashmitha Ambastha 2018-11-21 05:23:07 UTC
Jan, 

Yes, it's writable.

Comment 10 Jan Martiska 2018-11-21 09:55:19 UTC
Ok then I need to take a look at the log of the failing hawkular-metrics-schema pod, I suppose the root cause should be visible there

Comment 11 Junqi Zhao 2018-11-22 01:10:12 UTC
Created attachment 1507866 [details]
metrics logs with gluster-block storage

Comment 12 Junqi Zhao 2018-11-22 01:10:55 UTC
(In reply to Jan Martiska from comment #10)
> Ok then I need to take a look at the log of the failing
> hawkular-metrics-schema pod, I suppose the root cause should be visible there

see the attached file "metrics logs with gluster-block storage"

Comment 13 Jan Martiska 2018-11-22 14:45:47 UTC
Ok, I see this in the schema installer log:

WARN  2018-11-20 07:19:16,217 [main] org.hawkular.metrics.schema.Installer:run:102 - Installation failed
com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.31.56.116:9042] Timed out waiting for server response

The installation seems to correctly proceed through some parts, but somewhere in the middle, we get a timeout. The failed query seems to be this one:
DROP TABLE tenants_by_time
.. but it could be that there had been multiple attempts and each had failed at a different stage during the installation..

The Cassandra log also suggests that all install operations are taking unexpectedly long time, eg. CREATE TABLE LEASES taking ~70 seconds.
This suggests that the storage is just slow and/or unstable, thus triggering timeouts for some operations that fail to complete fast enough... I doubt that this would be an issue somewhere in Hawkular Metrics code, difference in the storage types should be transparent to the process.

Perhaps try increasing the CASSANDRA_CONNECTION_MAX_DELAY, CASSANDRA_CONNECTION_MAX_RETRIES, VERSION_UPDATE_DELAY, VERSION_UPDATE_MAX_RETRIES env properties for the schema installer (see https://access.redhat.com/containers/?tab=tech-details#/registry.access.redhat.com/openshift3/ose-metrics-schema-installer for details) to some big numbers to see if the problem goes away?


Note You need to log in before you can comment on or make changes to this bug.