Bug 1447066

Summary: 3.4.0 deploy of metrics with no persistent storage fails: cassandra restarting constantly
Product: OpenShift Container Platform Reporter: Eric Jones <erjones>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: aos-bugs, erjones, lizhou, mwringe, pweil
Target Milestone: ---Keywords: Reopened, Unconfirmed
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-18 13:30:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Jones 2017-05-01 14:30:45 UTC
Description of problem:
Customer is deploying new metrics components using [0] and cassandra is failing consistently. Logs from a currently trying to run pod as well as a previous pod (oc logs <POD> -p) will be attached shortly.

[0] oc new-app --as=system:serviceaccount:openshift-infra:metrics-deployer     -f metrics-deployer.yaml     -p HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.management-ocp-nonprod-int.uscis.dhs.gov     -p USE_PERSISTENT_STORAGE=false --loglevel=8


Version-Release number of selected component (if applicable):
3.4.0

Comment 4 Matt Wringe 2017-05-01 16:40:46 UTC
Please have them update the version of metrics to be installed to 3.4.1. The 3.4.1 image have the bug fixes required against the 3.4 version of OpenShift.

This can be done by setting the IMAGE_VERSION property or modifying the default value in the deployer's template.

Comment 5 Matt Wringe 2017-05-01 16:45:20 UTC
Can you also attach the output of 'oc get pods -o yaml -n openshift-infra'

And the metrics-deployer.yaml file that they are using? If they are using 3.4, it should look something like https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_hosted_templates/files/v1.4/enterprise/metrics-deployer.yaml

Comment 7 Eric Jones 2017-05-01 17:24:33 UTC
Comparing the metrics-deployer.yaml file attached (privately) vs the github version provided in comment 5 I only see a few differences off the bat:

first is the IMAGE_VERSION. The file attached specifically points to 3.4.0 but the github version points to v3.4

Second is the METRIC_RESOLUTION. The file attached specifies 15s but the github version has 30s set



Looking onto the pods yaml, we can see that cassandra has restarted 98 times since deploy (I believe occurred on Friday (2017/04/28))

Comment 8 Matt Wringe 2017-05-01 22:01:03 UTC
The output of 'oc get pods -o yaml -n openshift-infra' is correct. Which makes this even more confusing.

The reason the pod is getting restarted is because the post start script is being killed with an exit status of 126:

"Killing container with docker id XXXXXXXXXXXXXXXX: PostStart handler: Error executing in Docker Container: 126"

A 126 error code usually indicates that the file exists but it cannot execute the binary.

I have tried to reproduce this problem locally and it works fine for me.

@lizhou: can you see if this is reproducible for you as well?

Is there anything else about this cluster which may help us determine what is going on? Are they using the default security context, or have those been changed?

Comment 9 Junqi Zhao 2017-05-02 08:21:02 UTC
@Matt,

issue was not reproduced with Metrics 3.4.0/3.4.1 in my environment

# docker images | grep metrics
openshift3/metrics-hawkular-metrics   3.4.1               7d4bd715f9ad        6 days ago          1.261 GB
openshift3/metrics-heapster           3.4.1               0d79f78e5371        9 days ago          318 MB
openshift3/metrics-cassandra          3.4.1               08fdb9958866        9 days ago          539.5 MB
openshift3/metrics-deployer           3.4.1               8ed9217d0c54        9 days ago          864.2 MB



openshift3/metrics-deployer           3.4.0               57baaa1c797f        10 weeks ago        862.9 MB
openshift3/metrics-heapster           3.4.0               04c115b270b4        3 months ago        317.8 MB
openshift3/metrics-cassandra          3.4.0               b5d700281ef7        4 months ago        649.2 MB
openshift3/metrics-hawkular-metrics   3.4.0               ef113cd9dc4a        4 months ago        1.508 GB

Comment 10 Matt Wringe 2017-05-02 15:14:59 UTC
The only thing I can think of would be an issue with something strange like some security context which prevents the scripts from being executed due to some permission issue.

A work around for this would be to remove the poststart hook from the RC for Cassandra. That should allow the pod to start up properly. Since they are not using persistent data, they don't need the poststart hook to update any existing data.

Comment 11 Matt Wringe 2017-05-02 15:16:06 UTC
@ejones: do you know if they are using the default security context with their installation? If they remove the poststart script, does Cassandra start up properly?

Comment 12 Eric Jones 2017-05-02 15:24:25 UTC
@Matt, It appears that we are no longer seeing this issue after upgrading the cluster to 3.4.1 per your earlier recommendation.

Seems we are now having issues with certs and routes and such but I believe we do not have the ability to test this bz any further in the customer's cluster so if you guys are unable to reproduce, we should likely close this bug.

Comment 13 Matt Wringe 2017-05-02 15:57:50 UTC
Ok, as we do not have the ability to reproduce this issue and an update to the latest bug fix version fixed the issue for the user, we are closing this issue as 'INSUFFICIENT_DATA'

Comment 17 Matt Wringe 2017-07-18 13:30:20 UTC
I am marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1471251

*** This bug has been marked as a duplicate of bug 1471251 ***