Bug 1447066
Summary: | 3.4.0 deploy of metrics with no persistent storage fails: cassandra restarting constantly | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eric Jones <erjones> |
Component: | Hawkular | Assignee: | Matt Wringe <mwringe> |
Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.4.0 | CC: | aos-bugs, erjones, lizhou, mwringe, pweil |
Target Milestone: | --- | Keywords: | Reopened, Unconfirmed |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-07-18 13:30:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eric Jones
2017-05-01 14:30:45 UTC
Please have them update the version of metrics to be installed to 3.4.1. The 3.4.1 image have the bug fixes required against the 3.4 version of OpenShift. This can be done by setting the IMAGE_VERSION property or modifying the default value in the deployer's template. Can you also attach the output of 'oc get pods -o yaml -n openshift-infra' And the metrics-deployer.yaml file that they are using? If they are using 3.4, it should look something like https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_hosted_templates/files/v1.4/enterprise/metrics-deployer.yaml Comparing the metrics-deployer.yaml file attached (privately) vs the github version provided in comment 5 I only see a few differences off the bat: first is the IMAGE_VERSION. The file attached specifically points to 3.4.0 but the github version points to v3.4 Second is the METRIC_RESOLUTION. The file attached specifies 15s but the github version has 30s set Looking onto the pods yaml, we can see that cassandra has restarted 98 times since deploy (I believe occurred on Friday (2017/04/28)) The output of 'oc get pods -o yaml -n openshift-infra' is correct. Which makes this even more confusing. The reason the pod is getting restarted is because the post start script is being killed with an exit status of 126: "Killing container with docker id XXXXXXXXXXXXXXXX: PostStart handler: Error executing in Docker Container: 126" A 126 error code usually indicates that the file exists but it cannot execute the binary. I have tried to reproduce this problem locally and it works fine for me. @lizhou: can you see if this is reproducible for you as well? Is there anything else about this cluster which may help us determine what is going on? Are they using the default security context, or have those been changed? @Matt, issue was not reproduced with Metrics 3.4.0/3.4.1 in my environment # docker images | grep metrics openshift3/metrics-hawkular-metrics 3.4.1 7d4bd715f9ad 6 days ago 1.261 GB openshift3/metrics-heapster 3.4.1 0d79f78e5371 9 days ago 318 MB openshift3/metrics-cassandra 3.4.1 08fdb9958866 9 days ago 539.5 MB openshift3/metrics-deployer 3.4.1 8ed9217d0c54 9 days ago 864.2 MB openshift3/metrics-deployer 3.4.0 57baaa1c797f 10 weeks ago 862.9 MB openshift3/metrics-heapster 3.4.0 04c115b270b4 3 months ago 317.8 MB openshift3/metrics-cassandra 3.4.0 b5d700281ef7 4 months ago 649.2 MB openshift3/metrics-hawkular-metrics 3.4.0 ef113cd9dc4a 4 months ago 1.508 GB The only thing I can think of would be an issue with something strange like some security context which prevents the scripts from being executed due to some permission issue. A work around for this would be to remove the poststart hook from the RC for Cassandra. That should allow the pod to start up properly. Since they are not using persistent data, they don't need the poststart hook to update any existing data. @ejones: do you know if they are using the default security context with their installation? If they remove the poststart script, does Cassandra start up properly? @Matt, It appears that we are no longer seeing this issue after upgrading the cluster to 3.4.1 per your earlier recommendation. Seems we are now having issues with certs and routes and such but I believe we do not have the ability to test this bz any further in the customer's cluster so if you guys are unable to reproduce, we should likely close this bug. Ok, as we do not have the ability to reproduce this issue and an update to the latest bug fix version fixed the issue for the user, we are closing this issue as 'INSUFFICIENT_DATA' I am marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1471251 *** This bug has been marked as a duplicate of bug 1471251 *** |