| Summary: | Failure to deploy router as part of OSE | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pavel Zagalsky <pzagalsk> | ||||||
| Component: | Installer | Assignee: | Scott Dodson <sdodson> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Johnny Liu <jialiu> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 3.2.0 | CC: | aos-bugs, bbennett, bleanhar, dron, fche, jhenner, jokerman, jsanda, mmccomas, ncredi, pzagalsk | ||||||
| Target Milestone: | --- | Keywords: | Regression, TestBlocker | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | container | ||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2016-08-31 12:39:15 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
Can you give a bit more information about your setup? How was metrics installed? What secrets did you use, what deployment options were used. Etc. The message in Hawkular Metrics is about not being able to perform a write to Cassandra. Can you please attach the Cassandra logs to this BZ? I used this MOJO doc to install metrics: https://mojo.redhat.com/docs/DOC-1060820 How can I get the Cassandra logs? Do you have a command? It just 'oc logs ${POD_NAME}'
Or if you want a copy and paste command:
oc logs $(oc get pods | grep -i hawkular-cassandra | awk '{print $1}')
Created attachment 1191886 [details]
CassandraLog
Log attached ^ Did the crash happen multiple times? Hard to tell. Didn't see the dates in the log file, just the hours:minutes 'oc get pods' should indicate the number of times the pod has been restarted hawkular-cassandra-1-nj2w6 1/1 Running 5 75d hawkular-metrics-32epb 1/1 Running 6 50d heapster-i9wfy 1/1 Running 5 37d From the Hawkular Metrics logs attached, these are not the logs of a component which has crashed at startup. Those logs cover multiple hours. Can you please attach the logs for the container which has crashed? "oc logs -p $POD_NAME" will return the logs for the pod which has just crashed (-p returns the logs of the previous container) Also, from https://bugzilla.redhat.com/show_bug.cgi?id=1368095#c9 your Hawkular Metrics (and all metric components) are in the fully running and ready state, which is not something which could occur if the pod was constantly crashing at startup. Can you please explain what the exact issue is that you are seeing? The original description of the issue does not match the logs or information you have since presented. Lowering the priority since https://bugzilla.redhat.com/show_bug.cgi?id=1368095#c9 shows that the pods are up and running and are not constantly crashing. We see it in the automation environments as well both on OSE 3.2 and 3.3 the problem is not the metrics, but the router-1-deploy which stays "ContainerCreating" forever once you try to deploy metrics in this status, the router-1-deploy is switched to "Error" and on logs it says that it didnt succeed within 600 secs the metrics in this case will fail to deploy (obviously) but the root cause is router-1 deployment I suspect that this is due to the certificate generation that the service is requesting. If OpenShift is not configured to generate certs then you get this error when deploying a router. This comment has information on what needs to be present in the config: https://bugzilla.redhat.com/show_bug.cgi?id=1349144#c19 The ansible installer should be enabling the feature: https://github.com/openshift/openshift-ansible/issues/2345 https://github.com/openshift/openshift-ansible/pull/2358 So, please make sure that the ansible installer you used had the fix. Or if you did not use the ansible installer, can you post how you did the install. I checked the /etc/origin/master/ path and there are no service-signer.crt and service-signer.key files there. How do I obtain them? for the automation it seems that this fix is working, but now I am failing on the metrics deployment [cloud-user@ose3-master ~]$ oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-yqxxa 1/1 Running 0 12m default router-1-bpnys 1/1 Running 0 12m openshift-infra metrics-deployer-7mpmf 0/1 ImagePullBackOff 0 12m this is on 3.3 (didnt check 3.2 yet) and working from puddles for 3.2 it doesnt look like the router issue was fixed was this patch merged into 3.2 branch as well? As far as I know automatic certificate creation was never proposed for backporting. We'd probably need someone from Ben's team to comment on the feasibility. I know they did a lot of work to make the upgrade go smoothly so I think the solution is for the customer to upgrade. This is blocking metrics for us. added appropriate keywords and flags. adding more info my 3.2 is stuck as i described in comment #13 working with registry brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888 and using image openshift3/ose-deployer:v3.2.1.15 for the metrics ill open a different bug Hi Dafna/Nelly, There's a lot of confusion in this bugzilla. Can you please state the versions of the ansible playbooks that you are using? From what I can tell the issue is resolved on OCP 3.3. It's not clear to me which versions of metrics, OCP and openshift-ansible are in use. Brenton, I used the playbooks that can be found in this Mojo: https://mojo.redhat.com/docs/DOC-1060820 I opened a new bug for the metrics issue https://bugzilla.redhat.com/show_bug.cgi?id=1371578 so lets leave this aside at this point 3.3 router is deployed, 3.2 is not seeing the same 'ContainerCreating' state and working with registry brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888 and using image openshift3/ose-deployer:v3.2.1.15 (In reply to Nelly Credi from comment #24) > openshift3/ose-deployer:v3.2.1.15 This image was just built today, so that's probably the reason it's failing. It should be there now. thanks Scott its working in the automation from my side this issue is resolved Sorry, missed the question since the issue was resolved. We would not be able to backport easily. The way we implemented the default certs relies upon features only present in 3.3. |
Created attachment 1191859 [details] HawkularCrash Description of problem: Hawkular metrics crashes upon start Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Hawkular crashes immediately upon start Please check in my Master: 10.35.161.112 Actual results: Hawkular should be running and gathering metrics Expected results: Additional info: Crash log attached