Description of problem: Version-Release number of the following components: rpm -q openshift-ansible $ git log --oneline -1 a00706056 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #8402 from vrutkovs/force-hardlinks Tested on ocp-3.10.0-0.46.0 rpm -q ansible ansible --version How reproducible: Always Steps to Reproduce: 1. ansible-playbook -i /tmp/2.file openshift-ansible/playbooks/openshift-service-catalog/config.yml 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated TASK [openshift_service_catalog : Report errors] ***************************************************************************** task path: /home/fedora/openshift-ansible/roles/openshift_service_catalog/tasks/start.yml:52 fatal: [ec2-54-202-138-140.us-west-2.compute.amazonaws.com]: FAILED! => { "changed": false, "msg": "Catalog install failed." } to retry, use: --limit @/home/fedora/openshift-ansible/playbooks/openshift-service-catalog/config.retry PLAY RECAP ******************************************************************************************************************* ec2-34-217-68-146.us-west-2.compute.amazonaws.com : ok=0 changed=0 unreachable=0 failed=0 ec2-34-219-137-223.us-west-2.compute.amazonaws.com : ok=0 changed=0 unreachable=0 failed=0 ec2-54-202-138-140.us-west-2.compute.amazonaws.com : ok=98 changed=27 unreachable=0 failed=1 localhost : ok=13 changed=0 unreachable=0 failed=0 INSTALLER STATUS ************************************************************************************************************* Initialization : Complete (0:00:15) Service Catalog Install : In Progress (0:11:04) Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag Will be attached.
Does this eventually get to a state where the service catalog is running successfully? We're aware of some issues where the control plane nodes are being restarted and we're trying to address those then we'll ask that this be reproduced.
Will do it again tomorrow to answer the question.
Tested with # yum list installed | grep openshift atomic-openshift.x86_64 3.10.0-0.47.0.git.0.2fffa04.el7 $ git log --oneline -1 0ffb616c0 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #8461 from mgugino-upstream-stage/containerized-cleanup Still failed at TASK [openshift_service_catalog : Verify that the catalog api server is running] oc-get-pod showed the pod is ready. However, the curl command used in the task failed. ===================== During the installation: root@ip-172-31-11-200: ~ # oc get pod --all-namespaces | grep service kube-service-catalog apiserver-r96mp 1/1 Running 0 1m kube-service-catalog controller-manager-6p78r 1/1 Running 0 1m root@ip-172-31-11-200: ~ # curl -k https://apiserver.kube-service-catalog.svc/healthz [+]ping ok [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-service-catalog-apiserver-informers ok [-]etcd failed: reason withheld healthz check failed After the installation: # curl -k https://apiserver.kube-service-catalog.svc/healthz [+]ping ok [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-service-catalog-apiserver-informers ok [-]etcd failed: reason withheld healthz check failed
Meet the same issue on Azure, openshift v3.10.0-0.50.0. RPM install, HA.
Hongkai, Weihua, Can you check the logs of the apiserver pods in kube-service-catalog namespace? Jeff, Jay, Any ideas why the health check is failing?
I've tried reproducing this, but haven't been able to (both by running an install from scratch and then after running the catalog playbook as shown in comment 1). From the provided logs it looks like etcd isn't healthy. What does this show: # source /etc/etcd/etcd.conf # ETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS endpoint health If etcd is indeed having problems, I'll need access to the problematic cluster as I don't really know what details to ask for. The command in comment 1 was executed on a working cluster correct? I don't think the catalog playbook will set up everything properly by itself.
Jeff, I will provide the service catalog api pod log and etcd command results tomorrow. I do it with 2 steps just to show it is the installation problem of svc catalog. 1. install ocp cluster with openshift_enable_service_catalog=false 2. run the playbook of catalog with the same inv. file and openshift_enable_service_catalog=true. So it is also a brand new cluster. Other functions, eg, creating project, pods, etc., looks fine without svc catalog. Maybe ectd is OK. Anyway, we will know more tomorrow. Thanks.
Tried with several combination With 1 step: install svc catalog with cluster Succeeded on ocp-310-50 build (openshift_enable_service_catalog=true): https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/39232/consoleFull Failed on ocp-310-47 build (openshift_enable_service_catalog=true): https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/39231/consoleFull With 2 steps: install cluster without svc cata log, and then install it as a seperate playbook Succeeded on ocp-310-50 build (openshift_enable_service_catalog=true): Failed on ocp-310-47 build (openshift_enable_service_catalog=false): Weihua, Can you test it again with 50 build (maybe on aws this time)? So it looks like the problem is with 3.10.0-0.47.0.git.0.2fffa04.el7 and it is fixed in 3.10.0-0.50.0.git.0.db6dfd6.el7. When the failure happened, this is the command result # oc get pod -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-31-8-100.us-west-2.compute.internal 1/1 Running 1 28m master-controllers-ip-172-31-8-100.us-west-2.compute.internal 1/1 Running 0 29m master-etcd-ip-172-31-8-100.us-west-2.compute.internal 1/1 Running 0 28m root@ip-172-31-8-100: ~ # oc rsh -n kube-system master-etcd-ip-172-31-8-100.us-west-2.compute.internal sh-4.2# source /etc/etcd/etcd.conf sh-4.2# ETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS endpoint health https://172.31.8.100:2379 is healthy: successfully committed proposal: took = 832.005µs Those are the pod logs. http://file.rdu.redhat.com/~hongkliu/test_result/bz1579421/ They look quite different.
I just re-installed, this time disabling catalog in the beginning and then only installing that afterwards. No problems here. However, in the first apiserver pod log it doesn't look like all is well. Are those logs from the successful install?
Both logs are from the failure cases with 3.10.0-0.47.0.git.0.2fffa04.el7 apiserver.pod.1.step.log.txt: only 1 playbook apiserver.pod.2.step.log.txt: 2 playbooks Jeff, which build of OCP are you testing with?
Ah ok, well then it sounds like this bug is ok to be closed. I'm testing with 3.10.0-0.52.0 (I use the master branch in ansible directly).
Jeff, yes. I retest again with 3.10.0-0.50.0.git.0.db6dfd6.el7. Everything looks fine too me. I feel fine if you close the bz. Let us move on. Weihua, feel free to reopen if you still hit it with the latest version. Thanks.