1579421 – Installation failed: service-catalog: Verify that the catalog api server is running

Bug 1579421 - Installation failed: service-catalog: Verify that the catalog api server is running

Summary: Installation failed: service-catalog: Verify that the catalog api server is r...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Jeff Peeler
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-17 15:36 UTC by Hongkai Liu
Modified:	2018-05-25 19:09 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-25 19:09:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2018-05-17 15:36:13 UTC

Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
$ git log --oneline -1
a00706056 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #8402 from vrutkovs/force-hardlinks

Tested on ocp-3.10.0-0.46.0

rpm -q ansible
ansible --version

How reproducible:
Always

Steps to Reproduce:
1. ansible-playbook -i /tmp/2.file openshift-ansible/playbooks/openshift-service-catalog/config.yml
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

TASK [openshift_service_catalog : Report errors] *****************************************************************************
task path: /home/fedora/openshift-ansible/roles/openshift_service_catalog/tasks/start.yml:52
fatal: [ec2-54-202-138-140.us-west-2.compute.amazonaws.com]: FAILED! => {
    "changed": false, 
    "msg": "Catalog install failed."
}
	to retry, use: --limit @/home/fedora/openshift-ansible/playbooks/openshift-service-catalog/config.retry

PLAY RECAP *******************************************************************************************************************
ec2-34-217-68-146.us-west-2.compute.amazonaws.com : ok=0    changed=0    unreachable=0    failed=0   
ec2-34-219-137-223.us-west-2.compute.amazonaws.com : ok=0    changed=0    unreachable=0    failed=0   
ec2-54-202-138-140.us-west-2.compute.amazonaws.com : ok=98   changed=27   unreachable=0    failed=1   
localhost                  : ok=13   changed=0    unreachable=0    failed=0   


INSTALLER STATUS *************************************************************************************************************
Initialization           : Complete (0:00:15)
Service Catalog Install  : In Progress (0:11:04)


Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag
Will be attached.

Comment 3 Scott Dodson 2018-05-21 13:44:51 UTC

Does this eventually get to a state where the service catalog is running successfully? We're aware of some issues where the control plane nodes are being restarted and we're trying to address those then we'll ask that this be reproduced.

Comment 4 Hongkai Liu 2018-05-22 02:55:51 UTC

Will do it again tomorrow to answer the question.

Comment 5 Hongkai Liu 2018-05-22 11:59:17 UTC

Tested with

# yum list installed | grep openshift
atomic-openshift.x86_64       3.10.0-0.47.0.git.0.2fffa04.el7

$ git log --oneline -1
0ffb616c0 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #8461 from mgugino-upstream-stage/containerized-cleanup


Still failed at 
TASK [openshift_service_catalog : Verify that the catalog api server is running]

oc-get-pod showed the pod is ready.
However, the curl command used in the task failed.

=====================
During the installation:

root@ip-172-31-11-200: ~ # oc get pod --all-namespaces | grep service
kube-service-catalog    apiserver-r96mp                                                  1/1       Running   0          1m
kube-service-catalog    controller-manager-6p78r                                         1/1       Running   0          1m
root@ip-172-31-11-200: ~ # curl -k https://apiserver.kube-service-catalog.svc/healthz
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld
healthz check failed


After the installation:
# curl -k https://apiserver.kube-service-catalog.svc/healthz
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld
healthz check failed

Comment 6 Weihua Meng 2018-05-22 14:21:57 UTC

Meet the same issue on Azure, openshift v3.10.0-0.50.0.
RPM install, HA.

Comment 7 Scott Dodson 2018-05-24 19:21:24 UTC

Hongkai, Weihua,

Can you check the logs of the apiserver pods in kube-service-catalog namespace?

Jeff, Jay,

Any ideas why the health check is failing?

Comment 8 Jeff Peeler 2018-05-24 21:56:15 UTC

I've tried reproducing this, but haven't been able to (both by running an install from scratch and then after running the catalog playbook as shown in comment 1). From the provided logs it looks like etcd isn't healthy. What does this show:

# source /etc/etcd/etcd.conf
# ETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS endpoint health

If etcd is indeed having problems, I'll need access to the problematic cluster as I don't really know what details to ask for.

The command in comment 1 was executed on a working cluster correct? I don't think the catalog playbook will set up everything properly by itself.

Comment 9 Hongkai Liu 2018-05-25 02:08:55 UTC

Jeff,

I will provide the service catalog api pod log and etcd command results tomorrow.

I do it with 2 steps just to show it is the installation problem of svc catalog.

1. install ocp cluster with openshift_enable_service_catalog=false
2. run the playbook of catalog with the same inv. file and openshift_enable_service_catalog=true.

So it is also a brand new cluster.

Other functions, eg, creating project, pods, etc., looks fine without svc catalog. Maybe ectd is OK. Anyway, we will know more tomorrow.

Thanks.

Comment 10 Hongkai Liu 2018-05-25 13:46:54 UTC

Tried with several combination

With 1 step: install svc catalog with cluster
Succeeded on ocp-310-50 build (openshift_enable_service_catalog=true):
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/39232/consoleFull


Failed on ocp-310-47 build (openshift_enable_service_catalog=true):
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/39231/consoleFull


With 2 steps: install cluster without svc cata log, and then install it as a seperate playbook
Succeeded on ocp-310-50 build (openshift_enable_service_catalog=true):

Failed on ocp-310-47 build (openshift_enable_service_catalog=false):

Weihua,
Can you test it again with 50 build (maybe on aws this time)?

So it looks like the problem is with 3.10.0-0.47.0.git.0.2fffa04.el7 and it is fixed in 3.10.0-0.50.0.git.0.db6dfd6.el7.

When the failure happened, this is the command result

# oc get pod -n kube-system 
NAME                                                            READY     STATUS    RESTARTS   AGE
master-api-ip-172-31-8-100.us-west-2.compute.internal           1/1       Running   1          28m
master-controllers-ip-172-31-8-100.us-west-2.compute.internal   1/1       Running   0          29m
master-etcd-ip-172-31-8-100.us-west-2.compute.internal          1/1       Running   0          28m
root@ip-172-31-8-100: ~ # oc rsh -n kube-system master-etcd-ip-172-31-8-100.us-west-2.compute.internal
sh-4.2# source /etc/etcd/etcd.conf
sh-4.2# ETCDCTL_API=3 etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS endpoint health
https://172.31.8.100:2379 is healthy: successfully committed proposal: took = 832.005µs


Those are the pod logs.
http://file.rdu.redhat.com/~hongkliu/test_result/bz1579421/
They look quite different.

Comment 11 Jeff Peeler 2018-05-25 14:35:34 UTC

I just re-installed, this time disabling catalog in the beginning and then only installing that afterwards. No problems here.

However, in the first apiserver pod log it doesn't look like all is well. Are those logs from the successful install?

Comment 12 Hongkai Liu 2018-05-25 14:54:16 UTC

Both logs are from the failure cases with 3.10.0-0.47.0.git.0.2fffa04.el7
apiserver.pod.1.step.log.txt: only 1 playbook
apiserver.pod.2.step.log.txt: 2 playbooks

Jeff, which build of OCP are you testing with?

Comment 13 Jeff Peeler 2018-05-25 15:26:55 UTC

Ah ok, well then it sounds like this bug is ok to be closed.

I'm testing with 3.10.0-0.52.0 (I use the master branch in ansible directly).

Comment 14 Hongkai Liu 2018-05-25 17:47:17 UTC

Jeff,

yes. I retest again with 3.10.0-0.50.0.git.0.db6dfd6.el7. Everything looks fine too me. I feel fine if you close the bz. Let us move on.

Weihua,

feel free to reopen if you still hit it with the latest version.

Thanks.

Note You need to log in before you can comment on or make changes to this bug.