apiserver pod and controllers crash due to failed health checks. We followed same healthcheck and see: # oc rsh apiserver-btqh7 sh-4.2# curl -kv https://localhost:6443/healthz * About to connect() to localhost port 6443 (#0) * Trying 127.0.0.1... * Connected to localhost (127.0.0.1) port 6443 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * skipping SSL peer certificate verification * NSS: client certificate not found (nickname not specified) * SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: * subject: CN=apiserver.kube-service-catalog * start date: Dec 28 21:45:00 2018 GMT * expire date: Dec 27 21:45:01 2020 GMT * common name: apiserver.kube-service-catalog * issuer: CN=service-catalog-signer < HTTP/1.1 500 Internal Server Error < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Sat, 29 Dec 2018 23:47:25 GMT < Content-Length: 180 < [+]ping ok [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-service-catalog-apiserver-informers ok [-]etcd failed: reason withheld healthz check failed * Connection #0 to host localhost left intact sh-4.2# etcd is healthy, with no errors in logs: # oc rsh master-etcd-cbsccqaocpm1 sh-4.2# export ETCDCTL_API=3 sh-4.2# source /etc/etcd/etcd.conf sh-4.2# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS member list b4fc1645b35ef7b9, started, cbsccqaocpm1.qa.cibc.local, https://172.16.40.11:2380, https://172.16.40.11:2379 sh-4.2# 2018-12-30 00:09:17.598223 I | mvcc: finished scheduled compaction at 1558955 (took 1.229722ms) 2018-12-30 00:14:17.599308 I | mvcc: store.index: compact 1559452 2018-12-30 00:14:17.601238 I | mvcc: finished scheduled compaction at 1559452 (took 1.28863ms) 2018-12-30 00:19:17.604818 I | mvcc: store.index: compact 1559967 2018-12-30 00:19:17.606688 I | mvcc: finished scheduled compaction at 1559967 (took 1.237636ms) 2018-12-30 00:24:17.608219 I | mvcc: store.index: compact 1560465 2018-12-30 00:24:17.610150 I | mvcc: finished scheduled compaction at 1560465 (took 1.343512ms) A similar problem was reported here but it was on a much older version than you: https://access.redhat.com/solutions/3322311 Another similar issue hit here, again on a previous version: https://bugzilla.redhat.com/show_bug.cgi?id=1579421 And here, but with a potential workaround: https://github.com/openshift/openshift-ansible/issues/8076 However, the last one was reported due to wrong etcd in daemonset. However, etcd url is set correctly in daemonset yaml: spec: containers: - args: - apiserver - --storage-type - etcd - --secure-port - "6443" - --etcd-servers - https://master1.redacted.local:2379 - --etcd-cafile - /etc/origin/master/master.etcd-ca.crt - --etcd-certfile - /etc/origin/master/master.etcd-client.crt - --etcd-keyfile - /etc/origin/master/master.etcd-client.key
Pods were crashing due to livenessprobe not working, but we removed probes so that the pods wouldn't crash.
Steven, Could you help provide us the exact version info of the service catalog? And the cluster version. You can do the following command to get the service catalog version: $oc exec xxx(pod name) -- service-catalog --version
Is this after an upgrade or a new deployment? Please provide version details.
Closing this bug, this is not an issue with the service catalog. The issue that a 3rd party network sdn plugin is being used and due to that the kube-proxy is never started. The service catalog is failing because it is making call to the kubernetes service IP that should get NATd to the the masterIP. These nat rules never get setup because the kubeproxy is not running on this host.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days