apiserver pod and controllers crash due to failed health checks. We followed same healthcheck and see:
# oc rsh apiserver-btqh7
sh-4.2# curl -kv https://localhost:6443/healthz
* About to connect() to localhost port 6443 (#0)
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 6443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
* subject: CN=apiserver.kube-service-catalog
* start date: Dec 28 21:45:00 2018 GMT
* expire date: Dec 27 21:45:01 2020 GMT
* common name: apiserver.kube-service-catalog
* issuer: CN=service-catalog-signer
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Sat, 29 Dec 2018 23:47:25 GMT
< Content-Length: 180
[-]etcd failed: reason withheld
healthz check failed
* Connection #0 to host localhost left intact
etcd is healthy, with no errors in logs:
# oc rsh master-etcd-cbsccqaocpm1
sh-4.2# export ETCDCTL_API=3
sh-4.2# source /etc/etcd/etcd.conf
sh-4.2# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS member list
b4fc1645b35ef7b9, started, cbsccqaocpm1.qa.cibc.local, https://172.16.40.11:2380, https://172.16.40.11:2379
2018-12-30 00:09:17.598223 I | mvcc: finished scheduled compaction at 1558955 (took 1.229722ms)
2018-12-30 00:14:17.599308 I | mvcc: store.index: compact 1559452
2018-12-30 00:14:17.601238 I | mvcc: finished scheduled compaction at 1559452 (took 1.28863ms)
2018-12-30 00:19:17.604818 I | mvcc: store.index: compact 1559967
2018-12-30 00:19:17.606688 I | mvcc: finished scheduled compaction at 1559967 (took 1.237636ms)
2018-12-30 00:24:17.608219 I | mvcc: store.index: compact 1560465
2018-12-30 00:24:17.610150 I | mvcc: finished scheduled compaction at 1560465 (took 1.343512ms)
A similar problem was reported here but it was on a much older version than you: https://access.redhat.com/solutions/3322311
Another similar issue hit here, again on a previous version: https://bugzilla.redhat.com/show_bug.cgi?id=1579421
And here, but with a potential workaround: https://github.com/openshift/openshift-ansible/issues/8076
However, the last one was reported due to wrong etcd in daemonset. However, etcd url is set correctly in daemonset yaml:
Pods were crashing due to livenessprobe not working, but we removed probes so that the pods wouldn't crash.
Could you help provide us the exact version info of the service catalog? And the cluster version.
You can do the following command to get the service catalog version:
$oc exec xxx(pod name) -- service-catalog --version
Is this after an upgrade or a new deployment? Please provide version details.
Closing this bug, this is not an issue with the service catalog.
The issue that a 3rd party network sdn plugin is being used and due to that the kube-proxy is never started.
The service catalog is failing because it is making call to the kubernetes service IP that should get NATd to the the masterIP. These nat rules never get setup because the kubeproxy is not running on this host.