Bug 1662560 - service catalog apiserver fails with etcd failed reason withheld [NEEDINFO]
Summary: service catalog apiserver fails with etcd failed reason withheld
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Service Catalog
Version: 1.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Jay Boyd
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-30 00:33 UTC by Steven Walter
Modified: 2019-01-02 17:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-02 17:45:39 UTC
Target Upstream Version:
jaboyd: needinfo? (stwalter)


Attachments (Terms of Use)

Description Steven Walter 2018-12-30 00:33:18 UTC
apiserver pod and controllers crash due to failed health checks. We followed same healthcheck and see:

# oc rsh apiserver-btqh7
sh-4.2# curl -kv https://localhost:6443/healthz
* About to connect() to localhost port 6443 (#0)
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 6443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=apiserver.kube-service-catalog
*       start date: Dec 28 21:45:00 2018 GMT
*       expire date: Dec 27 21:45:01 2020 GMT
*       common name: apiserver.kube-service-catalog
*       issuer: CN=service-catalog-signer
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Sat, 29 Dec 2018 23:47:25 GMT
< Content-Length: 180
<
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld
healthz check failed
* Connection #0 to host localhost left intact
sh-4.2#

etcd is healthy, with no errors in logs:
# oc rsh master-etcd-cbsccqaocpm1
sh-4.2# export ETCDCTL_API=3
sh-4.2# source /etc/etcd/etcd.conf
sh-4.2# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS member list
b4fc1645b35ef7b9, started, cbsccqaocpm1.qa.cibc.local, https://172.16.40.11:2380, https://172.16.40.11:2379
sh-4.2#


2018-12-30 00:09:17.598223 I | mvcc: finished scheduled compaction at 1558955 (took 1.229722ms)
2018-12-30 00:14:17.599308 I | mvcc: store.index: compact 1559452
2018-12-30 00:14:17.601238 I | mvcc: finished scheduled compaction at 1559452 (took 1.28863ms)
2018-12-30 00:19:17.604818 I | mvcc: store.index: compact 1559967
2018-12-30 00:19:17.606688 I | mvcc: finished scheduled compaction at 1559967 (took 1.237636ms)
2018-12-30 00:24:17.608219 I | mvcc: store.index: compact 1560465
2018-12-30 00:24:17.610150 I | mvcc: finished scheduled compaction at 1560465 (took 1.343512ms)



A similar problem was reported here but it was on a much older version than you: https://access.redhat.com/solutions/3322311

Another similar issue hit here, again on a previous version: https://bugzilla.redhat.com/show_bug.cgi?id=1579421
And here, but with a potential workaround: https://github.com/openshift/openshift-ansible/issues/8076

However, the last one was reported due to wrong etcd in daemonset. However, etcd url is set correctly in daemonset yaml:


    spec:
      containers:
      - args:
        - apiserver
        - --storage-type
        - etcd
        - --secure-port
        - "6443"
        - --etcd-servers
        - https://master1.redacted.local:2379
        - --etcd-cafile
        - /etc/origin/master/master.etcd-ca.crt
        - --etcd-certfile
        - /etc/origin/master/master.etcd-client.crt
        - --etcd-keyfile
        - /etc/origin/master/master.etcd-client.key

Comment 1 Steven Walter 2018-12-30 00:34:10 UTC
Pods were crashing due to livenessprobe not working, but we removed probes so that the pods wouldn't crash.

Comment 3 Jian Zhang 2019-01-02 02:49:53 UTC
Steven,

Could you help provide us the exact version info of the service catalog? And the cluster version.
You can do the following command to get the service catalog version:
$oc exec xxx(pod name) -- service-catalog --version

Comment 4 Jay Boyd 2019-01-02 15:16:53 UTC
Is this after an upgrade or a new deployment?  Please provide version details.

Comment 5 Ryan Howe 2019-01-02 17:45:39 UTC
Closing this bug, this is not an issue with the service catalog. 

The issue that a 3rd party network sdn plugin is being used and due to that the kube-proxy is never started. 

The service catalog is failing because it is making call to the kubernetes service IP that should get NATd to the the masterIP. These nat rules never get setup because the kubeproxy is not running on this host.


Note You need to log in before you can comment on or make changes to this bug.