Bug 1662560

Summary: service catalog apiserver fails with etcd failed reason withheld
Product: OpenShift Container Platform Reporter: Steven Walter <stwalter>
Component: Service CatalogAssignee: Jay Boyd <jaboyd>
Status: CLOSED NOTABUG QA Contact: Jian Zhang <jiazha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.0.0CC: rhowe, schoudha, stwalter
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-02 17:45:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Walter 2018-12-30 00:33:18 UTC
apiserver pod and controllers crash due to failed health checks. We followed same healthcheck and see:

# oc rsh apiserver-btqh7
sh-4.2# curl -kv https://localhost:6443/healthz
* About to connect() to localhost port 6443 (#0)
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 6443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=apiserver.kube-service-catalog
*       start date: Dec 28 21:45:00 2018 GMT
*       expire date: Dec 27 21:45:01 2020 GMT
*       common name: apiserver.kube-service-catalog
*       issuer: CN=service-catalog-signer
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Sat, 29 Dec 2018 23:47:25 GMT
< Content-Length: 180
<
[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-service-catalog-apiserver-informers ok
[-]etcd failed: reason withheld
healthz check failed
* Connection #0 to host localhost left intact
sh-4.2#

etcd is healthy, with no errors in logs:
# oc rsh master-etcd-cbsccqaocpm1
sh-4.2# export ETCDCTL_API=3
sh-4.2# source /etc/etcd/etcd.conf
sh-4.2# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS member list
b4fc1645b35ef7b9, started, cbsccqaocpm1.qa.cibc.local, https://172.16.40.11:2380, https://172.16.40.11:2379
sh-4.2#


2018-12-30 00:09:17.598223 I | mvcc: finished scheduled compaction at 1558955 (took 1.229722ms)
2018-12-30 00:14:17.599308 I | mvcc: store.index: compact 1559452
2018-12-30 00:14:17.601238 I | mvcc: finished scheduled compaction at 1559452 (took 1.28863ms)
2018-12-30 00:19:17.604818 I | mvcc: store.index: compact 1559967
2018-12-30 00:19:17.606688 I | mvcc: finished scheduled compaction at 1559967 (took 1.237636ms)
2018-12-30 00:24:17.608219 I | mvcc: store.index: compact 1560465
2018-12-30 00:24:17.610150 I | mvcc: finished scheduled compaction at 1560465 (took 1.343512ms)



A similar problem was reported here but it was on a much older version than you: https://access.redhat.com/solutions/3322311

Another similar issue hit here, again on a previous version: https://bugzilla.redhat.com/show_bug.cgi?id=1579421
And here, but with a potential workaround: https://github.com/openshift/openshift-ansible/issues/8076

However, the last one was reported due to wrong etcd in daemonset. However, etcd url is set correctly in daemonset yaml:


    spec:
      containers:
      - args:
        - apiserver
        - --storage-type
        - etcd
        - --secure-port
        - "6443"
        - --etcd-servers
        - https://master1.redacted.local:2379
        - --etcd-cafile
        - /etc/origin/master/master.etcd-ca.crt
        - --etcd-certfile
        - /etc/origin/master/master.etcd-client.crt
        - --etcd-keyfile
        - /etc/origin/master/master.etcd-client.key

Comment 1 Steven Walter 2018-12-30 00:34:10 UTC
Pods were crashing due to livenessprobe not working, but we removed probes so that the pods wouldn't crash.

Comment 3 Jian Zhang 2019-01-02 02:49:53 UTC
Steven,

Could you help provide us the exact version info of the service catalog? And the cluster version.
You can do the following command to get the service catalog version:
$oc exec xxx(pod name) -- service-catalog --version

Comment 4 Jay Boyd 2019-01-02 15:16:53 UTC
Is this after an upgrade or a new deployment?  Please provide version details.

Comment 5 Ryan Howe 2019-01-02 17:45:39 UTC
Closing this bug, this is not an issue with the service catalog. 

The issue that a 3rd party network sdn plugin is being used and due to that the kube-proxy is never started. 

The service catalog is failing because it is making call to the kubernetes service IP that should get NATd to the the masterIP. These nat rules never get setup because the kubeproxy is not running on this host.

Comment 6 Red Hat Bugzilla 2023-09-15 00:14:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days