Bug 1631717
| Summary: | new node scale up failed at "Approve node certificates when bootstrapping" step | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Johnny Liu <jialiu> | ||||||
| Component: | Master | Assignee: | Michal Fojtik <mfojtik> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Xingxing Xia <xxia> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 3.11.0 | CC: | aos-bugs, avagarwa, dcaldwel, deads, fshaikh, jialiu, jokerman, mmccomas | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 3.11.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-09-25 17:10:43 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Johnny Liu
2018-09-21 11:22:13 UTC
Created attachment 1485470 [details]
scaleup_log_with_inventory_embeded
CSRs seem to have been approved, server csrs were approved but raw api query never worked for the new host. Logging into cluster, I see node is ready, pods are running on node, unable to read logs from sdn pod, unable to query raw api endpoint. [root@qe-jialiu-master-etcd-1 ~]# oc logs -n openshift-sdn ovs-xzf6r Error from server: Get https://qe-jialiu2-node-new-1:10250/containerLogs/openshift-sdn/ovs-xzf6r/openvswitch: Forbidden [root@qe-jialiu-master-etcd-1 ~]# oc get --raw /api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz Error from server (ServiceUnavailable): the server is currently unable to handle the request Restarting the node service does not work. Nothing obvious in journalctl output as to why the node's api endpoint is not responding. I can see that the node is successfully bound to port 10250 and responding to curl requests, but otherwise is not responding with any data. As far as I can tell, we installed everything correctly and approved CSRs, but node is broken. API server generates a 503 when trying to proxy to the referenced host but not others. # oc get --raw /api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz --loglevel=8 --server https://`hostname` I0921 16:01:53.928555 31991 loader.go:359] Config loaded from file /root/.kube/config I0921 16:01:53.930504 31991 round_trippers.go:383] GET https://qe-jialiu-master-etcd-1/api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz I0921 16:01:53.930565 31991 round_trippers.go:390] Request Headers: I0921 16:01:53.930600 31991 round_trippers.go:393] Accept: application/json, */* I0921 16:01:53.930633 31991 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0921 16:01:54.159595 31991 round_trippers.go:408] Response Status: 503 Service Unavailable in 228 milliseconds I0921 16:01:54.159763 31991 round_trippers.go:411] Response Headers: I0921 16:01:54.159825 31991 round_trippers.go:414] Cache-Control: no-store I0921 16:01:54.159845 31991 round_trippers.go:414] Content-Type: text/plain; charset=utf-8 I0921 16:01:54.159873 31991 round_trippers.go:414] Content-Length: 81 I0921 16:01:54.159919 31991 round_trippers.go:414] Date: Fri, 21 Sep 2018 20:01:54 GMT I0921 16:01:54.160027 31991 request.go:897] Response Body: Error: 'Forbidden' Trying to reach: 'https://qe-jialiu2-node-new-1:10250/healthz' I0921 16:01:54.160249 31991 helpers.go:201] server response object: [{ "metadata": {}, "status": "Failure", "message": "the server is currently unable to handle the request", "reason": "ServiceUnavailable", "details": { "causes": [ { "reason": "UnexpectedServerResponse", "message": "Error: 'Forbidden'\nTrying to reach: 'https://qe-jialiu2-node-new-1:10250/healthz'" } ] }, "code": 503 }] If I use the certificates referenced by kubeletClientInfo in master-config.yaml things are fine for the host in question and others. kubeletClientInfo: ca: ca-bundle.crt certFile: master.kubelet-client.crt keyFile: master.kubelet-client.key port: 10250 # curl --cacert /etc/origin/master/ca.crt --cert /etc/origin/master/master.kubelet-client.crt --key /etc/origin/master/master.kubelet-client.key -v -k https://qe-jialiu2-node-new- 1:10250/healthz * About to connect() to qe-jialiu2-node-new-1 port 10250 (#0) * Trying 192.168.100.9... * Connected to qe-jialiu2-node-new-1 (192.168.100.9) port 10250 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * skipping SSL peer certificate verification * NSS: client certificate from file * subject: CN=system:openshift-node-admin,O=system:node-admins * start date: Sep 21 02:20:18 2018 GMT * expire date: Sep 20 02:20:19 2020 GMT * common name: system:openshift-node-admin * issuer: CN=openshift-signer@1537496417 * SSL connection using TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 * Server certificate: * subject: CN=system:node:qe-jialiu2-node-new-1,O=system:nodes * start date: Sep 21 11:05:00 2018 GMT * expire date: Sep 21 11:05:00 2019 GMT * common name: system:node:qe-jialiu2-node-new-1 * issuer: CN=openshift-signer@1537496417 > GET /healthz HTTP/1.1 > User-Agent: curl/7.29.0 > Host: qe-jialiu2-node-new-1:10250 > Accept: */* > < HTTP/1.1 200 OK < Date: Fri, 21 Sep 2018 20:03:25 GMT < Content-Length: 2 < Content-Type: text/plain; charset=utf-8 < * Connection #0 to host qe-jialiu2-node-new-1 left intact Attaching a log with traceback. Created attachment 1485792 [details]
api server logs w/ traceback
Seth and I looked at this, we can find no difference between a working node and a failing node. Further, when we look at tcpdump for activity between the local API server and port 10250 on the affected node we see no traffic when we attempt to proxy via the API server. If we use curl directly we see traffic. I've gathered the contents of /etc/origin from master, qe-jialiu-node-1 (working node), and qe-jialiu2-node-new-1 (failing node) Created attachment 1485819 [details]
etc origin contents
The /etc/origin/master/master.env NO_PROXY value included every node except qe-jialiu2-node-new-1. Adding that value corrects the problem. Running `oc get --raw /api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz --loglevel=8 --server https://`hostname`` on that machine now works. (In reply to David Eads from comment #16) > The /etc/origin/master/master.env NO_PROXY value included every node except > qe-jialiu2-node-new-1. Adding that value corrects the problem. > > Running `oc get --raw /api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz > --loglevel=8 --server https://`hostname`` on that machine now works. Cool, thx for your debugging. This issue make me recall another bug: https://bugzilla.redhat.com/show_bug.cgi?id=1378840 They should be the same issue. In my cluster, though the subdomain for the whole cluster is openshift-snvl2.internal, but adding ".openshift-snvl2.internal" to NO_PROXY list do not help me a lot, because the node name is "qe-jialiu2-node-new-1", but not "qe-jialiu2-node-new-1.openshift-snvl2.internal". |