Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1631717

Summary: new node scale up failed at "Approve node certificates when bootstrapping" step
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED NOTABUG QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: aos-bugs, avagarwa, dcaldwel, deads, fshaikh, jialiu, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-25 17:10:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
scaleup_log_with_inventory_embeded
none
api server logs w/ traceback none

Description Johnny Liu 2018-09-21 11:22:13 UTC
Description of problem:

Version-Release number of the following components:
openshift-ansible-3.11.12-1.git.0.0c64f7a.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Set up a cluster
2. Scale up a new node
3.

Actual results:
"RETRYING" again and again at "Approve node certificates when bootstrapping" step until fail.

When the task is running and "RETRYING", login to master, checking csr and node, the new node is already registered. Not sure what installer is expecting.

[root@qe-jialiu-master-etcd-1 ~]# oc get csr -a
Flag --show-all has been deprecated, will be removed in an upcoming release
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-4srsn                                              5h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-5v9dj                                              7h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-7lmk7                                              6h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-8v5wv                                              6h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-bb6gx                                              6h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-c625v                                              5h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-ckvbn                                              4h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-clgtq                                              3h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-czvkg                                              5h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-dlfhq                                              5h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-j5z5s                                              5h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-p266x                                              4h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-q6jz6                                              4h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-qgc79                                              7h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-rzkgk                                              32m       system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-tkxzx                                              6h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-vls5d                                              4h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
csr-x8796                                              2m        system:node:qe-jialiu2-node-new-1                         Approved,Issued
csr-xjhdp                                              6h        system:node:qe-jialiu1-node-new-1                         Approved,Issued
node-csr-_QKmZLrcEmxROrS91umNpqNa-N_XbZNTtvwB8JOwWN0   32m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued
node-csr-_vloY87FfBS34GsfxghlJmoIjODTSsEopfggPgvi208   2m        system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued


[root@qe-jialiu-master-etcd-1 ~]# oc get node
NAME                               STATUS    ROLES     AGE       VERSION
qe-jialiu-master-etcd-1            Ready     master    8h        v1.11.0+d4cacc0
qe-jialiu-master-etcd-2            Ready     master    8h        v1.11.0+d4cacc0
qe-jialiu-master-etcd-3            Ready     master    8h        v1.11.0+d4cacc0
qe-jialiu-node-1                   Ready     compute   8h        v1.11.0+d4cacc0
qe-jialiu-node-registry-router-1   Ready     <none>    8h        v1.11.0+d4cacc0
qe-jialiu2-node-new-1              Ready     compute   2m        v1.11.0+d4cacc0


Expected results:
new node scale up is passed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Johnny Liu 2018-09-21 11:23:24 UTC
Created attachment 1485470 [details]
scaleup_log_with_inventory_embeded

Comment 4 Michael Gugino 2018-09-21 19:10:42 UTC
CSRs seem to have been approved, server csrs were approved but raw api query never worked for the new host.  Logging into cluster, I see node is ready, pods are running on node, unable to read logs from sdn pod, unable to query raw api endpoint.

[root@qe-jialiu-master-etcd-1 ~]# oc logs -n openshift-sdn ovs-xzf6r
Error from server: Get https://qe-jialiu2-node-new-1:10250/containerLogs/openshift-sdn/ovs-xzf6r/openvswitch: Forbidden
[root@qe-jialiu-master-etcd-1 ~]# oc get --raw /api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz
Error from server (ServiceUnavailable): the server is currently unable to handle the request

Comment 5 Michael Gugino 2018-09-21 19:36:32 UTC
Restarting the node service does not work.  Nothing obvious in journalctl output as to why the node's api endpoint is not responding.  I can see that the node is successfully bound to port 10250 and responding to curl requests, but otherwise is not responding with any data.

As far as I can tell, we installed everything correctly and approved CSRs, but node is broken.

Comment 6 Scott Dodson 2018-09-21 20:07:20 UTC
API server generates a 503 when trying to proxy to the referenced host but not others.

# oc get --raw /api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz --loglevel=8 --server https://`hostname`                                                                          
I0921 16:01:53.928555   31991 loader.go:359] Config loaded from file /root/.kube/config
I0921 16:01:53.930504   31991 round_trippers.go:383] GET https://qe-jialiu-master-etcd-1/api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz
I0921 16:01:53.930565   31991 round_trippers.go:390] Request Headers:
I0921 16:01:53.930600   31991 round_trippers.go:393]     Accept: application/json, */*
I0921 16:01:53.930633   31991 round_trippers.go:393]     User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0
I0921 16:01:54.159595   31991 round_trippers.go:408] Response Status: 503 Service Unavailable in 228 milliseconds
I0921 16:01:54.159763   31991 round_trippers.go:411] Response Headers:
I0921 16:01:54.159825   31991 round_trippers.go:414]     Cache-Control: no-store
I0921 16:01:54.159845   31991 round_trippers.go:414]     Content-Type: text/plain; charset=utf-8
I0921 16:01:54.159873   31991 round_trippers.go:414]     Content-Length: 81
I0921 16:01:54.159919   31991 round_trippers.go:414]     Date: Fri, 21 Sep 2018 20:01:54 GMT
I0921 16:01:54.160027   31991 request.go:897] Response Body: Error: 'Forbidden'
Trying to reach: 'https://qe-jialiu2-node-new-1:10250/healthz'
I0921 16:01:54.160249   31991 helpers.go:201] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server is currently unable to handle the request",
  "reason": "ServiceUnavailable",
  "details": {
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "Error: 'Forbidden'\nTrying to reach: 'https://qe-jialiu2-node-new-1:10250/healthz'"
      }
    ]
  },
  "code": 503
}]

If I use the certificates referenced by kubeletClientInfo in master-config.yaml things are fine for the host in question and others.

kubeletClientInfo:
  ca: ca-bundle.crt
  certFile: master.kubelet-client.crt
  keyFile: master.kubelet-client.key
  port: 10250

# curl --cacert /etc/origin/master/ca.crt --cert /etc/origin/master/master.kubelet-client.crt --key /etc/origin/master/master.kubelet-client.key -v -k https://qe-jialiu2-node-new-
1:10250/healthz                                                                                                                                                                                                    
* About to connect() to qe-jialiu2-node-new-1 port 10250 (#0)
*   Trying 192.168.100.9...
* Connected to qe-jialiu2-node-new-1 (192.168.100.9) port 10250 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate from file
*       subject: CN=system:openshift-node-admin,O=system:node-admins
*       start date: Sep 21 02:20:18 2018 GMT
*       expire date: Sep 20 02:20:19 2020 GMT
*       common name: system:openshift-node-admin
*       issuer: CN=openshift-signer@1537496417
* SSL connection using TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
* Server certificate:
*       subject: CN=system:node:qe-jialiu2-node-new-1,O=system:nodes
*       start date: Sep 21 11:05:00 2018 GMT
*       expire date: Sep 21 11:05:00 2019 GMT
*       common name: system:node:qe-jialiu2-node-new-1
*       issuer: CN=openshift-signer@1537496417
> GET /healthz HTTP/1.1
> User-Agent: curl/7.29.0
> Host: qe-jialiu2-node-new-1:10250
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 21 Sep 2018 20:03:25 GMT
< Content-Length: 2
< Content-Type: text/plain; charset=utf-8
< 
* Connection #0 to host qe-jialiu2-node-new-1 left intact


Attaching a log with traceback.

Comment 7 Scott Dodson 2018-09-21 20:08:09 UTC
Created attachment 1485792 [details]
api server logs w/ traceback

Comment 8 Scott Dodson 2018-09-21 21:19:00 UTC
Seth and I looked at this, we can find no difference between a working node and a failing node.

Further, when we look at tcpdump for activity between the local API server and port 10250 on the affected node we see no traffic when we attempt to proxy via the API server. If we use curl directly we see traffic.

I've gathered the contents of /etc/origin from master, qe-jialiu-node-1 (working node), and qe-jialiu2-node-new-1 (failing node)

Comment 9 Scott Dodson 2018-09-21 21:20:43 UTC
Created attachment 1485819 [details]
etc origin contents

Comment 16 David Eads 2018-09-25 17:10:43 UTC
The /etc/origin/master/master.env NO_PROXY value included every node except qe-jialiu2-node-new-1.  Adding that value corrects the problem.

Running `oc get --raw /api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz --loglevel=8 --server https://`hostname`` on that machine now works.

Comment 17 Johnny Liu 2018-09-25 18:42:48 UTC
(In reply to David Eads from comment #16)
> The /etc/origin/master/master.env NO_PROXY value included every node except
> qe-jialiu2-node-new-1.  Adding that value corrects the problem.
> 
> Running `oc get --raw /api/v1/nodes/qe-jialiu2-node-new-1/proxy/healthz
> --loglevel=8 --server https://`hostname`` on that machine now works.

Cool, thx for your debugging.

This issue make me recall another bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1378840

They should be the same issue.

In my cluster, though the subdomain for the whole cluster is openshift-snvl2.internal, but adding ".openshift-snvl2.internal" to NO_PROXY list do not help me a lot, because the node name is "qe-jialiu2-node-new-1", but not "qe-jialiu2-node-new-1.openshift-snvl2.internal".