Bug 1625817

Summary: [3.10] Installation stuck at TASK [Approve node certificates when bootstrapping]
Product: OpenShift Container Platform Reporter: Wei Sun <wsun>
Component: InstallerAssignee: Michael Gugino <mgugino>
Status: CLOSED CURRENTRELEASE QA Contact: Weihua Meng <wmeng>
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, fabio.martinelli, jialiu, jokerman, juzhao, kumarmn, mgoldman, mgugino, mmccomas, nils.ketelsen, roxenham, scortopa, wabouham, wmeng, wsun
Target Milestone: ---Keywords: Regression
Target Release: 3.10.zFlags: sdodson: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The node CSR approval process has been refactored to address several process deficiencies. This process now approves certificates for relevant nodes and waits for the certificate to be verifiable via the API. In the event that this new process fails, the logs will include relevant debugging information required by support to diagnose any remaining issues. Please make sure you capture these logs and provide them to support in the event of a failure.
Story Points: ---
Clone Of: 1622945 Environment:
Last Closed: 2019-01-03 17:34:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1622945    
Bug Blocks: 1479956, 1565405, 1623204, 1623248    

Comment 4 Michael Gugino 2018-09-18 17:29:26 UTC
Need output of `oc get nodes` and `oc get csr -o yaml`

Need ansible-playbook -vvv output (that's 3 v's)

Comment 5 Manoj Kumar 2018-09-21 18:15:52 UTC
I can hit this every time on a Power 8 bare-metal node with OCP 3.10

Comment 6 Manoj Kumar 2018-09-21 18:17:53 UTC
[root@rhel-ocpapp2 openshift-ansible]# oc project openshift-sdn
Now using project "openshift-sdn" on server "https://rhel-ocpapp2:8443".
[root@rhel-ocpapp2 openshift-ansible]# oc get all
NAME            READY     STATUS             RESTARTS   AGE
pod/ovs-j25wz   1/1       Running            0          5m
pod/sdn-h9c8k   0/1       CrashLoopBackOff   6          5m

NAME                 DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/ovs   1         1         1         1            1           <none>          5m
daemonset.apps/sdn   1         1         0         1            0           <none>          5m

NAME                                  DOCKER REPO                                           TAGS      UPDATED
imagestream.image.openshift.io/node   docker-registry.default.svc:5000/openshift-sdn/node   v3.10     5 minutes ago
[root@rhel-ocpapp2 openshift-ansible]# oc logs -f pod/sdn-h9c8k
Error from server: Get https://rhel-ocpapp2:10250/containerLogs/openshift-sdn/sdn-h9c8k/sdn?follow=true: remote error: tls: internal error

Comment 7 Scott Dodson 2018-09-27 13:39:12 UTC
A number of CSR approval changes have been backported from 3.11 to 3.10 and may have addressed this. Can we please test with the latest 3.10 code.

Comment 8 Manoj Kumar 2018-09-27 13:51:47 UTC
Willing to test it out on Power, if you can drop me the changes.

Comment 9 Weihua Meng 2018-10-01 05:06:29 UTC
I tried on different metrics, not hit this issue.
openshift-ansible-3.10.51-1.git.0.44a646c.el7.noarch
x86
EC2, GCP, OpenStack
docker, cri-o
HA, none-HA
with/without proxy
with/without system-container