Bug 1810302 - Worker node loops at Certificate signed by unknown authority [NEEDINFO]
Summary: Worker node loops at Certificate signed by unknown authority
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.3.z
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Stephen Benjamin
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-04 22:17 UTC by DirectedSoul
Modified: 2020-04-23 16:10 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-23 16:10:09 UTC
Target Upstream Version:
stbenjam: needinfo? (shegde)


Attachments (Terms of Use)
A screenshot showing CA error (277.17 KB, image/png)
2020-03-04 22:17 UTC, DirectedSoul
no flags Details

Description DirectedSoul 2020-03-04 22:17:09 UTC
Created attachment 1667591 [details]
A screenshot showing CA error

Description of problem:  IPv6 Disconnected Installation 
After the cluster is up and running(masters are in Ready state)worker nodes never show up and will be in error/inspecting state 

```
oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
master-0.kni7.cloud.lab.eng.bos.redhat.com   Ready    master   24h   v1.16.2
master-1.kni7.cloud.lab.eng.bos.redhat.com   Ready    master   24h   v1.16.2
master-2.kni7.cloud.lab.eng.bos.redhat.com   Ready    master   24h   v1.16.2
```
Version:

```
oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          24h     Unable to apply 4.3.0-0.nightly-2020-03-02-070732: some cluster operators have not yet rolled out 
```
How reproducible:
 
Bring up a BM IPI cluster using the above build, after the clister is up and running workers never show up and no signs of csr approval request either
```
oc get csr
NAME        AGE   REQUESTOR                                                CONDITION
csr-2vqvz   84m   system:node:master-1.kni7.cloud.lab.eng.bos.redhat.com   Approved,Issued
```
At this point jump into to idrac and observe the worker nodes to see that its in continuos loop stating X509 Certificate signed by unknown authority

Steps to Reproduce:
1. Bring up the cluster
2. Wait for the master nodes to come up
3. Observe the worker nodes using iDRAC to see they are in loop state (a screenshot is attached)


Actual results:
After masters are up , worker nodes should come up in Ready state instead they error on Certificate approval( screenshot attached) 

Expected results:
Master and worker nodes both should be in Ready state

Additional info:

1. oc get bmh -n openshift-machine-api

oc get bmh -n openshift-machine-api
NAME       STATUS   PROVISIONING STATUS      CONSUMER              BMC                                            HARDWARE PROFILE   ONLINE   ERROR
master-0   OK       externally provisioned   kni7-master-0         ipmi://[fd35:919d:4042:2:c7ed:9a9f:a9ec:100]                      true     
master-1   OK       externally provisioned   kni7-master-1         ipmi://[fd35:919d:4042:2:c7ed:9a9f:a9ec:101]                      true     
master-2   error    registering              kni7-master-2         ipmi://[fd35:919d:4042:2:c7ed:9a9f:a9ec:102]                      true     Failed to get power state for node c1989dcd-3e04-41ca-93a1-9311050b6f38. Error: IPMI call failed: power status.
worker-0   error    inspecting               kni7-worker-0-87smw   ipmi://[fd35:919d:4042:2:c7ed:9a9f:a9ec:104]                      true     Introspection timeout
worker-1   error    inspecting                                     ipmi://[fd35:919d:4042:2:c7ed:9a9f:a9ec:105]                      true     Introspection timeout
worker-2   error    inspecting               kni7-worker-0-fqvst   ipmi://[fd35:919d:4042:2:c7ed:9a9f:a9ec:106]                      true     Introspection timeout
 
2. A strange scenarion where even after control plane is up , bootstrap VM still is in running condition

```
sudo virsh list
 Id    Name                           State
----------------------------------------------------
 1     kni7-nfkp5-bootstrap           running

```

3. Inspection of logs clearly shows the control plane has been created 

```
Mar 03 22:28:58 localhost bootkube.sh[8730]: Tearing down temporary bootstrap control plane...
Mar 03 22:28:59 localhost podman[10575]: 2020-03-03 22:28:59.007795759 +0000 UTC m=+979.882708987 container died d798e3b444c911d55eb73aaa18063b50f8e7afbdda48dbe7fef8ad0b977736cd (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:72e8f7e1d55891a5643f9c50dd34d9352b87cb95788af1ae631b866b9ea5955e, name=youthful_clarke)
Mar 03 22:28:59 localhost podman[10575]: 2020-03-03 22:28:59.573262095 +0000 UTC m=+980.448175325 container remove d798e3b444c911d55eb73aaa18063b50f8e7afbdda48dbe7fef8ad0b977736cd (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:72e8f7e1d55891a5643f9c50dd34d9352b87cb95788af1ae631b866b9ea5955e, name=youthful_clarke)
Mar 03 22:28:59 localhost bootkube.sh[8730]: bootkube.service complete
```

4. All the pods are still up in bootstrap VM

```
[core@localhost ~]$ sudo podman ps
CONTAINER ID  IMAGE                                                                                                                   COMMAND  CREATED       STATUS           PORTS  NAMES
60bafcc2bdf7  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0232366f9b47f5f75c4bd502a60c41a5b0dc5bac8ca50e3676ebc6036d0f539b           25 hours ago  Up 25 hours ago         ironic-api
0efee46dac66  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad4b2438b1a8336c640d42ee7d5be21cff5242a00ec47326d9d12dbcb8250f54           25 hours ago  Up 25 hours ago         ironic-inspector
2c8a8cc0d791  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0232366f9b47f5f75c4bd502a60c41a5b0dc5bac8ca50e3676ebc6036d0f539b           25 hours ago  Up 25 hours ago         ironic-conductor
a959cd89efea  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0232366f9b47f5f75c4bd502a60c41a5b0dc5bac8ca50e3676ebc6036d0f539b           25 hours ago  Up 25 hours ago         httpd
b5c37b42b729  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0232366f9b47f5f75c4bd502a60c41a5b0dc5bac8ca50e3676ebc6036d0f539b           25 hours ago  Up 25 hours ago         dnsmasq
69588ce43687  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0232366f9b47f5f75c4bd502a60c41a5b0dc5bac8ca50e3676ebc6036d0f539b           25 hours ago  Up 25 hours ago         mariadb

~```

Comment 1 Stephen Benjamin 2020-03-19 17:50:56 UTC
It looks like the control plane is up, could you get a must-gather and upload it somewhere?

Also the workers all show inspect timeout, so screenshots of the consoles while this is happening would be helpful - is IPA reporting a problem taking back to Ironic? Were the workers perhaps already powered on from a previous install attempt?

Not strictly related, but your master also seems to have a problem with it's BMC for us to fetch power state:
  Failed to get power state for node c1989dcd-3e04-41ca-93a1-9311050b6f38. Error: IPMI call failed: power status.

Comment 2 Beth White 2020-04-23 16:10:09 UTC
No response to needinfo for further details on the issue in +1 month. Closing due to insufficient data. Please reopen with further details if you are still experiencing this issue.


Note You need to log in before you can comment on or make changes to this bug.