Bug 1651120

Summary: Could not find the master api/ip after destroying the bootstrap during create cluster
Product: OpenShift Container Platform Reporter: Jian Zhang <jiazha>
Component: InstallerAssignee: Alex Crawford <crawford>
Installer sub component: openshift-installer QA Contact: Jian Zhang <jiazha>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: high    
Priority: high CC: chezhang, crawford, dyan, jfan, jiazha, wking, zitang
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-23 18:19:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
The logs of destroying the bootstrap none

Description Jian Zhang 2018-11-19 09:06:53 UTC
Created attachment 1507144 [details]
The logs of destroying the bootstrap

Description of problem:
The master IP and route are deleted after destroying the bootstrap.


Version-Release number of the following components:
[jzhang@dhcp-140-18 installer]$ openshift-install version
openshift-install v0.3.0-243-g3d0ba6a0b0ef539970b5d8ae5542411b0bcb34b8
Terraform v0.11.8

Your version of Terraform is out of date! The latest version
is 0.11.10. You can update by downloading from www.terraform.io/downloads.html

How reproducible:
always

Steps to Reproduce:
1. Create the OCP 4.0 by following this doc: https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md
2. Environemnt varaiable:
[jzhang@dhcp-140-18 installer]$ cat env.sh 
export OPENSHIFT_INSTALL_PLATFORM=libvirt
export OPENSHIFT_INSTALL_BASE_DOMAIN=tt.testing
export OPENSHIFT_INSTALL_CLUSTER_NAME=jian
export OPENSHIFT_INSTALL_PULL_SECRET_PATH=`pwd`/config.json
export OPENSHIFT_INSTALL_LIBVIRT_URI=qemu+tcp://192.168.122.1/system
export OPENSHIFT_INSTALL_EMAIL_ADDRESS=jiazha@redhat.com
export OPENSHIFT_INSTALL_PASSWORD=redhat
export OPENSHIFT_INSTALL_SSH_PUB_KEY_PATH=/home/jzhang/.ssh/id_rsa.pub

3, [jzhang@dhcp-140-18 installer]$ openshift-install create cluster --dir=jian --log-level=debug
 

Actual results:
[jzhang@dhcp-140-18 installer]$ sudo virsh list --all
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 19    jian-master-0                  running
 20    jian-worker-0-fqmrd            running
[jzhang@dhcp-140-18 installer]$ oc get ns
Unable to connect to the server: dial tcp 192.168.126.11:6443: connect: no route to host
[jzhang@dhcp-140-18 installer]$ virsh -c "${OPENSHIFT_INSTALL_LIBVIRT_URI}" domifaddr "${OPENSHIFT_INSTALL_CLUSTER_NAME}-master-0"
setlocale: No such file or directory
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------

Expected results:
can access this cluster successfully.

Additional info:
[jzhang@dhcp-140-18 installer]$ sudo cat /var/lib/libvirt/dnsmasq/*
##WARNING:  THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE
##OVERWRITTEN AND LOST.  Changes to this configuration should be made using:
##    virsh net-edit default
## or other application using the libvirt API.
##
## dnsmasq conf file created by libvirt
strict-order
pid-file=/var/run/libvirt/network/default.pid
except-interface=lo
bind-dynamic
interface=virbr0
dhcp-range=192.168.122.2,192.168.122.254
dhcp-no-override
dhcp-authoritative
dhcp-lease-max=253
dhcp-hostsfile=/var/lib/libvirt/dnsmasq/default.hostsfile
addn-hosts=/var/lib/libvirt/dnsmasq/default.addnhosts
192.168.126.11	jian-api	jian-etcd-0	
192.168.126.50	jian	
##WARNING:  THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE
##OVERWRITTEN AND LOST.  Changes to this configuration should be made using:
##    virsh net-edit jian
## or other application using the libvirt API.
##
## dnsmasq conf file created by libvirt
strict-order
local=/tt.testing/
domain=tt.testing
expand-hosts
pid-file=/var/run/libvirt/network/jian.pid
except-interface=lo
bind-dynamic
interface=tt0
srv-host=_etcd-server-ssl._tcp.jian.tt.testing,jian-etcd-0.tt.testing,2380,0,10
addn-hosts=/var/lib/libvirt/dnsmasq/jian.addnhosts
[

]

Comment 1 Alex Crawford 2018-11-20 00:52:57 UTC
You need to update your libvirt provider to the latest version - specifically, you need https://github.com/dmacvicar/terraform-provider-libvirt/pull/469.

Comment 2 Jian Zhang 2018-11-20 04:32:56 UTC
Alex,

Thanks, I update it.
The old version:
[jzhang@dhcp-140-18 plugins]$ ./terraform-provider-libvirt -version
./terraform-provider-libvirt was not built correctly
Compiled against library: libvirt 3.7.0
Using library: libvirt 4.1.0
Running hypervisor: QEMU 2.11.2
Running against daemon: 4.1.0

The new version:
[jzhang@dhcp-140-18 plugins]$ ./terraform-provider-libvirt -version
./terraform-provider-libvirt was not built correctly
Compiled against library: libvirt 4.1.0
Using library: libvirt 4.1.0
Running hypervisor: QEMU 2.11.2
Running against daemon: 4.1.0

And, I also update the openshift-installer, but got the errors: the worker didn't up.

[jzhang@dhcp-140-18 installer]$ openshift-install version
openshift-install v0.3.0-250-g30bb25ac57d7c7d3dae519186cbfca9af8aeaca2
Terraform v0.11.8

Your version of Terraform is out of date! The latest version
is 0.11.10. You can update by downloading from www.terraform.io/downloads.html

[jzhang@dhcp-140-18 installer]$ sudo virsh list --all
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 21    demo2-bootstrap                running
 22    demo2-master-0                 running

[jzhang@dhcp-140-18 installer]$ oc get pods --all-namespaces
NAMESPACE                   NAME                                       READY     STATUS    RESTARTS   AGE
openshift-cluster-version   cluster-version-operator-8bb6cff75-492km   0/1       Pending   0          45m
[jzhang@dhcp-140-18 installer]$ oc get nodes
No resources found.
[jzhang@dhcp-140-18 installer]$ oc get all -n openshift-cluster-api
No resources found.

I can get its IP address successfully, but still cannot resolve hostname.
[jzhang@dhcp-140-18 installer]$  virsh -c "${OPENSHIFT_INSTALL_LIBVIRT_URI}" domifaddr "${OPENSHIFT_INSTALL_CLUSTER_NAME}-master-0"
setlocale: No such file or directory
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet1      32:b0:b7:03:99:af    ipv4         192.168.126.11/24

[jzhang@dhcp-140-18 installer]$ ssh core@demo2-master-0-tt.testing
ssh: Could not resolve hostname demo2-master-0-tt.testing: Name or service not known

Access the master and check the Kubelet log, got below errors, what do you suggest?

[jzhang@dhcp-140-18 installer]$ ssh core@192.168.126.11
[core@demo2-master-0 ~]$ journalctl -f -u kubelet
...
Nov 20 04:27:43 demo2-master-0 hyperkube[5132]: I1120 04:27:43.395217    5132 kubelet_node_status.go:79] Attempting to register node demo2-master-0
Nov 20 04:27:43 demo2-master-0 hyperkube[5132]: E1120 04:27:43.396444    5132 kubelet_node_status.go:103] Unable to register node "demo2-master-0" with API server: nodes is forbidden: User "system:anonymous" cannot create nodes at the cluster scope: no RBAC policy matched

Comment 3 Alex Crawford 2018-11-20 22:08:58 UTC
The DNS issue can be resolved by following the libvirt setup guide in the installer. Can you also try following the troubleshooting guide? That should help highlight which component is failing.

Comment 4 Jian Zhang 2018-11-21 02:44:03 UTC
Sorry, I didn't find the solution in the trobleshooting section. The issue I met is the worker node did NOT up. Is it a DNS issue? Can you tell me how to debug it? I check the log of the Kubelet running on the master node, got errors: 
Nov 20 04:27:43 demo2-master-0 hyperkube[5132]: I1120 04:27:43.395217    5132 kubelet_node_status.go:79] Attempting to register node demo2-master-0
Nov 20 04:27:43 demo2-master-0 hyperkube[5132]: E1120 04:27:43.396444    5132 kubelet_node_status.go:103] Unable to register node "demo2-master-0" with API server: nodes is forbidden: User "system:anonymous" cannot create nodes at the cluster scope: no RBAC policy matched

Comment 5 W. Trevor King 2018-11-26 06:28:08 UTC
> Sorry, I didn't find the solution in the trobleshooting section. The issue I met is the worker node did NOT up.

The section for that is [1].  It suggests looking at the logs for the clusterapi-manager-controllers pod.  What did you see when you looked at those logs?

[1]: https://github.com/openshift/installer/blob/v0.4.0/docs/user/troubleshooting.md#no-worker-nodes-created