Bug 1647660

Summary: How to consume ose-installer image
Product: OpenShift Container Platform Reporter: ge liu <geliu>
Component: ReleaseAssignee: Tim Bielawa <tbielawa>
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, crawford, eparis, jialiu, jiazha, jokerman, mmccomas, smunilla, vlaad, wking
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-20 17:54:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ge liu 2018-11-08 04:15:01 UTC
Description of problem:

We tried to install OCP 4.0 Next Gen, but it always appears all kinds of problems to hinder installation processing, we need a stable NextG installation to push test ahead. The major problems listed below:

1. Follow the doc to install, go forward difficultly, and installed it successfully, but on the next day, the env doesn't work, because there are some images be updated perhaps.
2. Try to install it again, and meet other problems and still go forward difficultly.

Install doc: https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md

QE guys will add the details problem items on the bug comment lists.
Scuch as: https://github.com/openshift/installer/issues/610
https://github.com/openshift/installer/issues/570
https://github.com/openshift/installer/issues/620

How reproducible:
Always

Steps to Reproduce:


Actual results:
NextG installer is not stable for testing

Expected results:
NextG installer is stable for testing

Comment 1 ge liu 2018-11-08 07:29:25 UTC
We need NextG puddle just like this: http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/4.0/v4.0.0-0.49.0_2018-11-06.1/x86_64/os/Packages, and also the image version is ready for matching.

Comment 2 Jian Zhang 2018-11-08 09:46:28 UTC
I cannot also create the OCP 4.0 by following the doc. Am I missing something?

[jzhang@dhcp-140-18 installer]$ openshift-install version
openshift-install v0.3.0-161-ge64a43d293f594c1d51a317822aa6b4783295ecb
Terraform v0.11.8

Your version of Terraform is out of date! The latest version
is 0.11.10. You can update by downloading from www.terraform.io/downloads.html

$ rpm -qa |grep libvirt
libvirt-4.1.0-5.fc28.x86_64

Steps to Reproduce:
1. Follow the above guide to create the OCP 4.0
2. [jzhang@dhcp-140-18 installer]$ openshift-install create cluster --dir=1108 --log-level=debug|tee ./1108/install-2.log

Actual results:
INFO Waiting for bootstrap completion...          
DEBUG API not up yet: Get https://demo-api.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: no route to host 

Additional info:
I also destroy and config it with a new cluster-name called demo2.
[jzhang@dhcp-140-18 installer]$ openshift-install destroy cluster --dir=1108 --log-level=debug

But, still got the same errors. 
It's strange. It still requests "Get https://demo-api.tt.testing:6443", it should be "Get https://demo2-api.tt.testing:6443".
Environment Variable:
[jzhang@dhcp-140-18 installer]$ env | grep -i cluster_name
OPENSHIFT_INSTALL_CLUSTER_NAME=demo2

Comment 3 Alex Crawford 2018-11-08 16:30:54 UTC
> Actual results:
> INFO Waiting for bootstrap completion...          
> DEBUG API not up yet: Get https://demo-api.tt.testing:6443/version?timeout=32s: dial tcp 192.168.126.11:6443: connect: no route to host 

"No route to host" indicates that you have a problem with your local network. Your kernel isn't able to route traffic from the installer to the nodes in the cluster.

Can you try the following command:

    ip route get 192.168.126.10

On my system, I see the following:

$ ip r get 192.168.126.10
192.168.126.10 dev tt0 src 192.168.126.1 uid 1000
    cache

Can you also check that you have a bridge named "tt0" on your system? That should have been created by the installer for your cluster.

Comment 4 Jian Zhang 2018-11-09 02:39:11 UTC
Sure, as below:
[jzhang@dhcp-140-18 installer]$ ip route get 192.168.126.10
192.168.126.10 dev tt0 src 192.168.126.1 uid 1000 
    cache
[jzhang@dhcp-140-18 installer]$ brctl show
bridge name	bridge id		STP enabled	interfaces
tt0		8000.52540038d527	yes		tt0-nic
							vnet0
							vnet2
virbr0		8000.5254008e224a	yes		virbr0-nic

Comment 5 Jian Zhang 2018-11-09 05:44:13 UTC
FYI,

I rebuild it with the latest master, seems like it works well.
[jzhang@dhcp-140-18 installer]$ openshift-install version
openshift-install v0.3.0-172-gfc8e1ec41927fdfe0a1878aa3bb41e916ceda0b1
Terraform v0.11.8

Your version of Terraform is out of date! The latest version
is 0.11.10. You can update by downloading from www.terraform.io/downloads.html

But, I still got the below errors:
DEBUG added openshift-master-controllers.156556714825a16c: controller-manager-lmtfr became leader 
WARNING RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2519 
ERROR waiting for bootstrap-complete: timed out waiting for the condition 
INFO Install complete! Run 'export KUBECONFIG=/home/jzhang/goproject/src/github.com/openshift/installer/demo2/auth/kubeconfig' to manage your cluster. 
INFO After exporting your kubeconfig, run 'oc -h' for a list of OpenShift client commands. 

And, one more thing, the KUBECONDIG wasn't be configured correctly, although the above logs show config is correct.
[jzhang@dhcp-140-18 installer]$ echo $KUBECONFIG
/home/jzhang/goproject/src/github.com/openshift/installer/auth/kubeconfig

And, one question, I also encountered the issue "ImagePullError" after running the cluster for a while. how can I change the images of one component?

Comment 6 W. Trevor King 2018-11-09 07:39:11 UTC
> WARNING RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2519

This just means the watcher dropped.  On libvirt it can usually reconnect without trouble.  On AWS it usually hangs [1].

> ERROR waiting for bootstrap-complete: timed out waiting for the condition 

This means either the watcher or the cluster boot itself hung, and that the bootstrap resources won't be automatically removed.  You can still remove those resources yourself with:

  $ openshift-install --dir=whatever-you-used-for-create-cluster --log-level=debug destroy bootstrap

> INFO Install complete! Run 'export KUBECONFIG=/home/jzhang/goproject/src/github.com/openshift/installer/demo2/auth/kubeconfig' to manage your cluster. 
> ...
> And, one more thing, the KUBECONDIG wasn't be configured correctly, although the above logs show config is correct.
> [jzhang@dhcp-140-18 installer]$ echo $KUBECONFIG
/home/jzhang/goproject/src/github.com/openshift/installer/auth/kubeconfig

The installer cannot change environment variables in the shell from which you invoked it.  It just logs the `export KUBECONFIG=...` suggestion to stdout.  If you want to use that kubeconfig, you need to copy/paste the export line into your terminal.

> And, one question, I also encountered...

Separate bugs?  It's going to get confusing if multiple issues get lumped into the same bug.

[1]: https://github.com/openshift/installer/pull/606

Comment 7 Jian Zhang 2018-11-09 10:04:39 UTC
Thanks for your clarification! 
> Separate bugs?  It's going to get confusing if multiple issues get lumped into the same bug.

OK, I create bug 1648270 to trace this problem.

Comment 9 ge liu 2018-11-22 01:53:31 UTC
Yes, 
We need 4.0 NextGen puddle and matched images for testing, as my understanding, release team take own puddle prompting, right? pls correct me if there is any misunderstanding, thanks in advance!

Currently, we still follow the doc to do installation(such as: https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md), there is not exact ocp version, we need puddles that could cover different cloud clusters(such as: aws, libvert, etc.) and versions. thanks again.

Comment 10 ge liu 2018-11-28 06:43:57 UTC
Close this bug, and file new bug(Bug 1654137) to trace current issue,

Comment 11 Alex Crawford 2018-11-29 00:03:39 UTC
*** Bug 1654137 has been marked as a duplicate of this bug. ***