Bug 1766066 - [proxy] report-progress.sh on bootstrap need external api server to create bootstrap configmap resource
Summary: [proxy] report-progress.sh on bootstrap need external api server to create bo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.z
Assignee: Jeremiah Stuever
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On: 1762618
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-28 06:56 UTC by Johnny Liu
Modified: 2019-11-19 13:49 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1762618
Environment:
Last Closed: 2019-11-19 13:49:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 388 0 'None' closed Bug 1766066: Revert "Merge pull request #334 from danehans/bz_1758656" 2020-10-07 01:53:24 UTC
Github openshift installer pull 2640 0 'None' closed Bug 1766066: Revert "Merge pull request #2471 from openshift-cherrypick-robot/cher… 2020-10-07 01:53:16 UTC
Red Hat Product Errata RHBA-2019:3869 0 None None None 2019-11-19 13:49:12 UTC

Description Johnny Liu 2019-10-28 06:56:25 UTC
This bug also reproduced in 4.2.0-0.nightly-2019-10-26-055649 when fixing BZ#1758663

+++ This bug was initially created as a clone of Bug #1762618 +++

Description of problem:

Version-Release number of the following components:
4.3.0-0.nightly-2019-10-16-010826

How reproducible:
Always

Steps to Reproduce:
1. Drop internet gateway for private subnets in VPC to create a disconnected env
2. Set up a proxy in public subnets, the proxy could be connected both external and internal network.
3. In proxy, use whitelist to control which traffic could get through, NOT adding api url into the list. such as:
acl whitelist dstdomain ec2.us-east-2.amazonaws.com iam.amazonaws.com .s3.us-east-2.amazonaws.com .apps.jialiu-42dis8.qe.devcluster.openshift.com ec2-18-191-189-164.us-east-2.compute.amazonaws.com .github.com .rubygems.org 
http_access allow whitelist
4. Enable proxy setting in install-config.yaml
5. Trigger a UPI install on aws

Actual results:
$ ./openshift-install wait-for bootstrap-complete --dir '/home/installer2/workspace/Launch Environment Flexy/workdir/install-dir'
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.jialiu-42dis8.qe.devcluster.openshift.com:6443..."
level=info msg="API v1.16.0-beta.2+453eff1 up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Use the following commands to gather logs from the cluster"
level=info msg="openshift-install gather bootstrap --help"
level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition"

Expected results:
Installation get passed

Additional info:
Log into bootstrap node, bootkube service is completed successfully.
$ journalctl -b -f -u bootkube.service
-- Logs begin at Wed 2019-10-16 09:26:30 UTC. --
Oct 16 09:38:35 ip-10-0-61-231 bootkube.sh[1610]: Skipped "secret-control-plane-client-signer.yaml" secrets.v1./kube-control-plane-signer -n openshift-kube-apiserver-operator as it already exists
Oct 16 09:38:35 ip-10-0-61-231 bootkube.sh[1610]: Skipped "secret-csr-signer-signer.yaml" secrets.v1./csr-signer-signer -n openshift-kube-controller-manager-operator as it already exists
Oct 16 09:38:36 ip-10-0-61-231 bootkube.sh[1610]: Skipped "secret-initial-kube-controller-manager-service-account-private-key.yaml" secrets.v1./initial-service-account-private-key -n openshift-config as it already exists
Oct 16 09:38:36 ip-10-0-61-231 bootkube.sh[1610]: Skipped "secret-kube-apiserver-to-kubelet-signer.yaml" secrets.v1./kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator as it already exists
Oct 16 09:38:37 ip-10-0-61-231 bootkube.sh[1610]: Skipped "secret-loadbalancer-serving-signer.yaml" secrets.v1./loadbalancer-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 16 09:38:37 ip-10-0-61-231 bootkube.sh[1610]: Skipped "secret-localhost-serving-signer.yaml" secrets.v1./localhost-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 16 09:38:37 ip-10-0-61-231 bootkube.sh[1610]: Skipped "secret-service-network-serving-signer.yaml" secrets.v1./service-network-serving-signer -n openshift-kube-apiserver-operator as it already exists
Oct 16 09:38:38 ip-10-0-61-231 bootkube.sh[1610]: Skipped "user-ca-bundle-config.yaml" configmaps.v1./user-ca-bundle -n openshift-config as it already exists
Oct 16 09:38:38 ip-10-0-61-231 bootkube.sh[1610]: Tearing down temporary bootstrap control plane...
Oct 16 09:38:38 ip-10-0-61-231 bootkube.sh[1610]: bootkube.service complete

But report-progress.sh is reporting some error.
Oct 16 11:37:12 ip-10-0-61-231 report-progress.sh[1611]: error: unable to recognize "STDIN": Get https://api.jialiu-42dis8.qe.devcluster.openshift.com:6443/api?timeout=32s: Forbidden
Oct 16 11:37:18 ip-10-0-61-231 report-progress.sh[1611]: error: unable to recognize "STDIN": Get https://api.jialiu-42dis8.qe.devcluster.openshift.com:6443/api?timeout=32s: Forbidden
Oct 16 11:37:23 ip-10-0-61-231 report-progress.sh[1611]: error: unable to recognize "STDIN": Get https://api.jialiu-42dis8.qe.devcluster.openshift.com:6443/api?timeout=32s: Forbidden
Oct 16 11:37:28 ip-10-0-61-231 report-progress.sh[1611]: error: unable to recognize "STDIN": Get https://api.jialiu-42dis8.qe.devcluster.openshift.com:6443/api?timeout=32s: Forbidden


# env|grep -i proxy
HTTP_PROXY=http://ec2-18-191-189-164.us-east-2.compute.amazonaws.com:3128
NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jialiu-42dis8.qe.devcluster.openshift.com,etcd-0.jialiu-42dis8.qe.devcluster.openshift.com,etcd-1.jialiu-42dis8.qe.devcluster.openshift.com,etcd-2.jialiu-42dis8.qe.devcluster.openshift.com,localhost,test.no-proxy.com
HTTPS_PROXY=http://ec2-18-191-189-164.us-east-2.compute.amazonaws.com:3128

Check report-progress.sh code:
# cat /usr/local/bin/report-progress.sh
#!/usr/bin/env bash

KUBECONFIG="${1}"

wait_for_existance() {
	while [ ! -e "${1}" ]
	do
		sleep 5
	done
}

echo "Waiting for bootstrap to complete..."
wait_for_existance /opt/openshift/.bootkube.done

echo "Reporting install progress..."
while ! oc --config="$KUBECONFIG" create -f - <<-EOF
	apiVersion: v1
	kind: ConfigMap
	metadata:
	  name: bootstrap
	  namespace: kube-system
	data:
	  status: complete
EOF
do
	sleep 5
done


The script is calling oc command against external api server to create some resource. But the api server is not in NO_PROXY list.


This issue is a regression issue, which is caused by https://github.com/openshift/installer/pull/2425

--- Additional comment from Scott Dodson on 2019-10-21 18:56:29 UTC ---

That PR was merged in order to resolve https://bugzilla.redhat.com/show_bug.cgi?id=1762618

My opinion is that customer proxy configuration should include external api in its whitelist.

--- Additional comment from Daneyon Hansen on 2019-10-21 19:10:46 UTC ---

What is the reason for report-progress.sh to use the api-server's external name instead of internal name?

--- Additional comment from Johnny Liu on 2019-10-22 01:07:08 UTC ---

> My opinion is that customer proxy configuration should include external api in its whitelist.
I think including external api in its whitelist is some kind of workaround.
The potential reasonable fix is run oc command against internal api in cluster itself instead of external api.

> What is the reason for report-progress.sh to use the api-server's external
> name instead of internal name?

The default kubeconfig for oc command is using the api-server's external name.

Comment 1 Eric Paris 2019-11-07 16:49:07 UTC
We've decided to revert:
https://github.com/openshift/cluster-network-operator/pull/334 in https://github.com/openshift/cluster-network-operator/pull/388
https://github.com/openshift/installer/pull/2471 in https://github.com/openshift/installer/pull/2640

This should put us back in the 4.2.0 - 4.2.2 state.

ewolinet is going to harden CI so we will detect the problem we are trying to avoid.

this bz should be used to fix report-progress.sh and re apply those 2 PRs.

Comment 2 Scott Dodson 2019-11-08 14:23:46 UTC
Fixed by reverting the change, we'll clone this for a fix that re-introduces the change and fixes report-progress.sh.

Comment 4 Johnny Liu 2019-11-11 08:08:16 UTC
Verified this bug with 4.2.4, and PASS.

The default proxy list is including api url again.

# env|grep -i api
NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jialiu42.qe.devcluster.openshift.com,api.jialiu42.qe.devcluster.openshift.com,etcd-0.jialiu42.qe.devcluster.openshift.com,etcd-1.jialiu42.qe.devcluster.openshift.com,etcd-2.jialiu42.qe.devcluster.openshift.com,localhost,test.no-proxy.com

Comment 6 errata-xmlrpc 2019-11-19 13:49:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3869


Note You need to log in before you can comment on or make changes to this bug.