Bug 1831760 - Invalid bootstrap APIServer certificates - Azure
Summary: Invalid bootstrap APIServer certificates - Azure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.3.z
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Abhinav Dahiya
QA Contact: Etienne Simard
URL:
Whiteboard:
Depends On:
Blocks: 1828382 1832137
TreeView+ depends on / blocked
 
Reported: 2020-05-05 15:13 UTC by Mangirdas Judeikis
Modified: 2020-07-13 17:35 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:35:00 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3551 0 None closed Bug 1831760: Fix bootstrap certificate generation 2020-11-17 15:05:06 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:35:20 UTC

Description Mangirdas Judeikis 2020-05-05 15:13:09 UTC
Description of problem:

Azure Loadbalancer is not accepting api server certificates for HTTPS probes.

Loadbalancer considers certificate invalid:
Error from the azure side (not visible for normal users):
WINHTTP_CALLBACK_STATUS_FLAG_INVALID_CERT


Based on the https://tools.ietf.org/html/rfc3280#section-4.2.1.1 and https://tools.ietf.org/html/rfc3280#section-4.2.1.2 we are using SubjectKeyID and AuthorityKeyId not as per specification. 

Current certificate configuration:

CA:
```
...
CA:TRUE
 X509v3 Subject Key Identifier:  
   81:11:91:F6:17:0F:F7:1E:B0:E3:CB:72:22:FC:17:03:FD:C7:82:C8 
...
```

Certificate:
```
...
X509v3 Subject Key Identifier:  
   81:11:91:F6:17:0F:F7:1E:B0:E3:CB:72:22:FC:17:03:FD:C7:82:C8
X509v3 Authority Key Identifier:  
 keyid:81:11:91:F6:17:0F:F7:1E:B0:E3:CB:72:22:FC:17:03:FD:C7:82:C8
...
```

Those fields should not be the same for a signed certificate. 

Both fields being equal in a signed certificate is considered an invalid configuration. 

How reproduce:

1. Create 2 azure VMs (we need 2 VMs as azure do not allows "same leg recursive calls via loadbalancer) and Internal LoadBalancer, vnet. 
2. Add HTTPS probe, load-balancing rules, for port 8443
3. SSH into VM1 and run script: https://gist.github.com/mjudeikis/4c0fc47552897bf13e82414b7d8a9f28 
4. SSH into VM2 and try reaching VM1 via NODE IP (curl https://ip:8443/readyz -Ik). This should work.
5. Try reaching VM1 via Loadbalancer IP - This should fail.
If you run ssldump on VM1:
   ssldump -i eth0 port 8443 

you will see that load-balancer is terminating the connection and never send ClientKeyExchange message.

6. Run the same script but change code behaviour for signed certificates (search for GOODCONFIG in the "gist").

Now calls via LB should work as the certificate is considered valid.

Comment 5 Etienne Simard 2020-05-12 02:09:48 UTC
Verified with:

./installer_https_fix/openshift-install version
./installer_https_fix/openshift-install unreleased-master-3026-ge476c483ed99c9cf2982529178e668dbcaf3ed5e-dirty
built from commit e476c483ed99c9cf2982529178e668dbcaf3ed5e
release image registry.svc.ci.openshift.org/origin/release:4.5

I've downloaded the installer source code and changed both azurerm_lb_probe templates (https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138 and
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164)

with the following configurations:

~~~
resource "azurerm_lb_probe" "internal_lb_probe_api_internal" {
  name                = "api-internal-probe"
  resource_group_name = var.resource_group_name
  interval_in_seconds = 5
  number_of_probes    = 2
  loadbalancer_id     = azurerm_lb.internal.id
  port                = 6443
  protocol            = "HTTPS"
  request_path        = "/readyz"

}
~~~
~~~
resource "azurerm_lb_probe" "public_lb_probe_api_internal" {
  count = var.private ? 0 : 1

  name                = "api-internal-probe"
  resource_group_name = var.resource_group_name
  interval_in_seconds = 5
  number_of_probes    = 2
  loadbalancer_id     = azurerm_lb.public.id
  port                = 6443
  protocol            = "HTTPS"
  request_path        = "/readyz"
}

~~~

I've compiled the `openshift-installer` binary after those changes with by running `hack/build.sh`

./installer_https_fix/openshift-install version
./installer_https_fix/openshift-install unreleased-master-3026-ge476c483ed99c9cf2982529178e668dbcaf3ed5e-dirty
built from commit e476c483ed99c9cf2982529178e668dbcaf3ed5e
release image registry.svc.ci.openshift.org/origin/release:4.5

After exporting the release image override, I was able to install the cluster (with the https health check).

$ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE="registry.svc.ci.openshift.org/ocp/release:4.5"
 ./installer_https_fix/openshift-install create cluster --dir httpsfix2
? SSH Public Key /home/qe/.ssh/id_rsa.pub
? Platform azure
INFO Credentials loaded from file "/home/qe/.azure/osServicePrincipal.json" 
? Region centralus
? Base Domain qe.cluster.openshift.com
? Cluster Name esshttps02
? Pull Secret [? for help] *********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************WARNING Found override for release image. Please be warned, this is not advised 
INFO Creating infrastructure resources...         
INFO Waiting up to 20m0s for the Kubernetes API at https://api.qe.cluster.openshift.com:6443... 
INFO API v1.18.2 up                               
INFO Waiting up to 40m0s for bootstrapping to complete... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 30m0s for the cluster at https://api.qe.cluster.openshift.com:6443 to initialize... 
INFO Waiting up to 10m0s for the openshift-console route to be created... 
INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/qe/TESTS/bugzilla/1831760-certificate/httpsfix2/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.qe.cluster.openshift.com 
INFO Login to the console with user: "kubeadmin", and password:
INFO Time elapsed: 42m29s                     


My understanding is that should be enough to confirm that the https certificate is now working on the bootstrap. If you need anything else and you have more details to add, please let me know.

Comment 6 Etienne Simard 2020-05-12 16:32:05 UTC
Cluster creation fails when using the same Azure LB HTTPS probes without https://github.com/openshift/installer/pull/3551 :

~~~
INFO Waiting up to 20m0s for the Kubernetes API at https://api.qe.cluster.openshift.com:6443... 
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.qe.cluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.89.117.146:6443: i/o timeout 
INFO Pulling debug logs from the bootstrap machine 
INFO Bootstrap gather logs captured here "/home/qe/TESTS/bugzilla/1831760-certificate/httpstofail2/log-bundle-20200512114414.tar.gz" 
FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded
~~~

I also noticed a lot of TLS handshake errors like the one below in the kube-apiserver bootstrap logs: 

~~~
bootstrap/containers/kube-apiserver-b768f854cf03134666fdf5a3b6abb48f3143863e3a33c02cbdf97b660b56cc92.log:I0512 15:22:39.201052       1 log.go:172] http: TLS handshake error from 168.63.129.16:57049: EOF
~~~

Comment 7 errata-xmlrpc 2020-07-13 17:35:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.