Bug 1831760

Summary: Invalid bootstrap APIServer certificates - Azure
Product: OpenShift Container Platform Reporter: Mangirdas Judeikis <mjudeiki>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: Etienne Simard <esimard>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: jminter, mgahagan, wking
Version: 4.3.z   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:35:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1828382, 1832137    

Description Mangirdas Judeikis 2020-05-05 15:13:09 UTC
Description of problem:

Azure Loadbalancer is not accepting api server certificates for HTTPS probes.

Loadbalancer considers certificate invalid:
Error from the azure side (not visible for normal users):
WINHTTP_CALLBACK_STATUS_FLAG_INVALID_CERT


Based on the https://tools.ietf.org/html/rfc3280#section-4.2.1.1 and https://tools.ietf.org/html/rfc3280#section-4.2.1.2 we are using SubjectKeyID and AuthorityKeyId not as per specification. 

Current certificate configuration:

CA:
```
...
CA:TRUE
 X509v3 Subject Key Identifier:  
   81:11:91:F6:17:0F:F7:1E:B0:E3:CB:72:22:FC:17:03:FD:C7:82:C8 
...
```

Certificate:
```
...
X509v3 Subject Key Identifier:  
   81:11:91:F6:17:0F:F7:1E:B0:E3:CB:72:22:FC:17:03:FD:C7:82:C8
X509v3 Authority Key Identifier:  
 keyid:81:11:91:F6:17:0F:F7:1E:B0:E3:CB:72:22:FC:17:03:FD:C7:82:C8
...
```

Those fields should not be the same for a signed certificate. 

Both fields being equal in a signed certificate is considered an invalid configuration. 

How reproduce:

1. Create 2 azure VMs (we need 2 VMs as azure do not allows "same leg recursive calls via loadbalancer) and Internal LoadBalancer, vnet. 
2. Add HTTPS probe, load-balancing rules, for port 8443
3. SSH into VM1 and run script: https://gist.github.com/mjudeikis/4c0fc47552897bf13e82414b7d8a9f28 
4. SSH into VM2 and try reaching VM1 via NODE IP (curl https://ip:8443/readyz -Ik). This should work.
5. Try reaching VM1 via Loadbalancer IP - This should fail.
If you run ssldump on VM1:
   ssldump -i eth0 port 8443 

you will see that load-balancer is terminating the connection and never send ClientKeyExchange message.

6. Run the same script but change code behaviour for signed certificates (search for GOODCONFIG in the "gist").

Now calls via LB should work as the certificate is considered valid.

Comment 5 Etienne Simard 2020-05-12 02:09:48 UTC
Verified with:

./installer_https_fix/openshift-install version
./installer_https_fix/openshift-install unreleased-master-3026-ge476c483ed99c9cf2982529178e668dbcaf3ed5e-dirty
built from commit e476c483ed99c9cf2982529178e668dbcaf3ed5e
release image registry.svc.ci.openshift.org/origin/release:4.5

I've downloaded the installer source code and changed both azurerm_lb_probe templates (https://github.com/openshift/installer/blob/master/data/data/azure/vnet/internal-lb.tf#L138 and
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/public-lb.tf#L164)

with the following configurations:

~~~
resource "azurerm_lb_probe" "internal_lb_probe_api_internal" {
  name                = "api-internal-probe"
  resource_group_name = var.resource_group_name
  interval_in_seconds = 5
  number_of_probes    = 2
  loadbalancer_id     = azurerm_lb.internal.id
  port                = 6443
  protocol            = "HTTPS"
  request_path        = "/readyz"

}
~~~
~~~
resource "azurerm_lb_probe" "public_lb_probe_api_internal" {
  count = var.private ? 0 : 1

  name                = "api-internal-probe"
  resource_group_name = var.resource_group_name
  interval_in_seconds = 5
  number_of_probes    = 2
  loadbalancer_id     = azurerm_lb.public.id
  port                = 6443
  protocol            = "HTTPS"
  request_path        = "/readyz"
}

~~~

I've compiled the `openshift-installer` binary after those changes with by running `hack/build.sh`

./installer_https_fix/openshift-install version
./installer_https_fix/openshift-install unreleased-master-3026-ge476c483ed99c9cf2982529178e668dbcaf3ed5e-dirty
built from commit e476c483ed99c9cf2982529178e668dbcaf3ed5e
release image registry.svc.ci.openshift.org/origin/release:4.5

After exporting the release image override, I was able to install the cluster (with the https health check).

$ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE="registry.svc.ci.openshift.org/ocp/release:4.5"
 ./installer_https_fix/openshift-install create cluster --dir httpsfix2
? SSH Public Key /home/qe/.ssh/id_rsa.pub
? Platform azure
INFO Credentials loaded from file "/home/qe/.azure/osServicePrincipal.json" 
? Region centralus
? Base Domain qe.cluster.openshift.com
? Cluster Name esshttps02
? Pull Secret [? for help] *********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************WARNING Found override for release image. Please be warned, this is not advised 
INFO Creating infrastructure resources...         
INFO Waiting up to 20m0s for the Kubernetes API at https://api.qe.cluster.openshift.com:6443... 
INFO API v1.18.2 up                               
INFO Waiting up to 40m0s for bootstrapping to complete... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 30m0s for the cluster at https://api.qe.cluster.openshift.com:6443 to initialize... 
INFO Waiting up to 10m0s for the openshift-console route to be created... 
INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/qe/TESTS/bugzilla/1831760-certificate/httpsfix2/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.qe.cluster.openshift.com 
INFO Login to the console with user: "kubeadmin", and password:
INFO Time elapsed: 42m29s                     


My understanding is that should be enough to confirm that the https certificate is now working on the bootstrap. If you need anything else and you have more details to add, please let me know.

Comment 6 Etienne Simard 2020-05-12 16:32:05 UTC
Cluster creation fails when using the same Azure LB HTTPS probes without https://github.com/openshift/installer/pull/3551 :

~~~
INFO Waiting up to 20m0s for the Kubernetes API at https://api.qe.cluster.openshift.com:6443... 
ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.qe.cluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.89.117.146:6443: i/o timeout 
INFO Pulling debug logs from the bootstrap machine 
INFO Bootstrap gather logs captured here "/home/qe/TESTS/bugzilla/1831760-certificate/httpstofail2/log-bundle-20200512114414.tar.gz" 
FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded
~~~

I also noticed a lot of TLS handshake errors like the one below in the kube-apiserver bootstrap logs: 

~~~
bootstrap/containers/kube-apiserver-b768f854cf03134666fdf5a3b6abb48f3143863e3a33c02cbdf97b660b56cc92.log:I0512 15:22:39.201052       1 log.go:172] http: TLS handshake error from 168.63.129.16:57049: EOF
~~~

Comment 7 errata-xmlrpc 2020-07-13 17:35:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409