Bug 1968701

Summary: Bare metal IPI installation is failed due to worker inspection failure
Product: OpenShift Container Platform Reporter: Hiroyuki Yasuhara (Fujitsu) <hyasuhar>
Component: Bare Metal Hardware ProvisioningAssignee: Jacob Anders <janders>
Bare Metal Hardware Provisioning sub component: cluster-baremetal-operator QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, dtantsur, fj-lsoft-rh-cnt, hshiina, hyasuhar, janders, jniu, rbartal, rpittau, tsedovic
Version: 4.8Keywords: OtherQA, Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:11:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hiroyuki Yasuhara (Fujitsu) 2021-06-07 20:32:57 UTC
Description of problem:

  When using iRMC server for OCP baremetal ipi deployment,
  the deployment failed because inspection to worker failed.
  ----------------------------------------
  $ oc get bmh -A
  NAMESPACE               NAME       STATE                    CONSUMER                   ONLINE   ERROR
  openshift-machine-api   master-0   externally provisioned   openshift-bwd8n-master-0   true
  openshift-machine-api   master-1   externally provisioned   openshift-bwd8n-master-1   true
  openshift-machine-api   master-2   externally provisioned   openshift-bwd8n-master-2   true
  openshift-machine-api   worker-0   inspecting                                          true     inspection error
  ----------------------------------------

  The following is the log information of ironic-inspector,
  this may be the reason for the failure of inspect

  ----------------------------------------
  2021-06-04 04:01:05.449 1 INFO eventlet.wsgi.server [req-8cf4f186-e733-4af1-ae76-e1aa617d36f6 - - - - -] ::1 "GET /v1 HTTP/1.1" status: 200  len: 507 time: 0.0038972^[[00m
  2021-06-04 04:01:05.539 1 INFO eventlet.wsgi.server [req-7b1a73dc-c372-4457-a5f0-23c28c46f163 - - - - -] ::1 "GET /v1/introspection/23931413-6909-4808-9324-f544653a8580 HTTP/1.1" status: 200  len: 488 time: 0.0061669^[[00m
  2021-06-04 04:01:07.442 1 DEBUG eventlet.wsgi.server [-] (1) accepted ('::ffff:192.168.20.157', 43920, 0, 0) server /usr/lib/python3.6/site-packages/eventlet/wsgi.py:985^[[00m
  Traceback (most recent call last):
    File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 461, in fire_timers
      timer()
    File "/usr/lib/python3.6/site-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/usr/lib/python3.6/site-packages/eventlet/greenthread.py", line 221, in main
      result = function(*args, **kwargs)
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 818, in process_request
      proto.__init__(conn_state, self)
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 357, in __init__
      self.handle()
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 390, in handle
      self.handle_one_request()
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 419, in handle_one_request
      self.raw_requestline = self._read_request_line()
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 402, in _read_request_line
      return self.rfile.readline(self.server.url_length_limit)
    File "/usr/lib64/python3.6/socket.py", line 586, in readinto
      return self._sock.recv_into(b)
    File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 241, in recv_into
      return self._base_recv(nbytes, flags, into=True, buffer_=buffer)
    File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 256, in _base_recv
      read = self.read(nbytes, buffer_)
    File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 176, in read
      super(GreenSSLSocket, self).read, *args, **kwargs)
    File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 150, in _call_trampolining
      return func(*a, **kw)
    File "/usr/lib64/python3.6/ssl.py", line 833, in read
      return self._sslobj.read(len, buffer)
    File "/usr/lib64/python3.6/ssl.py", line 590, in read
      v = self._sslobj.read(len, buffer)
  ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:2354)
  ----------------------------------------

  The error occurs from 4.8.0-0.nightly-2021-03-19-184028,
  and it's caused by the following commit:
    https://github.com/openshift/cluster-baremetal-operator/commit/671e334d95ed2a17d0e8eef5c6d8357431512a45

  This commit supports TLS for ironic and inspector.
  This seems to be the problem on OCP side, as follows.

  Whether Ironic uses tls is determined by the existence of the cert files.
  When ironic related containers start, they will first check if the certs
  exist and then generate the ironic.conf.
  If there is no cert file, the ironic.conf will be written to use no tls.
  If the certs exist, then tls will be enabled.

  When bootstrapping masters, there is no such certs and bootstrap VM also
  does not create them.
  The VM just starts ironic
    https://github.com/openshift/installer/blob/master/data/data/bootstrap/baremetal/files/usr/local/bin/startironic.sh.template.
  So during master deployment, ironic does not use tls.

  But during worker deployment, those cert files will be created by
  Cluster-Baremetal-Operator (CBO).
  CBO runs on master node after master deployment is completed.
  It is responsible for creating BMO, Ironic and the certs they need.
  CBO will create the certs as a k8s secret called `metal3-ironic-tls`
  and then create the metal3 deployment with mounting this secret to
  each BMO and ironic container using VolumeMount
    https://github.com/openshift/cluster-baremetal-operator/blob/master/provisioning/baremetal_pod.go.
  As a result, during worker deployment, the ironic on master will use tls.

  But the IPAs used for master and worker are the same one, they are
  set to send http request, so worker deployment will fail.
  In this case, https request is required for worker's IPA.

Version-Release number of selected component (if applicable):

  openshift-baremetal-install 4.8.0-0.nightly-2021-04-15-152737
  built from commit d0462d8b5074448e1917da7f0a5d7a904bd60359
  release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:70fe4f1a828dcbe493dce6d199eb5d9e76300d053c477f0f4b4577ef7b7d2934

How reproducible:

  Always.

Steps to Reproduce:

  1. openshift-install --dir ~/clusterconfigs create manifests
  2. cp ~/ipi/99_router-replicas.yaml ~/clusterconfigs/openshift/
  3. openshift-install --dir ~/clusterconfigs --log-level debug create cluster

Actual results:

  Inspecting Worker nodes failed, and baremetal ipi deployment failed.

Expected results:

  Inspecting Worker nodes succeeds, and baremetal ipi deployment complete.

Additional info:

  Upstream issue in the OCP community:
  https://github.com/openshift/cluster-baremetal-operator/issues/152

Comment 1 Jacob Anders 2021-06-07 21:34:42 UTC
Setting blocker- as this is a continuation of https://bugzilla.redhat.com/show_bug.cgi?id=1965168 which was triaged as blocker-

Comment 2 Fujitsu container team 2021-06-08 09:31:53 UTC
Hi Jacob,

The following may be the root cause, I think.
https://github.com/openshift/cluster-baremetal-operator/blob/75e3ab4524c200f0c57befd03883afcca13bbd98/provisioning/baremetal_pod.go#L503
The certs are not mounted to the httpd container.

I confirmed that this SSL error does not occur by the following modification.
(Currently I'm testing again, including other fixes that haven't been incorporated into OCP yet.)

------------------------------------------
diff --git a/provisioning/baremetal_pod.go b/provisioning/baremetal_pod.go
index 366af31..daed36e 100644
--- a/provisioning/baremetal_pod.go
+++ b/provisioning/baremetal_pod.go
@@ -503,6 +503,10 @@ func createContainerMetal3Httpd(images *Images, config *metal3iov1alpha1.Provisi
                VolumeMounts: []corev1.VolumeMount{
                        sharedVolumeMount,
                        imageVolumeMount,
+                       inspectorCredentialsMount,
+                       rpcCredentialsMount,
+                       ironicTlsMount,
+                       inspectorTlsMount,
                },
                Env: []corev1.EnvVar{
                        buildEnvVar(httpPort, config),
------------------------------------------

Regarding this modification, I'm not sure all the contents of these mounts are needed.
So I would like RH to clear which one is needed, and check this modification is reasonable or not.

BestRegards,
Yasuhiro Futakawa

Comment 3 Dmitry Tantsur 2021-06-08 11:28:02 UTC
Good catch, I think ironicTlsMount and inspectorTlsMount should be added there (credentials are probably not needed).

Comment 5 Lubov 2021-06-14 08:24:30 UTC
Cannot verify due lack of iRMC setup. The problem not happening on HPE and Dell setups. Closing as OtherQA

If the problem reproduced on iRMC, please, open/reopen

Comment 8 errata-xmlrpc 2021-07-27 23:11:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438