Bug 1968701 - Bare metal IPI installation is failed due to worker inspection failure
Summary: Bare metal IPI installation is failed due to worker inspection failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Jacob Anders
QA Contact: Lubov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-07 20:32 UTC by Hiroyuki Yasuhara (Fujitsu)
Modified: 2021-07-27 23:12 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:11:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-baremetal-operator issues 152 0 None open ssl.SSLError: [SSL: HTTP_REQUEST] http request 2021-06-08 14:29:05 UTC
Github openshift cluster-baremetal-operator pull 156 0 None closed Bug 1968701: Add ironic/inspector TlsMounts to baremetal pod 2021-06-08 14:29:03 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:12:11 UTC

Description Hiroyuki Yasuhara (Fujitsu) 2021-06-07 20:32:57 UTC
Description of problem:

  When using iRMC server for OCP baremetal ipi deployment,
  the deployment failed because inspection to worker failed.
  ----------------------------------------
  $ oc get bmh -A
  NAMESPACE               NAME       STATE                    CONSUMER                   ONLINE   ERROR
  openshift-machine-api   master-0   externally provisioned   openshift-bwd8n-master-0   true
  openshift-machine-api   master-1   externally provisioned   openshift-bwd8n-master-1   true
  openshift-machine-api   master-2   externally provisioned   openshift-bwd8n-master-2   true
  openshift-machine-api   worker-0   inspecting                                          true     inspection error
  ----------------------------------------

  The following is the log information of ironic-inspector,
  this may be the reason for the failure of inspect

  ----------------------------------------
  2021-06-04 04:01:05.449 1 INFO eventlet.wsgi.server [req-8cf4f186-e733-4af1-ae76-e1aa617d36f6 - - - - -] ::1 "GET /v1 HTTP/1.1" status: 200  len: 507 time: 0.0038972^[[00m
  2021-06-04 04:01:05.539 1 INFO eventlet.wsgi.server [req-7b1a73dc-c372-4457-a5f0-23c28c46f163 - - - - -] ::1 "GET /v1/introspection/23931413-6909-4808-9324-f544653a8580 HTTP/1.1" status: 200  len: 488 time: 0.0061669^[[00m
  2021-06-04 04:01:07.442 1 DEBUG eventlet.wsgi.server [-] (1) accepted ('::ffff:192.168.20.157', 43920, 0, 0) server /usr/lib/python3.6/site-packages/eventlet/wsgi.py:985^[[00m
  Traceback (most recent call last):
    File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 461, in fire_timers
      timer()
    File "/usr/lib/python3.6/site-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/usr/lib/python3.6/site-packages/eventlet/greenthread.py", line 221, in main
      result = function(*args, **kwargs)
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 818, in process_request
      proto.__init__(conn_state, self)
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 357, in __init__
      self.handle()
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 390, in handle
      self.handle_one_request()
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 419, in handle_one_request
      self.raw_requestline = self._read_request_line()
    File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 402, in _read_request_line
      return self.rfile.readline(self.server.url_length_limit)
    File "/usr/lib64/python3.6/socket.py", line 586, in readinto
      return self._sock.recv_into(b)
    File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 241, in recv_into
      return self._base_recv(nbytes, flags, into=True, buffer_=buffer)
    File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 256, in _base_recv
      read = self.read(nbytes, buffer_)
    File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 176, in read
      super(GreenSSLSocket, self).read, *args, **kwargs)
    File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 150, in _call_trampolining
      return func(*a, **kw)
    File "/usr/lib64/python3.6/ssl.py", line 833, in read
      return self._sslobj.read(len, buffer)
    File "/usr/lib64/python3.6/ssl.py", line 590, in read
      v = self._sslobj.read(len, buffer)
  ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:2354)
  ----------------------------------------

  The error occurs from 4.8.0-0.nightly-2021-03-19-184028,
  and it's caused by the following commit:
    https://github.com/openshift/cluster-baremetal-operator/commit/671e334d95ed2a17d0e8eef5c6d8357431512a45

  This commit supports TLS for ironic and inspector.
  This seems to be the problem on OCP side, as follows.

  Whether Ironic uses tls is determined by the existence of the cert files.
  When ironic related containers start, they will first check if the certs
  exist and then generate the ironic.conf.
  If there is no cert file, the ironic.conf will be written to use no tls.
  If the certs exist, then tls will be enabled.

  When bootstrapping masters, there is no such certs and bootstrap VM also
  does not create them.
  The VM just starts ironic
    https://github.com/openshift/installer/blob/master/data/data/bootstrap/baremetal/files/usr/local/bin/startironic.sh.template.
  So during master deployment, ironic does not use tls.

  But during worker deployment, those cert files will be created by
  Cluster-Baremetal-Operator (CBO).
  CBO runs on master node after master deployment is completed.
  It is responsible for creating BMO, Ironic and the certs they need.
  CBO will create the certs as a k8s secret called `metal3-ironic-tls`
  and then create the metal3 deployment with mounting this secret to
  each BMO and ironic container using VolumeMount
    https://github.com/openshift/cluster-baremetal-operator/blob/master/provisioning/baremetal_pod.go.
  As a result, during worker deployment, the ironic on master will use tls.

  But the IPAs used for master and worker are the same one, they are
  set to send http request, so worker deployment will fail.
  In this case, https request is required for worker's IPA.

Version-Release number of selected component (if applicable):

  openshift-baremetal-install 4.8.0-0.nightly-2021-04-15-152737
  built from commit d0462d8b5074448e1917da7f0a5d7a904bd60359
  release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:70fe4f1a828dcbe493dce6d199eb5d9e76300d053c477f0f4b4577ef7b7d2934

How reproducible:

  Always.

Steps to Reproduce:

  1. openshift-install --dir ~/clusterconfigs create manifests
  2. cp ~/ipi/99_router-replicas.yaml ~/clusterconfigs/openshift/
  3. openshift-install --dir ~/clusterconfigs --log-level debug create cluster

Actual results:

  Inspecting Worker nodes failed, and baremetal ipi deployment failed.

Expected results:

  Inspecting Worker nodes succeeds, and baremetal ipi deployment complete.

Additional info:

  Upstream issue in the OCP community:
  https://github.com/openshift/cluster-baremetal-operator/issues/152

Comment 1 Jacob Anders 2021-06-07 21:34:42 UTC
Setting blocker- as this is a continuation of https://bugzilla.redhat.com/show_bug.cgi?id=1965168 which was triaged as blocker-

Comment 2 Fujitsu container team 2021-06-08 09:31:53 UTC
Hi Jacob,

The following may be the root cause, I think.
https://github.com/openshift/cluster-baremetal-operator/blob/75e3ab4524c200f0c57befd03883afcca13bbd98/provisioning/baremetal_pod.go#L503
The certs are not mounted to the httpd container.

I confirmed that this SSL error does not occur by the following modification.
(Currently I'm testing again, including other fixes that haven't been incorporated into OCP yet.)

------------------------------------------
diff --git a/provisioning/baremetal_pod.go b/provisioning/baremetal_pod.go
index 366af31..daed36e 100644
--- a/provisioning/baremetal_pod.go
+++ b/provisioning/baremetal_pod.go
@@ -503,6 +503,10 @@ func createContainerMetal3Httpd(images *Images, config *metal3iov1alpha1.Provisi
                VolumeMounts: []corev1.VolumeMount{
                        sharedVolumeMount,
                        imageVolumeMount,
+                       inspectorCredentialsMount,
+                       rpcCredentialsMount,
+                       ironicTlsMount,
+                       inspectorTlsMount,
                },
                Env: []corev1.EnvVar{
                        buildEnvVar(httpPort, config),
------------------------------------------

Regarding this modification, I'm not sure all the contents of these mounts are needed.
So I would like RH to clear which one is needed, and check this modification is reasonable or not.

BestRegards,
Yasuhiro Futakawa

Comment 3 Dmitry Tantsur 2021-06-08 11:28:02 UTC
Good catch, I think ironicTlsMount and inspectorTlsMount should be added there (credentials are probably not needed).

Comment 5 Lubov 2021-06-14 08:24:30 UTC
Cannot verify due lack of iRMC setup. The problem not happening on HPE and Dell setups. Closing as OtherQA

If the problem reproduced on iRMC, please, open/reopen

Comment 8 errata-xmlrpc 2021-07-27 23:11:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.