Bug 2071046 - Frequent install failures with Ironic 500: "Cannot use 'none' RPC to connect to remote conductor"
Summary: Frequent install failures with Ironic 500: "Cannot use 'none' RPC to connect ...
Keywords:
Status: CLOSED DUPLICATE of bug 2068246
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Riccardo Pittau
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-01 16:42 UTC by Stephen Benjamin
Modified: 2022-04-04 08:04 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-04 08:04:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Stephen Benjamin 2022-04-01 16:42:32 UTC
During installation, terraform exits reporting Ironic returned 500:


level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [5m0s elapsed]
level=debug msg=ironic_node_v1.openshift-master-host[1]: Still creating... [5m0s elapsed]
level=debug msg=ironic_node_v1.openshift-master-host[1]: Still creating... [5m10s elapsed]
level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [5m10s elapsed]
level=debug msg=ironic_node_v1.openshift-master-host[2]: Still creating... [5m10s elapsed]
level=error
level=error msg=Error: Internal Server Error
level=error
level=error msg=  with ironic_node_v1.openshift-master-host[1],
level=error msg=  on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host":
level=error msg=  13: resource "ironic_node_v1" "openshift-master-host" {
level=error
level=error
level=error msg=Error: Internal Server Error
level=error
level=error msg=  with ironic_node_v1.openshift-master-host[0],
level=error msg=  on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host":
level=error msg=  13: resource "ironic_node_v1" "openshift-master-host" { 


Digging into the installer log bundle's Ironic logs, I do see errors like this:

ironic.common.exception.ServiceUnavailable: Cannot use 'none' RPC to connect to remote conductor 172.22.0.2
: ironic.common.exception.ServiceUnavailable: Cannot use 'none' RPC to connect to remote conductor 172.22.0.2
2022-04-01 15:53:47.486 1 INFO eventlet.wsgi.server [req-5f326ae0-e490-4751-a6e3-93aa088d8ac3 - - - - -] ::ffff:192.168.111.1 "POST /v1/nodes HTTP/1.1" status: 500  len: 476 time: 0.0718205
2022-04-01 15:53:47.490 1 ERROR ironic.api.method [req-fa9859dd-4f5c-4e18-8ce8-afe73948699a - - - - -] Server-side error: "Cannot use 'none' RPC to connect to remote conductor 172.22.0.2". Detail: 
Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/ironic/api/method.py", line 42, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib/python3.6/site-packages/ironic/api/method.py", line 109, in inner_body
    return function(*args, **kwargs)

  File "/usr/lib/python3.6/site-packages/ironic/common/args.py", line 379, in inner_check_args
    return function(*args, **kwargs_next)

  File "/usr/lib/python3.6/site-packages/ironic/api/controllers/v1/node.py", line 2493, in post
    new_node, topic)

  File "/usr/lib/python3.6/site-packages/ironic/conductor/rpcapi.py", line 314, in create_node
    cctxt = self._prepare_call(topic=topic, version='1.36')

  File "/usr/lib/python3.6/site-packages/ironic/conductor/rpcapi.py", line 213, in _prepare_call
    % host)



Here's an example run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-serial-ipv4/1509917381101621248


Looks like it's common enough to be worth investigating:
https://search.ci.openshift.org/?search=msg%3DError%3A+Internal+Server+Error&maxAge=48h&context=1&type=build-log&name=metal-ipi&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Zane Bitter 2022-04-01 18:39:58 UTC
This looks like a timing issue at startup. The API comes up before set_global_manager() is called, and if something hits the API in the meantime then we see this failure. The RPC service is launched before the WSGI one, but they both start in separate greenthreads so it's a race.

Happy Monday @dtantsur

Comment 2 Riccardo Pittau 2022-04-04 08:04:02 UTC

*** This bug has been marked as a duplicate of bug 2068246 ***


Note You need to log in before you can comment on or make changes to this bug.