Bug 2071046

Summary: Frequent install failures with Ironic 500: "Cannot use 'none' RPC to connect to remote conductor"
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: Bare Metal Hardware ProvisioningAssignee: Riccardo Pittau <rpittau>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Amit Ugol <augol>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: rpittau, zbitter
Version: 4.11Keywords: Triaged
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-04 08:04:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2022-04-01 16:42:32 UTC
During installation, terraform exits reporting Ironic returned 500:


level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [5m0s elapsed]
level=debug msg=ironic_node_v1.openshift-master-host[1]: Still creating... [5m0s elapsed]
level=debug msg=ironic_node_v1.openshift-master-host[1]: Still creating... [5m10s elapsed]
level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [5m10s elapsed]
level=debug msg=ironic_node_v1.openshift-master-host[2]: Still creating... [5m10s elapsed]
level=error
level=error msg=Error: Internal Server Error
level=error
level=error msg=  with ironic_node_v1.openshift-master-host[1],
level=error msg=  on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host":
level=error msg=  13: resource "ironic_node_v1" "openshift-master-host" {
level=error
level=error
level=error msg=Error: Internal Server Error
level=error
level=error msg=  with ironic_node_v1.openshift-master-host[0],
level=error msg=  on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host":
level=error msg=  13: resource "ironic_node_v1" "openshift-master-host" { 


Digging into the installer log bundle's Ironic logs, I do see errors like this:

ironic.common.exception.ServiceUnavailable: Cannot use 'none' RPC to connect to remote conductor 172.22.0.2
: ironic.common.exception.ServiceUnavailable: Cannot use 'none' RPC to connect to remote conductor 172.22.0.2
2022-04-01 15:53:47.486 1 INFO eventlet.wsgi.server [req-5f326ae0-e490-4751-a6e3-93aa088d8ac3 - - - - -] ::ffff:192.168.111.1 "POST /v1/nodes HTTP/1.1" status: 500  len: 476 time: 0.0718205
2022-04-01 15:53:47.490 1 ERROR ironic.api.method [req-fa9859dd-4f5c-4e18-8ce8-afe73948699a - - - - -] Server-side error: "Cannot use 'none' RPC to connect to remote conductor 172.22.0.2". Detail: 
Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/ironic/api/method.py", line 42, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib/python3.6/site-packages/ironic/api/method.py", line 109, in inner_body
    return function(*args, **kwargs)

  File "/usr/lib/python3.6/site-packages/ironic/common/args.py", line 379, in inner_check_args
    return function(*args, **kwargs_next)

  File "/usr/lib/python3.6/site-packages/ironic/api/controllers/v1/node.py", line 2493, in post
    new_node, topic)

  File "/usr/lib/python3.6/site-packages/ironic/conductor/rpcapi.py", line 314, in create_node
    cctxt = self._prepare_call(topic=topic, version='1.36')

  File "/usr/lib/python3.6/site-packages/ironic/conductor/rpcapi.py", line 213, in _prepare_call
    % host)



Here's an example run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-serial-ipv4/1509917381101621248


Looks like it's common enough to be worth investigating:
https://search.ci.openshift.org/?search=msg%3DError%3A+Internal+Server+Error&maxAge=48h&context=1&type=build-log&name=metal-ipi&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Zane Bitter 2022-04-01 18:39:58 UTC
This looks like a timing issue at startup. The API comes up before set_global_manager() is called, and if something hits the API in the meantime then we see this failure. The RPC service is launched before the WSGI one, but they both start in separate greenthreads so it's a race.

Happy Monday @dtantsur

Comment 2 Riccardo Pittau 2022-04-04 08:04:02 UTC

*** This bug has been marked as a duplicate of bug 2068246 ***