Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1883569

Summary:

BMH node provisioning takes too long ~ 30 minutes: ironic python agent reports errors: Error heartbeating to agent API: Invalid status code 400

Product:

OpenShift Container Platform

Reporter:

Marius Cornea <mcornea>

Component:

Bare Metal Hardware Provisioning

Assignee:

Bob Fournier <bfournie>

Bare Metal Hardware Provisioning sub component:

ironic

QA Contact:

Amit Ugol <augol>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

bfournie, derekh

Version:

4.6

Keywords:

TestBlocker, Triaged

Target Milestone:

---

Target Release:

4.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

ironic-python-agent tokens were not being generated properly by Ironic until the node was rebooted. With this fix the agent tokens are properly generated.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-10-27 16:46:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1883678

Bug Blocks:

Attachments:

Description	Flags
ironic provisioning errors screenshot	none

Description Marius Cornea 2020-09-29 15:30:23 UTC

Created attachment 1717573 [details]
ironic provisioning errors screenshot

Description of problem:

BMH node provisioning takes ~ 30 minutes which is too long compared to other nodes which takes ~ 5 minutes. During provisioning the console shows errors like: "ironic python agent reports errors: Error heartbeating to agent API: Invalid status code 400". After ~20 minutes of looping through this error the node reboots and the provisioning completes successfully.

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-25-085318

How reproducible:
100%

Steps to Reproduce:
1. Create BareMetalHost objects

[kni@ocp-edge12 ~]$ cat bmh-node-deploy/openshift-worker-6-bmh.yaml 
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  creationTimestamp: null
  name: openshift-worker-6
  namespace: openshift-machine-api
spec:
  bmc:
    address: "redfish://10.46.3.41/redfish/v1/Systems/System.Embedded.1"
    credentialsName: "openshift-worker-6-bmc-secret"
    disableCertificateVerification: true

  bootMACAddress: "BC:97:E1:69:DB:60"
  online: false
  userData:
    name: worker-user-data
    namespace: openshift-machine-api

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  creationTimestamp: null
  name: openshift-worker-7
  namespace: openshift-machine-api
spec:
  bmc:
    address: "redfish://10.46.3.42/redfish/v1/Systems/System.Embedded.1"
    credentialsName: "openshift-worker-7-bmc-secret"
    disableCertificateVerification: true

  bootMACAddress: "BC:97:E1:69:A1:A0"
  online: false
  userData:
    name: worker-user-data
    namespace: openshift-machine-api

2. Wait for BMH nodes to get into Ready state

3. Scale the machineset
oc -n openshift-machine-api scale --replica=2 machineset worker-lb

4. Watch BMH provisioning status and consoles

Actual results:

Provisioning takes too long as ironic python agent reports errors.

Expected results:

Provisioning occurs without errors in shorter time.

Additional info:

Attaching consoles screenshot and must-gather.

Comment 2 Dmitry Tantsur 2020-09-29 16:07:17 UTC

> BMH node provisioning takes ~ 30 minutes which is too long compared to other nodes which takes ~ 5 minutes

Quick question: is it always the same node or a random one?

Comment 3 Dmitry Tantsur 2020-09-29 16:45:55 UTC

It feels like that's the case of the issue fixed in https://opendev.org/openstack/ironic/commit/2d4d375358c8eb923a0a3bcee211ac53b991e54b and https://opendev.org/openstack/ironic/commit/bc04a42a96672f70e9327e805586b48c825bb5a0. We need to check if we have both patches downstream.

Comment 4 Bob Fournier 2020-09-29 16:59:49 UTC

For the ironic-image, the last tagged ironic-counductor pkg openstack-ironic-conductor-1:15.1.1-0.20200724075308.3e92fd0.el8ost (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1266093) is from July 24, 2020.  So neither of those patches are included.

Comment 5 Bob Fournier 2020-09-29 17:13:56 UTC

Regarding this being a token issue, we can see in the ironic-api logs that a 400 error is being returned to the IPA because of a token issue:
metal3-ironic-api/metal3-ironic-api/logs/current.log:2020-09-29T15:14:59.850395695Z 2020-09-29 15:14:59.850 41 DEBUG ironic.common.json_rpc.client [req-941ca0fc-826f-4af8-971b-a788720c648b - - - - -] RPC heartbeat returned {"jsonrpc":"2.0","id":"req-941ca0fc-826f-4af8-971b-a788720c648b","error":{"code":400,"message":"Invalid or missing agent token received.","data":{"class":"ironic.common.exception.InvalidParameterValue"}}} _request /usr/lib/python3.6/site-packages/ironic/common/json_rpc/client.py:173

We also see these errors in the ironic-conductor logs:
metal3-ironic-conductor/metal3-ironic-conductor/logs/current.log:2020-09-29T15:13:53.891450375Z 2020-09-29 15:13:53.891 1 DEBUG ironic.common.json_rpc.server [req-14ed9582-4404-4bb3-9872-3af5e5ded67f - - - - -] RPC error InvalidParameterValue: Invalid or missing agent token received. _handle_error /usr/lib/python3.6/site-packages/ironic/common/json_rpc/server.py:149

metal3-ironic-conductor/metal3-ironic-conductor/logs/current.log:2020-09-29T14:58:35.268519323Z 2020-09-29 14:58:35.268 1 WARNING ironic.conductor.manager [req-47908a15-93c4-4310-950c-274d501cb0ca - - - - -] An agent token generation request is being refused as one is already present for node 4294fb3a-e7ba-4065-b6f1-bea6f591f4d9

As Dmitry has theorized, after some time Ironic will reboot the node.  At that point the token issue will resolve and deployment will complete, which is why this delay is noticed, but provisioning does complete.

We'll need to pick up the ironic pkg that includes the fixes Dmitry noted.

Comment 6 Bob Fournier 2020-09-29 20:57:34 UTC

Actually the patch https://review.opendev.org/#/c/739964/ is in this ironic image but the 2nd patch https://review.opendev.org/751813 is not.  In the process of tagging the latest ironic packages along with dependencies to pick up this fix.

Comment 7 Tomas Sedovic 2020-09-30 10:20:12 UTC

Adding the cross-tagging request for the fixes as a dependency to this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1883678

Comment 8 Bob Fournier 2020-09-30 16:41:29 UTC

The ironic fix for agent token [1] has been tagged for inclusion in the ironic-image, it will be in the next ironic-image build.  Note that we've seen this fix resolve the same "agent token generation request is being refused" issue as reported here (see https://storyboard.openstack.org/#!/story/2008097).  That being said, we haven't been able to reproduce this particular problem in dev-scripts.

We recommend retesting with the updated ironic-image.

[1] https://review.opendev.org/751813

Comment 9 Derek Higgins 2020-09-30 16:52:53 UTC

Marius, we're having trouble reproducing this, I've successfully added additional bmh's to a cluster and scaled up onto them without any delay (using 4.6.0-0.nightly-2020-09-30-091659)

I did have some luck provoking the same 400 error by powering nodes on and off before scaling up.

Did you do anything after adding the bmh's and before attempting to scale up? Also can you retry with the ironic fix Bob has just tagged.

Comment 10 Bob Fournier 2020-09-30 18:53:35 UTC

The ironic-container-v4.6.0-202009301654.p0[1] has been built with the openstack-ironic-conductor-1:15.2.1-0.20200929104811.fb90ed4.el8ost.noarch pkg that has the token fix.

Marius - please retry with this fix and let us know. Thanks.

[1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1336575

Comment 12 Marius Cornea 2020-09-30 19:03:37 UTC

(In reply to Derek Higgins from comment #9)
> Marius, we're having trouble reproducing this, I've successfully added
> additional bmh's to a cluster and scaled up onto them without any delay
> (using 4.6.0-0.nightly-2020-09-30-091659)
> 
> I did have some luck provoking the same 400 error by powering nodes on and
> off before scaling up.
> 
> Did you do anything after adding the bmh's and before attempting to scale
> up? Also can you retry with the ironic fix Bob has just tagged.

On thing which could be specific to this environment is that the nodes are used to scale a custom machineset:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","metadata":{"annotations":{},"creationTimestamp":"2020-09-28T11:06:43Z","generation":1,"labels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machine-role":"worker-lb","machine.openshift.io/cluster-api-machine-type":"worker-lb"},"managedFields":[{"apiVersion":"machine.openshift.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machine-role":{},"f:machine.openshift.io/cluster-api-machine-type":{}}},"f:spec":{".":{},"f:replicas":{},"f:selector":{".":{},"f:matchLabels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machineset":{}}},"f:template":{".":{},"f:metadata":{".":{},"f:labels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machine-role":{},"f:machine.openshift.io/cluster-api-machine-type":{},"f:machine.openshift.io/cluster-api-machineset":{}}},"f:spec":{".":{},"f:metadata":{},"f:providerSpec":{".":{},"f:value":{".":{},"f:apiVersion":{},"f:hostSelector":{},"f:image":{},"f:kind":{},"f:metadata":{},"f:userData":{}}}}}},"f:status":{}},"manager":"cluster-bootstrap","operation":"Update","time":"2020-09-28T11:06:43Z"},{"apiVersion":"machine.openshift.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:availableReplicas":{},"f:fullyLabeledReplicas":{},"f:observedGeneration":{},"f:readyReplicas":{},"f:replicas":{}}},"manager":"machineset-controller","operation":"Update","time":"2020-09-28T12:15:39Z"}],"name":"ocp-edge1-clmn8-worker-lb-0","namespace":"openshift-machine-api","resourceVersion":"57809","uid":"a2a8e78f-2642-418f-a05a-ad166e91acb5"},"spec":{"replicas":0,"selector":{"matchLabels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machineset":"ocp-edge1-clmn8-worker-lb-0"}},"template":{"metadata":{"labels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machine-role":"worker-lb","machine.openshift.io/cluster-api-machine-type":"worker-lb","machine.openshift.io/cluster-api-machineset":"ocp-edge1-clmn8-worker-lb-0"}},"spec":{"metadata":{"labels":{"node-role.kubernetes.io/worker-lb":""}},"providerSpec":{"value":{"apiVersion":"baremetal.cluster.k8s.io/v1alpha1","hostSelector":{},"image":{"checksum":"http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2.md5sum","url":"http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2"},"kind":"BareMetalMachineProviderSpec","metadata":{"creationTimestamp":null},"userData":{"name":"worker-user-data-managed"}}}}}}}
  creationTimestamp: "2020-09-29T14:43:38Z"
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8
    machine.openshift.io/cluster-api-machine-role: worker-lb
    machine.openshift.io/cluster-api-machine-type: worker-lb
  managedFields:
  - apiVersion: machine.openshift.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:machine.openshift.io/cluster-api-cluster: {}
          f:machine.openshift.io/cluster-api-machine-role: {}
          f:machine.openshift.io/cluster-api-machine-type: {}
      f:spec:
        .: {}
        f:replicas: {}
        f:selector:
          .: {}
          f:matchLabels:
            .: {}
            f:machine.openshift.io/cluster-api-cluster: {}
            f:machine.openshift.io/cluster-api-machineset: {}
        f:template:
          .: {}
          f:metadata:
            .: {}
            f:labels:
              .: {}
              f:machine.openshift.io/cluster-api-cluster: {}
              f:machine.openshift.io/cluster-api-machine-role: {}
              f:machine.openshift.io/cluster-api-machine-type: {}
              f:machine.openshift.io/cluster-api-machineset: {}
          f:spec:
            .: {}
            f:metadata:
              .: {}
              f:labels:
                .: {}
                f:node-role.kubernetes.io/worker-lb: {}
            f:providerSpec:
              .: {}
              f:value:
                .: {}
                f:apiVersion: {}
                f:hostSelector: {}
                f:image: {}
                f:kind: {}
                f:metadata: {}
                f:userData: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2020-09-29T14:43:38Z"
  - apiVersion: machine.openshift.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:availableReplicas: {}
        f:fullyLabeledReplicas: {}
        f:observedGeneration: {}
        f:readyReplicas: {}
        f:replicas: {}
    manager: machineset-controller
    operation: Update
    time: "2020-09-29T15:37:46Z"
  name: ocp-edge1-clmn8-worker-lb-0
  namespace: openshift-machine-api
  resourceVersion: "1537659"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/ocp-edge1-clmn8-worker-lb-0
  uid: 0f77eec9-777c-42c5-a916-da8f24fe26b3
spec:
  replicas: 2
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8
      machine.openshift.io/cluster-api-machineset: ocp-edge1-clmn8-worker-lb-0
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8
        machine.openshift.io/cluster-api-machine-role: worker-lb
        machine.openshift.io/cluster-api-machine-type: worker-lb
        machine.openshift.io/cluster-api-machineset: ocp-edge1-clmn8-worker-lb-0
    spec:
      metadata:
        labels:
          node-role.kubernetes.io/worker-lb: ""
      providerSpec:
        value:
          apiVersion: baremetal.cluster.k8s.io/v1alpha1
          hostSelector: {}
          image:
            checksum: http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2.md5sum
            url: http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2
          kind: BareMetalMachineProviderSpec
          metadata:
            creationTimestamp: null
          userData:
            name: worker-user-data-managed
status:
  availableReplicas: 2
  fullyLabeledReplicas: 2
  observedGeneration: 2
  readyReplicas: 2
  replicas: 2

Comment 13 Marius Cornea 2020-09-30 19:06:39 UTC

(In reply to Bob Fournier from comment #10)
> The ironic-container-v4.6.0-202009301654.p0[1] has been built with the
> openstack-ironic-conductor-1:15.2.1-0.20200929104811.fb90ed4.el8ost.noarch
> pkg that has the token fix.
> 
> Marius - please retry with this fix and let us know. Thanks.
> 
> [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1336575

How do I find which release payload includes this image?

Comment 14 Bob Fournier 2020-09-30 19:15:14 UTC

> How do I find which release payload includes this image?

It should be in 4.6.0-0.nightly-2020-09-30-164414 and later.

Comment 15 Bob Fournier 2020-09-30 19:28:29 UTC

> It should be in 4.6.0-0.nightly-2020-09-30-164414 and later.

Sorry, 4.6.0-0.nightly-2020-09-30-191346 and later.

Comment 19 errata-xmlrpc 2020-10-27 16:46:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196