Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be unavailable on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1883569 - BMH node provisioning takes too long ~ 30 minutes: ironic python agent reports errors: Error heartbeating to agent API: Invalid status code 400
Summary: BMH node provisioning takes too long ~ 30 minutes: ironic python agent report...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Bob Fournier
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On: 1883678
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-29 15:30 UTC by Marius Cornea
Modified: 2020-10-27 16:46 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
ironic-python-agent tokens were not being generated properly by Ironic until the node was rebooted. With this fix the agent tokens are properly generated.
Clone Of:
Environment:
Last Closed: 2020-10-27 16:46:20 UTC
Target Upstream Version:


Attachments (Terms of Use)
ironic provisioning errors screenshot (358.42 KB, image/png)
2020-09-29 15:30 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 739964 0 None MERGED Wipe agent token during reboot or power off 2021-02-13 20:27:06 UTC
OpenStack gerrit 751813 0 None MERGED Also wipe agent token on manual power off or reboot 2021-02-13 20:27:06 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:46:39 UTC

Description Marius Cornea 2020-09-29 15:30:23 UTC
Created attachment 1717573 [details]
ironic provisioning errors screenshot

Description of problem:

BMH node provisioning takes ~ 30 minutes which is too long compared to other nodes which takes ~ 5 minutes. During provisioning the console shows errors like: "ironic python agent reports errors: Error heartbeating to agent API: Invalid status code 400". After ~20 minutes of looping through this error the node reboots and the provisioning completes successfully.

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-25-085318

How reproducible:
100%

Steps to Reproduce:
1. Create BareMetalHost objects

[kni@ocp-edge12 ~]$ cat bmh-node-deploy/openshift-worker-6-bmh.yaml 
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  creationTimestamp: null
  name: openshift-worker-6
  namespace: openshift-machine-api
spec:
  bmc:
    address: "redfish://10.46.3.41/redfish/v1/Systems/System.Embedded.1"
    credentialsName: "openshift-worker-6-bmc-secret"
    disableCertificateVerification: true

  bootMACAddress: "BC:97:E1:69:DB:60"
  online: false
  userData:
    name: worker-user-data
    namespace: openshift-machine-api

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  creationTimestamp: null
  name: openshift-worker-7
  namespace: openshift-machine-api
spec:
  bmc:
    address: "redfish://10.46.3.42/redfish/v1/Systems/System.Embedded.1"
    credentialsName: "openshift-worker-7-bmc-secret"
    disableCertificateVerification: true

  bootMACAddress: "BC:97:E1:69:A1:A0"
  online: false
  userData:
    name: worker-user-data
    namespace: openshift-machine-api

2. Wait for BMH nodes to get into Ready state

3. Scale the machineset
oc -n openshift-machine-api scale --replica=2 machineset worker-lb

4. Watch BMH provisioning status and consoles

Actual results:

Provisioning takes too long as ironic python agent reports errors.

Expected results:

Provisioning occurs without errors in shorter time.

Additional info:

Attaching consoles screenshot and must-gather.

Comment 2 Dmitry Tantsur 2020-09-29 16:07:17 UTC
> BMH node provisioning takes ~ 30 minutes which is too long compared to other nodes which takes ~ 5 minutes

Quick question: is it always the same node or a random one?

Comment 3 Dmitry Tantsur 2020-09-29 16:45:55 UTC
It feels like that's the case of the issue fixed in https://opendev.org/openstack/ironic/commit/2d4d375358c8eb923a0a3bcee211ac53b991e54b and https://opendev.org/openstack/ironic/commit/bc04a42a96672f70e9327e805586b48c825bb5a0. We need to check if we have both patches downstream.

Comment 4 Bob Fournier 2020-09-29 16:59:49 UTC
For the ironic-image, the last tagged ironic-counductor pkg openstack-ironic-conductor-1:15.1.1-0.20200724075308.3e92fd0.el8ost (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1266093) is from July 24, 2020.  So neither of those patches are included.

Comment 5 Bob Fournier 2020-09-29 17:13:56 UTC
Regarding this being a token issue, we can see in the ironic-api logs that a 400 error is being returned to the IPA because of a token issue:
metal3-ironic-api/metal3-ironic-api/logs/current.log:2020-09-29T15:14:59.850395695Z 2020-09-29 15:14:59.850 41 DEBUG ironic.common.json_rpc.client [req-941ca0fc-826f-4af8-971b-a788720c648b - - - - -] RPC heartbeat returned {"jsonrpc":"2.0","id":"req-941ca0fc-826f-4af8-971b-a788720c648b","error":{"code":400,"message":"Invalid or missing agent token received.","data":{"class":"ironic.common.exception.InvalidParameterValue"}}} _request /usr/lib/python3.6/site-packages/ironic/common/json_rpc/client.py:173

We also see these errors in the ironic-conductor logs:
metal3-ironic-conductor/metal3-ironic-conductor/logs/current.log:2020-09-29T15:13:53.891450375Z 2020-09-29 15:13:53.891 1 DEBUG ironic.common.json_rpc.server [req-14ed9582-4404-4bb3-9872-3af5e5ded67f - - - - -] RPC error InvalidParameterValue: Invalid or missing agent token received. _handle_error /usr/lib/python3.6/site-packages/ironic/common/json_rpc/server.py:149

metal3-ironic-conductor/metal3-ironic-conductor/logs/current.log:2020-09-29T14:58:35.268519323Z 2020-09-29 14:58:35.268 1 WARNING ironic.conductor.manager [req-47908a15-93c4-4310-950c-274d501cb0ca - - - - -] An agent token generation request is being refused as one is already present for node 4294fb3a-e7ba-4065-b6f1-bea6f591f4d9

As Dmitry has theorized, after some time Ironic will reboot the node.  At that point the token issue will resolve and deployment will complete, which is why this delay is noticed, but provisioning does complete.

We'll need to pick up the ironic pkg that includes the fixes Dmitry noted.

Comment 6 Bob Fournier 2020-09-29 20:57:34 UTC
Actually the patch https://review.opendev.org/#/c/739964/ is in this ironic image but the 2nd patch https://review.opendev.org/751813 is not.  In the process of tagging the latest ironic packages along with dependencies to pick up this fix.

Comment 7 Tomas Sedovic 2020-09-30 10:20:12 UTC
Adding the cross-tagging request for the fixes as a dependency to this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1883678

Comment 8 Bob Fournier 2020-09-30 16:41:29 UTC
The ironic fix for agent token [1] has been tagged for inclusion in the ironic-image, it will be in the next ironic-image build.  Note that we've seen this fix resolve the same "agent token generation request is being refused" issue as reported here (see https://storyboard.openstack.org/#!/story/2008097).  That being said, we haven't been able to reproduce this particular problem in dev-scripts.

We recommend retesting with the updated ironic-image.

[1] https://review.opendev.org/751813

Comment 9 Derek Higgins 2020-09-30 16:52:53 UTC
Marius, we're having trouble reproducing this, I've successfully added additional bmh's to a cluster and scaled up onto them without any delay (using 4.6.0-0.nightly-2020-09-30-091659)

I did have some luck provoking the same 400 error by powering nodes on and off before scaling up.

Did you do anything after adding the bmh's and before attempting to scale up? Also can you retry with the ironic fix Bob has just tagged.

Comment 10 Bob Fournier 2020-09-30 18:53:35 UTC
The ironic-container-v4.6.0-202009301654.p0[1] has been built with the openstack-ironic-conductor-1:15.2.1-0.20200929104811.fb90ed4.el8ost.noarch pkg that has the token fix.

Marius - please retry with this fix and let us know. Thanks.

[1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1336575

Comment 12 Marius Cornea 2020-09-30 19:03:37 UTC
(In reply to Derek Higgins from comment #9)
> Marius, we're having trouble reproducing this, I've successfully added
> additional bmh's to a cluster and scaled up onto them without any delay
> (using 4.6.0-0.nightly-2020-09-30-091659)
> 
> I did have some luck provoking the same 400 error by powering nodes on and
> off before scaling up.
> 
> Did you do anything after adding the bmh's and before attempting to scale
> up? Also can you retry with the ironic fix Bob has just tagged.

On thing which could be specific to this environment is that the nodes are used to scale a custom machineset:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","metadata":{"annotations":{},"creationTimestamp":"2020-09-28T11:06:43Z","generation":1,"labels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machine-role":"worker-lb","machine.openshift.io/cluster-api-machine-type":"worker-lb"},"managedFields":[{"apiVersion":"machine.openshift.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machine-role":{},"f:machine.openshift.io/cluster-api-machine-type":{}}},"f:spec":{".":{},"f:replicas":{},"f:selector":{".":{},"f:matchLabels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machineset":{}}},"f:template":{".":{},"f:metadata":{".":{},"f:labels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machine-role":{},"f:machine.openshift.io/cluster-api-machine-type":{},"f:machine.openshift.io/cluster-api-machineset":{}}},"f:spec":{".":{},"f:metadata":{},"f:providerSpec":{".":{},"f:value":{".":{},"f:apiVersion":{},"f:hostSelector":{},"f:image":{},"f:kind":{},"f:metadata":{},"f:userData":{}}}}}},"f:status":{}},"manager":"cluster-bootstrap","operation":"Update","time":"2020-09-28T11:06:43Z"},{"apiVersion":"machine.openshift.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:availableReplicas":{},"f:fullyLabeledReplicas":{},"f:observedGeneration":{},"f:readyReplicas":{},"f:replicas":{}}},"manager":"machineset-controller","operation":"Update","time":"2020-09-28T12:15:39Z"}],"name":"ocp-edge1-clmn8-worker-lb-0","namespace":"openshift-machine-api","resourceVersion":"57809","uid":"a2a8e78f-2642-418f-a05a-ad166e91acb5"},"spec":{"replicas":0,"selector":{"matchLabels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machineset":"ocp-edge1-clmn8-worker-lb-0"}},"template":{"metadata":{"labels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machine-role":"worker-lb","machine.openshift.io/cluster-api-machine-type":"worker-lb","machine.openshift.io/cluster-api-machineset":"ocp-edge1-clmn8-worker-lb-0"}},"spec":{"metadata":{"labels":{"node-role.kubernetes.io/worker-lb":""}},"providerSpec":{"value":{"apiVersion":"baremetal.cluster.k8s.io/v1alpha1","hostSelector":{},"image":{"checksum":"http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2.md5sum","url":"http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2"},"kind":"BareMetalMachineProviderSpec","metadata":{"creationTimestamp":null},"userData":{"name":"worker-user-data-managed"}}}}}}}
  creationTimestamp: "2020-09-29T14:43:38Z"
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8
    machine.openshift.io/cluster-api-machine-role: worker-lb
    machine.openshift.io/cluster-api-machine-type: worker-lb
  managedFields:
  - apiVersion: machine.openshift.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:machine.openshift.io/cluster-api-cluster: {}
          f:machine.openshift.io/cluster-api-machine-role: {}
          f:machine.openshift.io/cluster-api-machine-type: {}
      f:spec:
        .: {}
        f:replicas: {}
        f:selector:
          .: {}
          f:matchLabels:
            .: {}
            f:machine.openshift.io/cluster-api-cluster: {}
            f:machine.openshift.io/cluster-api-machineset: {}
        f:template:
          .: {}
          f:metadata:
            .: {}
            f:labels:
              .: {}
              f:machine.openshift.io/cluster-api-cluster: {}
              f:machine.openshift.io/cluster-api-machine-role: {}
              f:machine.openshift.io/cluster-api-machine-type: {}
              f:machine.openshift.io/cluster-api-machineset: {}
          f:spec:
            .: {}
            f:metadata:
              .: {}
              f:labels:
                .: {}
                f:node-role.kubernetes.io/worker-lb: {}
            f:providerSpec:
              .: {}
              f:value:
                .: {}
                f:apiVersion: {}
                f:hostSelector: {}
                f:image: {}
                f:kind: {}
                f:metadata: {}
                f:userData: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2020-09-29T14:43:38Z"
  - apiVersion: machine.openshift.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:availableReplicas: {}
        f:fullyLabeledReplicas: {}
        f:observedGeneration: {}
        f:readyReplicas: {}
        f:replicas: {}
    manager: machineset-controller
    operation: Update
    time: "2020-09-29T15:37:46Z"
  name: ocp-edge1-clmn8-worker-lb-0
  namespace: openshift-machine-api
  resourceVersion: "1537659"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/ocp-edge1-clmn8-worker-lb-0
  uid: 0f77eec9-777c-42c5-a916-da8f24fe26b3
spec:
  replicas: 2
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8
      machine.openshift.io/cluster-api-machineset: ocp-edge1-clmn8-worker-lb-0
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8
        machine.openshift.io/cluster-api-machine-role: worker-lb
        machine.openshift.io/cluster-api-machine-type: worker-lb
        machine.openshift.io/cluster-api-machineset: ocp-edge1-clmn8-worker-lb-0
    spec:
      metadata:
        labels:
          node-role.kubernetes.io/worker-lb: ""
      providerSpec:
        value:
          apiVersion: baremetal.cluster.k8s.io/v1alpha1
          hostSelector: {}
          image:
            checksum: http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2.md5sum
            url: http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2
          kind: BareMetalMachineProviderSpec
          metadata:
            creationTimestamp: null
          userData:
            name: worker-user-data-managed
status:
  availableReplicas: 2
  fullyLabeledReplicas: 2
  observedGeneration: 2
  readyReplicas: 2
  replicas: 2

Comment 13 Marius Cornea 2020-09-30 19:06:39 UTC
(In reply to Bob Fournier from comment #10)
> The ironic-container-v4.6.0-202009301654.p0[1] has been built with the
> openstack-ironic-conductor-1:15.2.1-0.20200929104811.fb90ed4.el8ost.noarch
> pkg that has the token fix.
> 
> Marius - please retry with this fix and let us know. Thanks.
> 
> [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1336575

How do I find which release payload includes this image?

Comment 14 Bob Fournier 2020-09-30 19:15:14 UTC
> How do I find which release payload includes this image?

It should be in 4.6.0-0.nightly-2020-09-30-164414 and later.

Comment 15 Bob Fournier 2020-09-30 19:28:29 UTC
> It should be in 4.6.0-0.nightly-2020-09-30-164414 and later.

Sorry, 4.6.0-0.nightly-2020-09-30-191346 and later.

Comment 19 errata-xmlrpc 2020-10-27 16:46:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.