Bug 1883569
| Summary: | BMH node provisioning takes too long ~ 30 minutes: ironic python agent reports errors: Error heartbeating to agent API: Invalid status code 400 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Marius Cornea <mcornea> | ||||
| Component: | Bare Metal Hardware Provisioning | Assignee: | Bob Fournier <bfournie> | ||||
| Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Amit Ugol <augol> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | unspecified | CC: | bfournie, derekh | ||||
| Version: | 4.6 | Keywords: | TestBlocker, Triaged | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.6.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
ironic-python-agent tokens were not being generated properly by Ironic until the node was rebooted. With this fix the agent tokens are properly generated.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-10-27 16:46:20 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1883678 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
> BMH node provisioning takes ~ 30 minutes which is too long compared to other nodes which takes ~ 5 minutes
Quick question: is it always the same node or a random one?
It feels like that's the case of the issue fixed in https://opendev.org/openstack/ironic/commit/2d4d375358c8eb923a0a3bcee211ac53b991e54b and https://opendev.org/openstack/ironic/commit/bc04a42a96672f70e9327e805586b48c825bb5a0. We need to check if we have both patches downstream. For the ironic-image, the last tagged ironic-counductor pkg openstack-ironic-conductor-1:15.1.1-0.20200724075308.3e92fd0.el8ost (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1266093) is from July 24, 2020. So neither of those patches are included. Regarding this being a token issue, we can see in the ironic-api logs that a 400 error is being returned to the IPA because of a token issue:
metal3-ironic-api/metal3-ironic-api/logs/current.log:2020-09-29T15:14:59.850395695Z 2020-09-29 15:14:59.850 41 DEBUG ironic.common.json_rpc.client [req-941ca0fc-826f-4af8-971b-a788720c648b - - - - -] RPC heartbeat returned {"jsonrpc":"2.0","id":"req-941ca0fc-826f-4af8-971b-a788720c648b","error":{"code":400,"message":"Invalid or missing agent token received.","data":{"class":"ironic.common.exception.InvalidParameterValue"}}} _request /usr/lib/python3.6/site-packages/ironic/common/json_rpc/client.py:173
We also see these errors in the ironic-conductor logs:
metal3-ironic-conductor/metal3-ironic-conductor/logs/current.log:2020-09-29T15:13:53.891450375Z 2020-09-29 15:13:53.891 1 DEBUG ironic.common.json_rpc.server [req-14ed9582-4404-4bb3-9872-3af5e5ded67f - - - - -] RPC error InvalidParameterValue: Invalid or missing agent token received. _handle_error /usr/lib/python3.6/site-packages/ironic/common/json_rpc/server.py:149
metal3-ironic-conductor/metal3-ironic-conductor/logs/current.log:2020-09-29T14:58:35.268519323Z 2020-09-29 14:58:35.268 1 WARNING ironic.conductor.manager [req-47908a15-93c4-4310-950c-274d501cb0ca - - - - -] An agent token generation request is being refused as one is already present for node 4294fb3a-e7ba-4065-b6f1-bea6f591f4d9
As Dmitry has theorized, after some time Ironic will reboot the node. At that point the token issue will resolve and deployment will complete, which is why this delay is noticed, but provisioning does complete.
We'll need to pick up the ironic pkg that includes the fixes Dmitry noted.
Actually the patch https://review.opendev.org/#/c/739964/ is in this ironic image but the 2nd patch https://review.opendev.org/751813 is not. In the process of tagging the latest ironic packages along with dependencies to pick up this fix. Adding the cross-tagging request for the fixes as a dependency to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1883678 The ironic fix for agent token [1] has been tagged for inclusion in the ironic-image, it will be in the next ironic-image build. Note that we've seen this fix resolve the same "agent token generation request is being refused" issue as reported here (see https://storyboard.openstack.org/#!/story/2008097). That being said, we haven't been able to reproduce this particular problem in dev-scripts. We recommend retesting with the updated ironic-image. [1] https://review.opendev.org/751813 Marius, we're having trouble reproducing this, I've successfully added additional bmh's to a cluster and scaled up onto them without any delay (using 4.6.0-0.nightly-2020-09-30-091659) I did have some luck provoking the same 400 error by powering nodes on and off before scaling up. Did you do anything after adding the bmh's and before attempting to scale up? Also can you retry with the ironic fix Bob has just tagged. The ironic-container-v4.6.0-202009301654.p0[1] has been built with the openstack-ironic-conductor-1:15.2.1-0.20200929104811.fb90ed4.el8ost.noarch pkg that has the token fix. Marius - please retry with this fix and let us know. Thanks. [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1336575 (In reply to Derek Higgins from comment #9) > Marius, we're having trouble reproducing this, I've successfully added > additional bmh's to a cluster and scaled up onto them without any delay > (using 4.6.0-0.nightly-2020-09-30-091659) > > I did have some luck provoking the same 400 error by powering nodes on and > off before scaling up. > > Did you do anything after adding the bmh's and before attempting to scale > up? Also can you retry with the ironic fix Bob has just tagged. On thing which could be specific to this environment is that the nodes are used to scale a custom machineset: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"machine.openshift.io/v1beta1","kind":"MachineSet","metadata":{"annotations":{},"creationTimestamp":"2020-09-28T11:06:43Z","generation":1,"labels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machine-role":"worker-lb","machine.openshift.io/cluster-api-machine-type":"worker-lb"},"managedFields":[{"apiVersion":"machine.openshift.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machine-role":{},"f:machine.openshift.io/cluster-api-machine-type":{}}},"f:spec":{".":{},"f:replicas":{},"f:selector":{".":{},"f:matchLabels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machineset":{}}},"f:template":{".":{},"f:metadata":{".":{},"f:labels":{".":{},"f:machine.openshift.io/cluster-api-cluster":{},"f:machine.openshift.io/cluster-api-machine-role":{},"f:machine.openshift.io/cluster-api-machine-type":{},"f:machine.openshift.io/cluster-api-machineset":{}}},"f:spec":{".":{},"f:metadata":{},"f:providerSpec":{".":{},"f:value":{".":{},"f:apiVersion":{},"f:hostSelector":{},"f:image":{},"f:kind":{},"f:metadata":{},"f:userData":{}}}}}},"f:status":{}},"manager":"cluster-bootstrap","operation":"Update","time":"2020-09-28T11:06:43Z"},{"apiVersion":"machine.openshift.io/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:availableReplicas":{},"f:fullyLabeledReplicas":{},"f:observedGeneration":{},"f:readyReplicas":{},"f:replicas":{}}},"manager":"machineset-controller","operation":"Update","time":"2020-09-28T12:15:39Z"}],"name":"ocp-edge1-clmn8-worker-lb-0","namespace":"openshift-machine-api","resourceVersion":"57809","uid":"a2a8e78f-2642-418f-a05a-ad166e91acb5"},"spec":{"replicas":0,"selector":{"matchLabels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machineset":"ocp-edge1-clmn8-worker-lb-0"}},"template":{"metadata":{"labels":{"machine.openshift.io/cluster-api-cluster":"ocp-edge1-clmn8","machine.openshift.io/cluster-api-machine-role":"worker-lb","machine.openshift.io/cluster-api-machine-type":"worker-lb","machine.openshift.io/cluster-api-machineset":"ocp-edge1-clmn8-worker-lb-0"}},"spec":{"metadata":{"labels":{"node-role.kubernetes.io/worker-lb":""}},"providerSpec":{"value":{"apiVersion":"baremetal.cluster.k8s.io/v1alpha1","hostSelector":{},"image":{"checksum":"http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2.md5sum","url":"http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2"},"kind":"BareMetalMachineProviderSpec","metadata":{"creationTimestamp":null},"userData":{"name":"worker-user-data-managed"}}}}}}} creationTimestamp: "2020-09-29T14:43:38Z" generation: 2 labels: machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8 machine.openshift.io/cluster-api-machine-role: worker-lb machine.openshift.io/cluster-api-machine-type: worker-lb managedFields: - apiVersion: machine.openshift.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:kubectl.kubernetes.io/last-applied-configuration: {} f:labels: .: {} f:machine.openshift.io/cluster-api-cluster: {} f:machine.openshift.io/cluster-api-machine-role: {} f:machine.openshift.io/cluster-api-machine-type: {} f:spec: .: {} f:replicas: {} f:selector: .: {} f:matchLabels: .: {} f:machine.openshift.io/cluster-api-cluster: {} f:machine.openshift.io/cluster-api-machineset: {} f:template: .: {} f:metadata: .: {} f:labels: .: {} f:machine.openshift.io/cluster-api-cluster: {} f:machine.openshift.io/cluster-api-machine-role: {} f:machine.openshift.io/cluster-api-machine-type: {} f:machine.openshift.io/cluster-api-machineset: {} f:spec: .: {} f:metadata: .: {} f:labels: .: {} f:node-role.kubernetes.io/worker-lb: {} f:providerSpec: .: {} f:value: .: {} f:apiVersion: {} f:hostSelector: {} f:image: {} f:kind: {} f:metadata: {} f:userData: {} manager: kubectl-client-side-apply operation: Update time: "2020-09-29T14:43:38Z" - apiVersion: machine.openshift.io/v1beta1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:availableReplicas: {} f:fullyLabeledReplicas: {} f:observedGeneration: {} f:readyReplicas: {} f:replicas: {} manager: machineset-controller operation: Update time: "2020-09-29T15:37:46Z" name: ocp-edge1-clmn8-worker-lb-0 namespace: openshift-machine-api resourceVersion: "1537659" selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/ocp-edge1-clmn8-worker-lb-0 uid: 0f77eec9-777c-42c5-a916-da8f24fe26b3 spec: replicas: 2 selector: matchLabels: machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8 machine.openshift.io/cluster-api-machineset: ocp-edge1-clmn8-worker-lb-0 template: metadata: labels: machine.openshift.io/cluster-api-cluster: ocp-edge1-clmn8 machine.openshift.io/cluster-api-machine-role: worker-lb machine.openshift.io/cluster-api-machine-type: worker-lb machine.openshift.io/cluster-api-machineset: ocp-edge1-clmn8-worker-lb-0 spec: metadata: labels: node-role.kubernetes.io/worker-lb: "" providerSpec: value: apiVersion: baremetal.cluster.k8s.io/v1alpha1 hostSelector: {} image: checksum: http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2.md5sum url: http://[fd00:1101::3]:6180/images/rhcos-46.82.202009222340-0-openstack.x86_64.qcow2/rhcos-46.82.202009222340-0-compressed.x86_64.qcow2 kind: BareMetalMachineProviderSpec metadata: creationTimestamp: null userData: name: worker-user-data-managed status: availableReplicas: 2 fullyLabeledReplicas: 2 observedGeneration: 2 readyReplicas: 2 replicas: 2 (In reply to Bob Fournier from comment #10) > The ironic-container-v4.6.0-202009301654.p0[1] has been built with the > openstack-ironic-conductor-1:15.2.1-0.20200929104811.fb90ed4.el8ost.noarch > pkg that has the token fix. > > Marius - please retry with this fix and let us know. Thanks. > > [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1336575 How do I find which release payload includes this image? > How do I find which release payload includes this image?
It should be in 4.6.0-0.nightly-2020-09-30-164414 and later.
> It should be in 4.6.0-0.nightly-2020-09-30-164414 and later.
Sorry, 4.6.0-0.nightly-2020-09-30-191346 and later.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |
Created attachment 1717573 [details] ironic provisioning errors screenshot Description of problem: BMH node provisioning takes ~ 30 minutes which is too long compared to other nodes which takes ~ 5 minutes. During provisioning the console shows errors like: "ironic python agent reports errors: Error heartbeating to agent API: Invalid status code 400". After ~20 minutes of looping through this error the node reboots and the provisioning completes successfully. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-25-085318 How reproducible: 100% Steps to Reproduce: 1. Create BareMetalHost objects [kni@ocp-edge12 ~]$ cat bmh-node-deploy/openshift-worker-6-bmh.yaml apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: creationTimestamp: null name: openshift-worker-6 namespace: openshift-machine-api spec: bmc: address: "redfish://10.46.3.41/redfish/v1/Systems/System.Embedded.1" credentialsName: "openshift-worker-6-bmc-secret" disableCertificateVerification: true bootMACAddress: "BC:97:E1:69:DB:60" online: false userData: name: worker-user-data namespace: openshift-machine-api apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: creationTimestamp: null name: openshift-worker-7 namespace: openshift-machine-api spec: bmc: address: "redfish://10.46.3.42/redfish/v1/Systems/System.Embedded.1" credentialsName: "openshift-worker-7-bmc-secret" disableCertificateVerification: true bootMACAddress: "BC:97:E1:69:A1:A0" online: false userData: name: worker-user-data namespace: openshift-machine-api 2. Wait for BMH nodes to get into Ready state 3. Scale the machineset oc -n openshift-machine-api scale --replica=2 machineset worker-lb 4. Watch BMH provisioning status and consoles Actual results: Provisioning takes too long as ironic python agent reports errors. Expected results: Provisioning occurs without errors in shorter time. Additional info: Attaching consoles screenshot and must-gather.