1991568 – Instrospection fails with Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://localhost:6181/images/ironic-python-agent.kernel failed

Bug 1991568 - Instrospection fails with Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://localhost:6181/images/ironic-python-agent.kernel failed

Summary: Instrospection fails with Failed to inspect hardware. Reason: unable to start...

Keywords:
Status:	CLOSED DUPLICATE of bug 1972753
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Tomas Sedovic
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-09 12:59 UTC by Andreas Karis
Modified:	2021-08-10 18:57 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-10 18:57:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Andreas Karis 2021-08-09 12:59:36 UTC

Description of problem:

Instrospection fails with Failed to inspect hardware. Reason: unable to start inspection: Validation     of image href http://localhost:6181/images/ironic-python-agent.kernel failed

This exact same environment, with the exact same deployment steps works well in OCP 4.6 and OCP 4.7.

The only moving part is the OCP 4.8 version.

~~~
[root@openshift-jumpserver-0 ~]# oc get machines -A
NAMESPACE               NAME                               PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ipi-cluster-kmmz4-master-0         Running                               97m
openshift-machine-api   ipi-cluster-kmmz4-master-1         Running                               97m
openshift-machine-api   ipi-cluster-kmmz4-master-2         Running                               97m
openshift-machine-api   ipi-cluster-kmmz4-worker-0-4wfbd   Provisioning                          81m
openshift-machine-api   ipi-cluster-kmmz4-worker-0-klzsj   Provisioning                          81m
[root@openshift-jumpserver-0 ~]# oc get bmj
error: the server doesn't have a resource type "bmj"
[root@openshift-jumpserver-0 ~]# oc get bmh
No resources found in default namespace.
[root@openshift-jumpserver-0 ~]# oc get bmh -A
NAMESPACE               NAME                 STATE                    CONSUMER                     ONLINE   ERROR
openshift-machine-api   openshift-master-0   externally provisioned   ipi-cluster-kmmz4-master-0   true     
openshift-machine-api   openshift-master-1   externally provisioned   ipi-cluster-kmmz4-master-1   true     
openshift-machine-api   openshift-master-2   externally provisioned   ipi-cluster-kmmz4-master-2   true     
openshift-machine-api   openshift-worker-0   inspecting                                            true     inspection error
openshift-machine-api   openshift-worker-1   inspecting                                            true     inspection error
openshift-machine-api   openshift-worker-2   inspecting                                            true     inspection error
[root@openshift-jumpserver-0 ~]# oc get bmh -n openshift-machine-api   openshift-worker-0  -o yaml
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  creationTimestamp: "2021-08-09T11:14:12Z"
  finalizers:
  - baremetalhost.metal3.io
  generation: 1
  name: openshift-worker-0
  namespace: openshift-machine-api
  resourceVersion: "18325"
  uid: 09cd8ab4-584a-4d87-88a2-0de2d729f4bf
spec:
  automatedCleaningMode: metadata
  bmc:
    address: ipmi://dell-r430-30-rc.mgmt.cee.ral3.lab.eng.rdu2.redhat.com
    credentialsName: openshift-worker-0-bmc-secret
  bootMACAddress: 18:66:da:9f:b1:0d
  hardwareProfile: unknown
  online: true
status:
  errorCount: 1
  errorMessage: 'Failed to inspect hardware. Reason: unable to start inspection: Validation
    of image href http://localhost:6181/images/ironic-python-agent.kernel failed,
    reason: HTTPConnectionPool(host=''localhost'', port=6181): Max retries exceeded
    with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError(''<urllib3.connection.HTTPConnection
    object at 0x7fd12a844320>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'',))'
  errorType: inspection error
  goodCredentials:
    credentials:
      name: openshift-worker-0-bmc-secret
      namespace: openshift-machine-api
    credentialsVersion: "17473"
  hardwareProfile: ""
  lastUpdated: "2021-08-09T11:39:30Z"
  operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: null
      start: "2021-08-09T11:39:04Z"
    provision:
      end: null
      start: null
    register:
      end: "2021-08-09T11:39:04Z"
      start: "2021-08-09T11:37:06Z"
  operationalStatus: error
  poweredOn: false
  provisioning:
    ID: 5ae5af65-e429-41ed-835b-f1808b4100c7
    bootMode: UEFI
    image:
      url: ""
    state: inspecting
  triedCredentials:
    credentials:
      name: openshift-worker-0-bmc-secret
      namespace: openshift-machine-api
    credentialsVersion: "17473"
~~~




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Bob Fournier 2021-08-09 14:35:11 UTC

Looks like a general issue when trying to connect to 'localhost' when using 'localhost' in the deploy-kernel path:
driver_info': {'deploy_kernel': 'http://localhost:6181/images/ironic-python-agent.kernel', 'deploy_ramdisk': 'http://localhost:6181/images/ironic-python-agent.initramfs'

reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd12a7cd0b8>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))

Comment 4 Andreas Karis 2021-08-09 18:30:42 UTC

I reinstalled my cluster just to see if this happens again. On the ILO, I see a connection timeout to 192.168.123.26:6180 on image fetch.

It turns out that the provisioning IP is gone. That IP is only set on init of the metal3 container:
~~~
[root@openshift-jumpserver-0 ~]# oc get pods -o wide
NAME                                          READY   STATUS    RESTARTS   AGE   IP                NODE                 NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-558c76fc6-lfprm   2/2     Running   4          95m   172.25.0.14       openshift-master-0   <none>           <none>
cluster-baremetal-operator-c4db8ddc6-v8btz    2/2     Running   2          95m   172.24.0.9        openshift-master-2   <none>           <none>
machine-api-controllers-f8995d59f-jb2dj       7/7     Running   3          78m   172.25.0.25       openshift-master-0   <none>           <none>
machine-api-operator-5b9bbcb7d7-q7jkm         2/2     Running   3          95m   172.25.0.7        openshift-master-0   <none>           <none>
metal3-6f7cc79d4d-8zc4d                       8/8     Running   0          77m   192.168.123.201   openshift-master-1   <none>           <none>
metal3-image-cache-jktgr                      1/1     Running   0          77m   192.168.123.200   openshift-master-0   <none>           <none>
metal3-image-cache-w2gw8                      1/1     Running   0          77m   192.168.123.202   openshift-master-2   <none>           <none>
metal3-image-cache-xnxzv                      1/1     Running   0          77m   192.168.123.201   openshift-master-1   <none>           <none>
[root@openshift-jumpserver-0 ~]# oc logs metal3-6f7cc79d4d-8zc4d -c  metal3-static-ip-set
+ '[' -z 192.168.123.26/24 ']'
+ '[' -z '' ']'
++ echo 192.168.123.26/24
++ cut -d/ -f1
+ IP_ONLY=192.168.123.26
++ ip -j addr
++ jq -r -c '.[].addr_info[] | select(.local == "192.168.123.26") | .label'
+ PROVISIONING_INTERFACE=
+ '[' -z '' ']'
++ ip -j route get 192.168.123.26
++ jq -r '.[] | select(.dev != "lo") | .dev'
+ PROVISIONING_INTERFACE=br-ex
+ '[' -z br-ex ']'
+ /usr/sbin/ip addr add 192.168.123.26/24 dev br-ex valid_lft 300 preferred_lft 300
~~~

For some reason (late network setup, network restart???) the IP is gone though:
~~~
[root@openshift-jumpserver-0 ~]# ssh core@openshift-master-1 sudo ip a | grep 192.168.123
    inet 192.168.123.201/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex
[root@openshift-jumpserver-0 ~]# 
~~~


In OCP 4.7, we had that same init container metal3-static-ip-set which ran `/set-static-ip` but also there was a normal container metal3-static-ip-manager which ran /refresh-static-ip constantly and fixed the IP addresses. In OCP 4.8, that container is gone:
~~~
[root@openshift-jumpserver-0 ~]# oc logs  metal3-64d647dff6-zcklt 
error: a container name must be specified for pod metal3-64d647dff6-zcklt, choose one of: [metal3-baremetal-operator metal3-mariadb metal3-httpd metal3-ironic-conductor ironic-inspector-ramdisk-logs metal3-ironic-api ironic-deploy-ramdisk-logs metal3-ironic-inspector metal3-static-ip-manager] or one of the init containers: [metal3-ipa-downloader metal3-machine-os-downloader metal3-static-ip-set]
~~~

OCP 4.8:
~~~
[root@openshift-jumpserver-0 ~]# oc logs metal3-6f7cc79d4d-8zc4d 
error: a container name must be specified for pod metal3-6f7cc79d4d-8zc4d, choose one of: [metal3-baremetal-operator metal3-mariadb metal3-httpd metal3-ironic-conductor ironic-inspector-ramdisk-logs metal3-ironic-api ironic-deploy-ramdisk-logs metal3-ironic-inspector] or one of the init containers: [metal3-ipa-downloader metal3-machine-os-downloader metal3-static-ip-set]
~~~

In OCP 4.7:
~~~
[root@openshift-jumpserver-0 ~]# oc get pods -o yaml | grep static-ip-manager -C10
(...)
   - command:
      - /refresh-static-ip
      env:
      - name: PROVISIONING_IP
        value: 192.168.123.26/24
      - name: PROVISIONING_INTERFACE
      image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2e14dffc18d80f4fe247209105c46a24a3db1319cf056f847d705e53eccf3a57
      imagePullPolicy: IfNotPresent
      name: metal3-static-ip-manager
      resources: {}
      securityContext:
        privileged: true
(...)
[root@openshift-jumpserver-0 ~]# oc logs -c metal3-static-ip-manager metal3-64d647dff6-zcklt | head -n 30
+ '[' -z 192.168.123.26/24 ']'
+ '[' -z '' ']'
++ echo 192.168.123.26/24
++ cut -d/ -f1
+ IP_ONLY=192.168.123.26
++ ip -j addr
++ jq -r -c '.[].addr_info[] | select(.local == "192.168.123.26") | .label'
+ PROVISIONING_INTERFACE=br-ex
+ '[' -z br-ex ']'
+ '[' -z br-ex ']'
+ /usr/sbin/ip addr add 192.168.123.26/24 dev br-ex valid_lft 10 preferred_lft 10
RTNETLINK answers: File exists
+ true
+ true
+ ip -o addr show dev br-ex scope link
+ grep -q ' fe80::'
+ /usr/sbin/ip addr change 192.168.123.26/24 dev br-ex valid_lft 10 preferred_lft 10
+ sleep 5
+ true
+ ip -o addr show dev br-ex scope link
+ grep -q ' fe80::'
+ /usr/sbin/ip addr change 192.168.123.26/24 dev br-ex valid_lft 10 preferred_lft 10
+ sleep 5
+ true
+ ip -o addr show dev br-ex scope link
+ grep -q ' fe80::'
+ /usr/sbin/ip addr change 192.168.123.26/24 dev br-ex valid_lft 10 preferred_lft 10
+ sleep 5
+ true
+ ip -o addr show dev br-ex scope link
~~~


so that when I deleted the IP, it was actually added right away:
~~~
[root@openshift-master-2 ~]# ip a ls dev br-ex
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:f1:e7:c1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.202/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex
       valid_lft 76444sec preferred_lft 76444sec
    inet 192.168.123.20/32 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fc00::b10:4666:152:932d/64 scope global dynamic noprefixroute 
       valid_lft 86387sec preferred_lft 14387sec
    inet6 fe80::f830:cc0:ac03:9aa4/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[root@openshift-master-2 ~]# ip a ls dev br-ex
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:f1:e7:c1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.202/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex
       valid_lft 76441sec preferred_lft 76441sec
    inet 192.168.123.20/32 scope global br-ex
       valid_lft forever preferred_lft forever
    inet 192.168.123.26/24 scope global secondary dynamic br-ex
       valid_lft 10sec preferred_lft 10sec
    inet6 fc00::b10:4666:152:932d/64 scope global dynamic noprefixroute 
       valid_lft 86384sec preferred_lft 14384sec
    inet6 fe80::f830:cc0:ac03:9aa4/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[root@openshift-master-2 ~]# 
~~~

Comment 5 Andreas Karis 2021-08-09 18:32:09 UTC

So on the second deploy attempt, I had to manually fix this with:

[root@openshift-jumpserver-0 ~]# oc get pods -o wide
NAME                                          READY   STATUS    RESTARTS   AGE    IP                NODE                 NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-558c76fc6-lfprm   2/2     Running   4          110m   172.25.0.14       openshift-master-0   <none>           <none>
cluster-baremetal-operator-c4db8ddc6-v8btz    2/2     Running   2          110m   172.24.0.9        openshift-master-2   <none>           <none>
machine-api-controllers-f8995d59f-jb2dj       7/7     Running   3          93m    172.25.0.25       openshift-master-0   <none>           <none>
machine-api-operator-5b9bbcb7d7-q7jkm         2/2     Running   3          110m   172.25.0.7        openshift-master-0   <none>           <none>
metal3-6f7cc79d4d-8zc4d                       8/8     Running   0          92m    192.168.123.201   openshift-master-1   <none>           <none>
metal3-image-cache-jktgr                      1/1     Running   0          92m    192.168.123.200   openshift-master-0   <none>           <none>
metal3-image-cache-w2gw8                      1/1     Running   0          92m    192.168.123.202   openshift-master-2   <none>           <none>
metal3-image-cache-xnxzv                      1/1     Running   0          92m    192.168.123.201   openshift-master-1   <none>           <none>
[root@openshift-jumpserver-0 ~]# ssh core@openshift-master-1
Red Hat Enterprise Linux CoreOS 48.84.202107271439-0
  Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html

---
Last login: Mon Aug  9 17:56:46 2021 from 192.168.123.1
su[core@openshift-master-1 ~]$ sudo -i
i[root@openshift-master-1 ~]# ip a a dev br-ex 192.168.123.26/24
[root@openshift-master-1 ~]# 
[root@openshift-master-1 ~]# 
[root@openshift-master-1 ~]# ip a ls dev br-ex
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:91:0a:d4 brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.201/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex
       valid_lft 80454sec preferred_lft 80454sec
    inet 192.168.123.26/24 scope global secondary br-ex
       valid_lft forever preferred_lft forever
    inet6 fc00::6141:56f0:30a8:1576/64 scope global dynamic noprefixroute 
       valid_lft 86376sec preferred_lft 14376sec
    inet6 fe80::15d:2faf:87fc:86b5/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[root@openshift-master-1 ~]#

Comment 6 Andreas Karis 2021-08-10 11:07:22 UTC

Comment 0 and the stuff starting in comment 4 are likely 2 different issues, on the same cluster, in 2 different installation runs. 

After I fixed the second run with #5, I had to extract the bmh nodes and secrets (oc get -o yaml bmh, oc get -o yaml secret    for each worker), delete the bmh objects, recreate the bmh objects and secrets to trigger a new introspection and deployment.

I will start a 3rd run soon and will report back.

Comment 7 Andreas Karis 2021-08-10 14:53:51 UTC

I redeployed my cluster and I got back into the same state as in comment #0

I'll attach file bmh-4-8-install-issue.txt with the log output of my install script and my troubleshooting.

I then tried to delete one of the failed bmh objects --- this did not work, the object was stuck in deleting until I deleted the metal3 pod and the cluster-baremetal-operator pod (the cluster-baremetal-operator pod had issues reaching the control plane / API).

I'll attach file bmh-4-8-cannot-delete-bmh.txt

Before deleting the 2 pods, I gathered an inspect of the namespace and all container logs: stage-1-logs.tar.gz

Now, the node introspection started, but I got into the issue that I describe in comment #4. So I had to manually add the provisioning IP.

I'll attach file bmh-4-8-no-ip-on-br-ex.txt

Comment 13 Derek Higgins 2021-08-10 16:14:08 UTC

This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1972753
Can you check your install-config.yaml 

If you don't need it, I think unsetting "provisioningHostIP" in your install-config should allow your workers to deploy (the external IP will be used by ironic)
Can you confirm if this works?

Comment 14 Andreas Karis 2021-08-10 16:50:25 UTC

Hi Derek,

This indeed does look like a duplicate. Alongside other issues that I still have with my lab, I'll have a look at the other BZ. But we can very likely close this as a dup. Thanks for the answer, I'll report back soon.
 
platform:
  baremetal:
    apiVIP: 192.168.123.20
    ingressVIP: 192.168.123.21
    provisioningNetwork: "Disabled"
    bootstrapProvisioningIP: 192.168.123.25
    clusterProvisioningIP: 192.168.123.26

Comment 15 Andreas Karis 2021-08-10 18:57:57 UTC

Nailed it. Thanks!

*** This bug has been marked as a duplicate of bug 1972753 ***

Note You need to log in before you can comment on or make changes to this bug.