Description of problem: Instrospection fails with Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://localhost:6181/images/ironic-python-agent.kernel failed This exact same environment, with the exact same deployment steps works well in OCP 4.6 and OCP 4.7. The only moving part is the OCP 4.8 version. ~~~ [root@openshift-jumpserver-0 ~]# oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ipi-cluster-kmmz4-master-0 Running 97m openshift-machine-api ipi-cluster-kmmz4-master-1 Running 97m openshift-machine-api ipi-cluster-kmmz4-master-2 Running 97m openshift-machine-api ipi-cluster-kmmz4-worker-0-4wfbd Provisioning 81m openshift-machine-api ipi-cluster-kmmz4-worker-0-klzsj Provisioning 81m [root@openshift-jumpserver-0 ~]# oc get bmj error: the server doesn't have a resource type "bmj" [root@openshift-jumpserver-0 ~]# oc get bmh No resources found in default namespace. [root@openshift-jumpserver-0 ~]# oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api openshift-master-0 externally provisioned ipi-cluster-kmmz4-master-0 true openshift-machine-api openshift-master-1 externally provisioned ipi-cluster-kmmz4-master-1 true openshift-machine-api openshift-master-2 externally provisioned ipi-cluster-kmmz4-master-2 true openshift-machine-api openshift-worker-0 inspecting true inspection error openshift-machine-api openshift-worker-1 inspecting true inspection error openshift-machine-api openshift-worker-2 inspecting true inspection error [root@openshift-jumpserver-0 ~]# oc get bmh -n openshift-machine-api openshift-worker-0 -o yaml apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: creationTimestamp: "2021-08-09T11:14:12Z" finalizers: - baremetalhost.metal3.io generation: 1 name: openshift-worker-0 namespace: openshift-machine-api resourceVersion: "18325" uid: 09cd8ab4-584a-4d87-88a2-0de2d729f4bf spec: automatedCleaningMode: metadata bmc: address: ipmi://dell-r430-30-rc.mgmt.cee.ral3.lab.eng.rdu2.redhat.com credentialsName: openshift-worker-0-bmc-secret bootMACAddress: 18:66:da:9f:b1:0d hardwareProfile: unknown online: true status: errorCount: 1 errorMessage: 'Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://localhost:6181/images/ironic-python-agent.kernel failed, reason: HTTPConnectionPool(host=''localhost'', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError(''<urllib3.connection.HTTPConnection object at 0x7fd12a844320>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'',))' errorType: inspection error goodCredentials: credentials: name: openshift-worker-0-bmc-secret namespace: openshift-machine-api credentialsVersion: "17473" hardwareProfile: "" lastUpdated: "2021-08-09T11:39:30Z" operationHistory: deprovision: end: null start: null inspect: end: null start: "2021-08-09T11:39:04Z" provision: end: null start: null register: end: "2021-08-09T11:39:04Z" start: "2021-08-09T11:37:06Z" operationalStatus: error poweredOn: false provisioning: ID: 5ae5af65-e429-41ed-835b-f1808b4100c7 bootMode: UEFI image: url: "" state: inspecting triedCredentials: credentials: name: openshift-worker-0-bmc-secret namespace: openshift-machine-api credentialsVersion: "17473" ~~~ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Looks like a general issue when trying to connect to 'localhost' when using 'localhost' in the deploy-kernel path: driver_info': {'deploy_kernel': 'http://localhost:6181/images/ironic-python-agent.kernel', 'deploy_ramdisk': 'http://localhost:6181/images/ironic-python-agent.initramfs' reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd12a7cd0b8>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
I reinstalled my cluster just to see if this happens again. On the ILO, I see a connection timeout to 192.168.123.26:6180 on image fetch. It turns out that the provisioning IP is gone. That IP is only set on init of the metal3 container: ~~~ [root@openshift-jumpserver-0 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-558c76fc6-lfprm 2/2 Running 4 95m 172.25.0.14 openshift-master-0 <none> <none> cluster-baremetal-operator-c4db8ddc6-v8btz 2/2 Running 2 95m 172.24.0.9 openshift-master-2 <none> <none> machine-api-controllers-f8995d59f-jb2dj 7/7 Running 3 78m 172.25.0.25 openshift-master-0 <none> <none> machine-api-operator-5b9bbcb7d7-q7jkm 2/2 Running 3 95m 172.25.0.7 openshift-master-0 <none> <none> metal3-6f7cc79d4d-8zc4d 8/8 Running 0 77m 192.168.123.201 openshift-master-1 <none> <none> metal3-image-cache-jktgr 1/1 Running 0 77m 192.168.123.200 openshift-master-0 <none> <none> metal3-image-cache-w2gw8 1/1 Running 0 77m 192.168.123.202 openshift-master-2 <none> <none> metal3-image-cache-xnxzv 1/1 Running 0 77m 192.168.123.201 openshift-master-1 <none> <none> [root@openshift-jumpserver-0 ~]# oc logs metal3-6f7cc79d4d-8zc4d -c metal3-static-ip-set + '[' -z 192.168.123.26/24 ']' + '[' -z '' ']' ++ echo 192.168.123.26/24 ++ cut -d/ -f1 + IP_ONLY=192.168.123.26 ++ ip -j addr ++ jq -r -c '.[].addr_info[] | select(.local == "192.168.123.26") | .label' + PROVISIONING_INTERFACE= + '[' -z '' ']' ++ ip -j route get 192.168.123.26 ++ jq -r '.[] | select(.dev != "lo") | .dev' + PROVISIONING_INTERFACE=br-ex + '[' -z br-ex ']' + /usr/sbin/ip addr add 192.168.123.26/24 dev br-ex valid_lft 300 preferred_lft 300 ~~~ For some reason (late network setup, network restart???) the IP is gone though: ~~~ [root@openshift-jumpserver-0 ~]# ssh core@openshift-master-1 sudo ip a | grep 192.168.123 inet 192.168.123.201/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex [root@openshift-jumpserver-0 ~]# ~~~ In OCP 4.7, we had that same init container metal3-static-ip-set which ran `/set-static-ip` but also there was a normal container metal3-static-ip-manager which ran /refresh-static-ip constantly and fixed the IP addresses. In OCP 4.8, that container is gone: ~~~ [root@openshift-jumpserver-0 ~]# oc logs metal3-64d647dff6-zcklt error: a container name must be specified for pod metal3-64d647dff6-zcklt, choose one of: [metal3-baremetal-operator metal3-mariadb metal3-httpd metal3-ironic-conductor ironic-inspector-ramdisk-logs metal3-ironic-api ironic-deploy-ramdisk-logs metal3-ironic-inspector metal3-static-ip-manager] or one of the init containers: [metal3-ipa-downloader metal3-machine-os-downloader metal3-static-ip-set] ~~~ OCP 4.8: ~~~ [root@openshift-jumpserver-0 ~]# oc logs metal3-6f7cc79d4d-8zc4d error: a container name must be specified for pod metal3-6f7cc79d4d-8zc4d, choose one of: [metal3-baremetal-operator metal3-mariadb metal3-httpd metal3-ironic-conductor ironic-inspector-ramdisk-logs metal3-ironic-api ironic-deploy-ramdisk-logs metal3-ironic-inspector] or one of the init containers: [metal3-ipa-downloader metal3-machine-os-downloader metal3-static-ip-set] ~~~ In OCP 4.7: ~~~ [root@openshift-jumpserver-0 ~]# oc get pods -o yaml | grep static-ip-manager -C10 (...) - command: - /refresh-static-ip env: - name: PROVISIONING_IP value: 192.168.123.26/24 - name: PROVISIONING_INTERFACE image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2e14dffc18d80f4fe247209105c46a24a3db1319cf056f847d705e53eccf3a57 imagePullPolicy: IfNotPresent name: metal3-static-ip-manager resources: {} securityContext: privileged: true (...) [root@openshift-jumpserver-0 ~]# oc logs -c metal3-static-ip-manager metal3-64d647dff6-zcklt | head -n 30 + '[' -z 192.168.123.26/24 ']' + '[' -z '' ']' ++ echo 192.168.123.26/24 ++ cut -d/ -f1 + IP_ONLY=192.168.123.26 ++ ip -j addr ++ jq -r -c '.[].addr_info[] | select(.local == "192.168.123.26") | .label' + PROVISIONING_INTERFACE=br-ex + '[' -z br-ex ']' + '[' -z br-ex ']' + /usr/sbin/ip addr add 192.168.123.26/24 dev br-ex valid_lft 10 preferred_lft 10 RTNETLINK answers: File exists + true + true + ip -o addr show dev br-ex scope link + grep -q ' fe80::' + /usr/sbin/ip addr change 192.168.123.26/24 dev br-ex valid_lft 10 preferred_lft 10 + sleep 5 + true + ip -o addr show dev br-ex scope link + grep -q ' fe80::' + /usr/sbin/ip addr change 192.168.123.26/24 dev br-ex valid_lft 10 preferred_lft 10 + sleep 5 + true + ip -o addr show dev br-ex scope link + grep -q ' fe80::' + /usr/sbin/ip addr change 192.168.123.26/24 dev br-ex valid_lft 10 preferred_lft 10 + sleep 5 + true + ip -o addr show dev br-ex scope link ~~~ so that when I deleted the IP, it was actually added right away: ~~~ [root@openshift-master-2 ~]# ip a ls dev br-ex 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 52:54:00:f1:e7:c1 brd ff:ff:ff:ff:ff:ff inet 192.168.123.202/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex valid_lft 76444sec preferred_lft 76444sec inet 192.168.123.20/32 scope global br-ex valid_lft forever preferred_lft forever inet6 fc00::b10:4666:152:932d/64 scope global dynamic noprefixroute valid_lft 86387sec preferred_lft 14387sec inet6 fe80::f830:cc0:ac03:9aa4/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@openshift-master-2 ~]# ip a ls dev br-ex 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 52:54:00:f1:e7:c1 brd ff:ff:ff:ff:ff:ff inet 192.168.123.202/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex valid_lft 76441sec preferred_lft 76441sec inet 192.168.123.20/32 scope global br-ex valid_lft forever preferred_lft forever inet 192.168.123.26/24 scope global secondary dynamic br-ex valid_lft 10sec preferred_lft 10sec inet6 fc00::b10:4666:152:932d/64 scope global dynamic noprefixroute valid_lft 86384sec preferred_lft 14384sec inet6 fe80::f830:cc0:ac03:9aa4/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@openshift-master-2 ~]# ~~~
So on the second deploy attempt, I had to manually fix this with: [root@openshift-jumpserver-0 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-558c76fc6-lfprm 2/2 Running 4 110m 172.25.0.14 openshift-master-0 <none> <none> cluster-baremetal-operator-c4db8ddc6-v8btz 2/2 Running 2 110m 172.24.0.9 openshift-master-2 <none> <none> machine-api-controllers-f8995d59f-jb2dj 7/7 Running 3 93m 172.25.0.25 openshift-master-0 <none> <none> machine-api-operator-5b9bbcb7d7-q7jkm 2/2 Running 3 110m 172.25.0.7 openshift-master-0 <none> <none> metal3-6f7cc79d4d-8zc4d 8/8 Running 0 92m 192.168.123.201 openshift-master-1 <none> <none> metal3-image-cache-jktgr 1/1 Running 0 92m 192.168.123.200 openshift-master-0 <none> <none> metal3-image-cache-w2gw8 1/1 Running 0 92m 192.168.123.202 openshift-master-2 <none> <none> metal3-image-cache-xnxzv 1/1 Running 0 92m 192.168.123.201 openshift-master-1 <none> <none> [root@openshift-jumpserver-0 ~]# ssh core@openshift-master-1 Red Hat Enterprise Linux CoreOS 48.84.202107271439-0 Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html --- Last login: Mon Aug 9 17:56:46 2021 from 192.168.123.1 su[core@openshift-master-1 ~]$ sudo -i i[root@openshift-master-1 ~]# ip a a dev br-ex 192.168.123.26/24 [root@openshift-master-1 ~]# [root@openshift-master-1 ~]# [root@openshift-master-1 ~]# ip a ls dev br-ex 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 52:54:00:91:0a:d4 brd ff:ff:ff:ff:ff:ff inet 192.168.123.201/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex valid_lft 80454sec preferred_lft 80454sec inet 192.168.123.26/24 scope global secondary br-ex valid_lft forever preferred_lft forever inet6 fc00::6141:56f0:30a8:1576/64 scope global dynamic noprefixroute valid_lft 86376sec preferred_lft 14376sec inet6 fe80::15d:2faf:87fc:86b5/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@openshift-master-1 ~]#
Comment 0 and the stuff starting in comment 4 are likely 2 different issues, on the same cluster, in 2 different installation runs. After I fixed the second run with #5, I had to extract the bmh nodes and secrets (oc get -o yaml bmh, oc get -o yaml secret for each worker), delete the bmh objects, recreate the bmh objects and secrets to trigger a new introspection and deployment. I will start a 3rd run soon and will report back.
I redeployed my cluster and I got back into the same state as in comment #0 I'll attach file bmh-4-8-install-issue.txt with the log output of my install script and my troubleshooting. I then tried to delete one of the failed bmh objects --- this did not work, the object was stuck in deleting until I deleted the metal3 pod and the cluster-baremetal-operator pod (the cluster-baremetal-operator pod had issues reaching the control plane / API). I'll attach file bmh-4-8-cannot-delete-bmh.txt Before deleting the 2 pods, I gathered an inspect of the namespace and all container logs: stage-1-logs.tar.gz Now, the node introspection started, but I got into the issue that I describe in comment #4. So I had to manually add the provisioning IP. I'll attach file bmh-4-8-no-ip-on-br-ex.txt
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1972753 Can you check your install-config.yaml If you don't need it, I think unsetting "provisioningHostIP" in your install-config should allow your workers to deploy (the external IP will be used by ironic) Can you confirm if this works?
Hi Derek, This indeed does look like a duplicate. Alongside other issues that I still have with my lab, I'll have a look at the other BZ. But we can very likely close this as a dup. Thanks for the answer, I'll report back soon. platform: baremetal: apiVIP: 192.168.123.20 ingressVIP: 192.168.123.21 provisioningNetwork: "Disabled" bootstrapProvisioningIP: 192.168.123.25 clusterProvisioningIP: 192.168.123.26
Nailed it. Thanks! *** This bug has been marked as a duplicate of bug 1972753 ***