Version: OCP 4.10.0-rc.1 ACM: 2.4.2 BM hardware: Dell R640 iDRAC9 BIOS Version 1.6.13 iDRAC Firmware Version 5.00.20.10 Spoke SNO deployment on real BM deployment gets stuck with: "The installation is in progress: Cluster has hosts pending user action". The server actually reboots with the same ISO that doesn't disconnect automatically. The workaround is to manually disconnect the ISO through the "virtual media" tab.
Thank you for the bug report. Commenting on this as I've been working on resolving https://bugzilla.redhat.com/show_bug.cgi?id=2054361 which you hit earlier. Does this happen only on subsequent reinstalls, or also on "fresh" installs? I am asking because this sounds like it may be related to Lifecycle Controller job queue already having vMedia entries. In OCP 4.10 we attempt to automatically clear Lifecycle Controller queue on node enroll. (however this will not repeatedly happen, only on first install after enroll). A useful troubleshooting step could be to remove the node from the cluster as well as the node's definition from metal3 and then enrolling it again (unless this already happens in your workflow). Let me know your thoughts on this.
The machine was never deployed on that particular HUB, but it had an SNO running on it before. Previous attempts to deploy on this machine from that hub failed as the created agent was never approved (had a wrong mac address).
Thank you for the response. I recommend also updating the BIOS to latest (iDRAC looks good) and re-testing . There is a possibility that either the BIOS itself or a mix of old BIOS and new iDRAC are causing some issues. Running very old BIOS and very new iDRAC is not recommended - such configuration isn't supportable, in my experience if we asked for assistance from Dell they wouldn't even talk to us till we upgrade the BIOS. Also - is this the same machine we worked with in https://bugzilla.redhat.com/show_bug.cgi?id=2054361? What state is the machine in prior to the installation (booted into ISO, not booted, booted into on-disk image)? It may be good to chat on slack while you have it in this state. I wonder if some component is not trying to detach vMedia at all, or trying and failing. I can have a look when you have time. I can't comment on the higher-level components but happy to look at the iDRAC/Ironic level with you like we did in the previous bug. Thank you, Jacob
Reproduced upon a new hub/spoke deployment.
(In reply to Jacob Anders from comment #3) > Thank you for the response. > > I recommend also updating the BIOS to latest (iDRAC looks good) and > re-testing . There is a possibility that either the BIOS itself or a mix of > old BIOS and new iDRAC are causing some issues. Running very old BIOS and > very new iDRAC is not recommended - such configuration isn't supportable, in > my experience if we asked for assistance from Dell they wouldn't even talk > to us till we upgrade the BIOS. > > Also - is this the same machine we worked with in > https://bugzilla.redhat.com/show_bug.cgi?id=2054361? > > What state is the machine in prior to the installation (booted into ISO, not > booted, booted into on-disk image)? It may be good to chat on slack while > you have it in this state. I wonder if some component is not trying to > detach vMedia at all, or trying and failing. I can have a look when you have > time. I can't comment on the higher-level components but happy to look at > the iDRAC/Ironic level with you like we did in the previous bug. > > Thank you, > Jacob Yes. Same machine as in https://bugzilla.redhat.com/show_bug.cgi?id=2054361
Observation: After reproducing the reported issue, I updated the Bios to 2.13.3 Firmware: 5.00.20.10 Removed BMH, infraenv, clusterdeployment ... and re-created all. The issue still persists.
Thank you for the update Sasha. My understanding of this is that the vMedia image is removed while being attached to the server and this causes the iDRAC to end up in this state where it can't either attach or detach vMedia anymore and needs to be reset. Given we reproduced this with the latest firmware, I discussed the iDRAC problem with Dell and they confirmed they can reproduce it as well and consider it's a firmware issue. There is plan to enhance the iDRAC firmware so that it is more resilient. However I think it would be good to investigate further on our side and see if we can avoid triggering the firmware bug - at the end of the day this happens when we remove vMedia image while it's mounted and being accessed which doesn't sound like the right thing to do. Do you know which of the components is responsible for the removal of the image within the provisioning flow that you are using (I am less familiar with ACM/ZTP approach)? Let me know, I can also ask around.
@
(In reply to Alexander Chuzhoy from comment #0) > Version: > OCP 4.10.0-rc.1 > ACM: 2.4.2 > BM hardware: Dell R640 iDRAC9 > BIOS Version 1.6.13 > iDRAC Firmware Version 5.00.20.10 > > > Spoke SNO deployment on real BM deployment gets stuck with: > "The installation is in progress: Cluster has hosts pending user action". > > > The server actually reboots with the same ISO that doesn't disconnect > automatically. > > > The workaround is to manually disconnect the ISO through the "virtual media" > tab. Could you please upload the logs from the agent and assisted installer? Something tells me there may be an issue when setting the bootorder. What you are seeing is expected since Ironic doesn't remove the virtual media from the BMC after deployment. Therefore the machine will keep booting into the ISO if the bootorder is wrong, hence the deployment being in "pending user action"
Thank you for your insights Flavio. This makes a lot of sense. My thinking got in the rut of still thinking within the boundaries of the previous problem we were looking at on the same server. Given the machine keeps booting from the same ISO over and over it is clearly not an issue of vMedia attachment not working - that was a red herring, please disregard it.
Reproduced with: ACM: quay.io/acm-d/acm-custom-registry:v2.4.2-RC5 OCP: 4.10.0-rc.6 oc describe agent 84b7944a-00af-c659-da70-3f624ba4aff5 Name: 84b7944a-00af-c659-da70-3f624ba4aff5 Namespace: qe1 Labels: agent-install.openshift.io/bmh=master-1-0 infraenvs.agent-install.openshift.io=qe1 Annotations: <none> API Version: agent-install.openshift.io/v1beta1 Kind: Agent Metadata: Creation Timestamp: 2022-03-04T21:46:54Z Finalizers: agent.agent-install.openshift.io/ai-deprovision Generation: 2 Managed Fields: API Version: agent-install.openshift.io/v1beta1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:debugInfo: f:state: f:stateInfo: f:inventory: f:disks: f:progress: f:currentStage: f:stageStartTime: f:stageUpdateTime: Manager: assisted-service Operation: Update Subresource: status Time: 2022-03-04T21:54:23Z Resource Version: 622991 UID: f58b90ce-6550-4293-8984-f3d6179bc21a Spec: Approved: true Cluster Deployment Name: Name: qe1 Namespace: qe1 Hostname: master-1-0 installation_disk_id: /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52 Role: master Status: Bootstrap: true Conditions: Last Transition Time: 2022-03-04T21:46:54Z Message: The Spec has been successfully applied Reason: SyncOK Status: True Type: SpecSynced Last Transition Time: 2022-03-04T21:46:54Z Message: The agent's connection to the installation service is unimpaired Reason: AgentIsConnected Status: True Type: Connected Last Transition Time: 2022-03-04T21:48:05Z Message: Installation already started and is in progress Reason: AgentAlreadyInstalling Status: True Type: RequirementsMet Last Transition Time: 2022-03-04T21:48:05Z Message: The agent's validations are passing Reason: ValidationsPassing Status: True Type: Validated Last Transition Time: 2022-03-04T21:46:54Z Message: The installation is in progress: Expected the host to boot from disk, but it booted the installation image - please reboot and fix boot order to boot from disk PERC_H330_Mini 64cd98f04fde0e00246884800d3f8b52 (sda, /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52) Reason: InstallationInProgress Status: False Type: Installed Last Transition Time: 2022-03-04T21:46:54Z Message: The agent is bound to a cluster deployment Reason: Bound Status: True Type: Bound Debug Info: Events URL: https://assisted-service-rhacm.apps.rhos-qe.e2e.bos.redhat.com/api/assisted-install/v1/clusters/2eed5655-9860-4c9b-8b45-ad975cfabf85/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI1NTYxNGMzZS1hN2ExLTQ1MjktYWYxZi1mZGIyYjNmNjA4ZjAifQ.ud3i-fooKRDYGfHtQa-l7n539OMiSF7p3Az5l4lXps2jEZSBV8goEYhRJXOsRnhtq_vZdHy_LOJmhZ9AvHahEQ&host_id=84b7944a-00af-c659-da70-3f624ba4aff5 State: installing-pending-user-action State Info: Expected the host to boot from disk, but it booted the installation image - please reboot and fix boot order to boot from disk PERC_H330_Mini 64cd98f04fde0e00246884800d3f8b52 (sda, /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52) Inventory: Bmc Address: 10.19.133.14 bmcV6Address: 2620:52:0:1381:4ed9:8fff:fe2f:30aa Boot: Current Boot Mode: uefi Cpu: Architecture: x86_64 Clock Megahertz: 1000 Count: 64 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities Model Name: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz Disks: By ID: /dev/disk/by-id/nvme-eui.01000000010000005cd2e48288375051 By Path: /dev/disk/by-path/pci-0000:86:00.0-nvme-1 Drive Type: SSD Id: /dev/disk/by-id/nvme-eui.01000000010000005cd2e48288375051 Installation Eligibility: Eligible: true Not Eligible Reasons: Io Perf: Model: Dell Express Flash NVMe P4610 1.6TB SFF Name: nvme0n1 Path: /dev/nvme0n1 Serial: BTLN903303DD1P6AGN Size Bytes: 1600000000000 Smart: {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/nvme0n1"],"exit_status":0},"device":{"name":"/dev/nvme0n1","info_name":"/dev/nvme0n1","type":"nvme","protocol":"NVMe"},"model_name":"Dell Express Flash NVMe P4610 1.6TB SFF","serial_number":"BTLN903303DD1P6AGN","firmware_version":"VDV1DP21","nvme_pci_vendor":{"id":32902,"subsystem_id":4136},"nvme_ieee_oui_identifier":6083300,"nvme_total_capacity":1600000000000,"nvme_unallocated_capacity":0,"nvme_controller_id":0,"nvme_number_of_namespaces":1,"nvme_namespaces":[{"id":1,"size":{"blocks":3125000000,"bytes":1600000000000},"capacity":{"blocks":3125000000,"bytes":1600000000000},"utilization":{"blocks":3125000000,"bytes":1600000000000},"formatted_lba_size":512,"eui64":{"oui":6083300,"ext_id":560631054592}}],"user_capacity":{"blocks":3125000000,"bytes":1600000000000},"logical_block_size":512,"local_time":{"time_t":1646430754,"asctime":"Fri Mar 4 21:52:34 2022 UTC"},"smart_status":{"passed":true,"nvme":{"value":0}},"nvme_smart_health_information_log":{"critical_warning":0,"temperature":30,"available_spare":100,"available_spare_threshold":10,"percentage_used":0,"data_units_read":88059,"data_units_written":117579,"host_reads":2495327,"host_writes":1673837,"controller_busy_time":0,"power_cycles":1135,"power_on_hours":20092,"unsafe_shutdowns":889,"media_errors":0,"num_err_log_entries":0,"warning_temp_time":0,"critical_comp_time":0},"temperature":{"current":30},"power_cycle_count":1135,"power_on_time":{"hours":20092}} Wwn: eui.01000000010000005cd2e48288375051 By ID: /dev/disk/by-id/nvme-eui.01000000010000005cd2e4db2b305051 By Path: /dev/disk/by-path/pci-0000:87:00.0-nvme-1 Drive Type: SSD Id: /dev/disk/by-id/nvme-eui.01000000010000005cd2e4db2b305051 Installation Eligibility: Eligible: true Not Eligible Reasons: Io Perf: Model: Dell Express Flash NVMe P4610 1.6TB SFF Name: nvme1n1 Path: /dev/nvme1n1 Serial: BTLN852300W71P6AGN Size Bytes: 1600000000000 Smart: {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/nvme1n1"],"exit_status":0},"device":{"name":"/dev/nvme1n1","info_name":"/dev/nvme1n1","type":"nvme","protocol":"NVMe"},"model_name":"Dell Express Flash NVMe P4610 1.6TB SFF","serial_number":"BTLN852300W71P6AGN","firmware_version":"VDV1DP21","nvme_pci_vendor":{"id":32902,"subsystem_id":4136},"nvme_ieee_oui_identifier":6083300,"nvme_total_capacity":1600000000000,"nvme_unallocated_capacity":0,"nvme_controller_id":0,"nvme_number_of_namespaces":1,"nvme_namespaces":[{"id":1,"size":{"blocks":3125000000,"bytes":1600000000000},"capacity":{"blocks":3125000000,"bytes":1600000000000},"utilization":{"blocks":3125000000,"bytes":1600000000000},"formatted_lba_size":512,"eui64":{"oui":6083300,"ext_id":941322404096}}],"user_capacity":{"blocks":3125000000,"bytes":1600000000000},"logical_block_size":512,"local_time":{"time_t":1646430754,"asctime":"Fri Mar 4 21:52:34 2022 UTC"},"smart_status":{"passed":true,"nvme":{"value":0}},"nvme_smart_health_information_log":{"critical_warning":0,"temperature":32,"available_spare":100,"available_spare_threshold":10,"percentage_used":0,"data_units_read":75605,"data_units_written":26254,"host_reads":2263511,"host_writes":584940,"controller_busy_time":0,"power_cycles":1136,"power_on_hours":20093,"unsafe_shutdowns":53,"media_errors":0,"num_err_log_entries":0,"warning_temp_time":0,"critical_comp_time":0},"temperature":{"current":32},"power_cycle_count":1136,"power_on_time":{"hours":20093}} Wwn: eui.01000000010000005cd2e4db2b305051 By ID: /dev/disk/by-id/nvme-eui.01000000010000005cd2e41ea0265051 By Path: /dev/disk/by-path/pci-0000:88:00.0-nvme-1 Drive Type: SSD Id: /dev/disk/by-id/nvme-eui.01000000010000005cd2e41ea0265051 Installation Eligibility: Eligible: true Not Eligible Reasons: Io Perf: Model: Dell Express Flash NVMe P4610 1.6TB SFF Name: nvme2n1 Path: /dev/nvme2n1 Serial: BTLN8500042S1P6AGN Size Bytes: 1600000000000 Smart: {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/nvme2n1"],"exit_status":0},"device":{"name":"/dev/nvme2n1","info_name":"/dev/nvme2n1","type":"nvme","protocol":"NVMe"},"model_name":"Dell Express Flash NVMe P4610 1.6TB SFF","serial_number":"BTLN8500042S1P6AGN","firmware_version":"VDV1DP21","nvme_pci_vendor":{"id":32902,"subsystem_id":4136},"nvme_ieee_oui_identifier":6083300,"nvme_total_capacity":1600000000000,"nvme_unallocated_capacity":0,"nvme_controller_id":0,"nvme_number_of_namespaces":1,"nvme_namespaces":[{"id":1,"size":{"blocks":3125000000,"bytes":1600000000000},"capacity":{"blocks":3125000000,"bytes":1600000000000},"utilization":{"blocks":3125000000,"bytes":1600000000000},"formatted_lba_size":512,"eui64":{"oui":6083300,"ext_id":131535864064}}],"user_capacity":{"blocks":3125000000,"bytes":1600000000000},"logical_block_size":512,"local_time":{"time_t":1646430754,"asctime":"Fri Mar 4 21:52:34 2022 UTC"},"smart_status":{"passed":true,"nvme":{"value":0}},"nvme_smart_health_information_log":{"critical_warning":0,"temperature":30,"available_spare":100,"available_spare_threshold":10,"percentage_used":0,"data_units_read":76057,"data_units_written":34034,"host_reads":2270667,"host_writes":835654,"controller_busy_time":0,"power_cycles":1135,"power_on_hours":20093,"unsafe_shutdowns":16,"media_errors":0,"num_err_log_entries":0,"warning_temp_time":0,"critical_comp_time":0},"temperature":{"current":30},"power_cycle_count":1135,"power_on_time":{"hours":20093}} Wwn: eui.01000000010000005cd2e41ea0265051 By ID: /dev/disk/by-id/nvme-eui.01000000010000005cd2e4b42b305051 By Path: /dev/disk/by-path/pci-0000:89:00.0-nvme-1 Drive Type: SSD Id: /dev/disk/by-id/nvme-eui.01000000010000005cd2e4b42b305051 Installation Eligibility: Eligible: true Not Eligible Reasons: Io Perf: Model: Dell Express Flash NVMe P4610 1.6TB SFF Name: nvme3n1 Path: /dev/nvme3n1 Serial: BTLN852300VV1P6AGN Size Bytes: 1600000000000 Smart: {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/nvme3n1"],"exit_status":0},"device":{"name":"/dev/nvme3n1","info_name":"/dev/nvme3n1","type":"nvme","protocol":"NVMe"},"model_name":"Dell Express Flash NVMe P4610 1.6TB SFF","serial_number":"BTLN852300VV1P6AGN","firmware_version":"VDV1DP21","nvme_pci_vendor":{"id":32902,"subsystem_id":4136},"nvme_ieee_oui_identifier":6083300,"nvme_total_capacity":1600000000000,"nvme_unallocated_capacity":0,"nvme_controller_id":0,"nvme_number_of_namespaces":1,"nvme_namespaces":[{"id":1,"size":{"blocks":3125000000,"bytes":1600000000000},"capacity":{"blocks":3125000000,"bytes":1600000000000},"utilization":{"blocks":3125000000,"bytes":1600000000000},"formatted_lba_size":512,"eui64":{"oui":6083300,"ext_id":773818679552}}],"user_capacity":{"blocks":3125000000,"bytes":1600000000000},"logical_block_size":512,"local_time":{"time_t":1646430754,"asctime":"Fri Mar 4 21:52:34 2022 UTC"},"smart_status":{"passed":true,"nvme":{"value":0}},"nvme_smart_health_information_log":{"critical_warning":0,"temperature":31,"available_spare":100,"available_spare_threshold":10,"percentage_used":0,"data_units_read":76182,"data_units_written":63509,"host_reads":2277724,"host_writes":1249048,"controller_busy_time":0,"power_cycles":1135,"power_on_hours":20092,"unsafe_shutdowns":27,"media_errors":0,"num_err_log_entries":0,"warning_temp_time":0,"critical_comp_time":0},"temperature":{"current":31},"power_cycle_count":1135,"power_on_time":{"hours":20092}} Wwn: eui.01000000010000005cd2e4b42b305051 Bootable: true By ID: /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52 By Path: /dev/disk/by-path/pci-0000:18:00.0-scsi-0:2:0:0 Drive Type: HDD Hctl: 1:2:0:0 Id: /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52 Installation Eligibility: Eligible: true Not Eligible Reasons: Io Perf: Model: PERC_H330_Mini Name: sda Path: /dev/sda Serial: 64cd98f04fde0e00246884800d3f8b52 Size Bytes: 479559942144 Smart: {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/sda"],"messages":[{"string":"Smartctl open device: /dev/sda failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'","severity":"error"}],"exit_status":2}} Vendor: DELL Wwn: 0x64cd98f04fde0e00246884800d3f8b52 By Path: /dev/disk/by-path/pci-0000:00:14.0-usb-0:14.4.2:1.0-scsi-0:0:0:0 Drive Type: ODD Hctl: 0:0:0:0 Id: /dev/disk/by-path/pci-0000:00:14.0-usb-0:14.4.2:1.0-scsi-0:0:0:0 Installation Eligibility: Not Eligible Reasons: Disk is removable Disk is too small (disk only has 107 MB, but 120 GB are required) Drive type is ODD, it must be one of HDD, SSD. Io Perf: Model: Virtual_CD/DVD Name: sr0 Path: /dev/sr0 Serial: 1028_123456 Size Bytes: 106727424 Smart: {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/sr0"],"exit_status":4},"device":{"name":"/dev/sr0","info_name":"/dev/sr0","type":"scsi","protocol":"SCSI"},"vendor":"Linux","product":"Virtual CD/DVD","model_name":"Linux Virtual CD/DVD","revision":"0001","user_capacity":{"blocks":52113,"bytes":106727424},"logical_block_size":2048,"device_type":{"scsi_value":5,"name":"CD/DVD"},"local_time":{"time_t":1646430755,"asctime":"Fri Mar 4 21:52:35 2022 UTC"},"temperature":{"current":0,"drive_trip":0}} Vendor: Linux Hostname: api.qe1.kni.lab.eng.bos.redhat.com Interfaces: Bios Dev Name: em1 Flags: up broadcast multicast Has Carrier: true ipV4Addresses: ipV6Addresses: Mac Address: 98:03:9b:61:7c:80 Mtu: 1500 Name: eno1 Product: 0x1015 Speed Mbps: 25000 Vendor: 0x15b3 Bios Dev Name: em2 Flags: up broadcast multicast Has Carrier: true ipV4Addresses: 10.19.134.13/25 10.19.134.15/25 ipV6Addresses: Mac Address: 98:03:9b:61:7c:81 Mtu: 1500 Name: eno2 Product: 0x1015 Speed Mbps: 25000 Vendor: 0x15b3 Memory: Physical Bytes: 206158430208 Usable Bytes: 201228070912 System Vendor: Manufacturer: Dell Inc. Product Name: PowerEdge R640 Serial Number: 176P2W2 Ntp Sources: Source Name: ntp.xtom.com Source State: unreachable Source Name: 38.229.54.9 Source State: unreachable Source Name: t1.time.bf1.yahoo.com Source State: unreachable Source Name: 104.171.113.34 Source State: unreachable Source Name: gopher.fart.website Source State: unreachable Source Name: li1.forfun.net Source State: unreachable Source Name: 2601:603:b7f:fec0:fec0:b7f:603:2601 Source State: unreachable Source Name: mci.clearnet.pw Source State: unreachable Progress: Current Stage: Rebooting Stage Start Time: 2022-03-04T21:54:23Z Stage Update Time: 2022-03-04T21:54:23Z Role: master Events: <none>
Without the agent and assisted installer logs we can't really tell what went wrong here. @
@sasha I see https://bugzilla.redhat.com/show_bug.cgi?id=1975848#c22. Since this seems to reproduce can you get these logs from the host prior to the reboot? * can be done by: 1. ssh to the host during the installation. 2. when prompted with the shutdown message (e.g. Installation complete this host will shutdown in...") cancel the shutdown by typing: shutdown -c 3. Get the logs: a. sudo journalctl TAG=agent b. sudo journalctl -u agent.service b. sudo journalctl TAG=installer You can resume the installation after collecting the logs by typing: shutdown -r +1 "Done collecting logs ;-), server is going to reboot."
assisted-installer log setting efi boot: Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="Setting efibootmgr to boot from disk" Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="Using EFI file 'shimx64.efi' for GOARCH 'amd64'" Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="efibootmgr: ** Warning ** : Boot0005 has same label Red Hat Enterprise Linux\n" Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="BootCurrent: 0002\nBootOrder: 0006,0002,0005,0004\nBoot0000* FlexBoot v3.5.504 (PCI 19:00.0)\tBBS(128,FlexBoot v3.5.504 (PCI 19:00.0),0x0)................`...........a.........................................................A..................\u007f...F.l.e.x.B.o.o.t. .v.3...5...5.0.4. .(.P.C.I. .1.9.:.0.0...0.)...\nBoot0001* FlexBoot v3.5.504 (PCI 19:00.1)\tBBS(128,FlexBoot v3.5.504 (PCI 19:00.1),0x0)................`...........a.........................................................A..................\u007f...F.l.e.x.B.o.o.t. .v.3...5...5.0.4. .(.P.C.I. .1.9.:.0.0...1.)...\n" Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="Boot0002* Virtual CD/DVD\tPciRoot(0x0)/Pci(0x14,0x0)/USB(13,0)/USB(3,0)/USB(1,0)\nBoot0003* Hard drive C:\tVenHw(d6c0639f-c705-4eb9-aa4f-5802d8823de6)............................f.........................................................A..................\u007f...P.E.R.C. .H.3.3.0. .M.i.n.i.(.b.u.s. .1.8. .d.e.v. .0.0.)...\nBoot0004* Integrated NIC 1 Port 1 Partition 1\tVenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)\nBoot0005* Red Hat Enterprise Linux\tHD(2,GPT,ed26d305-052e-4148-9b44-05357053742a,0x1000,0x3f800)/File(\\EFI\\redhat\\shimx64.efi)\nBoot0006* Red Hat Enterprise Linux\tHD(2,GPT,687cb1a3-b974-438a-ab6a-0eae099cfcd2,0x1000,0x3f800)/File(\\EFI\\redhat\\shimx64.efi)\n"
There is a warning in the efibootmgr log about another device with the same label as our boot disk https://github.com/rhboot/efibootmgr/blob/103aa22ece98f09fe3ea2a0c83988f0ee2d0e5a8/src/efibootmgr.c#L228 Perhaps this duplication might cause a conflict upon boot resolting a boot from the wrong device. fpercoco yshnaidm otuchfel ?
Changing severity to high because it is encountered on multiple server with multiple releases and require manual intervention during ZTP.
I'm not sure why there are two active boot entries with "Red Hat Enterprise Linux" label on my server. The servers are test machines that we would switch between different ocp releases on regular basis. So far I have seen this issue when installing 4.9, 4.10, 4.11.. Boot0014* Red Hat Enterprise Linux HD(2,GPT,1e8869d4-1225-4915-866c-9e18550a9a72,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi) Boot0018* Red Hat Enterprise Linux HD(2,GPT,ed26d305-052e-4148-9b44-05357053742a,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)
I removed the additional boot entries with "Red Hat Enterprise Linux" label that are not "current" , and redeployed my cluster, and this issue was not encountered. I will keep an eye on other clusters to prove whether it is caused by the multiple active "Red Hat Enterprise Linux" boot entries.
My second server also has the correct boot order (HD 1st, CD 2nd) in fresh ZTP deployment after I removed all the extra boot entries AND disabled "Hard-disk Drive Placeholder" option in idrac BIOS settings.
I reproduce this with: OCP version: 4.10.18 multicluster-engine.v2.0.0 iDRAC9 Firmware Version 5.10.10.00 Moving back to assigned.
I'm actually seeing that cleaning the boot entries doesn't help. Opened a new bug for ironic with the original title of this bug (The virtualmedia doesn't disconnect the ISO during spoke deployment after writing the image to the disk) https://bugzilla.redhat.com/show_bug.cgi?id=2100501
Tested with: HUB: 4.11.0-0.nightly-2022-08-24-091058 multicluster-engine.v2.1.0 SPOKE: 4.11.2 Successfully deployed SNO spoke on real BM: oc get agentclusterinstall qe1 -o json|jq ".status.conditions[-3].message" -r The installation has completed: Cluster is installed BM hardware: Dell R640 iDRAC9 BIOS Version 1.6.13 Firmware: 5.10.10.00
Although the deployment passed, noticed that the virtual media is still mounted on the BM node. Is that expected?
Yes it is expected in regular ztp flow. In converge flow it will be unmounted but it is not a default right now.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days