Bug 2055978 - host fail to boot from installation disk in case of multiple active "Red Hat Enterprise Linux" boot entries
Summary: host fail to boot from installation disk in case of multiple active "Red Hat ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Infrastructure Operator
Version: rhacm-2.4.z
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: rhacm-2.6
Assignee: Igal Tsoiref
QA Contact:
Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-18 05:19 UTC by Alexander Chuzhoy
Modified: 2023-09-18 04:32 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-03 20:20:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-installer pull 477 0 None open Bug 2055978: host fail to boot from installation disk in case of multiple active "Red Hat Enterprise Linux" boot entries 2022-06-01 09:29:09 UTC
Github stolostron backlog issues 20070 0 None None None 2022-02-18 07:25:05 UTC
Red Hat Issue Tracker MGMTBUGSM-101 0 None None None 2022-02-18 05:46:02 UTC

Description Alexander Chuzhoy 2022-02-18 05:19:20 UTC
Version:
OCP 4.10.0-rc.1
ACM: 2.4.2
BM hardware: Dell R640 iDRAC9 
BIOS Version	1.6.13
iDRAC Firmware Version	5.00.20.10


Spoke SNO deployment on real BM deployment gets stuck with:
  "The installation is in progress: Cluster has hosts pending user action".


The server actually reboots with the same ISO that doesn't disconnect automatically.


The workaround is to manually disconnect the ISO through the "virtual media" tab.

Comment 1 Jacob Anders 2022-02-18 10:26:54 UTC
Thank you for the bug report.

Commenting on this as I've been working on resolving https://bugzilla.redhat.com/show_bug.cgi?id=2054361 which you hit earlier.

Does this happen only on subsequent reinstalls, or also on "fresh" installs?

I am asking because this sounds like it may be related to Lifecycle Controller job queue already having vMedia entries.

In OCP 4.10 we attempt to automatically clear Lifecycle Controller queue on node enroll.

(however this will not repeatedly happen, only on first install after enroll).

A useful troubleshooting step could be to remove the node from the cluster as well as the node's definition from metal3 and then enrolling it again (unless this already happens in your workflow).

Let me know your thoughts on this.

Comment 2 Alexander Chuzhoy 2022-02-18 19:57:18 UTC
The machine was never deployed on that particular HUB, but it had an SNO running on it before.
Previous attempts to deploy on this machine from that hub failed as the created agent was never approved (had a wrong mac address).

Comment 3 Jacob Anders 2022-02-21 06:05:24 UTC
Thank you for the response.

I recommend also updating the BIOS to latest (iDRAC looks good) and re-testing . There is a possibility  that either the BIOS itself or a mix of old BIOS and new iDRAC are causing some issues. Running very old BIOS and very new iDRAC is not recommended - such configuration isn't supportable, in my experience if we asked for assistance from Dell they wouldn't even talk to us till we upgrade the BIOS.

Also - is this the same machine we worked with in https://bugzilla.redhat.com/show_bug.cgi?id=2054361?

What state is the machine in prior to the installation (booted into ISO, not booted, booted into on-disk image)? It may be good to chat on slack while you have it in this state. I wonder if some component is not trying to detach vMedia at all, or trying and failing. I can have a look when you have time. I can't comment on the higher-level components but happy to look at the iDRAC/Ironic level with you like we did in the previous bug.

Thank you,
Jacob

Comment 4 Alexander Chuzhoy 2022-02-22 20:30:42 UTC
Reproduced upon a new hub/spoke deployment.

Comment 5 Alexander Chuzhoy 2022-02-22 20:32:14 UTC
(In reply to Jacob Anders from comment #3)
> Thank you for the response.
> 
> I recommend also updating the BIOS to latest (iDRAC looks good) and
> re-testing . There is a possibility  that either the BIOS itself or a mix of
> old BIOS and new iDRAC are causing some issues. Running very old BIOS and
> very new iDRAC is not recommended - such configuration isn't supportable, in
> my experience if we asked for assistance from Dell they wouldn't even talk
> to us till we upgrade the BIOS.
> 
> Also - is this the same machine we worked with in
> https://bugzilla.redhat.com/show_bug.cgi?id=2054361?
> 
> What state is the machine in prior to the installation (booted into ISO, not
> booted, booted into on-disk image)? It may be good to chat on slack while
> you have it in this state. I wonder if some component is not trying to
> detach vMedia at all, or trying and failing. I can have a look when you have
> time. I can't comment on the higher-level components but happy to look at
> the iDRAC/Ironic level with you like we did in the previous bug.
> 
> Thank you,
> Jacob

Yes. Same machine as in https://bugzilla.redhat.com/show_bug.cgi?id=2054361

Comment 6 Alexander Chuzhoy 2022-02-22 21:21:00 UTC
Observation:


After reproducing the reported issue, I updated the Bios to 2.13.3
Firmware: 5.00.20.10


Removed BMH, infraenv, clusterdeployment ... and re-created all.

The issue still persists.

Comment 7 Jacob Anders 2022-03-02 06:52:24 UTC
Thank you for the update Sasha.

My understanding of this is that the vMedia image is removed while being attached to the server and this causes the iDRAC to end up in this state where it can't either attach or detach vMedia anymore and needs to be reset.

Given we reproduced this with the latest firmware, I discussed the iDRAC problem with Dell and they confirmed they can reproduce it as well and consider it's a firmware issue. There is plan to enhance the iDRAC firmware so that it is more resilient.

However I think it would be good to investigate further on our side and see if we can avoid triggering the firmware bug - at the end of the day this happens when we remove vMedia image while it's mounted and being accessed which doesn't sound like the right thing to do.

Do you know which of the components is responsible for the removal of the image within the provisioning flow that you are using (I am less familiar with ACM/ZTP approach)? Let me know, I can also ask around.

Comment 8 Flavio Percoco 2022-03-02 07:57:24 UTC
@

Comment 9 Flavio Percoco 2022-03-02 07:59:53 UTC
(In reply to Alexander Chuzhoy from comment #0)
> Version:
> OCP 4.10.0-rc.1
> ACM: 2.4.2
> BM hardware: Dell R640 iDRAC9 
> BIOS Version	1.6.13
> iDRAC Firmware Version	5.00.20.10
> 
> 
> Spoke SNO deployment on real BM deployment gets stuck with:
>   "The installation is in progress: Cluster has hosts pending user action".
> 
> 
> The server actually reboots with the same ISO that doesn't disconnect
> automatically.
> 
> 
> The workaround is to manually disconnect the ISO through the "virtual media"
> tab.


Could you please upload the logs from the agent and assisted installer? 

Something tells me there may be an issue when setting the bootorder.

What you are seeing is expected since Ironic doesn't remove the virtual media from the BMC after deployment. Therefore the machine will keep booting into the ISO if the bootorder is wrong, hence the deployment being in "pending user action"

Comment 10 Jacob Anders 2022-03-02 08:43:47 UTC
Thank you for your insights Flavio. This makes a lot of sense. My thinking got in the rut of still thinking within the boundaries of the previous problem we were looking at on the same server. Given the machine keeps booting from the same ISO over and over it is clearly not an issue of vMedia attachment not working - that was a red herring, please disregard it.

Comment 11 Alexander Chuzhoy 2022-03-04 22:14:08 UTC
Reproduced with: 
ACM: quay.io/acm-d/acm-custom-registry:v2.4.2-RC5
OCP: 4.10.0-rc.6




oc describe agent 84b7944a-00af-c659-da70-3f624ba4aff5 
Name:         84b7944a-00af-c659-da70-3f624ba4aff5
Namespace:    qe1
Labels:       agent-install.openshift.io/bmh=master-1-0
              infraenvs.agent-install.openshift.io=qe1
Annotations:  <none>
API Version:  agent-install.openshift.io/v1beta1
Kind:         Agent
Metadata:
  Creation Timestamp:  2022-03-04T21:46:54Z
  Finalizers:
    agent.agent-install.openshift.io/ai-deprovision
  Generation:  2
  Managed Fields:
    API Version:  agent-install.openshift.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:debugInfo:
          f:state:
          f:stateInfo:
        f:inventory:
          f:disks:
        f:progress:
          f:currentStage:
          f:stageStartTime:
          f:stageUpdateTime:
    Manager:         assisted-service
    Operation:       Update
    Subresource:     status
    Time:            2022-03-04T21:54:23Z
  Resource Version:  622991
  UID:               f58b90ce-6550-4293-8984-f3d6179bc21a
Spec:
  Approved:  true
  Cluster Deployment Name:
    Name:                qe1
    Namespace:           qe1
  Hostname:              master-1-0
  installation_disk_id:  /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52
  Role:                  master
Status:
  Bootstrap:  true
  Conditions:
    Last Transition Time:  2022-03-04T21:46:54Z
    Message:               The Spec has been successfully applied
    Reason:                SyncOK
    Status:                True
    Type:                  SpecSynced
    Last Transition Time:  2022-03-04T21:46:54Z
    Message:               The agent's connection to the installation service is unimpaired
    Reason:                AgentIsConnected
    Status:                True
    Type:                  Connected
    Last Transition Time:  2022-03-04T21:48:05Z
    Message:               Installation already started and is in progress
    Reason:                AgentAlreadyInstalling
    Status:                True
    Type:                  RequirementsMet
    Last Transition Time:  2022-03-04T21:48:05Z
    Message:               The agent's validations are passing
    Reason:                ValidationsPassing
    Status:                True
    Type:                  Validated
    Last Transition Time:  2022-03-04T21:46:54Z
    Message:               The installation is in progress: Expected the host to boot from disk, but it booted the installation image - please reboot and fix boot order to boot from disk PERC_H330_Mini 64cd98f04fde0e00246884800d3f8b52 (sda, /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52)
    Reason:                InstallationInProgress
    Status:                False
    Type:                  Installed
    Last Transition Time:  2022-03-04T21:46:54Z
    Message:               The agent is bound to a cluster deployment
    Reason:                Bound
    Status:                True
    Type:                  Bound
  Debug Info:
    Events URL:  https://assisted-service-rhacm.apps.rhos-qe.e2e.bos.redhat.com/api/assisted-install/v1/clusters/2eed5655-9860-4c9b-8b45-ad975cfabf85/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI1NTYxNGMzZS1hN2ExLTQ1MjktYWYxZi1mZGIyYjNmNjA4ZjAifQ.ud3i-fooKRDYGfHtQa-l7n539OMiSF7p3Az5l4lXps2jEZSBV8goEYhRJXOsRnhtq_vZdHy_LOJmhZ9AvHahEQ&host_id=84b7944a-00af-c659-da70-3f624ba4aff5
    State:       installing-pending-user-action
    State Info:  Expected the host to boot from disk, but it booted the installation image - please reboot and fix boot order to boot from disk PERC_H330_Mini 64cd98f04fde0e00246884800d3f8b52 (sda, /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52)
  Inventory:
    Bmc Address:   10.19.133.14
    bmcV6Address:  2620:52:0:1381:4ed9:8fff:fe2f:30aa
    Boot:
      Current Boot Mode:  uefi
    Cpu:
      Architecture:     x86_64
      Clock Megahertz:  1000
      Count:            64
      Flags:
        fpu
        vme
        de
        pse
        tsc
        msr
        pae
        mce
        cx8
        apic
        sep
        mtrr
        pge
        mca
        cmov
        pat
        pse36
        clflush
        dts
        acpi
        mmx
        fxsr
        sse
        sse2
        ss
        ht
        tm
        pbe
        syscall
        nx
        pdpe1gb
        rdtscp
        lm
        constant_tsc
        art
        arch_perfmon
        pebs
        bts
        rep_good
        nopl
        xtopology
        nonstop_tsc
        cpuid
        aperfmperf
        pni
        pclmulqdq
        dtes64
        monitor
        ds_cpl
        vmx
        smx
        est
        tm2
        ssse3
        sdbg
        fma
        cx16
        xtpr
        pdcm
        pcid
        dca
        sse4_1
        sse4_2
        x2apic
        movbe
        popcnt
        tsc_deadline_timer
        aes
        xsave
        avx
        f16c
        rdrand
        lahf_lm
        abm
        3dnowprefetch
        cpuid_fault
        epb
        cat_l3
        cdp_l3
        invpcid_single
        pti
        intel_ppin
        ssbd
        mba
        ibrs
        ibpb
        stibp
        tpr_shadow
        vnmi
        flexpriority
        ept
        vpid
        ept_ad
        fsgsbase
        tsc_adjust
        bmi1
        hle
        avx2
        smep
        bmi2
        erms
        invpcid
        rtm
        cqm
        mpx
        rdt_a
        avx512f
        avx512dq
        rdseed
        adx
        smap
        clflushopt
        clwb
        intel_pt
        avx512cd
        avx512bw
        avx512vl
        xsaveopt
        xsavec
        xgetbv1
        xsaves
        cqm_llc
        cqm_occup_llc
        cqm_mbm_total
        cqm_mbm_local
        dtherm
        ida
        arat
        pln
        pts
        pku
        ospke
        md_clear
        flush_l1d
        arch_capabilities
      Model Name:  Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
    Disks:
      By ID:       /dev/disk/by-id/nvme-eui.01000000010000005cd2e48288375051
      By Path:     /dev/disk/by-path/pci-0000:86:00.0-nvme-1
      Drive Type:  SSD
      Id:          /dev/disk/by-id/nvme-eui.01000000010000005cd2e48288375051
      Installation Eligibility:
        Eligible:  true
        Not Eligible Reasons:
      Io Perf:
      Model:       Dell Express Flash NVMe P4610 1.6TB SFF
      Name:        nvme0n1
      Path:        /dev/nvme0n1
      Serial:      BTLN903303DD1P6AGN
      Size Bytes:  1600000000000
      Smart:       {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/nvme0n1"],"exit_status":0},"device":{"name":"/dev/nvme0n1","info_name":"/dev/nvme0n1","type":"nvme","protocol":"NVMe"},"model_name":"Dell Express Flash NVMe P4610 1.6TB SFF","serial_number":"BTLN903303DD1P6AGN","firmware_version":"VDV1DP21","nvme_pci_vendor":{"id":32902,"subsystem_id":4136},"nvme_ieee_oui_identifier":6083300,"nvme_total_capacity":1600000000000,"nvme_unallocated_capacity":0,"nvme_controller_id":0,"nvme_number_of_namespaces":1,"nvme_namespaces":[{"id":1,"size":{"blocks":3125000000,"bytes":1600000000000},"capacity":{"blocks":3125000000,"bytes":1600000000000},"utilization":{"blocks":3125000000,"bytes":1600000000000},"formatted_lba_size":512,"eui64":{"oui":6083300,"ext_id":560631054592}}],"user_capacity":{"blocks":3125000000,"bytes":1600000000000},"logical_block_size":512,"local_time":{"time_t":1646430754,"asctime":"Fri Mar  4 21:52:34 2022 UTC"},"smart_status":{"passed":true,"nvme":{"value":0}},"nvme_smart_health_information_log":{"critical_warning":0,"temperature":30,"available_spare":100,"available_spare_threshold":10,"percentage_used":0,"data_units_read":88059,"data_units_written":117579,"host_reads":2495327,"host_writes":1673837,"controller_busy_time":0,"power_cycles":1135,"power_on_hours":20092,"unsafe_shutdowns":889,"media_errors":0,"num_err_log_entries":0,"warning_temp_time":0,"critical_comp_time":0},"temperature":{"current":30},"power_cycle_count":1135,"power_on_time":{"hours":20092}}
      Wwn:         eui.01000000010000005cd2e48288375051
      By ID:       /dev/disk/by-id/nvme-eui.01000000010000005cd2e4db2b305051
      By Path:     /dev/disk/by-path/pci-0000:87:00.0-nvme-1
      Drive Type:  SSD
      Id:          /dev/disk/by-id/nvme-eui.01000000010000005cd2e4db2b305051
      Installation Eligibility:
        Eligible:  true
        Not Eligible Reasons:
      Io Perf:
      Model:       Dell Express Flash NVMe P4610 1.6TB SFF
      Name:        nvme1n1
      Path:        /dev/nvme1n1
      Serial:      BTLN852300W71P6AGN
      Size Bytes:  1600000000000
      Smart:       {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/nvme1n1"],"exit_status":0},"device":{"name":"/dev/nvme1n1","info_name":"/dev/nvme1n1","type":"nvme","protocol":"NVMe"},"model_name":"Dell Express Flash NVMe P4610 1.6TB SFF","serial_number":"BTLN852300W71P6AGN","firmware_version":"VDV1DP21","nvme_pci_vendor":{"id":32902,"subsystem_id":4136},"nvme_ieee_oui_identifier":6083300,"nvme_total_capacity":1600000000000,"nvme_unallocated_capacity":0,"nvme_controller_id":0,"nvme_number_of_namespaces":1,"nvme_namespaces":[{"id":1,"size":{"blocks":3125000000,"bytes":1600000000000},"capacity":{"blocks":3125000000,"bytes":1600000000000},"utilization":{"blocks":3125000000,"bytes":1600000000000},"formatted_lba_size":512,"eui64":{"oui":6083300,"ext_id":941322404096}}],"user_capacity":{"blocks":3125000000,"bytes":1600000000000},"logical_block_size":512,"local_time":{"time_t":1646430754,"asctime":"Fri Mar  4 21:52:34 2022 UTC"},"smart_status":{"passed":true,"nvme":{"value":0}},"nvme_smart_health_information_log":{"critical_warning":0,"temperature":32,"available_spare":100,"available_spare_threshold":10,"percentage_used":0,"data_units_read":75605,"data_units_written":26254,"host_reads":2263511,"host_writes":584940,"controller_busy_time":0,"power_cycles":1136,"power_on_hours":20093,"unsafe_shutdowns":53,"media_errors":0,"num_err_log_entries":0,"warning_temp_time":0,"critical_comp_time":0},"temperature":{"current":32},"power_cycle_count":1136,"power_on_time":{"hours":20093}}
      Wwn:         eui.01000000010000005cd2e4db2b305051
      By ID:       /dev/disk/by-id/nvme-eui.01000000010000005cd2e41ea0265051
      By Path:     /dev/disk/by-path/pci-0000:88:00.0-nvme-1
      Drive Type:  SSD
      Id:          /dev/disk/by-id/nvme-eui.01000000010000005cd2e41ea0265051
      Installation Eligibility:
        Eligible:  true
        Not Eligible Reasons:
      Io Perf:
      Model:       Dell Express Flash NVMe P4610 1.6TB SFF
      Name:        nvme2n1
      Path:        /dev/nvme2n1
      Serial:      BTLN8500042S1P6AGN
      Size Bytes:  1600000000000
      Smart:       {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/nvme2n1"],"exit_status":0},"device":{"name":"/dev/nvme2n1","info_name":"/dev/nvme2n1","type":"nvme","protocol":"NVMe"},"model_name":"Dell Express Flash NVMe P4610 1.6TB SFF","serial_number":"BTLN8500042S1P6AGN","firmware_version":"VDV1DP21","nvme_pci_vendor":{"id":32902,"subsystem_id":4136},"nvme_ieee_oui_identifier":6083300,"nvme_total_capacity":1600000000000,"nvme_unallocated_capacity":0,"nvme_controller_id":0,"nvme_number_of_namespaces":1,"nvme_namespaces":[{"id":1,"size":{"blocks":3125000000,"bytes":1600000000000},"capacity":{"blocks":3125000000,"bytes":1600000000000},"utilization":{"blocks":3125000000,"bytes":1600000000000},"formatted_lba_size":512,"eui64":{"oui":6083300,"ext_id":131535864064}}],"user_capacity":{"blocks":3125000000,"bytes":1600000000000},"logical_block_size":512,"local_time":{"time_t":1646430754,"asctime":"Fri Mar  4 21:52:34 2022 UTC"},"smart_status":{"passed":true,"nvme":{"value":0}},"nvme_smart_health_information_log":{"critical_warning":0,"temperature":30,"available_spare":100,"available_spare_threshold":10,"percentage_used":0,"data_units_read":76057,"data_units_written":34034,"host_reads":2270667,"host_writes":835654,"controller_busy_time":0,"power_cycles":1135,"power_on_hours":20093,"unsafe_shutdowns":16,"media_errors":0,"num_err_log_entries":0,"warning_temp_time":0,"critical_comp_time":0},"temperature":{"current":30},"power_cycle_count":1135,"power_on_time":{"hours":20093}}
      Wwn:         eui.01000000010000005cd2e41ea0265051
      By ID:       /dev/disk/by-id/nvme-eui.01000000010000005cd2e4b42b305051
      By Path:     /dev/disk/by-path/pci-0000:89:00.0-nvme-1
      Drive Type:  SSD
      Id:          /dev/disk/by-id/nvme-eui.01000000010000005cd2e4b42b305051
      Installation Eligibility:
        Eligible:  true
        Not Eligible Reasons:
      Io Perf:
      Model:       Dell Express Flash NVMe P4610 1.6TB SFF
      Name:        nvme3n1
      Path:        /dev/nvme3n1
      Serial:      BTLN852300VV1P6AGN
      Size Bytes:  1600000000000
      Smart:       {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/nvme3n1"],"exit_status":0},"device":{"name":"/dev/nvme3n1","info_name":"/dev/nvme3n1","type":"nvme","protocol":"NVMe"},"model_name":"Dell Express Flash NVMe P4610 1.6TB SFF","serial_number":"BTLN852300VV1P6AGN","firmware_version":"VDV1DP21","nvme_pci_vendor":{"id":32902,"subsystem_id":4136},"nvme_ieee_oui_identifier":6083300,"nvme_total_capacity":1600000000000,"nvme_unallocated_capacity":0,"nvme_controller_id":0,"nvme_number_of_namespaces":1,"nvme_namespaces":[{"id":1,"size":{"blocks":3125000000,"bytes":1600000000000},"capacity":{"blocks":3125000000,"bytes":1600000000000},"utilization":{"blocks":3125000000,"bytes":1600000000000},"formatted_lba_size":512,"eui64":{"oui":6083300,"ext_id":773818679552}}],"user_capacity":{"blocks":3125000000,"bytes":1600000000000},"logical_block_size":512,"local_time":{"time_t":1646430754,"asctime":"Fri Mar  4 21:52:34 2022 UTC"},"smart_status":{"passed":true,"nvme":{"value":0}},"nvme_smart_health_information_log":{"critical_warning":0,"temperature":31,"available_spare":100,"available_spare_threshold":10,"percentage_used":0,"data_units_read":76182,"data_units_written":63509,"host_reads":2277724,"host_writes":1249048,"controller_busy_time":0,"power_cycles":1135,"power_on_hours":20092,"unsafe_shutdowns":27,"media_errors":0,"num_err_log_entries":0,"warning_temp_time":0,"critical_comp_time":0},"temperature":{"current":31},"power_cycle_count":1135,"power_on_time":{"hours":20092}}
      Wwn:         eui.01000000010000005cd2e4b42b305051
      Bootable:    true
      By ID:       /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52
      By Path:     /dev/disk/by-path/pci-0000:18:00.0-scsi-0:2:0:0
      Drive Type:  HDD
      Hctl:        1:2:0:0
      Id:          /dev/disk/by-id/wwn-0x64cd98f04fde0e00246884800d3f8b52
      Installation Eligibility:
        Eligible:  true
        Not Eligible Reasons:
      Io Perf:
      Model:       PERC_H330_Mini
      Name:        sda
      Path:        /dev/sda
      Serial:      64cd98f04fde0e00246884800d3f8b52
      Size Bytes:  479559942144
      Smart:       {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/sda"],"messages":[{"string":"Smartctl open device: /dev/sda failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'","severity":"error"}],"exit_status":2}}
      Vendor:      DELL
      Wwn:         0x64cd98f04fde0e00246884800d3f8b52
      By Path:     /dev/disk/by-path/pci-0000:00:14.0-usb-0:14.4.2:1.0-scsi-0:0:0:0
      Drive Type:  ODD
      Hctl:        0:0:0:0
      Id:          /dev/disk/by-path/pci-0000:00:14.0-usb-0:14.4.2:1.0-scsi-0:0:0:0
      Installation Eligibility:
        Not Eligible Reasons:
          Disk is removable
          Disk is too small (disk only has 107 MB, but 120 GB are required)
          Drive type is ODD, it must be one of HDD, SSD.
      Io Perf:
      Model:       Virtual_CD/DVD
      Name:        sr0
      Path:        /dev/sr0
      Serial:      1028_123456
      Size Bytes:  106727424
      Smart:       {"json_format_version":[1,0],"smartctl":{"version":[7,1],"svn_revision":"5049","platform_info":"x86_64-linux-4.18.0-305.34.2.el8_4.x86_64","build_info":"(local build)","argv":["smartctl","--xall","--json=c","/dev/sr0"],"exit_status":4},"device":{"name":"/dev/sr0","info_name":"/dev/sr0","type":"scsi","protocol":"SCSI"},"vendor":"Linux","product":"Virtual CD/DVD","model_name":"Linux Virtual CD/DVD","revision":"0001","user_capacity":{"blocks":52113,"bytes":106727424},"logical_block_size":2048,"device_type":{"scsi_value":5,"name":"CD/DVD"},"local_time":{"time_t":1646430755,"asctime":"Fri Mar  4 21:52:35 2022 UTC"},"temperature":{"current":0,"drive_trip":0}}
      Vendor:      Linux
    Hostname:      api.qe1.kni.lab.eng.bos.redhat.com
    Interfaces:
      Bios Dev Name:  em1
      Flags:
        up
        broadcast
        multicast
      Has Carrier:  true
      ipV4Addresses:
      ipV6Addresses:
      Mac Address:    98:03:9b:61:7c:80
      Mtu:            1500
      Name:           eno1
      Product:        0x1015
      Speed Mbps:     25000
      Vendor:         0x15b3
      Bios Dev Name:  em2
      Flags:
        up
        broadcast
        multicast
      Has Carrier:  true
      ipV4Addresses:
        10.19.134.13/25
        10.19.134.15/25
      ipV6Addresses:
      Mac Address:  98:03:9b:61:7c:81
      Mtu:          1500
      Name:         eno2
      Product:      0x1015
      Speed Mbps:   25000
      Vendor:       0x15b3
    Memory:
      Physical Bytes:  206158430208
      Usable Bytes:    201228070912
    System Vendor:
      Manufacturer:   Dell Inc.
      Product Name:   PowerEdge R640
      Serial Number:  176P2W2
  Ntp Sources:
    Source Name:   ntp.xtom.com
    Source State:  unreachable
    Source Name:   38.229.54.9
    Source State:  unreachable
    Source Name:   t1.time.bf1.yahoo.com
    Source State:  unreachable
    Source Name:   104.171.113.34
    Source State:  unreachable
    Source Name:   gopher.fart.website
    Source State:  unreachable
    Source Name:   li1.forfun.net
    Source State:  unreachable
    Source Name:   2601:603:b7f:fec0:fec0:b7f:603:2601
    Source State:  unreachable
    Source Name:   mci.clearnet.pw
    Source State:  unreachable
  Progress:
    Current Stage:      Rebooting
    Stage Start Time:   2022-03-04T21:54:23Z
    Stage Update Time:  2022-03-04T21:54:23Z
  Role:                 master
Events:                 <none>

Comment 14 Eran Cohen 2022-03-15 08:11:24 UTC
Without the agent and assisted installer logs we can't really tell what went wrong here.
@

Comment 15 Eran Cohen 2022-03-15 08:35:24 UTC
@sasha I see https://bugzilla.redhat.com/show_bug.cgi?id=1975848#c22.
Since this seems to reproduce can you get these logs from the host prior to the reboot?
* can be done by:
 1. ssh to the host during the installation.
 2. when prompted with the shutdown message (e.g. Installation complete this host will shutdown in...") cancel the shutdown by typing: shutdown -c
 3. Get the logs:
  a. sudo journalctl TAG=agent
  b. sudo journalctl -u agent.service
  b. sudo journalctl TAG=installer    

You can resume the installation after collecting the logs by typing: shutdown -r +1 "Done collecting logs ;-), server is going to reboot."

Comment 20 Eran Cohen 2022-04-25 12:41:32 UTC
assisted-installer log setting efi boot:

Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="Setting efibootmgr to boot from disk"
Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="Using EFI file 'shimx64.efi' for GOARCH 'amd64'"
Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="efibootmgr: ** Warning ** : Boot0005 has same label Red Hat Enterprise Linux\n"
Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="BootCurrent: 0002\nBootOrder: 0006,0002,0005,0004\nBoot0000* FlexBoot v3.5.504 (PCI 19:00.0)\tBBS(128,FlexBoot v3.5.504 (PCI 19:00.0),0x0)................`...........a.........................................................A..................\u007f...F.l.e.x.B.o.o.t. .v.3...5...5.0.4. .(.P.C.I. .1.9.:.0.0...0.)...\nBoot0001* FlexBoot v3.5.504 (PCI 19:00.1)\tBBS(128,FlexBoot v3.5.504 (PCI 19:00.1),0x0)................`...........a.........................................................A..................\u007f...F.l.e.x.B.o.o.t. .v.3...5...5.0.4. .(.P.C.I. .1.9.:.0.0...1.)...\n"
Apr 07 20:14:48 api.qe1.kni.lab.eng.bos.redhat.com installer[124716]: time="2022-04-07T20:14:48Z" level=info msg="Boot0002* Virtual CD/DVD\tPciRoot(0x0)/Pci(0x14,0x0)/USB(13,0)/USB(3,0)/USB(1,0)\nBoot0003* Hard drive C:\tVenHw(d6c0639f-c705-4eb9-aa4f-5802d8823de6)............................f.........................................................A..................\u007f...P.E.R.C. .H.3.3.0. .M.i.n.i.(.b.u.s. .1.8. .d.e.v. .0.0.)...\nBoot0004* Integrated NIC 1 Port 1 Partition 1\tVenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)\nBoot0005* Red Hat Enterprise Linux\tHD(2,GPT,ed26d305-052e-4148-9b44-05357053742a,0x1000,0x3f800)/File(\\EFI\\redhat\\shimx64.efi)\nBoot0006* Red Hat Enterprise Linux\tHD(2,GPT,687cb1a3-b974-438a-ab6a-0eae099cfcd2,0x1000,0x3f800)/File(\\EFI\\redhat\\shimx64.efi)\n"

Comment 21 Eran Cohen 2022-04-25 17:22:19 UTC
There is a warning in the efibootmgr log about another device with the same label as our boot disk
https://github.com/rhboot/efibootmgr/blob/103aa22ece98f09fe3ea2a0c83988f0ee2d0e5a8/src/efibootmgr.c#L228
Perhaps this duplication might cause a conflict upon boot resolting a boot from the wrong device.

fpercoco yshnaidm otuchfel ?

Comment 23 yliu1 2022-05-26 17:35:52 UTC
Changing severity to high because it is encountered on multiple server with multiple releases and require manual intervention during ZTP.

Comment 24 yliu1 2022-05-26 17:44:33 UTC
I'm not sure why there are two active boot entries with "Red Hat Enterprise Linux" label on my server. The servers are test machines that we would switch between different ocp releases on regular basis. So far I have seen this issue when installing 4.9, 4.10, 4.11..

Boot0014* Red Hat Enterprise Linux	HD(2,GPT,1e8869d4-1225-4915-866c-9e18550a9a72,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)
Boot0018* Red Hat Enterprise Linux	HD(2,GPT,ed26d305-052e-4148-9b44-05357053742a,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)

Comment 25 yliu1 2022-05-26 20:56:27 UTC
I removed the additional boot entries with "Red Hat Enterprise Linux" label that are not "current" , and redeployed my cluster, and this issue was not encountered. I will keep an eye on other clusters to prove whether it is caused by the multiple active "Red Hat Enterprise Linux" boot entries.

Comment 26 yliu1 2022-05-27 00:30:45 UTC
My second server also has the correct boot order (HD 1st, CD 2nd) in fresh ZTP deployment after I removed all the extra boot entries AND disabled "Hard-disk Drive Placeholder" option in idrac BIOS settings.

Comment 27 Alexander Chuzhoy 2022-06-23 04:01:04 UTC
I reproduce this with:
OCP version: 4.10.18
multicluster-engine.v2.0.0

iDRAC9 Firmware Version	5.10.10.00

Moving back to assigned.

Comment 28 Alexander Chuzhoy 2022-06-23 14:35:12 UTC
I'm actually seeing that cleaning the boot entries doesn't help.
Opened a new bug for ironic with the original title of this bug (The virtualmedia doesn't disconnect the ISO during spoke deployment after writing the image to the disk)

https://bugzilla.redhat.com/show_bug.cgi?id=2100501

Comment 29 Alexander Chuzhoy 2022-08-26 17:41:59 UTC
Tested with:
HUB: 4.11.0-0.nightly-2022-08-24-091058
     multicluster-engine.v2.1.0
SPOKE: 4.11.2

Successfully deployed SNO spoke on real BM:
oc get agentclusterinstall qe1 -o json|jq ".status.conditions[-3].message" -r
The installation has completed: Cluster is installed




BM hardware: 
Dell R640 iDRAC9 
BIOS Version	1.6.13
Firmware: 5.10.10.00

Comment 30 Alexander Chuzhoy 2022-08-29 14:43:08 UTC
Although the deployment passed, noticed that the virtual media is still mounted on the BM node.
Is that expected?

Comment 31 Igal Tsoiref 2022-08-31 09:01:54 UTC
Yes it is expected in regular ztp flow. 
In converge flow it will be unmounted but it is not a default right now.

Comment 32 Red Hat Bugzilla 2023-09-18 04:32:21 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.