Bug 1970567
| Summary: | [master] [Assisted-4.8 ] AI Pods throws errors cannot access the media (ISO) - media was likely disconnected | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Yuri Obshansky <yobshans> | ||||
| Component: | assisted-installer | Assignee: | Michael Levy <milevy> | ||||
| assisted-installer sub component: | assisted-service | QA Contact: | Yuri Obshansky <yobshans> | ||||
| Status: | CLOSED WONTFIX | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | urgent | CC: | aos-bugs, mfilanov, milevy | ||||
| Version: | 4.8 | Keywords: | Triaged | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | AI-Team-Core | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1974735 (view as bug list) | Environment: | |||||
| Last Closed: | 2021-12-26 13:22:19 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1974735 | ||||||
| Attachments: |
|
||||||
|
Description
Yuri Obshansky
2021-06-10 17:39:23 UTC
@milevy Need to understand if this is a bug, or that indeed the virtual media was disconnected in these cases @yobshans The only way to verify that is by the host log. can you please find or guide me how to find the logs? There is no way to verify that the device has been disconnected from the service logs but it should be logged in the host logs I'm not sure that the logs are available at this stage... We found the problematic cluster "my-ai-cluster" https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/2f232558-f9b6-41bb-abc3-ceb9fd5bf17a which is a reason of a lot of errors in Pods logs. Unfortunately we cannot find owner of cluster/host to make deeper investigation. host f05-h29-000-1029p.rdu2.scalelab.redhat.com This is the SNO cluster with Openshift Virtualization and Base Domain: assistedinstaller.sysdeseng.com(route53) Host Details UUID 02b62875-2f3f-7d23-2923-7db07d78f1ae Manufacturer Supermicro Product SYS-1029P-WTR Serial number S263530X7921112 Memory capacity 251.55 GiB CPU model name Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz CPU cores and clock speed 64 cores (hyper-threaded) at 3,700 MHz CPU architecture x86_64 Hardware type Bare metal BMC address 10.1.41.1 Boot mode bios NTP status 3 Disks Name Role Limitations Drive type Size Serial Model WWN nvme0n1 SSD 512.11 GB S316NX0J406601 SAMSUNG MZVKW512HMJP-00000 eui.002538c471b01a56 sda Installation disk SSD 480.10 GB BTDV724407L9480BGN INTEL_SSDSC2BB48 0x55cd2e414dd3ef67 sr0 (bootable) ODD 1.03 GB ATEN_Virtual_CDROM-0:0 Virtual_CDROM 6 NICs Name MAC address IPv4 address IPv6 address Speed eno1 0c:c4:7a:fa:19:4c 10.1.40.192/23 2620:52:0:128:1aa9:1377:246d:cac8/64 1000 Mbps eno2 0c:c4:7a:fa:19:4d N/A ens2f0 ac:1f:6b:2d:19:04 10000 Mbps ens2f1 ac:1f:6b:2d:19:05 10000 Mbps ens2f2 ac:1f:6b:2d:19:06 10000 Mbps ens2f3 ac:1f:6b:2d:19:07 2620:52:0:134:49b7:aacf:9b16:b002/64 10000 Mbps Cluster events: Deleted image from backend because it expired. It may be generated again at any time. 6/14/2021, 9:10:12 PM Updated status of cluster my-ai-cluster to ready 6/14/2021, 9:10:12 PM Cluster validation 'all-hosts-are-ready-to-install' is now fixed 6/14/2021, 9:10:07 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "discovering" to "known" (Host is ready to be installed) 6/14/2021, 9:10:07 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: validation 'has-inventory' is now fixed 6/14/2021, 9:09:47 PM warning Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: validation 'has-inventory' that used to succeed is now failing 6/14/2021, 9:09:47 PM Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: validation 'connected' is now fixed 6/14/2021, 9:09:44 PM Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: registered to cluster 6/14/2021, 9:09:44 PM Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: updated status from "resetting-pending-user-action" to "discovering" (Waiting for host to send hardware details) 6/14/2021, 7:59:08 PM warning Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: validation 'connected' that used to succeed is now failing 6/14/2021, 7:57:02 PM warning Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing 6/14/2021, 7:56:56 PM Installation reset for host f05-h29-000-1029p.rdu2.scalelab.redhat.com 6/14/2021, 7:56:56 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "error" to "resetting-pending-user-action" (Host requires booting into the discovery image to complete resetting the installation) 6/14/2021, 7:56:56 PM Reset cluster installation 6/14/2021, 7:55:31 PM Updated status of cluster my-ai-cluster to error 6/14/2021, 7:55:22 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: reached installation stage Failed: failed executing nsenter [-t 1 -m -i -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ecf386d37752299dd66ebf9f0963a7d8a29fbc134b7cbd8cdb51b026e60080c0 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "... or reading osImageURL from rpm-ostree: error running rpm-ostree status --json: exit status 1 error: Failed to activate service 'org.projectatomic.rpmostree1': timed out (service_start_timeout=25000ms)" 6/14/2021, 7:55:22 PM error Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "installing-in-progress" to "error" (Failed - failed executing nsenter [-t 1 -m -i -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ecf386d37752299dd66ebf9f0963a7d8a29fbc134b7cbd8cdb51b026e60080c0 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "... or reading osImageURL from rpm-ostree: error running rpm-ostree status --json: exit status 1 error: Failed to activate service 'org.projectatomic.rpmostree1': timed out (service_start_timeout=25000ms)") 6/14/2021, 7:53:01 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: reached installation stage Installing: bootstrap 6/14/2021, 7:53:00 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: reached installation stage Starting installation: bootstrap 6/14/2021, 7:53:00 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "installing" to "installing-in-progress" (Starting installation) 6/14/2021, 7:53:00 PM f05-h29-000-1029p.rdu2.scalelab.redhat.com: Performing quick format of disk sda(/dev/disk/by-id/wwn-0x55cd2e414dd3ef67) 6/14/2021, 7:53:00 PM f05-h29-000-1029p.rdu2.scalelab.redhat.com: Performing quick format of disk nvme0n1(/dev/disk/by-id/nvme-eui.002538c471b01a56) 6/14/2021, 7:52:27 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "preparing-successful" to "installing" (Installation is in progress) 6/14/2021, 7:52:22 PM Updated status of cluster my-ai-cluster to installing 6/14/2021, 7:52:19 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "preparing-for-installation" to "preparing-successful" (Host finished successfully to prepare for installation) 6/14/2021, 7:52:13 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: New image status quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:99578016c9c56ee258124c4b4253a101da00cfbaae5b4b44cf8d049b6466b24d. result: success. time: 2.289669 seconds; size: 406456118.000000 bytes; download rate: 177.517431 MBps 6/14/2021, 7:52:13 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: New image status quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ecf386d37752299dd66ebf9f0963a7d8a29fbc134b7cbd8cdb51b026e60080c0. result: success. time: 6.268845 seconds; size: 453151831.000000 bytes; download rate: 72.286334 MBps 6/14/2021, 7:52:13 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: New image status registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-rhel8:v1.0.0-54. result: success. time: 5.416992 seconds; size: 310005440.000000 bytes; download rate: 57.228341 MBps 6/14/2021, 7:51:47 PM Cluster was prepared successfully for installation 6/14/2021, 7:51:31 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "known" to "preparing-for-installation" (Host is preparing for installation) 6/14/2021, 7:51:24 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: set as bootstrap 6/14/2021, 7:50:29 PM Updated status of cluster my-ai-cluster to ready 6/14/2021, 7:50:29 PM Cluster validation 'all-hosts-are-ready-to-install' is now fixed 6/14/2021, 7:50:29 PM Cluster validation 'machine-cidr-defined' is now fixed 6/14/2021, 7:50:29 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "insufficient" to "known" (Host is ready to be installed) 6/14/2021, 7:20:37 PM warning Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "discovering" to "insufficient" (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server) 6/14/2021, 7:20:37 PM Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: validation 'has-inventory' is now fixed 6/14/2021, 7:20:01 PM Cluster validation 'sufficient-masters-count' is now fixed 6/14/2021, 7:20:01 PM warning Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing 6/14/2021, 7:19:51 PM Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: registered to cluster 6/14/2021, 4:56:19 PM Generated image (Image type is "full-iso", SSH public key is set) 6/14/2021, 4:52:31 PM Updated status of cluster my-ai-cluster to pending-for-input 6/14/2021, 4:52:22 PM Successfully registered cluster my-ai-cluster with id 2f232558-f9b6-41bb-abc3-ceb9fd5bf17a 6/14/2021, 4:52:22 PM Registered cluster "my-ai-cluster" Created attachment 1792091 [details]
pod logs
The main problem is that when virtual-media is disconnected, then as long as the host is not rebooted, we'll keep spamming the logs with this message. Need to understand why this host did not go into an error state, and it it does, make sure that there are no more messages sent to the log The https://github.com/openshift/assisted-service/pull/1867 PR should resolve this issue because it change the host's status to Failed and stop the installation. It didn't merged to Staging so we should verify this issue after merging this PR @liat The https://github.com/openshift/assisted-service/pull/1867 PR changed the status of the host to error but if even in error state the assisted-service keep sending commands to execute. Created a new PR https://github.com/openshift/assisted-service/pull/2061 *** Bug 1977292 has been marked as a duplicate of this bug. *** unfortunately it is reproduced again with other machine 10:03 time="2021-06-21T13:15:59Z" level=error msg="Received step reply <execute-a39ece51> from cluster <ebef33a6-fed7-451c-a135-53e5c0c89b10> host <dc38e705-4a24-2b8a-dded-a3c33f3b7373> exit-code <256> stderr <cannot access the media (ISO) - media was likely disconnected: open /dev/sr0: no medium found> stdout <>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:3123" cluster_id=ebef33a6-fed7-451c-a135-53e5c0c89b10 go-id=1264934 host_id=dc38e705-4a24-2b8a-dded-a3c33f3b7373 pkg=Inventory request_id=c5513ab6-b501-47db-a03d-08d37e1ae419 10:03 https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/ebef33a6-fed7-451c-a135-53e5c0c89b10 10:04 Cluster deployment failed, but agent continue to throws errors not sure that it's connected to the this issue... https://coreos.slack.com/archives/CUPJTHQ5P/p1626795956241600 @Vitaliy Emporopulo can you please elaborate? |