Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1970567

Summary: [master] [Assisted-4.8 ] AI Pods throws errors cannot access the media (ISO) - media was likely disconnected
Product: OpenShift Container Platform Reporter: Yuri Obshansky <yobshans>
Component: assisted-installerAssignee: Michael Levy <milevy>
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: urgent CC: aos-bugs, mfilanov, milevy
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Core
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1974735 (view as bug list) Environment:
Last Closed: 2021-12-26 13:22:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1974735    
Attachments:
Description Flags
pod logs none

Description Yuri Obshansky 2021-06-10 17:39:23 UTC
Description of problem:
Assisted Service Pods running on Staging environment throws a lot of errors 
"cannot access the media (ISO) - media was likely disconnected"

Examples:

time="2021-06-10T15:45:12Z" level=error msg="Received step reply <inventory-576b0f3c> from cluster <c3a6b27c-0ff3-49e3-a57b-5a8412e3df15> host <12eca4f4-5704-aad3-f84f-63600412f417> exit-code <256> stderr <cannot access the media (ISO) - media was likely disconnected: open /dev/sr0: no medium found> stdout <>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:3047" cluster_id=c3a6b27c-0ff3-49e3-a57b-5a8412e3df15 go-id=3746360 host_id=12eca4f4-5704-aad3-f84f-63600412f417 pkg=Inventory request_id=2dd27f1c-547f-491a-93eb-4e88622245bb


time="2021-06-10T15:45:12Z" level=error msg="Received step reply <ntp-synchronizer-4dfa46a5> from cluster <c3a6b27c-0ff3-49e3-a57b-5a8412e3df15> host <12eca4f4-5704-aad3-f84f-63600412f417> exit-code <256> stderr <cannot access the media (ISO) - media was likely disconnected: open /dev/sr0: no medium found> stdout <>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:3047" cluster_id=c3a6b27c-0ff3-49e3-a57b-5a8412e3df15 go-id=3747829 host_id=12eca4f4-5704-aad3-f84f-63600412f417 pkg=Inventory request_id=fef7a9de-12fb-43d2-8cbf-51a623112be9


Version-Release number of selected component (if applicable):
v1.0.21.3

How reproducible:
https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/k8s/ns/assisted-installer-stage/pods/assisted-service-69dbd58656-krm4z/logs

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ronnie Lazar 2021-06-13 09:16:34 UTC
@milevy  Need to understand if this is a bug, or that indeed the virtual media was disconnected in these cases

Comment 2 Michael Levy 2021-06-16 12:01:37 UTC
@yobshans The only way to verify that is by the host log.
can you please find or guide me how to find the logs?

Comment 3 Michael Levy 2021-06-16 12:07:01 UTC
There is no way to verify that the device has been disconnected from the service logs but it should be logged in the host logs
I'm not sure that the logs are available at this stage...

Comment 4 Yuri Obshansky 2021-06-16 15:27:23 UTC
We found the problematic cluster "my-ai-cluster"
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/2f232558-f9b6-41bb-abc3-ceb9fd5bf17a which is a reason of a lot of errors in Pods logs.
Unfortunately we cannot find owner of cluster/host to make deeper investigation.
host f05-h29-000-1029p.rdu2.scalelab.redhat.com
This is the SNO cluster with Openshift Virtualization 
and Base Domain: assistedinstaller.sysdeseng.com(route53)

Host Details

UUID    02b62875-2f3f-7d23-2923-7db07d78f1ae
Manufacturer     Supermicro
Product    SYS-1029P-WTR
Serial number    S263530X7921112

Memory capacity    251.55 GiB
CPU model name    Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
CPU cores and clock speed    64 cores (hyper-threaded) at 3,700 MHz
CPU architecture    x86_64

Hardware type    Bare metal
BMC address    10.1.41.1
Boot mode    bios
NTP status
3 Disks
Name	Role	Limitations	Drive type	Size	Serial	Model	WWN
nvme0n1	
	SSD	512.11 GB	S316NX0J406601	SAMSUNG MZVKW512HMJP-00000	eui.002538c471b01a56
sda	Installation disk		SSD	480.10 GB	BTDV724407L9480BGN	INTEL_SSDSC2BB48	0x55cd2e414dd3ef67
sr0 (bootable)	
ODD	1.03 GB	ATEN_Virtual_CDROM-0:0	Virtual_CDROM	
6 NICs
Name	MAC address	IPv4 address	IPv6 address	Speed
eno1	0c:c4:7a:fa:19:4c	10.1.40.192/23	2620:52:0:128:1aa9:1377:246d:cac8/64	1000 Mbps
eno2	0c:c4:7a:fa:19:4d			N/A
ens2f0	ac:1f:6b:2d:19:04			10000 Mbps
ens2f1	ac:1f:6b:2d:19:05			10000 Mbps
ens2f2	ac:1f:6b:2d:19:06			10000 Mbps
ens2f3	ac:1f:6b:2d:19:07		2620:52:0:134:49b7:aacf:9b16:b002/64	10000 Mbps

Comment 5 Yuri Obshansky 2021-06-16 15:27:50 UTC
Cluster events:
Deleted image from backend because it expired. It may be generated again at any time.
6/14/2021, 9:10:12 PM	Updated status of cluster my-ai-cluster to ready
6/14/2021, 9:10:12 PM	Cluster validation 'all-hosts-are-ready-to-install' is now fixed
6/14/2021, 9:10:07 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "discovering" to "known" (Host is ready to be installed)
6/14/2021, 9:10:07 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: validation 'has-inventory' is now fixed
6/14/2021, 9:09:47 PM	
warning Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: validation 'has-inventory' that used to succeed is now failing
6/14/2021, 9:09:47 PM	Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: validation 'connected' is now fixed
6/14/2021, 9:09:44 PM	Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: registered to cluster
6/14/2021, 9:09:44 PM	Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: updated status from "resetting-pending-user-action" to "discovering" (Waiting for host to send hardware details)
6/14/2021, 7:59:08 PM	
warning Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: validation 'connected' that used to succeed is now failing
6/14/2021, 7:57:02 PM	
warning Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing
6/14/2021, 7:56:56 PM	Installation reset for host f05-h29-000-1029p.rdu2.scalelab.redhat.com
6/14/2021, 7:56:56 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "error" to "resetting-pending-user-action" (Host requires booting into the discovery image to complete resetting the installation)
6/14/2021, 7:56:56 PM	Reset cluster installation
6/14/2021, 7:55:31 PM	Updated status of cluster my-ai-cluster to error
6/14/2021, 7:55:22 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: reached installation stage Failed: failed executing nsenter [-t 1 -m -i -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ecf386d37752299dd66ebf9f0963a7d8a29fbc134b7cbd8cdb51b026e60080c0 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "... or reading osImageURL from rpm-ostree: error running rpm-ostree status --json: exit status 1 error: Failed to activate service 'org.projectatomic.rpmostree1': timed out (service_start_timeout=25000ms)"
6/14/2021, 7:55:22 PM	
error Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "installing-in-progress" to "error" (Failed - failed executing nsenter [-t 1 -m -i -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ecf386d37752299dd66ebf9f0963a7d8a29fbc134b7cbd8cdb51b026e60080c0 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "... or reading osImageURL from rpm-ostree: error running rpm-ostree status --json: exit status 1 error: Failed to activate service 'org.projectatomic.rpmostree1': timed out (service_start_timeout=25000ms)")
6/14/2021, 7:53:01 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: reached installation stage Installing: bootstrap
6/14/2021, 7:53:00 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: reached installation stage Starting installation: bootstrap
6/14/2021, 7:53:00 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "installing" to "installing-in-progress" (Starting installation)
6/14/2021, 7:53:00 PM	f05-h29-000-1029p.rdu2.scalelab.redhat.com: Performing quick format of disk sda(/dev/disk/by-id/wwn-0x55cd2e414dd3ef67)
6/14/2021, 7:53:00 PM	f05-h29-000-1029p.rdu2.scalelab.redhat.com: Performing quick format of disk nvme0n1(/dev/disk/by-id/nvme-eui.002538c471b01a56)
6/14/2021, 7:52:27 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "preparing-successful" to "installing" (Installation is in progress)
6/14/2021, 7:52:22 PM	Updated status of cluster my-ai-cluster to installing
6/14/2021, 7:52:19 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "preparing-for-installation" to "preparing-successful" (Host finished successfully to prepare for installation)
6/14/2021, 7:52:13 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: New image status quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:99578016c9c56ee258124c4b4253a101da00cfbaae5b4b44cf8d049b6466b24d. result: success. time: 2.289669 seconds; size: 406456118.000000 bytes; download rate: 177.517431 MBps
6/14/2021, 7:52:13 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: New image status quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ecf386d37752299dd66ebf9f0963a7d8a29fbc134b7cbd8cdb51b026e60080c0. result: success. time: 6.268845 seconds; size: 453151831.000000 bytes; download rate: 72.286334 MBps
6/14/2021, 7:52:13 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: New image status registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-rhel8:v1.0.0-54. result: success. time: 5.416992 seconds; size: 310005440.000000 bytes; download rate: 57.228341 MBps
6/14/2021, 7:51:47 PM	Cluster was prepared successfully for installation
6/14/2021, 7:51:31 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "known" to "preparing-for-installation" (Host is preparing for installation)
6/14/2021, 7:51:24 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: set as bootstrap
6/14/2021, 7:50:29 PM	Updated status of cluster my-ai-cluster to ready
6/14/2021, 7:50:29 PM	Cluster validation 'all-hosts-are-ready-to-install' is now fixed
6/14/2021, 7:50:29 PM	Cluster validation 'machine-cidr-defined' is now fixed
6/14/2021, 7:50:29 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "insufficient" to "known" (Host is ready to be installed)
6/14/2021, 7:20:37 PM	
warning Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: updated status from "discovering" to "insufficient" (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server)
6/14/2021, 7:20:37 PM	Host f05-h29-000-1029p.rdu2.scalelab.redhat.com: validation 'has-inventory' is now fixed
6/14/2021, 7:20:01 PM	Cluster validation 'sufficient-masters-count' is now fixed
6/14/2021, 7:20:01 PM	
warning Cluster validation 'all-hosts-are-ready-to-install' that used to succeed is now failing
6/14/2021, 7:19:51 PM	Host 02b62875-2f3f-7d23-2923-7db07d78f1ae: registered to cluster
6/14/2021, 4:56:19 PM	Generated image (Image type is "full-iso", SSH public key is set)
6/14/2021, 4:52:31 PM	Updated status of cluster my-ai-cluster to pending-for-input
6/14/2021, 4:52:22 PM	Successfully registered cluster my-ai-cluster with id 2f232558-f9b6-41bb-abc3-ceb9fd5bf17a
6/14/2021, 4:52:22 PM	Registered cluster "my-ai-cluster"

Comment 6 Yuri Obshansky 2021-06-18 15:47:49 UTC
Created attachment 1792091 [details]
pod logs

Comment 7 Ronnie Lazar 2021-06-20 07:46:16 UTC
The main problem is that when virtual-media is disconnected, then as long as the host is not rebooted, we'll keep spamming the logs with this message.
Need to understand why this host did not go into an error state, and it it does, make sure that there are no more messages sent to the log

Comment 8 Michael Levy 2021-06-21 12:17:57 UTC
The https://github.com/openshift/assisted-service/pull/1867 PR should resolve this issue because it change the host's status to Failed and stop the installation.
It didn't merged to Staging so we should verify this issue after merging this PR @liat

Comment 9 Michael Levy 2021-06-24 07:33:54 UTC
The https://github.com/openshift/assisted-service/pull/1867 PR changed the status of the host to error but if even in error state the assisted-service keep sending commands to execute.
Created a new PR https://github.com/openshift/assisted-service/pull/2061

Comment 10 Ronnie Lazar 2021-06-29 15:44:07 UTC
*** Bug 1977292 has been marked as a duplicate of this bug. ***

Comment 11 Yuri Obshansky 2021-07-20 16:10:23 UTC
unfortunately it is reproduced again with other machine
10:03 time="2021-06-21T13:15:59Z" level=error msg="Received step reply <execute-a39ece51> from cluster <ebef33a6-fed7-451c-a135-53e5c0c89b10> host <dc38e705-4a24-2b8a-dded-a3c33f3b7373> exit-code <256> stderr <cannot access the media (ISO) - media was likely disconnected: open /dev/sr0: no medium found> stdout <>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:3123" cluster_id=ebef33a6-fed7-451c-a135-53e5c0c89b10 go-id=1264934 host_id=dc38e705-4a24-2b8a-dded-a3c33f3b7373 pkg=Inventory request_id=c5513ab6-b501-47db-a03d-08d37e1ae419
10:03 https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/ebef33a6-fed7-451c-a135-53e5c0c89b10
10:04 Cluster deployment failed, but agent continue to throws errors

Comment 12 Michael Levy 2021-07-29 15:07:04 UTC
not sure that it's connected to the this issue...
https://coreos.slack.com/archives/CUPJTHQ5P/p1626795956241600
@Vitaliy Emporopulo can you please elaborate?