Description of problem: Version-Release number of the following components: 4.2.0-0.nightly-2019-09-02-162943 podman-1.4.2-3.el8.x86_64 cri-o-1.14.10-0.7.dev.rhaos4.2.git7e863dd.el8.x86_64 # rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a9022e9f1c7b760efaf608e63dc1f41e5f97113d467daf584ba453bc7cdde368 CustomOrigin: Image generated via coreos-assembler Version: 42.80.20190809.0 (2019-08-09T13:37:09Z) How reproducible: Always Steps to Reproduce: 1. Mirror release payload image to internal registry 2. Trigger a disconnected install 3. Actual results: Bootkube service on bootstrap node never booted up. # journalctl -f -u bootkube -- Logs begin at Wed 2019-09-04 01:22:06 UTC. -- ^C Expected results: Disconnected install should be completed successfully. Additional info: # systemctl cat bootkube # /etc/systemd/system/bootkube.service [Unit] Description=Bootstrap a Kubernetes cluster Requires=release-image.service Wants=kubelet.service After=kubelet.service release-image.service ConditionPathExists=!/opt/openshift/.bootkube.done [Service] WorkingDirectory=/opt/openshift ExecStart=/usr/local/bin/bootkube.sh Restart=on-failure RestartSec=5s From bootkube service output, it have dependency with release-image.service and kubelet.service. From kubelet service log, seem like crio service is not started. # journalctl -f -u kubelet.service -- Logs begin at Wed 2019-09-04 01:22:06 UTC. -- Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: I0904 01:34:43.151073 2627 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/run/crio/crio.sock 0 <nil>}] Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: I0904 01:34:43.151128 2627 clientconn.go:796] ClientConn switching balancer to "pick_first" Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: W0904 01:34:43.151162 2627 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {/var/run/crio/crio.sock 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory". Reconnecting... Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: I0904 01:34:43.151180 2627 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc000260090, TRANSIENT_FAILURE Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: E0904 01:34:43.151199 2627 remote_runtime.go:85] Version from runtime service failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: E0904 01:34:43.151291 2627 kuberuntime_manager.go:196] Get runtime version failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: F0904 01:34:43.151313 2627 server.go:267] failed to run Kubelet: failed to create kubelet: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: kubelet.service: Failed with result 'exit-code'. Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Failed to start Kubernetes Kubelet Per my understanding, crio service should be started by bootkube service, but unfortunately bootkube service never start. Continue checking release-image.service. # journalctl -f -u release-image.service -- Logs begin at Wed 2019-09-04 01:22:06 UTC. -- Sep 04 01:22:19 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 04 01:22:19 qe-jialiu-osp-9mr28-bootstrap-0 release-image-download.sh[1482]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0... Sep 04 01:22:20 qe-jialiu-osp-9mr28-bootstrap-0 podman[1490]: 2019-09-04 01:22:20.006542259 +0000 UTC m=+0.385063491 system refresh Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'. Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image. -- Reboot -- Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 release-image-download.sh[1052]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0... From the log, the service is pending on pulling image. Check container config file, the mirror registry configuration is landed there. # cat /etc/containers/registries.conf [[registry]] location = "quay.io/openshift-release-dev/ocp-v4.0-art-dev" insecure = false mirror-by-digest-only = true [[registry.mirror]] location = "internal-registry.qe.devcluster.openshift.com:5000/ocp/release" insecure = false [[registry]] location = "registry.svc.ci.openshift.org/ocp/release" insecure = false mirror-by-digest-only = true [[registry.mirror]] location = "internal-registry.qe.devcluster.openshift.com:5000/ocp/release" insecure = false Seem like the mirror setting never take effected for podman. If I remember right, release-image.service is something newly introduced recently, I am not sure which PR introduce it. Before it is introduced, disconnected install could be completed successfully. So this is regression bug, and this toally is blocking disconnected install testing. I also tried to switch to newer rhcos - 42.80.20190828.2, also hit the same issue.
> Continue checking release-image.service. > # journalctl -f -u release-image.service > -- Logs begin at Wed 2019-09-04 01:22:06 UTC. -- > Sep 04 01:22:19 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting > Download the OpenShift Release Image... > Sep 04 01:22:19 qe-jialiu-osp-9mr28-bootstrap-0 > release-image-download.sh[1482]: Pulling > registry.svc.ci.openshift.org/ocp/release@sha256: > 0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0... > Sep 04 01:22:20 qe-jialiu-osp-9mr28-bootstrap-0 podman[1490]: 2019-09-04 > 01:22:20.006542259 +0000 UTC m=+0.385063491 system refresh > Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: > release-image.service: Main process exited, code=killed, status=15/TERM > Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: > release-image.service: Failed with result 'signal'. > Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Stopped Download > the OpenShift Release Image. > -- Reboot -- > Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting > Download the OpenShift Release Image... > Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 > release-image-download.sh[1052]: Pulling > registry.svc.ci.openshift.org/ocp/release@sha256: > 0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0... The release-image.service seems to be stuck trying to pull the image.. and based on https://bugzilla.redhat.com/attachment.cgi?id=1611301 where the release image cannot be pulled https://bugzilla.redhat.com/show_bug.cgi?id=1748635 the service is correctly setup to retry pulling images. > From the log, the service is pending on pulling image. Check container > config file, the mirror registry configuration is landed there. > # cat /etc/containers/registries.conf > [[registry]] > location = "quay.io/openshift-release-dev/ocp-v4.0-art-dev" > insecure = false > mirror-by-digest-only = true > > [[registry.mirror]] > location = "internal-registry.qe.devcluster.openshift.com:5000/ocp/release" > insecure = false > > > [[registry]] > location = "registry.svc.ci.openshift.org/ocp/release" > insecure = false > mirror-by-digest-only = true > > [[registry.mirror]] > location = "internal-registry.qe.devcluster.openshift.com:5000/ocp/release" > insecure = false > > Seem like the mirror setting never take effected for podman. This makes me lean towards podman bug.. as the setup for bootstrap-host seems to be working as expected. > If I remember right, release-image.service is something newly introduced > recently, I am not sure which PR introduce it. Before it is introduced, > disconnected install could be completed successfully. So this is regression > bug, and this toally is blocking disconnected install testing. > > I also tried to switch to newer rhcos - 42.80.20190828.2, also hit the same > issue.
Also attaching the `gather bootstrap` would be very helpful in debugging the issue. And since this is UPI, please provide the version details of the bootstrap-host binaries.. - `cat /etc/os-release` - and versions of the podman, crictl, criod
(In reply to Abhinav Dahiya from comment #2) > Also attaching the `gather bootstrap` would be very helpful in debugging the > issue. Because no crio service is running on bootstrap, no any containers, seem like `gather bootstrap` get stuck there. $ ./openshift-install gather bootstrap --dir install-dir --bootstrap 10.0.148.102 --master 10.0.151.213 --key ~/libra-new.pem --log-level debug DEBUG OpenShift Installer v4.2.0-201908282219-dirty DEBUG Built from commit 4f3e73a0143ba36229f42e8b65b6e65342bb826b INFO Pulling debug logs from the bootstrap machine DEBUG Gathering bootstrap journals ... DEBUG Gathering bootstrap containers ... DEBUG time="2019-09-04T05:33:26Z" level=fatal msg="failed to connect: failed to connect: context deadline exceeded" DEBUG cp: cannot stat '/var/log/bootstrap-control-plane/': No such file or directory ^C Then I logged into bootstrap, manually gather journals logs, and attached them. > And since this is UPI, please provide the version details of the > bootstrap-host binaries.. > - `cat /etc/os-release` [root@qe-jialiu-osp-9mr28-bootstrap-0 ~]# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="42.80.20190809.0" VERSION_ID="4.2" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20190809.0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.2" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.2" OSTREE_VERSION=42.80.20190809.0 > - and versions of the podman, crictl, criod [root@qe-jialiu-osp-9mr28-bootstrap-0 ~]# podman version Version: 1.4.2-stable1 RemoteAPI Version: 1 Go Version: go1.12.6 OS/Arch: linux/amd64 [root@qe-jialiu-osp-9mr28-bootstrap-0 ~]# crictl version NAME: crictl version - Display runtime version information USAGE: crictl version [arguments...] FATA[0002] failed to connect: failed to connect: context deadline exceeded [root@qe-jialiu-osp-9mr28-bootstrap-0 ~]# rpm -qa|grep cri criu-3.10-7.el8.x86_64 cri-tools-1.14.0-1.rhaos4.2.el8.x86_64 subscription-manager-rhsm-certificates-1.23.8-35.el8.x86_64 cri-o-1.14.10-0.7.dev.rhaos4.2.git7e863dd.el8.x86_64
Created attachment 1611354 [details] bootstrap journals log
I think is whole thing is due to cri-o not starting on the bootstrap node iiuc. However the crio logs in the latest attachment are empty.
Are _full_ logs from that machine available (full contents of the journal, or ideally all of /var/log)? Logs from the installation? I can’t find any indication of a problem in the above archive.(In reply to Abhinav Dahiya from comment #1) > > Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting > > Download the OpenShift Release Image... > > Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 > > release-image-download.sh[1052]: Pulling > > registry.svc.ci.openshift.org/ocp/release@sha256: > > 0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0... > > The release-image.service seems to be stuck trying to pull the image.. The kubelet logs go up to 5:49. Was that single (podman pull) really stuck for FOUR HOURS?! The connection is supposed to have a 30-second keep-alive, so there actually was a connection. Is it possible that the pull has actually succeeded (release-image-download.sh does not log on success)? Or podman is really hung on some kind of lock, but that’s not going to be possible to diagnose from this. A backtrace would almost certainly be necessary, and a `podman --log-level=debug` log from the stuck command would be helpful.
additional needinfo Did this just start failing? If so, what was the last known good release? Also, crio can fail to start if there is no default route (the internet does not have to be reachable). Are the nodes configured with default routes? I am trying to recreate.
I was not able to recreate on my disconnected cluster. # cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="42.80.20190904.1" $ ogcv NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.ci-2019-09-04-140125 True False 1s Cluster version is 4.2.0-0.ci-2019-09-04-140125
> The kubelet logs go up to 5:49. Was that single (podman pull) really stuck for FOUR HOURS?! The connection is supposed to have a 30-second keep-alive, so there actually was a connection. > > Is it possible that the pull has actually succeeded (release-image-download.sh does not log on success)? Or podman is really hung on some kind of lock, but that’s not going to be possible to diagnose from this. A backtrace would almost certainly be necessary, and a `podman --log-level=debug` log from the stuck command would be helpful. The bootstrap have no any internet connectivity, and the log is very clear, release-image.service got into restart loop because of pulling images failure. > Are _full_ logs from that machine available (full contents of the journal, or ideally all of /var/log)? Logs from the installation? I can’t find any indication of a problem in the above archive. I already gave the reason in comment 3 why the log maybe not _full_ logs, because many services are never started. The very beginning service - bootkube get stuck, I do not expect there are too much useful info in logs. > Did this just start failing? If so, what was the last known good release? At least last week, I did not see such failure. I have to try old builds one by one to tell you which is the last known good build. > Also, crio can fail to start if there is no default route (the internet does not have to be reachable). Are the nodes configured with default routes? Does this really care about crio service? Who can tell me which service would trigger crio service startup? At least from `systecmctl list-unit-files` output, crio service is not some on-boot service. Now the core issue is podman is failing to pull image from mirror registry, even did not reach to start crio service yet in bootkube.sh script. > I was not able to recreate on my disconnected cluster. Today I re-run two disconnected install, one is upi-on-vsphere, another is upi-on-baremetal (which is running on openstack) disconnected install on upi-on-baremetal successfully, while disconnected install on upi-on-vsphere failed (try twice, 2 failed) I dig a bit more in the bootstrap node, found some useful info for this issue. [root@bootstrap-0 ~]# journalctl -f -u release-image.service -- Logs begin at Thu 2019-09-05 04:24:11 UTC. -- Sep 05 04:24:28 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 05 04:24:28 bootstrap-0 release-image-download.sh[1579]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0... Sep 05 04:24:28 bootstrap-0 podman[1585]: 2019-09-05 04:24:28.615403376 +0000 UTC m=+0.185638572 system refresh Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'. Sep 05 04:24:33 bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image. -- Reboot -- Sep 05 04:26:41 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 05 04:26:41 bootstrap-0 release-image-download.sh[1079]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0... ^C [root@bootstrap-0 ~]# stat /etc/containers/registries.conf File: /etc/containers/registries.conf Size: 461 Blocks: 8 IO Block: 4096 regular file Device: 803h/2051d Inode: 11762523 Links: 1 Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:etc_t:s0 Access: 2019-09-05 04:24:28.616203710 +0000 Modify: 2019-09-05 04:24:25.706724218 +0000 Change: 2019-09-05 04:24:27.746724121 +0000 Birth: - [root@bootstrap-0 ~]# podman images --log-level=debug DEBU[0000] Initializing boltdb state at /var/lib/containers/storage/libpod/bolt_state.db DEBU[0000] Using graph driver overlay DEBU[0000] Using graph root /var/lib/containers/storage DEBU[0000] Using run root /var/run/containers/storage DEBU[0000] Using static dir /var/lib/containers/storage/libpod DEBU[0000] Using tmp dir /var/run/libpod DEBU[0000] Using volume path /var/lib/containers/storage/volumes DEBU[0000] Set libpod namespace to "" ^C (pending here, have to terminate it by ctl + c) Compare the timestamps of /etc/containers/registries.conf and release-image.service bootup, release-image is starting to pull image at "04:26:41", while last status change of /etc/container/registries.conf happen at "04:24:27", that means pulling image happened prior to mirror registry configuration landed on the system. So pulling image is pending there is some reasonable behaviour. Then I restart release-image.service. [root@bootstrap-0 ~]# systemctl restart release-image.service [root@bootstrap-0 ~]# journalctl -u release-image.service -- Logs begin at Thu 2019-09-05 04:24:11 UTC, end at Thu 2019-09-05 05:14:40 UTC. -- Sep 05 04:24:28 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 05 04:24:28 bootstrap-0 release-image-download.sh[1579]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7> Sep 05 04:24:28 bootstrap-0 podman[1585]: 2019-09-05 04:24:28.615403376 +0000 UTC m=+0.185638572 system refresh Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'. Sep 05 04:24:33 bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image. -- Reboot -- Sep 05 04:26:41 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 05 04:26:41 bootstrap-0 release-image-download.sh[1079]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7> Sep 05 05:14:03 bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM Sep 05 05:14:03 bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'. Sep 05 05:14:03 bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image. Sep 05 05:14:03 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 05 05:14:03 bootstrap-0 release-image-download.sh[6150]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7> Sep 05 05:14:03 bootstrap-0 podman[6151]: 2019-09-05 05:14:03.941044057 +0000 UTC m=+0.121074679 system refresh Sep 05 05:14:11 bootstrap-0 podman[6151]: 2019-09-05 05:14:11.469523077 +0000 UTC m=+7.649553742 image pull Sep 05 05:14:11 bootstrap-0 release-image-download.sh[6150]: 462c5de6c77e1445a53906d58032f3d38f0c9229db3fd722ea8433d127ac2983 Sep 05 05:14:11 bootstrap-0 systemd[1]: Started Download the OpenShift Release Image. This time it succeed, so my assumption (pulling image is started prior to registry.conf landed on system) seem like right. There are some race condition of release-image service and registry.conf creation. So the fix should be how to ensure configuration for mirror registry in /etc/container/registries.conf get completed before release-image.service started.
Some more hint after one more failure on disconnected upi-on-vshere install. [root@bootstrap-0 ~]# journalctl -u release-image.service -- Logs begin at Thu 2019-09-05 05:45:28 UTC, end at Thu 2019-09-05 07:19:39 UTC. -- Sep 05 05:45:45 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 05 05:45:45 bootstrap-0 release-image-download.sh[1524]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7> Sep 05 05:45:45 bootstrap-0 podman[1527]: 2019-09-05 05:45:45.724904114 +0000 UTC m=+0.215514181 system refresh Sep 05 05:45:50 bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM Sep 05 05:45:50 bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'. Sep 05 05:45:50 bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image. -- Reboot -- Sep 05 05:47:58 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image... Sep 05 05:47:58 bootstrap-0 release-image-download.sh[1062]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7> [root@bootstrap-0 ~]# podman --log-level=debug pull registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0 DEBU[0000] Initializing boltdb state at /var/lib/containers/storage/libpod/bolt_state.db DEBU[0000] Using graph driver overlay DEBU[0000] Using graph root /var/lib/containers/storage DEBU[0000] Using run root /var/run/containers/storage DEBU[0000] Using static dir /var/lib/containers/storage/libpod DEBU[0000] Using tmp dir /var/run/libpod DEBU[0000] Using volume path /var/lib/containers/storage/volumes DEBU[0000] Set libpod namespace to "" Get stuck here for ever. Checked /usr/local/bin/release-image-download.sh: RELEASE_IMAGE=registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0 echo "Pulling $RELEASE_IMAGE..." while ! podman pull --quiet "$RELEASE_IMAGE" do echo "Pull failed. Retrying $RELEASE_IMAGE..." done Obviously the script get stuck at 'while ! podman pull --quiet "$RELEASE_IMAGE"', because podman pull never exits, so the whole script hang there. So beside my proposed fix in comment 9, another potential fix is enhance podman with some timeout setting, it should exit once reach timeout limit, but not wait for it forever.
> Compare the timestamps of /etc/containers/registries.conf and release-image.service bootup, release-image is starting to pull image at "04:26:41", while last status change of /etc/container/registries.conf happen at "04:24:27", that means pulling image happened prior to mirror registry configuration landed on the system. So pulling image is pending there is some reasonable behaviour. I'm confused here. 04:24:27 is before 04:26:41. Thus the registries.conf file is in place before the pull starts, as expected. The registries.conf file is laid down by ignition during initrd. It is strange that this only happens on upi for vshpere. There shouldn't be any difference between platforms at this stage afaik.
(In reply to Johnny Liu from comment #9) > > The kubelet logs go up to 5:49. Was that single (podman pull) really stuck for FOUR HOURS?! The connection is supposed to have a 30-second keep-alive, so there actually was a connection. > > > > Is it possible that the pull has actually succeeded (release-image-download.sh does not log on success)? Or podman is really hung on some kind of lock, but that’s not going to be possible to diagnose from this. A backtrace would almost certainly be necessary, and a `podman --log-level=debug` log from the stuck command would be helpful. > > The bootstrap have no any internet connectivity, and the log is very clear, > release-image.service got into restart loop because of pulling images > failure. Is it a restart loop or a four-hour wait, then? If it is a restart loop, why don’t the logs in the attached archive show any restarts? > > Are _full_ logs from that machine available (full contents of the journal, or ideally all of /var/log)? Logs from the installation? I can’t find any indication of a problem in the above archive. > > I already gave the reason in comment 3 why the log maybe not _full_ logs, > because many services are never started. The very beginning service - > bootkube get stuck, I do not expect there are too much useful info in logs. Well, the logs as they are now don’t give us anything to go on, and there clearly is _more_ available: comment#0 already quotes the output of (# journalctl -f -u release-image.service ) which contains more than the three lines available in tmp/artifacts/bootstrap/journals/release-image.logs) in the archive. I was not asking for output of any OpenShift specific tools, but just raw logs from /var/long and/or the journal. Plain command-line access should be sufficient for that, even without OpenShift succeeding the start-up. Of course I don’t know that there really is useful information somewhere in there, but if we can’t reproduce this ourselves and no information is available from the system where it does reproduce, I don’t see many paths forward other than blind guessing. --- More mysteries: > [root@bootstrap-0 ~]# journalctl -f -u release-image.service > -- Logs begin at Thu 2019-09-05 04:24:11 UTC. -- > Sep 05 04:24:28 bootstrap-0 systemd[1]: Starting Download the OpenShift > Release Image... > Sep 05 04:24:28 bootstrap-0 release-image-download.sh[1579]: Pulling > registry.svc.ci.openshift.org/ocp/release@sha256: > 0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0... > Sep 05 04:24:28 bootstrap-0 podman[1585]: 2019-09-05 04:24:28.615403376 > +0000 UTC m=+0.185638572 system refresh What is this log line (and the "image pull" line), where is it coming from? I can’t find it in the podman source code. > Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Main process > exited, code=killed, status=15/TERM What is terminating podman after mere five seconds, and why? And why does the "Sep 05 04:26:41" instance take over 47 minutes?
(In reply to Johnny Liu from comment #9) > [root@bootstrap-0 ~]# podman images --log-level=debug > DEBU[0000] Initializing boltdb state at > /var/lib/containers/storage/libpod/bolt_state.db > DEBU[0000] Using graph driver overlay > DEBU[0000] Using graph root /var/lib/containers/storage > DEBU[0000] Using run root /var/run/containers/storage > DEBU[0000] Using static dir /var/lib/containers/storage/libpod > DEBU[0000] Using tmp dir /var/run/libpod > DEBU[0000] Using volume path /var/lib/containers/storage/volumes > DEBU[0000] Set libpod namespace to "" > ^C (pending here, have to terminate it by ctl + c) Oh well, let’s give blind guessing a chance, anyway. Matt, does the above ring any bells?
Hm. It looks like the Libpod runtime comes up successfully, and it hangs sometime thereafter. Unfortunately, there isn't much in the way of debug logging in `images` so it's hard to say where exactly that hang is. I have a vague suspicion that this could related to retrieving images from c/storage and locking thereof. You're running CRI-O simultaneously, correct? That does get us a potential recipe for a deadlock on the storage locks. Can you get an `lslocks` when the hang happens, to see if Podman and/or CRI-O are holding locks at the point it happens?
The system refresh event log indicates that Podman has successfully completed a state refresh - so, from the point of view of Libpod, everything is running happily just before this freeze happens. We have no locks on images, though we could be looking at an infinite loop of some sort in the image processing code. So I'm leaning towards either a storage lock or a bug in the Podman image code.
(In reply to Seth Jennings from comment #11) > > Compare the timestamps of /etc/containers/registries.conf and release-image.service bootup, release-image is starting to pull image at "04:26:41", while last status change of /etc/container/registries.conf happen at "04:24:27", that means pulling image happened prior to mirror registry configuration landed on the system. So pulling image is pending there is some reasonable behaviour. > > I'm confused here. 04:24:27 is before 04:26:41. Thus the registries.conf > file is in place before the pull starts, as expected. The registries.conf > file is laid down by ignition during initrd. > > It is strange that this only happens on upi for vshpere. There shouldn't be > any difference between platforms at this stage afaik. Sorry my copy/paste error, I meant the first pulling image was started at "04:24:28". It is really close to the registries.conf file laid down time. Then system is rebooted. (QE is referring to [1] for our disconnected install, in [2], there is a restart operation) Once the system is rebooted, the registries.conf would be laid down again and overwritten by ignition during initrd, and release-image.service would also started almost at the same time, right? I guess there is some race condition between the two operations. In my initial report, it is also reproduced on baremetal install. But reproduce ratio is lower than upi-on-vsphere, upi-on-vsphere now is about 100% reproduce. Yeah, this is have nothing with platform, but I guess it is related to how to compose ignition which maybe cause boot order difference. (such as: in this vsphere install, there is a restart step in ignition) [1]: https://github.com/openshift/installer/tree/master/upi/vsphere [2]: https://github.com/openshift/installer/blob/master/upi/vsphere/machine/ignition.tf#L43 > Is it a restart loop or a four-hour wait, then? If it is a restart loop, why don’t the logs in the attached archive show any restarts? Sorry for the misleading statement. It is not a restart loop, but a forever wait. I double confirm that is a forever wait in comment 9 and 10. > there clearly is _more_ available: comment#0 already quotes the output of (# journalctl -f -u release-image.service ) which contains more than the three lines available in tmp/artifacts/bootstrap/journals/release-image.logs) in the archive. Sorry I do not know why the log in tmp/artifacts/bootstrap/journals/release-image.logs is less than the output of `journalctl -f -u release-image.service`, seem like journals/release-image.logs only capture the logs after system reboot operation. I did not any modification to journals/release-image.logs.
> > Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Main process > > exited, code=killed, status=15/TERM > > What is terminating podman after mere five seconds, and why? And why does > the "Sep 05 04:26:41" instance take over 47 minutes? Because system is rebooted which is defined in ignition files. [1]: https://github.com/openshift/installer/tree/master/upi/vsphere [2]: https://github.com/openshift/installer/blob/master/upi/vsphere/machine/ignition.tf#L43
Verified this bug with 42.80.20190916.1 rhcos, and PASS.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922