1748682 – [disconnected] release-image.service failed in disconnected env which is blocking bootkube service boot up

Bug 1748682 - [disconnected] release-image.service failed in disconnected env which is blocking bootkube service boot up

Summary: [disconnected] release-image.service failed in disconnected env which is bloc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Miloslav Trmač
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-04 01:47 UTC by Johnny Liu
Modified:	2019-10-16 06:40 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:40:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
bootstrap journals log (8.24 MB, application/gzip) 2019-09-04 05:58 UTC, Johnny Liu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:40:22 UTC

Description Johnny Liu 2019-09-04 01:47:37 UTC

Description of problem:

Version-Release number of the following components:
4.2.0-0.nightly-2019-09-02-162943
podman-1.4.2-3.el8.x86_64
cri-o-1.14.10-0.7.dev.rhaos4.2.git7e863dd.el8.x86_64
# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a9022e9f1c7b760efaf608e63dc1f41e5f97113d467daf584ba453bc7cdde368
              CustomOrigin: Image generated via coreos-assembler
                   Version: 42.80.20190809.0 (2019-08-09T13:37:09Z)


How reproducible:
Always

Steps to Reproduce:
1. Mirror release payload image to internal registry
2. Trigger a disconnected install
3.

Actual results:
Bootkube service on bootstrap node never booted up.
# journalctl -f  -u bootkube
-- Logs begin at Wed 2019-09-04 01:22:06 UTC. --
^C



Expected results:
Disconnected install should be completed successfully.

Additional info:
# systemctl cat bootkube
# /etc/systemd/system/bootkube.service
[Unit]
Description=Bootstrap a Kubernetes cluster
Requires=release-image.service
Wants=kubelet.service
After=kubelet.service release-image.service
ConditionPathExists=!/opt/openshift/.bootkube.done

[Service]
WorkingDirectory=/opt/openshift
ExecStart=/usr/local/bin/bootkube.sh

Restart=on-failure
RestartSec=5s

From bootkube service output, it have dependency with release-image.service and kubelet.service.

From kubelet service log, seem like crio service is not started.
# journalctl -f  -u kubelet.service
-- Logs begin at Wed 2019-09-04 01:22:06 UTC. --
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: I0904 01:34:43.151073    2627 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/run/crio/crio.sock 0  <nil>}]
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: I0904 01:34:43.151128    2627 clientconn.go:796] ClientConn switching balancer to "pick_first"
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: W0904 01:34:43.151162    2627 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {/var/run/crio/crio.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory". Reconnecting...
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: I0904 01:34:43.151180    2627 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc000260090, TRANSIENT_FAILURE
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: E0904 01:34:43.151199    2627 remote_runtime.go:85] Version from runtime service failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory"
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: E0904 01:34:43.151291    2627 kuberuntime_manager.go:196] Get runtime version failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory"
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 hyperkube[2627]: F0904 01:34:43.151313    2627 server.go:267] failed to run Kubelet: failed to create kubelet: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory"
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Sep 04 01:34:43 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Failed to start Kubernetes Kubelet

Per my understanding, crio service should be started by bootkube service, but unfortunately bootkube service never start.

Continue checking release-image.service.
# journalctl -f  -u release-image.service 
-- Logs begin at Wed 2019-09-04 01:22:06 UTC. --
Sep 04 01:22:19 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 04 01:22:19 qe-jialiu-osp-9mr28-bootstrap-0 release-image-download.sh[1482]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0...
Sep 04 01:22:20 qe-jialiu-osp-9mr28-bootstrap-0 podman[1490]: 2019-09-04 01:22:20.006542259 +0000 UTC m=+0.385063491 system refresh
Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM
Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'.
Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image.
-- Reboot --
Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 release-image-download.sh[1052]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0...

From the log, the service is pending on pulling image. Check container config file, the mirror registry configuration is landed there.
# cat /etc/containers/registries.conf 
[[registry]]
location = "quay.io/openshift-release-dev/ocp-v4.0-art-dev"
insecure = false
mirror-by-digest-only = true

[[registry.mirror]]
location = "internal-registry.qe.devcluster.openshift.com:5000/ocp/release"
insecure = false


[[registry]]
location = "registry.svc.ci.openshift.org/ocp/release"
insecure = false
mirror-by-digest-only = true

[[registry.mirror]]
location = "internal-registry.qe.devcluster.openshift.com:5000/ocp/release"
insecure = false

Seem like the mirror setting never take effected for podman.

If I remember right, release-image.service is something newly introduced recently, I am not sure which PR introduce it. Before it is introduced, disconnected install could be completed successfully. So this is regression bug, and this toally is blocking disconnected install testing.

I also tried to switch to newer rhcos - 42.80.20190828.2, also hit the same issue.

Comment 1 Abhinav Dahiya 2019-09-04 02:16:36 UTC

> Continue checking release-image.service.
> # journalctl -f  -u release-image.service 
> -- Logs begin at Wed 2019-09-04 01:22:06 UTC. --
> Sep 04 01:22:19 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting
> Download the OpenShift Release Image...
> Sep 04 01:22:19 qe-jialiu-osp-9mr28-bootstrap-0
> release-image-download.sh[1482]: Pulling
> registry.svc.ci.openshift.org/ocp/release@sha256:
> 0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0...
> Sep 04 01:22:20 qe-jialiu-osp-9mr28-bootstrap-0 podman[1490]: 2019-09-04
> 01:22:20.006542259 +0000 UTC m=+0.385063491 system refresh
> Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]:
> release-image.service: Main process exited, code=killed, status=15/TERM
> Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]:
> release-image.service: Failed with result 'signal'.
> Sep 04 01:22:24 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Stopped Download
> the OpenShift Release Image.
> -- Reboot --
> Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting
> Download the OpenShift Release Image...
> Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0
> release-image-download.sh[1052]: Pulling
> registry.svc.ci.openshift.org/ocp/release@sha256:
> 0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0...


The release-image.service seems to be stuck trying to pull the image..

and based on https://bugzilla.redhat.com/attachment.cgi?id=1611301 where the release image cannot be pulled https://bugzilla.redhat.com/show_bug.cgi?id=1748635 the service is correctly setup to retry pulling images.

> From the log, the service is pending on pulling image. Check container
> config file, the mirror registry configuration is landed there.
> # cat /etc/containers/registries.conf 
> [[registry]]
> location = "quay.io/openshift-release-dev/ocp-v4.0-art-dev"
> insecure = false
> mirror-by-digest-only = true
> 
> [[registry.mirror]]
> location = "internal-registry.qe.devcluster.openshift.com:5000/ocp/release"
> insecure = false
> 
> 
> [[registry]]
> location = "registry.svc.ci.openshift.org/ocp/release"
> insecure = false
> mirror-by-digest-only = true
> 
> [[registry.mirror]]
> location = "internal-registry.qe.devcluster.openshift.com:5000/ocp/release"
> insecure = false
> 
> Seem like the mirror setting never take effected for podman.

This makes me lean towards podman bug.. as the setup for bootstrap-host seems to be working as expected.

> If I remember right, release-image.service is something newly introduced
> recently, I am not sure which PR introduce it. Before it is introduced,
> disconnected install could be completed successfully. So this is regression
> bug, and this toally is blocking disconnected install testing.
> 
> I also tried to switch to newer rhcos - 42.80.20190828.2, also hit the same
> issue.

Comment 2 Abhinav Dahiya 2019-09-04 02:19:03 UTC

Also attaching the `gather bootstrap` would be very helpful in debugging the issue. 
And since this is UPI, please provide the version details of the bootstrap-host binaries..
- `cat /etc/os-release`
-  and versions of the podman, crictl, criod

Comment 3 Johnny Liu 2019-09-04 05:57:57 UTC

(In reply to Abhinav Dahiya from comment #2)
> Also attaching the `gather bootstrap` would be very helpful in debugging the
> issue. 
Because no crio service is running on bootstrap, no any containers, seem like `gather bootstrap` get stuck there.
$ ./openshift-install gather  bootstrap  --dir install-dir  --bootstrap 10.0.148.102 --master 10.0.151.213 --key ~/libra-new.pem  --log-level debug
DEBUG OpenShift Installer v4.2.0-201908282219-dirty 
DEBUG Built from commit 4f3e73a0143ba36229f42e8b65b6e65342bb826b 
INFO Pulling debug logs from the bootstrap machine 
DEBUG Gathering bootstrap journals ...             
DEBUG Gathering bootstrap containers ...           
DEBUG time="2019-09-04T05:33:26Z" level=fatal msg="failed to connect: failed to connect: context deadline exceeded" 
DEBUG cp: cannot stat '/var/log/bootstrap-control-plane/': No such file or directory 
^C

Then I logged into bootstrap, manually gather journals logs, and attached them.

> And since this is UPI, please provide the version details of the
> bootstrap-host binaries..
> - `cat /etc/os-release`
[root@qe-jialiu-osp-9mr28-bootstrap-0 ~]# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="42.80.20190809.0"
VERSION_ID="4.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20190809.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.2"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.2"
OSTREE_VERSION=42.80.20190809.0

> -  and versions of the podman, crictl, criod

[root@qe-jialiu-osp-9mr28-bootstrap-0 ~]# podman version
Version:            1.4.2-stable1
RemoteAPI Version:  1
Go Version:         go1.12.6
OS/Arch:            linux/amd64

[root@qe-jialiu-osp-9mr28-bootstrap-0 ~]# crictl version
NAME:
   crictl version - Display runtime version information

USAGE:
   crictl version [arguments...]
FATA[0002] failed to connect: failed to connect: context deadline exceeded
[root@qe-jialiu-osp-9mr28-bootstrap-0 ~]# rpm -qa|grep cri
criu-3.10-7.el8.x86_64
cri-tools-1.14.0-1.rhaos4.2.el8.x86_64
subscription-manager-rhsm-certificates-1.23.8-35.el8.x86_64
cri-o-1.14.10-0.7.dev.rhaos4.2.git7e863dd.el8.x86_64

Comment 4 Johnny Liu 2019-09-04 05:58:52 UTC

Created attachment 1611354 [details]
bootstrap journals log

Comment 5 Seth Jennings 2019-09-04 16:06:56 UTC

I think is whole thing is due to cri-o not starting on the bootstrap node iiuc.  However the crio logs in the latest attachment are empty.

Comment 6 Miloslav Trmač 2019-09-04 16:25:42 UTC

Are _full_ logs from that machine available (full contents of the journal, or ideally all of /var/log)? Logs from the installation? I can’t find any indication of a problem in the above archive.(In reply to Abhinav Dahiya from comment #1)



> > Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0 systemd[1]: Starting
> > Download the OpenShift Release Image...
> > Sep 04 01:23:25 qe-jialiu-osp-9mr28-bootstrap-0
> > release-image-download.sh[1052]: Pulling
> > registry.svc.ci.openshift.org/ocp/release@sha256:
> > 0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0...
> 
> The release-image.service seems to be stuck trying to pull the image..

The kubelet logs go up to 5:49. Was that single (podman pull) really stuck for FOUR HOURS?! The connection is supposed to have a 30-second keep-alive, so there actually was a connection.

Is it possible that the pull has actually succeeded (release-image-download.sh does not log on success)? Or podman is really hung on some kind of lock, but that’s not going to be possible to diagnose from this. A backtrace would almost certainly be necessary, and a `podman --log-level=debug` log from the stuck command would be helpful.

Comment 7 Seth Jennings 2019-09-04 16:29:47 UTC

additional needinfo

Did this just start failing? If so, what was the last known good release?

Also, crio can fail to start if there is no default route (the internet does not have to be reachable).  Are the nodes configured with default routes?

I am trying to recreate.

Comment 8 Seth Jennings 2019-09-04 18:24:38 UTC

I was not able to recreate on my disconnected cluster.

# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="42.80.20190904.1"

$ ogcv
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.ci-2019-09-04-140125   True        False         1s      Cluster version is 4.2.0-0.ci-2019-09-04-140125

Comment 9 Johnny Liu 2019-09-05 05:51:35 UTC

> The kubelet logs go up to 5:49. Was that single (podman pull) really stuck for FOUR HOURS?! The connection is supposed to have a 30-second keep-alive, so there actually was a connection.
> 
> Is it possible that the pull has actually succeeded (release-image-download.sh does not log on success)? Or podman is really hung on some kind of lock, but that’s not going to be possible to diagnose from this. A backtrace would almost certainly be necessary, and a `podman --log-level=debug` log from the stuck command would be helpful.
The bootstrap have no any internet connectivity, and the log is very clear, release-image.service got into restart loop because of pulling images failure.

> Are _full_ logs from that machine available (full contents of the journal, or ideally all of /var/log)? Logs from the installation? I can’t find any indication of a problem in the above archive.
I already gave the reason in comment 3 why the log maybe not _full_ logs, because many services are never started. The very beginning service - bootkube get stuck, I do not expect there are too much useful info in logs.

> Did this just start failing? If so, what was the last known good release?
At least last week, I did not see such failure. I have to try old builds one by one to tell you which is the last known good build.

> Also, crio can fail to start if there is no default route (the internet does not have to be reachable).  Are the nodes configured with default routes?
Does this really care about crio service? Who can tell me which service would trigger crio service startup? At least from `systecmctl list-unit-files` output, crio service is not some on-boot service.
Now the core issue is podman is failing to pull image from mirror registry, even did not reach to start crio service yet in bootkube.sh script.


> I was not able to recreate on my disconnected cluster.
Today I re-run two disconnected install, one is upi-on-vsphere, another is upi-on-baremetal (which is running on openstack)
disconnected install on upi-on-baremetal successfully, while disconnected install on upi-on-vsphere failed (try twice, 2 failed)

I dig a bit more in the bootstrap node, found some useful info for this issue.
[root@bootstrap-0 ~]# journalctl -f -u release-image.service 
-- Logs begin at Thu 2019-09-05 04:24:11 UTC. --
Sep 05 04:24:28 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 05 04:24:28 bootstrap-0 release-image-download.sh[1579]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0...
Sep 05 04:24:28 bootstrap-0 podman[1585]: 2019-09-05 04:24:28.615403376 +0000 UTC m=+0.185638572 system refresh
Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM
Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'.
Sep 05 04:24:33 bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image.
-- Reboot --
Sep 05 04:26:41 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 05 04:26:41 bootstrap-0 release-image-download.sh[1079]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0...
^C
[root@bootstrap-0 ~]# stat /etc/containers/registries.conf
  File: /etc/containers/registries.conf
  Size: 461       	Blocks: 8          IO Block: 4096   regular file
Device: 803h/2051d	Inode: 11762523    Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:etc_t:s0
Access: 2019-09-05 04:24:28.616203710 +0000
Modify: 2019-09-05 04:24:25.706724218 +0000
Change: 2019-09-05 04:24:27.746724121 +0000
 Birth: -
[root@bootstrap-0 ~]# podman images --log-level=debug
DEBU[0000] Initializing boltdb state at /var/lib/containers/storage/libpod/bolt_state.db 
DEBU[0000] Using graph driver overlay                   
DEBU[0000] Using graph root /var/lib/containers/storage 
DEBU[0000] Using run root /var/run/containers/storage   
DEBU[0000] Using static dir /var/lib/containers/storage/libpod 
DEBU[0000] Using tmp dir /var/run/libpod                
DEBU[0000] Using volume path /var/lib/containers/storage/volumes 
DEBU[0000] Set libpod namespace to ""                   
^C (pending here, have to terminate it by ctl + c)


Compare the timestamps of /etc/containers/registries.conf and release-image.service bootup, release-image is starting to pull image at "04:26:41", while last status change of /etc/container/registries.conf happen at "04:24:27", that means pulling image happened prior to mirror registry configuration landed on the system. So pulling image is pending there is some reasonable behaviour.

Then I restart release-image.service.
[root@bootstrap-0 ~]# systemctl restart release-image.service 
[root@bootstrap-0 ~]# journalctl -u release-image.service
-- Logs begin at Thu 2019-09-05 04:24:11 UTC, end at Thu 2019-09-05 05:14:40 UTC. --
Sep 05 04:24:28 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 05 04:24:28 bootstrap-0 release-image-download.sh[1579]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7>
Sep 05 04:24:28 bootstrap-0 podman[1585]: 2019-09-05 04:24:28.615403376 +0000 UTC m=+0.185638572 system refresh
Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM
Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'.
Sep 05 04:24:33 bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image.
-- Reboot --
Sep 05 04:26:41 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 05 04:26:41 bootstrap-0 release-image-download.sh[1079]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7>
Sep 05 05:14:03 bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM
Sep 05 05:14:03 bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'.
Sep 05 05:14:03 bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image.
Sep 05 05:14:03 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 05 05:14:03 bootstrap-0 release-image-download.sh[6150]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7>
Sep 05 05:14:03 bootstrap-0 podman[6151]: 2019-09-05 05:14:03.941044057 +0000 UTC m=+0.121074679 system refresh
Sep 05 05:14:11 bootstrap-0 podman[6151]: 2019-09-05 05:14:11.469523077 +0000 UTC m=+7.649553742 image pull  
Sep 05 05:14:11 bootstrap-0 release-image-download.sh[6150]: 462c5de6c77e1445a53906d58032f3d38f0c9229db3fd722ea8433d127ac2983
Sep 05 05:14:11 bootstrap-0 systemd[1]: Started Download the OpenShift Release Image.

This time it succeed, so my assumption (pulling image is started prior to registry.conf landed on system) seem like right. There are some race condition of release-image service and registry.conf creation. So the fix should be how to ensure configuration for mirror registry in /etc/container/registries.conf get completed before release-image.service started.

Comment 10 Johnny Liu 2019-09-05 07:30:25 UTC

Some more hint after one more failure on disconnected upi-on-vshere install.


[root@bootstrap-0 ~]# journalctl -u release-image.service 
-- Logs begin at Thu 2019-09-05 05:45:28 UTC, end at Thu 2019-09-05 07:19:39 UTC. --
Sep 05 05:45:45 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 05 05:45:45 bootstrap-0 release-image-download.sh[1524]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7>
Sep 05 05:45:45 bootstrap-0 podman[1527]: 2019-09-05 05:45:45.724904114 +0000 UTC m=+0.215514181 system refresh
Sep 05 05:45:50 bootstrap-0 systemd[1]: release-image.service: Main process exited, code=killed, status=15/TERM
Sep 05 05:45:50 bootstrap-0 systemd[1]: release-image.service: Failed with result 'signal'.
Sep 05 05:45:50 bootstrap-0 systemd[1]: Stopped Download the OpenShift Release Image.
-- Reboot --
Sep 05 05:47:58 bootstrap-0 systemd[1]: Starting Download the OpenShift Release Image...
Sep 05 05:47:58 bootstrap-0 release-image-download.sh[1062]: Pulling registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7>

[root@bootstrap-0 ~]# podman --log-level=debug  pull  registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0
DEBU[0000] Initializing boltdb state at /var/lib/containers/storage/libpod/bolt_state.db 
DEBU[0000] Using graph driver overlay                   
DEBU[0000] Using graph root /var/lib/containers/storage 
DEBU[0000] Using run root /var/run/containers/storage   
DEBU[0000] Using static dir /var/lib/containers/storage/libpod 
DEBU[0000] Using tmp dir /var/run/libpod                
DEBU[0000] Using volume path /var/lib/containers/storage/volumes 
DEBU[0000] Set libpod namespace to ""    

Get stuck here for ever.

Checked /usr/local/bin/release-image-download.sh:
RELEASE_IMAGE=registry.svc.ci.openshift.org/ocp/release@sha256:0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0

echo "Pulling $RELEASE_IMAGE..."
while ! podman pull --quiet "$RELEASE_IMAGE"
do
    echo "Pull failed. Retrying $RELEASE_IMAGE..."
done


Obviously the script get stuck at 'while ! podman pull --quiet "$RELEASE_IMAGE"', because podman pull never exits, so the whole script hang there. So beside my proposed fix in comment 9, another potential fix is enhance podman with some timeout setting, it should exit once reach timeout limit, but not wait for it forever.

Comment 11 Seth Jennings 2019-09-05 13:58:40 UTC

> Compare the timestamps of /etc/containers/registries.conf and release-image.service bootup, release-image is starting to pull image at "04:26:41", while last status change of /etc/container/registries.conf happen at "04:24:27", that means pulling image happened prior to mirror registry configuration landed on the system. So pulling image is pending there is some reasonable behaviour.

I'm confused here.  04:24:27 is before 04:26:41.  Thus the registries.conf file is in place before the pull starts, as expected.  The registries.conf file is laid down by ignition during initrd.

It is strange that this only happens on upi for vshpere.  There shouldn't be any difference between platforms at this stage afaik.

Comment 12 Miloslav Trmač 2019-09-05 20:23:48 UTC

(In reply to Johnny Liu from comment #9)
> > The kubelet logs go up to 5:49. Was that single (podman pull) really stuck for FOUR HOURS?! The connection is supposed to have a 30-second keep-alive, so there actually was a connection.
> > 
> > Is it possible that the pull has actually succeeded (release-image-download.sh does not log on success)? Or podman is really hung on some kind of lock, but that’s not going to be possible to diagnose from this. A backtrace would almost certainly be necessary, and a `podman --log-level=debug` log from the stuck command would be helpful.
>
> The bootstrap have no any internet connectivity, and the log is very clear,
> release-image.service got into restart loop because of pulling images
> failure.

Is it a restart loop or a four-hour wait, then? If it is a restart loop, why don’t the logs in the attached archive show any restarts?


> > Are _full_ logs from that machine available (full contents of the journal, or ideally all of /var/log)? Logs from the installation? I can’t find any indication of a problem in the above archive.
>
> I already gave the reason in comment 3 why the log maybe not _full_ logs,
> because many services are never started. The very beginning service -
> bootkube get stuck, I do not expect there are too much useful info in logs.

Well, the logs as they are now don’t give us anything to go on, and there clearly is _more_ available: comment#0 already quotes the output of (# journalctl -f  -u release-image.service ) which contains more than the three lines available in tmp/artifacts/bootstrap/journals/release-image.logs) in the archive. I was not asking for output of any OpenShift specific tools, but just raw logs from /var/long and/or the journal.  Plain command-line access should be sufficient for that, even without OpenShift succeeding the start-up.

Of course I don’t know that there really is useful information somewhere in there, but if we can’t reproduce this ourselves and no information is available from the system where it does reproduce, I don’t see many paths forward other than blind guessing.


---

More mysteries:

> [root@bootstrap-0 ~]# journalctl -f -u release-image.service 
> -- Logs begin at Thu 2019-09-05 04:24:11 UTC. --
> Sep 05 04:24:28 bootstrap-0 systemd[1]: Starting Download the OpenShift
> Release Image...
> Sep 05 04:24:28 bootstrap-0 release-image-download.sh[1579]: Pulling
> registry.svc.ci.openshift.org/ocp/release@sha256:
> 0dc868002248f5ae2dcdeb9d435a7ae335a71335b535bc7dd8daddeff03deaa0...
> Sep 05 04:24:28 bootstrap-0 podman[1585]: 2019-09-05 04:24:28.615403376
> +0000 UTC m=+0.185638572 system refresh

What is this log line (and the "image pull" line), where is it coming from? I can’t find it in the podman source code.

> Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Main process
> exited, code=killed, status=15/TERM

What is terminating podman after mere five seconds, and why? And why does the "Sep 05 04:26:41" instance take over 47 minutes?

Comment 13 Miloslav Trmač 2019-09-05 20:25:37 UTC

(In reply to Johnny Liu from comment #9)
> [root@bootstrap-0 ~]# podman images --log-level=debug
> DEBU[0000] Initializing boltdb state at
> /var/lib/containers/storage/libpod/bolt_state.db 
> DEBU[0000] Using graph driver overlay                   
> DEBU[0000] Using graph root /var/lib/containers/storage 
> DEBU[0000] Using run root /var/run/containers/storage   
> DEBU[0000] Using static dir /var/lib/containers/storage/libpod 
> DEBU[0000] Using tmp dir /var/run/libpod                
> DEBU[0000] Using volume path /var/lib/containers/storage/volumes 
> DEBU[0000] Set libpod namespace to ""                   
> ^C (pending here, have to terminate it by ctl + c)

Oh well, let’s give blind guessing a chance, anyway. Matt, does the above ring any bells?

Comment 14 Matthew Heon 2019-09-05 20:35:07 UTC

Hm. It looks like the Libpod runtime comes up successfully, and it hangs sometime thereafter. Unfortunately, there isn't much in the way of debug logging in `images` so it's hard to say where exactly that hang is.

I have a vague suspicion that this could related to retrieving images from c/storage and locking thereof. You're running CRI-O simultaneously, correct? That does get us a potential recipe for a deadlock on the storage locks. Can you get an `lslocks` when the hang happens, to see if Podman and/or CRI-O are holding locks at the point it happens?

Comment 15 Matthew Heon 2019-09-05 20:40:10 UTC

The system refresh event log indicates that Podman has successfully completed a state refresh - so, from the point of view of Libpod, everything is running happily just before this freeze happens. We have no locks on images, though we could be looking at an infinite loop of some sort in the image processing code. So I'm leaning towards either a storage lock or a bug in the Podman image code.

Comment 16 Johnny Liu 2019-09-06 01:24:49 UTC

(In reply to Seth Jennings from comment #11)
> > Compare the timestamps of /etc/containers/registries.conf and release-image.service bootup, release-image is starting to pull image at "04:26:41", while last status change of /etc/container/registries.conf happen at "04:24:27", that means pulling image happened prior to mirror registry configuration landed on the system. So pulling image is pending there is some reasonable behaviour.
> 
> I'm confused here.  04:24:27 is before 04:26:41.  Thus the registries.conf
> file is in place before the pull starts, as expected.  The registries.conf
> file is laid down by ignition during initrd.
> 
> It is strange that this only happens on upi for vshpere.  There shouldn't be
> any difference between platforms at this stage afaik.
Sorry my copy/paste error, I meant the first pulling image was started at "04:24:28". It is really close to the registries.conf file laid down time. Then system is rebooted. (QE is referring to [1] for our disconnected install, in [2], there is a restart operation) Once the system is rebooted, the registries.conf would be laid down again and overwritten by ignition during initrd, and release-image.service would also started almost at the same time, right? I guess there is some race condition between the two operations.

In my initial report, it is also reproduced on baremetal install. But reproduce ratio is lower than upi-on-vsphere, upi-on-vsphere now is about 100% reproduce. 
Yeah, this is have nothing with platform, but I guess it is related to how to compose ignition which maybe cause boot order difference. (such as: in this vsphere install, there is a restart step in ignition)


[1]: https://github.com/openshift/installer/tree/master/upi/vsphere
[2]: https://github.com/openshift/installer/blob/master/upi/vsphere/machine/ignition.tf#L43


> Is it a restart loop or a four-hour wait, then? If it is a restart loop, why don’t the logs in the attached archive show any restarts?
Sorry for the misleading statement. It is not a restart loop, but a forever wait. I double confirm that is a forever wait in comment 9 and 10.

> there clearly is _more_ available: comment#0 already quotes the output of (# journalctl -f  -u release-image.service ) which contains more than the three lines available in tmp/artifacts/bootstrap/journals/release-image.logs) in the archive.
Sorry I do not know why the log in tmp/artifacts/bootstrap/journals/release-image.logs is less than the output of `journalctl -f  -u release-image.service`, seem like journals/release-image.logs only capture the logs after system reboot operation. I did not any modification to journals/release-image.logs.

Comment 17 Johnny Liu 2019-09-06 01:28:50 UTC

> > Sep 05 04:24:33 bootstrap-0 systemd[1]: release-image.service: Main process
> > exited, code=killed, status=15/TERM
> 
> What is terminating podman after mere five seconds, and why? And why does
> the "Sep 05 04:26:41" instance take over 47 minutes?

Because system is rebooted which is defined in ignition files.

[1]: https://github.com/openshift/installer/tree/master/upi/vsphere
[2]: https://github.com/openshift/installer/blob/master/upi/vsphere/machine/ignition.tf#L43

Comment 41 Johnny Liu 2019-09-17 03:52:05 UTC

Verified this bug with 42.80.20190916.1 rhcos, and PASS.

Comment 43 errata-xmlrpc 2019-10-16 06:40:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.