Bug 1708442

Summary:	[3.11] Symlinks under /var/lib/containers/storage/overlay/l are lost on reboot
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Containers	Assignee:	Urvashi Mohnani <umohnani>
Status:	CLOSED ERRATA	QA Contact:	weiwei jiang <wjiang>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	adeshpan, aos-bugs, dwalsh, eparis, jokerman, mmccomas, umohnani, wjiang
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1704410	Environment:
Last Closed:	2019-06-26 09:08:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1704410
Bug Blocks:

Comment 1 Urvashi Mohnani 2019-05-10 09:30:59 UTC

New build is available at https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21597045

Comment 2 weiwei jiang 2019-05-13 05:30:26 UTC

(In reply to Urvashi Mohnani from comment #1)
> New build is available at
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21597045

Hi, do we miss some important containers/storage things for this build?

One node got this error for a pod:
Events:
  Type     Reason   Age                From                                           Message
  ----     ------   ----               ----                                           -------
  Warning  Failed   10m                kubelet, qe-wjiang-311-node-registry-router-1  Failed to pull image "brewregistry.stage.redhat.io/openshift3/ose-node:v3.11": rpc error: code = Unknown desc = Error writing blob: error storing blob to file "/var/tmp/storage729637586/1": unexpected EOF
  Warning  Failed   10m                kubelet, qe-wjiang-311-node-registry-router-1  Error: ErrImagePull
  Normal   BackOff  10m                kubelet, qe-wjiang-311-node-registry-router-1  Back-off pulling image "brewregistry.stage.redhat.io/openshift3/ose-node:v3.11"
  Warning  Failed   10m                kubelet, qe-wjiang-311-node-registry-router-1  Error: ImagePullBackOff
  Normal   Pulling  10m (x2 over 16m)  kubelet, qe-wjiang-311-node-registry-router-1  pulling image "brewregistry.stage.redhat.io/openshift3/ose-node:v3.11"

Comment 3 Urvashi Mohnani 2019-05-13 12:40:56 UTC

Nope, nothing in containers/storage changed, we just cherry-picked the symlink fixes onto the containers/storage version already being used by cri-o 1.11.
Are you seeing this error on multiple pods? Did it eventually fix itself, or was it stuck in this state? Did you try killing the pod and letting it start up again?

Comment 4 Steven Walter 2019-05-13 17:11:01 UTC

My customer who is hitting the issue "rebuilds" the node to fix the issue (removing and reinstalling components) -- but this is a very big workaround and not ideal. Curious if anyone has a less intrusive workaround.

Comment 5 Urvashi Mohnani 2019-05-13 18:19:30 UTC

So the "storing-the-layer-blob-to-a-file" logic comes from containers/image and not containers/storage.
If this issue is continuously happening, please open another bz for it. This shouldn't be blocking this bz as the symlink fixes went into containers/storage.

Comment 6 Steven Walter 2019-05-13 18:50:49 UTC

Hi,
Do you mean, if it requires rebuilding (i.e. if it does not resolve by deleting pods or etc)

Comment 7 Urvashi Mohnani 2019-05-13 18:56:36 UTC

@Steven yeah, does deleting the pod resole the issue? Also how often is the customer seeing this happen?
If possible can I get cri-o and kubelet logs from the cluster as well.

Comment 8 weiwei jiang 2019-05-14 09:13:51 UTC

Checked with 1.11.14 and reboot 5 times for the whole clusters, not met this issue, so move to verified.

# oc get nodes -o wide 
NAME                                   STATUS    ROLES     AGE       VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION               CONTAINER-RUNTIME
qe-wjiang-311-master-etcd-1            Ready     master    48m       v1.11.0+d4cacc0   10.0.76.16    <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.12.1.el7.x86_64   cri-o://1.11.14-1.rhaos3.11.gitd56660e.el7
qe-wjiang-311-node-1                   Ready     compute   45m       v1.11.0+d4cacc0   10.0.77.60    <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.12.1.el7.x86_64   cri-o://1.11.14-1.rhaos3.11.gitd56660e.el7
qe-wjiang-311-node-registry-router-1   Ready     <none>    45m       v1.11.0+d4cacc0   10.0.76.72    <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.12.1.el7.x86_64   cri-o://1.11.14-1.rhaos3.11.gitd56660e.el7

Comment 9 weiwei jiang 2019-05-14 09:15:22 UTC

For the "Error writing blob: error storing blob to file" issue, I tried 2 times, but not met this.
Will keep an eye on that, and open bug once I met that again.

Comment 10 Steven Walter 2019-05-14 23:06:42 UTC

@Urvashi
Hm, well the issue seems to occur with new pods, so this might not apply. I opened a new bug.
https://bugzilla.redhat.com/show_bug.cgi?id=1710124

Comment 12 errata-xmlrpc 2019-06-26 09:08:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605