Bug 1405214
| Summary: | [paid][online-int][free][prod]Pods fail to mount a PV when a detach of same PV is followed by an attach. | ||
|---|---|---|---|
| Product: | OpenShift Online | Reporter: | bernard |
| Component: | Storage | Assignee: | Hemant Kumar <hekumar> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | yasun |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.x | CC: | abhgupta, aos-bugs, bernard, bingli, dakini, hchen, jgoulding, lxia, sampah_budi, xtian, yasun, yufchang, zhezli |
| Target Milestone: | --- | Keywords: | OpsBlocker |
| Target Release: | --- | Flags: | yasun:
needinfo-
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-11-09 18:54:03 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
bernard
2016-12-15 21:26:01 UTC
ps. I first posted on the Google group and received the advice there to create a bug report as this apparently is a known issue and requires the operations team to intervene. This is most likely due to the fact that you have "Rolling" as your deployment strategy and your pods require a volume. In Developer Preview, the PVCs are backed by EBS volumes and these cannot be mounted on two different nodes. In case of a rolling deployment, the new pod comes up first as a canary and only when that is successful does the deployment proceed and the old pod taken down. The new pod (from the new deployment) will try to mount the volume and fail since the old pod (from the current/existing deployment) still has the volume mounted. These pods can (and most likely will) be scheduled on different nodes and hence this is not going to work. The solution is to use "Recreate" as the deployment strategy in case the pods rely on PVCs. Can you confirm this to be the issue and that the suggested change resolves this? I have similar issue that the original poster experience I have changed the YAML of the deployment (change Rolling to Recreate) and recreate the PVC and it works for 2-3 times Now, the PVC can't even be used anymore Here is the log --> Scaling beta-23 down to zero --> Scaling beta-24 to 1 before performing acceptance check --> Waiting up to 10m0s for pods in deployment beta-24 to become ready error: update acceptor rejected beta-24: pods for deployment "beta-24" took longer than 600 seconds to become ready Let me know if you need anything else thanks My openshift email account is the same as my bugzilla's email address - and I haven't deleted my pod so you can look at it thanks Yeah this seems similar to problem in https://bugzilla.redhat.com/show_bug.cgi?id=1404811. But storage team has made lot of improvements in attach/detach code path for AWS. If you can upgrade to 3.4 and try and if it doesn't work let us know. The Online environment has recently been upgraded. Can you please confirm if this is still an issue? It works better (or probably flawlessly) until a few days ago when I can't even mount a volume to a pod This is the error that I got W0210 03:01:37.993356 1 reflector.go:330] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:468: watch of *api.Pod ended with: too old resource version: 850065961 (850090417) Please look into this thanks Can you post more details: 1. What does your pod and pv yamls look like? If you can't post it publicly you can email it to me. 2. What kind of volume type you were using? 3. How did you deploy openshift? 4. Exact version of openshift you were using. 5. Also more logs around that error will be helpful. 1. here is my pod YAML
apiVersion: v1
kind: Pod
metadata:
name: beta-2-deploy
namespace: divvy
selfLink: /api/v1/namespaces/divvy/pods/beta-2-deploy
uid: 0343a47a-ef46-11e6-b599-0e63b9c1c48f
resourceVersion: '850334773'
creationTimestamp: '2017-02-10T04:04:26Z'
labels:
openshift.io/deployer-pod-for.name: beta-2
annotations:
kubernetes.io/limit-ranger: >-
LimitRanger plugin set: cpu, memory request for container deployment; cpu,
memory limit for container deployment
openshift.io/deployment.name: beta-2
openshift.io/scc: restricted
spec:
volumes:
- name: deployer-token-o8eyc
secret:
secretName: deployer-token-o8eyc
defaultMode: 420
containers:
- name: deployment
image: 'registry.ops.openshift.com/openshift3/ose-deployer:v3.4.1.2'
env:
- name: KUBERNETES_MASTER
value: 'https://ip-172-31-10-24.ec2.internal'
- name: OPENSHIFT_MASTER
value: 'https://ip-172-31-10-24.ec2.internal'
- name: BEARER_TOKEN_FILE
value: /var/run/secrets/kubernetes.io/serviceaccount/token
- name: OPENSHIFT_CA_DATA
value: |
-----BEGIN CERTIFICATE-----
MIIC5jCCAdCgAwIBAgIBATALBgkqhkiG9w0BAQswJjEkMCIGA1UEAwwbb3BlbnNo
aWZ0LXNpZ25lckAxNDYzMTU2NTg2MB4XDTE2MDUxMzE2MjMwNloXDTIxMDUxMjE2
MjMwN1owJjEkMCIGA1UEAwwbb3BlbnNoaWZ0LXNpZ25lckAxNDYzMTU2NTg2MIIB
IjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEArp4BlumhbaZiJxnPJPd78jqp
scHOa71PnC8Pd/Uzg/cr6kCz8cqFadVpHyAYxR2MVPzwGEjJ2ScP2f5iVby8w10n
408WfAv3HelPCcw5z1yp4pb2WnFNy1eglGl2fQp7Z/Od8TgO2OOpeVvLfxSL/K9V
OXYmt9HFnfhO/0c5Cv5T7OJc997h3++006yi/qt0lGTHgeF/eUCmnZ0tosjCRhAS
7AJrYAXN8ERI3s91mrzDMC4q3FjOLlWVa9ZrXeUrbvJYCYgbdtgG2wup2ETy2nFJ
6meeYRYF/7JaVXsOZWkJYfH2K6Lg1wGjFyOXNZkA2jLqOlRMUZWHNnA/DTpL3wID
AQABoyMwITAOBgNVHQ8BAf8EBAMCAKQwDwYDVR0TAQH/BAUwAwEB/zALBgkqhkiG
9w0BAQsDggEBADQPZ3eyz2OtWdsxzG//lq1DXguV7T5KUfgp76mkZuDjp5ermC42
m1DjFtEP8HvFTZgz+LYsAIhv7MShe/bZOieHnz4A/vc3oFi6uVrcLffR+CVjdlSP
UDKZzOkf7/jTxOzSQImNk3AQAuIeVCcMXF4v4zVRlyMaWcTtOuNGWdEmLZUhUrjT
E5Gh+KQOW1jFDYKeZ1RGkAMCL8aD6p7jNvmxVGzQasIleKylDteGblcEdn8M3Xjp
hHUVIWnru5CBTwCxCqSXkxMFUsZqSIy+hiMeJPFmkDIdSBb7n2BwgcG0cXu/Zuju
2PKZGzVqvgHhcIlwFZ2g9g1S/SwlVEGUvZs=
-----END CERTIFICATE-----
- name: OPENSHIFT_DEPLOYMENT_NAME
value: beta-2
- name: OPENSHIFT_DEPLOYMENT_NAMESPACE
value: divvy
resources:
limits:
cpu: '1'
memory: 512Mi
requests:
cpu: 60m
memory: 307Mi
volumeMounts:
- name: deployer-token-o8eyc
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePath: /dev/termination-log
imagePullPolicy: Always
securityContext:
capabilities:
drop:
- KILL
- MKNOD
- NET_RAW
- SETGID
- SETUID
- SYS_CHROOT
privileged: false
seLinuxOptions:
level: 's0:c227,c194'
runAsUser: 1051690000
restartPolicy: Never
terminationGracePeriodSeconds: 10
activeDeadlineSeconds: 3600
dnsPolicy: ClusterFirst
nodeSelector:
type: compute
serviceAccountName: deployer
serviceAccount: deployer
nodeName: ip-172-31-10-175.ec2.internal
securityContext:
seLinuxOptions:
level: 's0:c227,c194'
fsGroup: 1051690000
imagePullSecrets:
- name: deployer-dockercfg-vk2yt
status:
phase: Failed
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2017-02-10T04:04:26Z'
- type: Ready
status: 'False'
lastProbeTime: null
lastTransitionTime: '2017-02-10T04:14:56Z'
reason: ContainersNotReady
message: 'containers with unready status: [deployment]'
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2017-02-10T04:04:26Z'
hostIP: 172.31.10.175
podIP: 10.1.102.15
startTime: '2017-02-10T04:04:26Z'
containerStatuses:
- name: deployment
state:
terminated:
exitCode: 1
reason: Error
startedAt: '2017-02-10T04:04:54Z'
finishedAt: '2017-02-10T04:14:55Z'
containerID: >-
docker://2875722a329fe71b7b2eefe416395e0274f8e7aa623d2ec5a17995bf4dc65c9a
lastState: {}
ready: false
restartCount: 0
image: 'registry.ops.openshift.com/openshift3/ose-deployer:v3.4.1.2'
imageID: >-
docker-pullable://registry.ops.openshift.com/openshift3/ose-deployer@sha256:37adf782e29f09c815ae0bd91299e99ae84e2849b25de100c6581df36c6a7920
containerID: >-
docker://2875722a329fe71b7b2eefe416395e0274f8e7aa623d2ec5a17995bf4dc65c9a
I don't know how to get PV YAML, is it the same as deployment YAML?
2. I've tried both RWO (Read-Write-Once) & RWX (Read-Write-Many)
3. I am using web interface (as opposed to oc command line)
4. Openshift 3
5. Here is the log from failed pod
--> Scaling beta-1 down to zero
--> Scaling beta-2 to 1 before performing acceptance check
--> Waiting up to 10m0s for pods in deployment beta-2 to become ready
W0210 04:14:06.692864 1 reflector.go:330] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:468: watch of *api.Pod ended with: too old resource version: 850302230 (850328358)
error: update acceptor rejected beta-2: pods for deployment "beta-2" took longer than 600 seconds to become ready
I don't see any persistent volumes mounted in your pod. What you are seeing is probably a different bug and unrelated to this bug which was specially opened for persistent volumes. I'm including YAML from deployment which shows persisted volume, hope this helps
apiVersion: v1
kind: DeploymentConfig
metadata:
name: beta
namespace: divvy
selfLink: /oapi/v1/namespaces/divvy/deploymentconfigs/beta
uid: 7d91daff-ef45-11e6-b125-0e3d364e19a5
resourceVersion: '850335639'
generation: 5
creationTimestamp: '2017-02-10T04:00:41Z'
labels:
app: beta
annotations:
openshift.io/generated-by: OpenShiftWebConsole
spec:
strategy:
type: Recreate
recreateParams:
timeoutSeconds: 600
rollingParams:
updatePeriodSeconds: 1
intervalSeconds: 1
timeoutSeconds: 600
maxUnavailable: 25%
maxSurge: 25%
resources: {}
triggers:
- type: ImageChange
imageChangeParams:
automatic: true
containerNames:
- beta
from:
kind: ImageStreamTag
namespace: divvy
name: 'beta:latest'
lastTriggeredImage: >-
172.30.47.227:5000/divvy/beta@sha256:91ed279cee18e4f1ce31ae00a46d49192f7270c2c6253cf129cdfc7f56323e3e
- type: ConfigChange
replicas: 1
test: false
selector:
deploymentconfig: beta
template:
metadata:
creationTimestamp: null
labels:
app: beta
deploymentconfig: beta
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: data
containers:
- name: beta
image: >-
172.30.47.227:5000/divvy/beta@sha256:91ed279cee18e4f1ce31ae00a46d49192f7270c2c6253cf129cdfc7f56323e3e
ports:
- containerPort: 8080
protocol: TCP
resources: {}
volumeMounts:
- name: data
mountPath: /data
terminationMessagePath: /dev/termination-log
imagePullPolicy: Always
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
status:
latestVersion: 2
observedGeneration: 5
replicas: 1
availableReplicas: 1
details:
message: config change
causes:
- type: ConfigChange
conditions:
- type: Progressing
status: 'False'
lastTransitionTime: '2017-02-10T04:14:57Z'
reason: ProgressDeadlineExceeded
message: Replication controller "beta-2" has failed progressing
- type: Available
status: 'True'
lastTransitionTime: '2017-02-10T04:15:12Z'
message: Deployment config has minimum availability.
Okay thank you for the Deployment YAML. But I don't see any errors on this Deployment object you posted. It shows - "availableReplicas:1" and "replicas:1" which means whatever Pods that were required for running the deployment are running correctly. Are you certain that - this Deployment you posted above is not running properly and stuck because of unable to mount volumes? If that is indeed the case - can you also find pods that were created for that deployment (you can do `oc get pods`) and then post output of following commands: ~> `oc describe pod <pod_name_from_above>` ~> `oc logs <pod_name_from_above>` @budi - If you are still affected with this bug, can you open a new bug with following items: 1. Steps to reproduce 2. Whatever logs you have. (such oc logs) 3. output of oc describe pod - for pod that is stuck. Also, you are saying openshift version as "3". Can you be more specific? There has been lot of fixes between 3.3 and 3.4. I was able to reproduce the bug consistently before (not sure which version of Openshift), but I wasn't able to reproduce the error today (using Openshift 3.4.0.13) *** Bug 1441602 has been marked as a duplicate of this bug. *** Testing on free-int: 1. Create a persistent application with the template mysql-persistent, and create some data on the persistent volume 2. After the pod is ready, scale down the pod to 0 3. Wait for a while, scale up the pod to 1 4. repeat the step 2~3 100 times The pv can be attached successfully. Wait for online-int env is ready for testing |