Bug 1303273 - Deployments Fail Often and Containers hang forever in Openshift Origin v1.1.1
Summary: Deployments Fail Often and Containers hang forever in Openshift Origin v1.1.1
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OKD
Classification: Red Hat
Component: Deployments
Version: 1.x
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Dan Mace
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-30 03:47 UTC by Dean Peterson
Modified: 2017-05-31 18:22 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-31 18:22:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openshift web console (117.98 KB, image/png)
2016-01-30 03:47 UTC, Dean Peterson
no flags Details
oc get dc,rc -o yaml (113.00 KB, text/plain)
2016-01-30 03:48 UTC, Dean Peterson
no flags Details
Available images and the oldest is always being used for deployments (111.05 KB, image/png)
2016-01-30 03:50 UTC, Dean Peterson
no flags Details

Description Dean Peterson 2016-01-30 03:47:14 UTC
Created attachment 1119566 [details]
openshift web console

Description of problem:
I have been having a lot of trouble with this version.  Deployments are very inconsistent.  The latest manifestation has three separate deployments for one service; all of them hung in various states.  This is after I have been able to deploy before without any changes.  The new openshift v1.1.1 just randomly starts giving me errors.  This is just the latest: Failed to pull image "172.30.250.187:5000/abecorn/tradeclient@sha256:cbca9d885bf1c23bb518662cc51d61b5365ab321147a59d2be5b86869f50c08e": Driver devicemapper failed to create image rootfs 0dd71c06ef1387030a9c4c05e3cea6727405fce6eda53b5d01cb5ff442440d02: Unknown device e551632771dbc2fa0728e96af65300567243f2413311b76334e3abebbe836e19

Subsequent deployments do not show errors in the event log or anywhere else I can find. They just hang.  I did notice all subsequent builds, including the original failed deployment keep trying to use the oldest container image id even though there are many later successful builds and images of that container.  I also notice every time I start a new deployment it increments the container count of the oldest failed deployment and creates a new deployment with a container increment of 1.  I have attached an image of the mess in my web console.  Along with the output for: oc get dc,rc -o yaml

Version-Release number of selected component (if applicable):
Openshift Origin v1.1.1, RHEL 7.2, Docker 1.8.2

How reproducible:
Deployments will work a few times, but ultimately start failing and hanging

Steps to Reproduce:
1. Set up thin docker storage pool with free volume group (option b in prerequisites)
2. Start Openshift Origin Master and Node all in one on RHEL 7.2 machine
3. Build and deploy containers

Actual results:


Expected results:


Additional info:

Comment 1 Dean Peterson 2016-01-30 03:48:37 UTC
Created attachment 1119567 [details]
oc get dc,rc -o yaml

Comment 2 Dean Peterson 2016-01-30 03:50:33 UTC
Created attachment 1119580 [details]
Available images and the oldest is always being used for deployments

Comment 3 Dean Peterson 2016-01-30 04:12:24 UTC
I also see a lot of pods out there when I do: oc get pods on the project.  I don't remember seeing so many pods in earlier versions.  It seems oc get pods would only show running pods:
abecornlandingpageservice-1-2jazv   1/1       Running            0          13h
abecornlandingpageservice-1-frbyd   1/1       Running            0          13h
batchservice-1-deploy               0/1       Error              0          5d
batchservice-6-deploy               0/1       Error              0          2d
batchservice-8-deploy               0/1       Error              0          1d
batchservice-9-7mjvz                1/1       Running            0          23h
batchservice-build-1-build          0/1       Error              0          5d
batchservice-build-2-build          0/1       Completed          0          5d
batchservice-build-3-build          0/1       Completed          0          3d
batchservice-build-4-build          0/1       Completed          0          3d
batchservice-build-5-build          0/1       Completed          0          3d
batchservice-build-6-build          0/1       Completed          0          2d
batchservice-build-7-build          0/1       Completed          0          1d
client-base-build-1-build           0/1       Completed          0          6d
client-base-build-2-build           0/1       Completed          0          2d
client-base-build-3-build           0/1       Error              0          2d
client-base-build-4-build           0/1       Completed          0          2d
client-base-build-5-build           0/1       Completed          0          1d
itemrepoclient-build-1-build        0/1       Completed          0          1d
itemrepoclientservice-2-grzzg       1/1       Running            1          1d
itemrepoclientservice-2-zatbp       1/1       Running            1          1d
tradeclient-build-3-build           0/1       Completed          0          1d
tradeclient-build-4-build           0/1       Completed          0          2h
tradeclient-build-5-build           0/1       Completed          0          1h
tradeclientservice-10-hzw2i         0/1       Terminating        0          1h
tradeclientservice-4-2szx0          0/1       Terminating        0          2h
tradeclientservice-4-dfx0e          0/1       Terminating        0          2h
tradeclientservice-4-rinm1          0/1       Terminating        0          1h
tradeclientservice-4-uc7ni          0/1       Terminating        0          1h
tradeclientservice-5-388dq          0/1       Terminating        0          1h
tradeclientservice-7-ezw21          0/1       Terminating        0          1h
tradeclientservice-9-1g1dk          0/1       Terminating        0          1h
tradeservice-1-deploy               0/1       Error              0          23h
tradeservice-2-deploy               0/1       DeadlineExceeded   0          23h
tradeservice-3-deploy               0/1       Error              0          23h
tradeservice-6-bs9c8                1/1       Running            0          2h
tradeservice-build-1-build          0/1       Error              0          1d
tradeservice-build-2-build          0/1       Error              0          1d
tradeservice-build-5-build          0/1       Completed          0          1d
tradeservice-build-6-build          0/1       Completed          0          7h
tradeservice-build-7-build          0/1       Completed          0          2h
tradeservicebase-build-1-build      0/1       Completed          0          7d
tradeservicebase-build-2-build      0/1       Completed          0          3d
tradeservicebase-build-3-build      0/1       Completed          0          3d
tradeservicebase-build-4-build      0/1       Completed          0          2d
tradeservicebase-build-5-build      0/1       Completed          0          1d
wildfly-jdk-8-build-1-build         0/1       Error              0          8d
wildfly-jdk-8-build-10-build        0/1       Completed          0          3d
wildfly-jdk-8-build-11-build        0/1       Completed          0          2d
wildfly-jdk-8-build-12-build        0/1       Completed          0          1d
wildfly-jdk-8-build-2-build         0/1       Error              0          8d
wildfly-jdk-8-build-3-build         0/1       Error              0          8d
wildfly-jdk-8-build-4-build         0/1       Error              0          8d
wildfly-jdk-8-build-5-build         0/1       Error              0          8d
wildfly-jdk-8-build-6-build         0/1       Error              0          8d
wildfly-jdk-8-build-7-build         0/1       Error              0          8d
wildfly-jdk-8-build-8-build         0/1       Error              0          8d
wildfly-jdk-8-build-9-build         0/1       Completed          0          8d

Comment 4 Dean Peterson 2016-01-30 04:14:06 UTC
The terminating pods have been "Terminating" for hours.  I have tried deleting them; I tried deleting the Deployment Config for them and the associated service.  The Pods still remain.

Comment 5 Dean Peterson 2016-01-30 04:28:06 UTC
The only way to get the pods to clear was by killing openshift.  I restarted,  ran the deployment again and it worked this time.  This happens far too often. If it is something with my machine setup, why does it work after restarting Openshift?  I have been working with Openshift Origin for a while now and my previous v1.0.7 seemed more reliable.  The logs were incredibly slow, builds would get slower over time and deployments would often kill previous containers before the new containers were up; still, I rarely had pods hang forever forcing me to manually delete deployment configs and restart openshift.  The only thing I have changed from that setup to this setup is moving from RHEL 7.1 to RHEL 7.2 and switching to thin storage pool over default loopback storage.

Comment 6 Kenjiro Nakayama 2016-02-10 11:13:33 UTC
As for "Terminating" state pods, do you remember that you deleted the pods when it's pulling image? Or failed to pull?

There are some related bug reports:

  https://bugzilla.redhat.com/show_bug.cgi?id=1271198#c8
  https://bugzilla.redhat.com/show_bug.cgi?id=1274598#c0

and they said that they deleted the pods when it's pulling image.

NOTE: If your original issue is NOT related to the "Terminating" state, we should not discuss it here. But if you think it is related to your issue, please comment on this bug ticket.

Comment 7 Dean Peterson 2016-02-10 16:11:45 UTC
I don't think image pulling is part of the problem.  I see plenty of messages that say the image is pulled successfully.  The image is pulled but the container never starts and looking at the container logs directly using docker logs shows the logs are completely empty.  The last time I had containers failing to start with logs that were blank I needed to set the runAsUser to runAsAny because I use root user inside my docker files.  I checked that setting and the scc restricted file is set to runAsAny. I only manually delete the pods after the event logs show the image was pulled and the containers supposedly start.  After they start, the event logs show the containers are killed for some unknown reason.

Comment 8 Eric Paris 2017-05-31 18:22:11 UTC
We apologize, however, we do not plan to address this report at this time. The majority of our active development is for the v3 version of OpenShift. If you would like for Red Hat to reconsider this decision, please reach out to your support representative. We are very sorry for any inconvenience this may cause.


Note You need to log in before you can comment on or make changes to this bug.