Bug 1304266

Summary:

Pod status keeps in pending status on dedicated env

Product:

OpenShift Online

Reporter:

Wang Haoran <haowang>

Component:

Containers

Assignee:

Jhon Honce <jhonce>

Status:

CLOSED CURRENTRELEASE

QA Contact:

DeShuai Ma <dma>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.x

CC:

agrimm, akostadi, aos-bugs, haowang, jhonce, jokerman, mmccomas, pruan, whearn, wzheng

Target Milestone:

---

Keywords:

TestBlocker

Target Release:

---

Flags:

jhonce: needinfo-

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-05-23 15:08:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1303130

Attachments:

Description	Flags
python build failed	none

Description Wang Haoran 2016-02-03 08:37:46 UTC

Description of problem:

The pod status become OutOfDisk on the dedicated env after run sometime later
Version-Release number of selected component (if applicable):


How reproducible:

Always
Steps to Reproduce:
1.create project 
2.create a pod
3.check pod status

Actual results:
[vagrant@ose ~]$ oc get pod -o wide
NAME                    READY     STATUS      RESTARTS   AGE       NODE
beego-example-1-build   0/1       OutOfDisk   0          24s       ip-172-31-5-177.ec2.internal

Expected results:
running

Additional info:

Comment 1 Wang Haoran 2016-02-04 02:34:45 UTC

the env cannot build and  deploy now
[vagrant@ose ~]$ oc get event
FIRSTSEEN   LASTSEEN   COUNT     NAME         KIND                    SUBOBJECT   REASON              SOURCE                           MESSAGE
17m         10m        11        database-1   ReplicationController               FailedCreate        {deployer }                      Error creating deployer pod for haowang2/database-1: Pod "database-1-deploy" is forbidden: service account haowang2/deployer was not found, retry after the service account is created
17m         12m        2         database-1   ReplicationController               FailedCreate        {deployer }                      Error creating deployer pod for haowang2/database-1: Internal error occurred: Get http://api.stage.openshift.com/api/v1/namespaces/haowang2: dial tcp 52.5.122.7:80: connection refused
17m         14m        2         database-1   ReplicationController               FailedCreate        {deployer }                      Error creating deployer pod for haowang2/database-1: Internal error occurred: Get http://api.stage.openshift.com/api/v1/namespaces/haowang2: dial tcp 52.72.220.72:80: connection refused
8m          17s        10        database-2   ReplicationController               FailedCreate        {deployer }                      Error creating deployer pod for haowang2/database-2: Pod "database-2-deploy" is forbidden: service account haowang2/deployer was not found, retry after the service account is created
17m         17m        1         database     DeploymentConfig                    DeploymentCreated   {deploymentconfig-controller }   Created new deployment "database-1" for version 1
8m          8m         1         database     DeploymentConfig                    DeploymentCreated   {deploymentconfig-controller }   Created new deployment "database-2" for version 2

Comment 2 Wesley Hearn 2016-02-05 21:04:04 UTC

Initial report that project does not exist and there are no other pods showing this.


Comment 1 is related to https://bugzilla.redhat.com/show_bug.cgi?id=1304586

Can you recreate the OutOfDisk error so I can actually look into it?

Comment 3 Wenjing Zheng 2016-02-06 04:01:04 UTC

No such errors like comment #1 and no OutOfDisk error now; However, pods keep in pending status in nodes:
[wzheng@localhost ~]$ oc get builds
NAME                 TYPE      FROM      STATUS    STARTED   DURATION
php-sample-build-1   Source    Git       Pending             
[wzheng@localhost ~]$ oc get pods -o wide
NAME                       READY     STATUS      RESTARTS   AGE       NODE
database-1-deploy          1/1       Running     0          21m       ip-172-31-5-179.ec2.internal
database-1-kbnn8           1/1       Running     0          21m       ip-172-31-5-179.ec2.internal
database-1-posthook        0/1       Pending     0          21m       ip-172-31-5-180.ec2.internal
database-1-prehook         0/1       Completed   0          21m       ip-172-31-5-179.ec2.internal
php-sample-build-1-build   0/1       Pending     0          19m       ip-172-31-5-180.ec2.internal
[wzheng@localhost ~]$ oc get pods -n wzheng3 -o wide
NAME                       READY     STATUS      RESTARTS   AGE       NODE
php-sample-build-2-build   0/1       Pending     0          19m       ip-172-31-5-179.ec2.internal
[wzheng@localhost ~]$ oc get pods -o wide -n wzheng123
NAME                       READY     STATUS      RESTARTS   AGE       NODE
php-sample-build-3-build   0/1       Pending     0          21m       ip-172-31-5-177.ec2.internal

Comment 5 Wesley Hearn 2016-02-08 15:32:26 UTC

This is related to docker getting hung. It seems to be happening at a higher rate for us in 3.1.1.6.

Comment 6 Jhon Honce 2016-02-08 21:19:38 UTC

During the next hang, please attach strace to the docker process (-f -v -y -yy -s 4096) and log for approximately 5 minutes. Please attach log or forward via email.  Thanks.

Comment 7 Aleksandar Kostadinov 2016-02-10 15:54:53 UTC

Created attachment 1122835 [details]
python build failed

It might be related working in web console, I've got
"the image cannot be retrieved" several times while trying to open:
https://console.stage.openshift.com/console/project/ctrnl/create/fromimage?imageName=python&imageTag=3.4&namespace=ctrnl

At some point it succeeded. But then build failed with failed to push image. See attached console log.

Comment 11 Peter Ruan 2016-02-12 19:18:10 UTC

I've tested it with my limited runs and have not seen the problem.  Will need to run more tests to see and will put it as VERIFIED if still can't reproduce the problem then.

Comment 12 Peter Ruan 2016-02-12 23:48:24 UTC

I've run more tests today on top off yesterday and have not seen the issue again.  Putting it as verified.