Bug 975206

Summary: oo-admin-clear-pending-ops can get stuck on certain users / apps...
Product: OpenShift Online Reporter: Thomas Wiest <twiest>
Component: PodAssignee: Rajat Chopra <rchopra>
Status: CLOSED WORKSFORME QA Contact: libra bugs <libra-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.xCC: rchopra
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-18 16:59:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Thomas Wiest 2013-06-17 19:54:55 UTC
Description of problem:
In PROD, we're seeing that oo-admin-clear-pending-ops is getting stuck on a certain user / app and never completing it's run.

For one of the runs, it ran for 3 days straight and never got past the user / app.

I killed that process and started a new one, and the new one almost immediately got stuck on this user / app.

Below is an strace that shows the user / app that it's stuck on.

Note: strace is continually showing the same user / app over and over, as if oo-admin-clear-pending-ops is looping on this user / app. rchopra assures me that there isn't a loop for this. However, as you can see in the strace output below, the first line and the last line are the same. I didn't bother to include the rest of the 2nd entry as it's the same as the first. This looks like a loop.



Version-Release number of selected component (if applicable):
openshift-origin-broker-util-1.9.7-1.el6oso.noarch

How reproducible:
Very in PROD (this happens over and over, each time I start oo-admin-clear-pending-ops). Note, this user, gear has not been cleaned up, but probably needs to be.

Steps to Reproduce:
1. Unknown, just started happening in PROD.


Actual results:
Gets stuck on a user / app and does _not_ move on. It can continue for days, or until it's killed.


Expected results:
I think a timeout would be good. If the user / app can't be cleaned up within a certain timeout, the script should error and move on.


Additional info:
[...snip...]
[pid 31386] write(5, "2013-06-17 15:42:21.479 [DEBUG] MOPED: 10.77.7.15:27017 COMMAND      database=openshift_broker command={:findAndModify=>\"locks\", :query=>{\"user_id\"=>\"51bb31075973cab7fe0003b7\", \"$or\"=>[{\"locked\"=>false}, {\"#<Origin::Key:0x000000093699b8>\"=>1371498141}], \"app_ids.51bb31084382ec659e000470\"=>{\"$exists\"=>true}}, :new=>true, :update=>{\"$set\"=>{:locked=>true, :timeout=>1371498741}}} (3.4101ms) (pid:31386)\n", 403) = 403
[pid 31386] gettimeofday({1371498141, 480671}, NULL) = 0
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931388, 920780643}) = 0
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931388, 920865776}) = 0
[pid 31386] futex(0x1b905b4, FUTEX_WAIT_BITSET_PRIVATE, 17937, {2931389, 920865776}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 31386] futex(0x1b90618, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931389, 921388355}) = 0
[pid 31386] gettimeofday({1371498142, 481466}, NULL) = 0
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 481570408}) = 0
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 482125010}) = 0
[pid 31386] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
[pid 31386] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
[pid 31386] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 482904356}) = 0
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931389, 923256779}) = 0
[pid 31386] select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931389, 923493863}) = 0
[pid 31386] select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 31386] write(8, "r\0\0\0NF\0\0\0\0\0\0\324\7\0\0\0\0\0\0openshift_broker.locks\0\0\0\0\0\377\377\377\377?\0\0\0\3$query\0\32\0\0\0\7user_id\0Q\2731\7Ys\312\267\376\0\3\267\0\3$orderby\0\16\0\0\0\20_id\0\1\0\0\0\0\0", 114) = 114
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931389, 924200632}) = 0
[pid 31386] select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 31386] read(8, "s\0\0\0Y\373\364\7NF\0\0\1\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0", 36) = 36
[pid 31386] read(8, "O\0\0\0\7_id\0Q\2731\7Ys\312\267\376\0\3\270\3app_ids\0\5\0\0\0\0\10locked\0\0\20timeout\0\3255\273Q\7user_id\0Q\2731\7Ys\312\267\376\0\3\267\0", 79) = 79
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 486489648}) = 0
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 486706675}) = 0
[pid 31386] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
[pid 31386] write(5, "2013-06-17 15:42:22.486 [DEBUG] MOPED: 10.77.7.15:27017 QUERY        database=openshift_broker collection=locks selector={\"$query\"=>{\"user_id\"=>\"51bb31075973cab7fe0003b7\"}, \"$orderby\"=>{:_id=>1}} flags=[] limit=-1 skip=0 batch_size=nil fields=nil (3.5851ms) (pid:31386)\n", 270) = 270
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 487814441}) = 0
[pid 31386] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
[pid 31386] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
[pid 31386] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 488631971}) = 0
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931389, 928881307}) = 0
[pid 31386] select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931389, 929215761}) = 0
[pid 31386] select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 31386] write(8, "\36\1\0\0OF\0\0\0\0\0\0\324\7\0\0\0\0\0\0openshift_broker.$cmd\0\0\0\0\0\377\377\377\377\354\0\0\0\2findAndModify\0\6\0\0\0locks\0\3query\0\223\0\0\0\7user_id\0Q\2731\7Ys\312\267\376\0\3\267\4$or\0C\0\0\0\0030\0\16\0\0\0\10locked\0\0\0\0031\0*\0\0\0\20#<Origin::Key:0x0000000938b7c0>\0\236f\277Q\0\0\3app_ids.51bb31084382ec659e000470\0\17\0\0\0\10$exists\0\1\0\0\10new\0\1\3update\0&\0\0\0\3$set\0\33\0\0\0\10locked\0\1\20timeout\0\366h\277Q\0\0\0", 286) = 286
[pid 31386] clock_gettime(CLOCK_MONOTONIC, {2931389, 929689658}) = 0
[pid 31386] select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
[pid 31386] read(8, "<\0\0\0Z\373\364\7OF\0\0\1\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0", 36) = 36
[pid 31386] read(8, "\30\0\0\0\nvalue\0\1ok\0\0\0\0\0\0\0\360?\0", 24) = 24
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 491955858}) = 0
[pid 31386] clock_gettime(CLOCK_REALTIME, {1371498142, 492269851}) = 0
[pid 31386] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
[pid 31386] write(5, "2013-06-17 15:42:22.492 [DEBUG] MOPED: 10.77.7.15:27017 COMMAND      database=openshift_broker command={:findAndModify=>\"locks\", :query=>{\"user_id\"=>\"51bb31075973cab7fe0003b7\", \"$or\"=>[{\"locked\"=>false}, {\"#<Origin::Key:0x0000000938b7c0>\"=>1371498142}], \"app_ids.51bb31084382ec659e000470\"=>{\"$exists\"=>true}}, :new=>true, :update=>{\"$set\"=>{:locked=>true, :timeout=>1371498742}}} (3.3240ms) (pid:31386)\n", 403) = 403
[pid 31386] gettimeofday({1371498142, 492885}, NULL) = 0
[...snip...]

Comment 1 Rajat Chopra 2013-06-18 16:59:31 UTC
There was some unknown issue with mongo connectivity. The problem has resolved by itself now. The daily cron job for clear-pending-ops ran successfully this morning.