Bug 1008283 - Corrupted pending_op in pending_op_groups queue
Corrupted pending_op in pending_op_groups queue
Status: CLOSED CURRENTRELEASE
Product: OpenShift Online
Classification: Red Hat
Component: Pod (Show other bugs)
2.x
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Rajat Chopra
libra bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-16 01:15 EDT by Rajat Chopra
Modified: 2015-05-14 20:20 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-17 09:28:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rajat Chopra 2013-09-16 01:15:04 EDT
Description of problem:

We still see some pending_ops that are corrupted in online deployment. e.g. output of clear-pending-ops says :

ERROR in cleaning up application's op because the type is nil. App uuid - 5217b47b4382ec8f25000223. Op - #<PendingAppOpGroup _id: 52362df6e0b8cd02df000001, _type: nil, created_at: nil, .. <truncated>

Version-Release number of selected component (if applicable):


How reproducible:
Seems to happen time and again with some applications that see lot of REST API volume.

Steps to Reproduce:
1. unknown
2.
3.

Actual results:
pending_op becomes corrupted possibly because of mongoid overwriting at the wrong index.

Expected results:
No pending_op should ever be corrupted.

Additional info:
Last time around the debugging revealed that this happens when 'lock' sanity is violated. Mcoll can sometimes return back after a long time (~15 minutes). If this time happens to be >30 minutes, the lock gets expired exposing the broker thread to write to mongo without a lock.
Comment 1 Rajat Chopra 2013-09-18 00:58:59 EDT
Fixed with rev#ec4f4447e1b65e276f10c2695b9ba799835b7b5b in origin-server.
Atomic updates are now made to embedded documents.

This bug will be difficult to reproduce. We will have to wait for feedback from Online production if this happens again.
Comment 2 Jianwei Hou 2013-09-29 21:44:24 EDT
This bug is not reproduced from QE side in recent days, I'm moving it to verified.

Note You need to log in before you can comment on or make changes to this bug.