Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1600695

Summary: Hypervisor Update jobs can fail due to PersistentObjectException
Product: [Community] Candlepin (Migrated to Jira) Reporter: Shayne Riley <sriley>
Component: candlepinAssignee: Alex Wood <awood>
Status: CLOSED CURRENTRELEASE QA Contact: Katello QA List <katello-qa-list>
Severity: high Docs Contact:
Priority: high    
Version: 2.3CC: asakpal, awood, khowell, redakkan, skallesh
Target Milestone: ---Keywords: Triaged
Target Release: 2.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-2.3.9-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1608033 1608034 (view as bug list) Environment:
Last Closed: 2018-07-27 14:23:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1608033, 1608034    
Attachments:
Description Flags
Exception log of a case when a hypervisor checkin fails. none

Description Shayne Riley 2018-07-12 19:29:50 UTC
Description of problem:
On the Customer Portal, some customers reporting that their hypervisor Last checkin date are older than expected. Digging into the cause, it appears virt-who checkins are fine, but on the Candlepin side, the job's status is "FAILED".

Digging more, the abbreviated cause of the jobs' failures:
org.quartz.JobExecutionException: javax.persistence.RollbackException: Error while committing the transaction
Caused by: javax.persistence.PersistenceException: org.hibernate.PersistentObjectException: detached entity passed to persist: org.candlepin.model.GuestId
Caused by: org.hibernate.PersistentObjectException: detached entity passed to persist: org.candlepin.model.GuestId

Version-Release number of selected component (if applicable):
Candlepin 2.3.8


How reproducible:
Only affects some orgs, but some of those orgs' hypervisor checkin jobs occasionally succeed.


Steps to Reproduce:
Details to follow. It involves submitting an asynchronous hypervisor checkin.


Actual results:
Hypervisor checkin fails.


Expected results:
Hypervisor checkin succeed.


Additional info:
Logs indicate this became an issue 2018-07-10 at 8:00pm EDT, which is when we went live with Candlepin 2.3.8

Comment 1 Shayne Riley 2018-07-12 19:43:17 UTC
Created attachment 1458524 [details]
Exception log of a case when a hypervisor checkin fails.

Comment 2 Shayne Riley 2018-07-13 19:30:27 UTC
Candlepin in PROD is now rolled back to 2.2.3, the version we were on before, which doesn't have the issue.

So far, looks like the hypervisor checkin jobs are completing successfully. Hypervisors should start showing on Customer Portal with up-to-date information.

Comment 3 Shayne Riley 2018-07-30 20:08:23 UTC
Based on awoods commit
https://github.com/candlepin/candlepin/commit/e108ffff59902d483d141fab22f7c8c9b6ed3b10 which adds an rspec test for reproducing this error, we were able to reproduce it on 2.3.8 as well:

1. Submit a hypervisor checkin with two hosts. Each host should contain two guests apiece.
2. Wait for the job to complete. It should succeed.
3. Swap the guests in the two hosts and submit another checkin. That is, the guests that were in host1 should now be in host2, and the guests that were in host2 should now be in host1.
4. Wait for the job to complete.

Actual results (in Candlepin 2.3.8):
The second job fails.

Expected results:
The second job succeeds.

Additional notes:
- If this doesn't fail for you, and you're running 2.3.8, try repeating steps 3 and 4 multiple times. 90% of the time, I didn't need to repeat the steps, and if I did, it was usually caught after the first or second repeat.
- The checkins will always pass in production because we aren't running Candlepin 2.3.8 now.
- I didn't modify the test much, beyond: a) each host has two guests instead of one, and b) the hosts swap their guests, rather than what the test originally does, which is move host2's guest onto host1. With that configuration, my tests were only able to reproduce the error 30% of the time.