Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1600695

Summary:

Hypervisor Update jobs can fail due to PersistentObjectException

Product:

[Community] Candlepin (Migrated to Jira)

Reporter:

Shayne Riley <sriley>

Component:

candlepin

Assignee:

Alex Wood <awood>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Katello QA List <katello-qa-list>

Severity:

high

Docs Contact:

Priority:

high

Version:

2.3

CC:

asakpal, awood, khowell, redakkan, skallesh

Target Milestone:

---

Keywords:

Triaged

Target Release:

2.3

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

candlepin-2.3.9-1

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1608033 1608034 (view as bug list)

Environment:

Last Closed:

2018-07-27 14:23:41 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1608033, 1608034

Attachments:

Description	Flags
Exception log of a case when a hypervisor checkin fails.	none

Description Shayne Riley 2018-07-12 19:29:50 UTC

Description of problem:
On the Customer Portal, some customers reporting that their hypervisor Last checkin date are older than expected. Digging into the cause, it appears virt-who checkins are fine, but on the Candlepin side, the job's status is "FAILED".

Digging more, the abbreviated cause of the jobs' failures:
org.quartz.JobExecutionException: javax.persistence.RollbackException: Error while committing the transaction
Caused by: javax.persistence.PersistenceException: org.hibernate.PersistentObjectException: detached entity passed to persist: org.candlepin.model.GuestId
Caused by: org.hibernate.PersistentObjectException: detached entity passed to persist: org.candlepin.model.GuestId

Version-Release number of selected component (if applicable):
Candlepin 2.3.8


How reproducible:
Only affects some orgs, but some of those orgs' hypervisor checkin jobs occasionally succeed.


Steps to Reproduce:
Details to follow. It involves submitting an asynchronous hypervisor checkin.


Actual results:
Hypervisor checkin fails.


Expected results:
Hypervisor checkin succeed.


Additional info:
Logs indicate this became an issue 2018-07-10 at 8:00pm EDT, which is when we went live with Candlepin 2.3.8

Comment 1 Shayne Riley 2018-07-12 19:43:17 UTC

Created attachment 1458524 [details]
Exception log of a case when a hypervisor checkin fails.

Comment 2 Shayne Riley 2018-07-13 19:30:27 UTC

Candlepin in PROD is now rolled back to 2.2.3, the version we were on before, which doesn't have the issue.

So far, looks like the hypervisor checkin jobs are completing successfully. Hypervisors should start showing on Customer Portal with up-to-date information.

Comment 3 Shayne Riley 2018-07-30 20:08:23 UTC

Based on awoods commit
https://github.com/candlepin/candlepin/commit/e108ffff59902d483d141fab22f7c8c9b6ed3b10 which adds an rspec test for reproducing this error, we were able to reproduce it on 2.3.8 as well:

1. Submit a hypervisor checkin with two hosts. Each host should contain two guests apiece.
2. Wait for the job to complete. It should succeed.
3. Swap the guests in the two hosts and submit another checkin. That is, the guests that were in host1 should now be in host2, and the guests that were in host2 should now be in host1.
4. Wait for the job to complete.

Actual results (in Candlepin 2.3.8):
The second job fails.

Expected results:
The second job succeeds.

Additional notes:
- If this doesn't fail for you, and you're running 2.3.8, try repeating steps 3 and 4 multiple times. 90% of the time, I didn't need to repeat the steps, and if I did, it was usually caught after the first or second repeat.
- The checkins will always pass in production because we aren't running Candlepin 2.3.8 now.
- I didn't modify the test much, beyond: a) each host has two guests instead of one, and b) the hosts swap their guests, rather than what the test originally does, which is move host2's guest onto host1. With that configuration, my tests were only able to reproduce the error 30% of the time.