Bug 2300323

Summary: candlepin-4.4.12-1 DB upgrade gets stuck
Product: [Community] Candlepin Reporter: Nikos Moumoulidis <nmoumoul>
Component: candlepinAssignee: Nikos Moumoulidis <nmoumoul>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.4CC: redakkan, vchepkov, vogt
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: candlepin-4.4.14-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-09-03 14:35:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nikos Moumoulidis 2024-07-29 10:31:10 UTC
Description of problem:
Upstream users of foreman/katello/candlepin, during an upgrade from 4.4.10-1 to 4.4.12-1, notice that Candlepin takes too long or gets entirely stuck during startup (more specifically during the liquibase db migration step).

Version-Release number of selected component (if applicable):
candlepin-4.4.12-1

How reproducible:
Likely dependent on the size/shape of existing data in the DB. On a fresh deployment this is probably not reproducible.

Steps to Reproduce:
1. Have an upstream foreman/katello/candlepin setup with candlepin version 4.4.10-1.
2. Perform normal operations (unknown if specific operations required)
3. Upgrade candlepin to 4.4.12-1

Actual results:
Candlepin gets stuck in the DB migration on startup.

Expected results:
Candlepin should perform the DB migration on startup in a timely manner, without getting stuck.

Additional info:
Relevant posts: 
https://community.theforeman.org/t/feedback-for-foreman-3-11-katello-4-13/38375/29
https://community.theforeman.org/t/foreman-3-11-upgrade/38902
https://community.theforeman.org/t/upgrade-from-3-10-0-4-12-1-to-3-11-1-4-13-1-makes-foreman-unusable/38893/15
https://community.theforeman.org/t/status-code-403-for-error-failed-to-download-metadata-for-repo-after-upgrade-to-3-11/38601/26

Comment 1 Nikos Moumoulidis 2024-07-31 10:04:25 UTC
Copy/pasting here the RCA that Chris Rog posted on the foreman community forum here: https://community.theforeman.org/t/feedback-for-foreman-3-11-katello-4-13/38375/68

I’m pretty confident I’ve discovered the root cause here and it’s a bit of an adventure.
The immediate workaround for those of you hitting this will be to use a utility daemon like haveged (available from the EPEL repos) which populate the system with entropy for /dev/random.
We’ll be putting together a build that tries to work around this soon™ so that’s not necessary.
For those interested in the underlying details:
As established above, Candlepin uses Liquibase for managing its db schema. As part of the various operations it does while comparing the existing schema to the changesets it has, it occasionally generates random strings. It does so by calling RandomStringUtil.random(...) [1] from the Apache Commons’ lang3 library.

While this has generally been fine, recently-ish (v3.15+) [2], the folks managing the lang3 library updated the RandomStringUtil to use the blocking SecureRandom implementation for generating these strings. This, unfortunately is hard-coded to pull from /dev/random under the hood.
To Liquibase’s credit, their library depends on v3.14 – before this change to the RandomStringUtil was made. However, Candlepin uses v3.15, and due to the dependency resolution magic and packaging, means Liquibase is also forced to use it, leading to this issue. It’s likely their devs aren’t aware of this yet, and/or will be addressing it in an upcoming build.

Our short-term fix is going to be to back up the dependency on Apache Commons lang3 to v3.14 which rolls back the reliance on SecureRandom in this path. Unfortunately because the issue is being hit multiple libraries away from our code in a way that isn’t configurable at all, this is our only immediate recourse; and I’d like to not incur a system-level dependency of some kind of rng tooling to workaround it.
So that’s where we’re at. My apologies to those who burned their time dealing with this, and I appreciate all the logs and details provided. It really helped narrow down the root cause.

[1] https://github.com/liquibase/liquibase/blob/e854a6e18c29651da0d51265a28c58f22ef3248b/liquibase-standard/src/main/java/liquibase/util/StringUtil.java#L771
[2] https://github.com/apache/commons-lang/blob/535ec32c680a6581739a41bf97ecb8b8718c73b8/src/main/java/org/apache/commons/lang3/RandomStringUtils.java#L29


The root cause issue has been already reported on the Apache bug tracker here: https://issues.apache.org/jira/browse/LANG-1748

Comment 2 Nikos Moumoulidis 2024-08-01 11:05:03 UTC
Re-opening this, because the issue is not actually fixed. The fix was supposed to be the downgrade of apache commons-lang3 library from 3.15.0 to 3.14.0, which works fine in a development setup, but unfortunately the internal build system we use does not respect the gradle 'strictly' keyword (used to pin/downgrade a library to a specific version). We're working with the build team to fix this.