Bug 2300323 - candlepin-4.4.12-1 DB upgrade gets stuck
Summary: candlepin-4.4.12-1 DB upgrade gets stuck
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Candlepin
Classification: Community
Component: candlepin
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Nikos Moumoulidis
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-07-29 10:31 UTC by Nikos Moumoulidis
Modified: 2024-09-03 14:35 UTC (History)
3 users (show)

Fixed In Version: candlepin-4.4.14-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-09-03 14:35:48 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA LANG-1748 0 Major Open RandomStringUtils.random() drains the systems entropy pool and blocks 2024-07-31 10:04:25 UTC
Github candlepin candlepin pull 4792 0 None open [M] CANDLEPIN-901: Added missing validCheckSum to the entity layering changeset 2024-07-31 10:04:25 UTC
Github candlepin candlepin pull 4793 0 None Merged [M] CANDLEPIN-901: Rolled back to previous version of Apache Commons lang 2024-07-31 10:04:25 UTC

Description Nikos Moumoulidis 2024-07-29 10:31:10 UTC
Description of problem:
Upstream users of foreman/katello/candlepin, during an upgrade from 4.4.10-1 to 4.4.12-1, notice that Candlepin takes too long or gets entirely stuck during startup (more specifically during the liquibase db migration step).

Version-Release number of selected component (if applicable):
candlepin-4.4.12-1

How reproducible:
Likely dependent on the size/shape of existing data in the DB. On a fresh deployment this is probably not reproducible.

Steps to Reproduce:
1. Have an upstream foreman/katello/candlepin setup with candlepin version 4.4.10-1.
2. Perform normal operations (unknown if specific operations required)
3. Upgrade candlepin to 4.4.12-1

Actual results:
Candlepin gets stuck in the DB migration on startup.

Expected results:
Candlepin should perform the DB migration on startup in a timely manner, without getting stuck.

Additional info:
Relevant posts: 
https://community.theforeman.org/t/feedback-for-foreman-3-11-katello-4-13/38375/29
https://community.theforeman.org/t/foreman-3-11-upgrade/38902
https://community.theforeman.org/t/upgrade-from-3-10-0-4-12-1-to-3-11-1-4-13-1-makes-foreman-unusable/38893/15
https://community.theforeman.org/t/status-code-403-for-error-failed-to-download-metadata-for-repo-after-upgrade-to-3-11/38601/26

Comment 1 Nikos Moumoulidis 2024-07-31 10:04:25 UTC
Copy/pasting here the RCA that Chris Rog posted on the foreman community forum here: https://community.theforeman.org/t/feedback-for-foreman-3-11-katello-4-13/38375/68

I’m pretty confident I’ve discovered the root cause here and it’s a bit of an adventure.
The immediate workaround for those of you hitting this will be to use a utility daemon like haveged (available from the EPEL repos) which populate the system with entropy for /dev/random.
We’ll be putting together a build that tries to work around this soon™ so that’s not necessary.
For those interested in the underlying details:
As established above, Candlepin uses Liquibase for managing its db schema. As part of the various operations it does while comparing the existing schema to the changesets it has, it occasionally generates random strings. It does so by calling RandomStringUtil.random(...) [1] from the Apache Commons’ lang3 library.

While this has generally been fine, recently-ish (v3.15+) [2], the folks managing the lang3 library updated the RandomStringUtil to use the blocking SecureRandom implementation for generating these strings. This, unfortunately is hard-coded to pull from /dev/random under the hood.
To Liquibase’s credit, their library depends on v3.14 – before this change to the RandomStringUtil was made. However, Candlepin uses v3.15, and due to the dependency resolution magic and packaging, means Liquibase is also forced to use it, leading to this issue. It’s likely their devs aren’t aware of this yet, and/or will be addressing it in an upcoming build.

Our short-term fix is going to be to back up the dependency on Apache Commons lang3 to v3.14 which rolls back the reliance on SecureRandom in this path. Unfortunately because the issue is being hit multiple libraries away from our code in a way that isn’t configurable at all, this is our only immediate recourse; and I’d like to not incur a system-level dependency of some kind of rng tooling to workaround it.
So that’s where we’re at. My apologies to those who burned their time dealing with this, and I appreciate all the logs and details provided. It really helped narrow down the root cause.

[1] https://github.com/liquibase/liquibase/blob/e854a6e18c29651da0d51265a28c58f22ef3248b/liquibase-standard/src/main/java/liquibase/util/StringUtil.java#L771
[2] https://github.com/apache/commons-lang/blob/535ec32c680a6581739a41bf97ecb8b8718c73b8/src/main/java/org/apache/commons/lang3/RandomStringUtils.java#L29


The root cause issue has been already reported on the Apache bug tracker here: https://issues.apache.org/jira/browse/LANG-1748

Comment 2 Nikos Moumoulidis 2024-08-01 11:05:03 UTC
Re-opening this, because the issue is not actually fixed. The fix was supposed to be the downgrade of apache commons-lang3 library from 3.15.0 to 3.14.0, which works fine in a development setup, but unfortunately the internal build system we use does not respect the gradle 'strictly' keyword (used to pin/downgrade a library to a specific version). We're working with the build team to fix this.


Note You need to log in before you can comment on or make changes to this bug.