Bug 2006307

Summary:	Pulp2-Pulp3 migration failed due to mongodb oom errors
Product:	Red Hat Satellite	Reporter:	Stephen Wadeley <swadeley>
Component:	Pulp	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED NOTABUG	QA Contact:	Stephen Wadeley <swadeley>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.9.0	CC:	ggainey, jjeffers, jsherril, mdepaulo, osousa, ttereshc
Target Milestone:	6.9.7	Keywords:	Regression, Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-07 10:24:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen Wadeley 2021-09-21 13:20:42 UTC

Description of problem:

Under relatively low RAM (20G), but still valid, test scenario, migration fails due to mongodb oom errors

Version-Release number of selected component (if applicable):
testing on snap 6.9.6-2.0

~]# rpm -q python3-pulp-2to3-migration
python3-pulp-2to3-migration-0.11.4-1.el7pc.noarch

How reproducible:
every time

Steps to Reproduce:
1. Use SatLab created 6.9.X VM (20G RAM or as per current installation guide)
2. Sync RHEL8 kickstart and baseos repo
3. make a note of "free -h", check there are no "grep -r oom_ /var/log/messages"
4. satellite-maintain prep-6.10-upgrade
5. make a note of "free -h", check there are no "grep -r oom_ /var/log/messages"
6. satellite-maintain content prepare
7. Check there are no failure messages
8. make a note of "free -h", check there are no "grep -r oom_ /var/log/messages"
9. Run migration command to migrate to 6.10

Actual results:
oom failures between each step, requiring restarting all services
failure to "content prepare"

Expected results:
Migration should proceed without oom issues when using systems meeting documented specs.

Additional info:

Initially it was found the test system had no swap space, which is not according to documented spec, so the RAM was increased to 28G but the failures still occurred.

Specs: https://access.redhat.com/documentation/en-us/red_hat_satellite/6.9/html/installing_satellite_server_from_a_connected_network/preparing-environment-for-satellite-installation#system-requirements_satellite

A minimum of 20 GB RAM is required for Satellite Server to function. In addition, a minimum of 4 GB RAM of swap space is also recommended. Satellite running with less RAM than the minimum value might not operate correctly.

Comment 3 Grant Gainey 2021-09-26 21:16:10 UTC

I stood up a pulpcore-3.7.8/pulp-2to3-0.11.5/pulp-2.21.5 box on a VM with 15Gb of memory. Issuing the following steps:

  711  pulp-admin login -u admin -p admin
  713  ./rhel8_setup.bsh 
  716  pulp migration plan create --plan '{"plugins": [{"type": "rpm"}]}'
  729  pulp migration plan run --href /pulp/api/v3/migration-plans/9299eeba-81e2-4863-bfee-963e5a2b2ab6/

The script "rhel8_setup" syncs RHEL8 baseOS and kickstart into pulp2 :

#!/bin/bash -v
BASE='rhel8-baseos'
STREAM='rhel8-appstream'
KS='rhel8-ks'
REMOTE1='http://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/'
REMOTE2='https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/kickstart/'
DEST='destination'
pulp-admin rpm repo create --repo-id=$BASE --relative-url=$BASE --feed=$REMOTE1 \
        --download-policy on_demand \
        --generate-sqlite false --repoview false \
        --feed-ca-cert /home/vagrant/devel/pulp_startup/CDN_cert/redhat-uep.pem \
        --feed-cert /home/vagrant/devel/pulp_startup/CDN_cert/cdn.crt \
        --feed-key /home/vagrant/devel/pulp_startup/CDN_cert/cdn.key
pulp-admin rpm repo sync run --repo-id=$BASE
pulp-admin rpm repo create --repo-id=$KS --relative-url=$KS --feed=$REMOTE2 \
        --download-policy on_demand \
        --generate-sqlite false --repoview false \
        --feed-ca-cert /home/vagrant/devel/pulp_startup/CDN_cert/redhat-uep.pem \
        --feed-cert /home/vagrant/devel/pulp_startup/CDN_cert/cdn.crt \
        --feed-key /home/vagrant/devel/pulp_startup/CDN_cert/cdn.key
pulp-admin rpm repo sync run --repo-id=$KS
pulp-admin repo list

After syncing into Pulp2, the machine was at 7.5Gb used. Across the migration, that rose to ~10.1GB used.  Migration completed successfully.

Questions:
* What versions of pulpcore/pulp-2to3/pulp2 were running?
* When "free -h" was run pre-migration, what were the results?
* When you say "oom failures between each step" - does that mean steps 2, 4, and 6 all failed with OOM? (If so - something is def Not Right on that machine, since pulp2 hasn't changed, and has been able to sync RHEL8/RHEL8-KS just fine in that environment)
* What are the journalctl entries from the minute before to the minute after the OOM report?
* Why do we think this was a pulp problem? If you run a pulp2 system out of memory, for **any** reason, OOMKiller is probably going to choose Mongo to shoot, because it's the largest single-process memory use.

The foreman log doesn't seem to have anything useful in this case - I can't even tell from it when the problem was encountered :(

Comment 4 Stephen Wadeley 2021-09-27 14:53:35 UTC

(In reply to Grant Gainey from comment #3)

> Questions:
> * What versions of pulpcore/pulp-2to3/pulp2 were running?

python3-pulpcore-3.7.8-1.el7pc.noarch
tfm-rubygem-pulpcore_client-3.7.1-1.el7sat.noarch


python3-pulp-2to3-migration-0.11.4-1.el7pc.noarch

pulp-client-1.0-2.noarch
pulp-katello-1.0.3-1.el7sat.noarch
pulp-maintenance-2.21.5.2-1.el7sat.noarch
pulp-rpm-plugins-2.21.5.1-1.el7sat.noarch
pulp-server-2.21.5.2-1.el7sat.noarch



> * When "free -h" was run pre-migration, what were the results?

~]# free -h
              total        used        free      shared  buff/cache   available
Mem:            19G         11G        6.9G         83M        927M        7.4G
Swap:            0B          0B          0B


> * When you say "oom failures between each step" - does that mean steps 2, 4,
> and 6 all failed with OOM? (If so - something is def Not Right on that
> machine, since pulp2 hasn't changed, and has been able to sync
> RHEL8/RHEL8-KS just fine in that environment)

There were oom failures after step 6 in my testing, but in comment 0 I am putting in extra checks so we can track any leaks.

> * What are the journalctl entries from the minute before to the minute after
> the OOM report?

I no longer have that system, I will retest to get you that info

> * Why do we think this was a pulp problem? If you run a pulp2 system out of
> memory, for **any** reason, OOMKiller is probably going to choose Mongo to
> shoot, because it's the largest single-process memory use.

interesting, let me reproduce on latest snap to get more logs

thank you

Comment 6 Grant Gainey 2021-09-29 14:27:40 UTC

The log does help, yes thanks!

The machine from the log has no swap, and still only 20Gb of memory:

==-=
[root@host ~]# free
              total        used        free      shared  buff/cache   available
Mem:       20379856    16112344      185672      209452     4081840     3703988
Swap:             0           0           0
===

I see a services-restart happening at 08:22:07, followed by kernel-thread1 invoking OOMKiller at 08:22:28

  Sep 29 08:22:28 host kernel: thread1 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

with Mongo being chosen and killed. I don't see a single pulp task started yet - the workers came up, but mongo died before I see any "task started" messages. At the point OOMKiller is invoked, it looks like there are only ~151Mb of storage free (assuming page-size is 4K?

  free:37818

Also, we're still running without swap, it looks like - is that intended?

===
Sep 29 08:22:28 host kernel: 0 pages in swap cache
Sep 29 08:22:28 host kernel: Swap cache stats: add 0, delete 0, find 0/0
Sep 29 08:22:28 host kernel: Free swap  = 0kB
Sep 29 08:22:28 host kernel: Total swap = 0kB
===

MikeDep333 has added swap and restarted services, we'll see how it goes. So far, I don't see anything Pulp can be active on in this report.

Comment 7 Stephen Wadeley 2021-09-29 14:52:14 UTC

(In reply to Grant Gainey from comment #6)
> The log does help, yes thanks!
> 
good

> The machine from the log has no swap, and still only 20Gb of memory:
> 


that is the default SatLab test system, and RAM is as per spec in docs
see bottom of comment 0


> Also, we're still running without swap, it looks like - is that intended?

I only discovered this SatLab VM short coming last week when working on this, SatLab team will fix that in coming sprint.
>
> 
> MikeDep333 has added swap and restarted services, we'll see how it goes. So
> far, I don't see anything Pulp can be active on in this report.

OK, and if swap proves to be crucial, we should file PR against docs to make this line stronger:
"In addition, a minimum of 4 GB RAM of swap space is also recommended."
Note it only says "recommended".


thank you