Bug 2172182

Summary: [Pulp-3] Orphan cleanup does not remove the artifact_id association from individual content units
Product: Red Hat Satellite Reporter: Sayan Das <saydas>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: NEW --- QA Contact: Satellite QE Team <sat-qe-bz-list>
Severity: high Docs Contact:
Priority: medium    
Version: 6.11.0CC: dalley
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sayan Das 2023-02-21 15:55:40 UTC
Description of problem:

While it's possible to get rid of all the repository information from a capsule 6.13 server by removing all lifecycles from the capsule and running an orphan cleanup against the same, The artifact_id remains associated with individual content units at the database level. 

Not only that but, at this point, even it is possible to run repair api against pulpcore of the capsule server, which shows successfully completed but actually does not performs anything at all. 


Version-Release number of selected component (if applicable):

Satellite Capsule 6.13 [ or any version of Satellite and Satellite capsule running with Pulp 3 ]


How reproducible:

Always

Steps to Reproduce:
1. Install a Satellite 6.13 
2. Enable RHEL 8 BaseOS and Appstream and Sync them 
3. Configure required repos for Capsule installation and associate a Capsule 6.13 system with satellite.
4. Create PROD lifecycle on satellite and associate the same lifecycle with capsule for content syncing.
5. Use repos from Step 2 to create a CV called "RHEL8-CV", Publish the same and promote the new version to PROD lifecycle
6. Wait for the automated capsule sync task to finish.
7. On Satellite, Configure pulp-cli with a profile named proxy to check pulp data from the external capsule server and use the following command to see the list of synced repos in capsule.
   # pulp --profile proxy repository list  --limit 9999 | jq .[].name
   
8. Register a RHEL 8.6 client system to the capsule server , with "RHEL8-CV" and "PROD" lifecycle.
9. Execute "dnf clean all && dnf update --downloadonly -y" on the client system and it should download about 270+ packages. 
10. On capsule, Remove all the RPM type artifacts from the filesystem i.e. 

    # cd /var/lib/pulp/media/artifact/
    # file */* | grep RPM -c            ## NOTE down the count 
    # file */* | grep RPM | awk -F':' '{print $1}' | xargs rm -f
    # cd ~

11. Repeat step 9 and notice the errors with RPM downloads. Also check the pulp logs from capsule server reflecting that expected artifacts are missing from filesystem.
12. Collect the output of the followng output from capsule :

    # echo "select ca.pulp_id,cca.artifact_id,ca.file,cca.relative_path from core_artifact ca LEFT JOIN core_contentartifact cca on cca.artifact_id = ca.pulp_id where ca.file = 'artifact/fe/3c5fe47fcde23b567759bc05dd0e8f294d6cb8997cd7c7c18072bc30fc1896';"   | su - postgres -c "psql -x pulpcore"

13. Use the hammer command from https://access.redhat.com/solutions/6685201 on satellite, to reduce the orphan protection timeout to 3 minutes.
14. Disassociate the PROD lifecycle from Capsule and ensure that no LCEs are selected to sync with capsule.
15. Invoke a sync for the capsule server ( which should finish in seconds ).
16. Wait for 5 minutes ( > the orphan protection timeout ) and then execute this command on satellite to initiate orphan cleanup on capsule 

    # SMART_PROXY_ID=2 foreman-rake katello:delete_orphaned_content RAILS_ENV=production --trace
    
17. Wait for the task to complete
18. Repeat the command from "Step 7" and the list of repos should be zero i.e. no repos are listed
19. Repeat Step 12 on Capsule and check the output.
20. Execute the following on satellite to set back the Orphan protection timeout to default value i.e.

    # hammer settings set --name orphan_protection_time --value 1440
    
21. Try running the repair API on the capsule from satellite while monitoring the pulp logs on capsule 

    # curl -s --cert /etc/foreman/client_cert.pem --key /etc/foreman/client_key.pem  -H "Content-Type: application/json" -X POST https://capsule613.example.com/pulp/api/v3/repair/ | jq .
    # curl -s --cert /etc/foreman/client_cert.pem --key /etc/foreman/client_key.pem  -H "Content-Type: application/json" -X GET https://capsule613.example.com/<task href> | jq .

22. Add back the PROD lifecycle to the Capsule server for syncing and Initiate a "Complete Sync" and wait for the sync to be successfully completed. 
23. Repeat Step 9 on client. 
24. Repeat Step 21 on the capsule server.
25. Now for the final time, repeat Step 9 on the client host
 
Actual results:

Step 7: We will see > 2 repo name\IDs 

Step 11: 

On client:
[MIRROR] kernel-modules-4.18.0-425.10.1.el8_7.x86_64.rpm: Status code: 500 for https://capsule613.example.com/pulp/content/RedHat/PROD/RHEL8/content/dist/rhel8/8/x86_64/baseos/os/Packages/k/kernel-modules-4.18.0-425.10.1.el8_7.x86_64.rpm (IP: 192.168.124.3)

In pulp logs of capsule:

Feb 21 19:38:19 capsule613.example.com pulpcore-content[22961]:     return await self._match_and_stream(path, request)
Feb 21 19:38:19 capsule613.example.com pulpcore-content[22961]:   File "/usr/lib/python3.9/site-packages/pulpcore/content/handler.py", line 542, in _match_and_stream
Feb 21 19:38:19 capsule613.example.com pulpcore-content[22961]:     return await self._serve_content_artifact(ca, headers, request)
Feb 21 19:38:19 capsule613.example.com pulpcore-content[22961]:   File "/usr/lib/python3.9/site-packages/pulpcore/content/handler.py", line 815, in _serve_content_artifact
Feb 21 19:38:19 capsule613.example.com pulpcore-content[22961]:     raise Exception(_("Expected path '{}' is not found").format(path))
Feb 21 19:38:19 capsule613.example.com pulpcore-content[22961]: Exception: Expected path '/var/lib/pulp/media/artifact/fe/3c5fe47fcde23b567759bc05dd0e8f294d6cb8997cd7c7c18072bc30fc1896' is not found
Feb 21 19:38:19 capsule613.example.com pulpcore-content[22961]:  [21/Feb/2023:14:08:19 +0000] "GET /pulp/content/RedHat/PROD/RHEL8/content/dist/rhel8/8/x86_64/baseos/os/Packages/k/kernel-modules-4.18.0-425.10.1.el8_7.x86_64.rpm HTTP/1.1" 500 244 "-" "libdnf (Red Hat Enterprise Linux 8.6; generic; Linux.x86_64)"


Step 12 and 19

# echo "select ca.pulp_id,cca.artifact_id,ca.file,cca.relative_path from core_artifact ca LEFT JOIN core_contentartifact cca on cca.artifact_id = ca.pulp_id where ca.file = 'artifact/fe/3c5fe47fcde23b567759bc05dd0e8f294d6cb8997cd7c7c18072bc30fc1896';"   | su - postgres -c "psql -x pulpcore"
-[ RECORD 1 ]-+---------------------------------------------------------------------------
pulp_id       | 4d9ffd03-e0a1-4a1b-b019-84239a295e7f
artifact_id   | 4d9ffd03-e0a1-4a1b-b019-84239a295e7f
file          | artifact/fe/3c5fe47fcde23b567759bc05dd0e8f294d6cb8997cd7c7c18072bc30fc1896
relative_path | kernel-modules-4.18.0-425.10.1.el8_7.x86_64.rpm

Step 21:

# curl -s --cert /etc/foreman/client_cert.pem --key /etc/foreman/client_key.pem  -H "Content-Type: application/json" -X GET https://capsule613.example.com/pulp/api/v3/tasks/a0a2dec4-1da1-4955-828c-8539d4977dd6/ | jq .progress_reports
[
    {
      "message": "Identify missing units",
      "code": "repair.missing",
      "state": "completed",
      "total": null,
      "done": 278,
      "suffix": null
    },
    {
      "message": "Identify corrupted units",
      "code": "repair.corrupted",
      "state": "completed",
      "total": null,
      "done": 0,
      "suffix": null
    },
    {
      "message": "Repair corrupted units",
      "code": "repair.repaired",
      "state": "completed",
      "total": null,
      "done": 0,  --------> nothing was done but the task still shows completed
      "suffix": null
    }
]

# curl -s --cert /etc/foreman/client_cert.pem --key /etc/foreman/client_key.pem  -H "Content-Type: application/json" -X GET https://capsule613.example.com/pulp/api/v3/tasks/a0a2dec4-1da1-4955-828c-8539d4977dd6/ | jq .state
"completed"


Step 24:

# curl -s --cert /etc/foreman/client_cert.pem --key /etc/foreman/client_key.pem  -H "Content-Type: application/json" -X GET https://capsule613.example.com/pulp/api/v3/tasks/771c701d-6839-49d3-9ac4-56ee53ad76f4/ | jq .state
"completed"

# curl -s --cert /etc/foreman/client_cert.pem --key /etc/foreman/client_key.pem  -H "Content-Type: application/json" -X GET https://capsule613.example.com/pulp/api/v3/tasks/771c701d-6839-49d3-9ac4-56ee53ad76f4/ | jq .progress_reports
[
  {
    "message": "Identify missing units",
    "code": "repair.missing",
    "state": "completed",
    "total": null,
    "done": 278,
    "suffix": null
  },
  {
    "message": "Identify corrupted units",
    "code": "repair.corrupted",
    "state": "completed",
    "total": null,
    "done": 0,
    "suffix": null
  },
  {
    "message": "Repair corrupted units",
    "code": "repair.repaired",
    "state": "completed",
    "total": null,
    "done": 278,
    "suffix": null
  }
]


Step 25: Successful yum\dnf transaction on client host


Expected results:

* If no Lifecycles are associated with a capsule server and all orphan contents have been deleted after the orphan protection timeout, Then all content unit information should be deleted from the database or disassociated from their artifact_id's as well. 

* Repair API should not work at all when no remotes ( i.e. no syncable repos ) are present on the said capsule 


Additional info:

On 6.10 or 6.11, This scenario would be completely unrecoverable if someone does exactly the same as I did or simply removes the entire artifacts directory from capsule. Reason -> Missing Modulemd metadata cannot be repaired by repair API.

On 6.12+ it's still possible to get back the artifacts as expected but Users will additionally need to run the repair API on capsule which I would very much love to skip and so as the end-users.

Comment 2 Daniel Alley 2023-02-28 20:22:00 UTC
>>>>>
On 6.10 or 6.11, This scenario would be completely unrecoverable if someone does exactly the same as I did or simply removes the entire artifacts directory from capsule. Reason -> Missing Modulemd metadata cannot be repaired by repair API.

On 6.12+ it's still possible to get back the artifacts as expected but Users will additionally need to run the repair API on capsule which I would very much love to skip and so as the end-users.
>>>>>

Two sidenotes:

1) Keep in mind that published metadata is also stored in /var/lib/pulp/artifacts/ and can't be regenerated on an individual basis in an exact way - it can't be "repaired".

2) Possibly "orphan cleanup protection time" is no longer an ideal solution any more given that the direction we would like to move for other reasons (RBAC, etc.) is that uploaded content must be immediately added to a repo in order to set permissions on it appropriately. Uploading content independently of adding it to a repo conflicts with several features which we want to adopt.

Comment 3 Sayan Das 2023-03-01 08:58:40 UTC
1) Keep in mind that published metadata is also stored in /var/lib/pulp/artifacts/ and can't be regenerated on an individual basis in an exact way - it can't be "repaired".

Correct and for that, The very first action we take is to Force Full Sync the repos + Republish the content-view version metadata as well. Once that is done, we go for "Validate Sync" or the "Repair" API to get back the other content.

That is why i mentioned on 6.10 or 6.11 it is impossible to fix the issue as modulemd metadata used to be stored on the filesystem as well that repair API cannot recover. That is no longer a blocker for 6.12+