1904681 – "fetch the archive" play from tripleo-transfer is not reliable for huge files

Bug 1904681 - "fetch the archive" play from tripleo-transfer is not reliable for huge files

Summary: "fetch the archive" play from tripleo-transfer is not reliable for huge files

Keywords:
Status:	CLOSED DUPLICATE of bug 1916162
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	16.1 (Train)
Hardware:	All
OS:	All
Priority:	medium
Severity:	high
Target Milestone:	z4
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Jose Luis Franco
QA Contact:	Joe H. Rahme
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-05 14:42 UTC by Alex Stupnikov
Modified:	2024-03-25 17:24 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-21 16:46:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1908425	0	None	None	None	2020-12-16 16:57:25 UTC
OpenStack gerrit	767177	0	None	NEW	Do not use fetch with become in transfer_data role	2021-02-18 16:16:58 UTC

Description Alex Stupnikov 2020-12-05 14:42:43 UTC

Description of problem:

During RHOSP 13 -> 16.1 upgrade of RHOSP deployment with separate Database control role customer faced a situation when command [1] failed because of timeout. When we analyzed python logs it turned out that "fetch the archive" play was initiated, but was in progress after few hours.

On director node we saw active ansible-playbook process that used ~2GB of RAM, but wasn't actually doing anything. Our first assumption was that DB archive is too big, but it was only ~7.2GB and there were a lot of space on controller node and director node.

This problem was reproduced after second run, so I have modified /usr/share/ansible/roles/tripleo-transfer/tasks/main.yml: removed "become" parameter from fetch play and tuned related permissions. After next run we were able to overcome the limitation.

It looks like it is known to some extend that fetch plays use different mechanisms and create extra load when "become" parameter is used. I am not sure what kind of solution should be applied here, so kindly asking developers to check.

[1]
"openstack overcloud external-upgrade run --stack overcloud --tags system_upgrade_transfer_data -y"

Comment 1 Jose Luis Franco 2020-12-14 17:02:54 UTC

Hello Alex,

Your modification was correct, as stated in the Ansible docs:

https://docs.ansible.com/ansible/latest/collections/ansible/builtin/fetch_module.html

When running fetch with become, the ansible.builtin.slurp module will also be used to fetch the contents of the file for determining the remote checksum. This effectively doubles the transfer size, and depending on the file size can consume all available memory on the remote or local hosts causing a MemoryError. Due to this it is advisable to run this module without become whenever possible.

I have accessed PSI's Undercloud and got the modifications which were performed in the role tasks:

[jfrancoa@localhost tripleo-ansible]$ diff roles/tripleo-transfer/tasks/main.yml ~/tripleo-transfer-main.yml
1,17d0
< ---
< # Copyright 2019 Red Hat, Inc.
< # All Rights Reserved.
< #
< # Licensed under the Apache License, Version 2.0 (the "License"); you may
< # not use this file except in compliance with the License. You may obtain
< # a copy of the License at
< #
< #     http://www.apache.org/licenses/LICENSE-2.0
< #
< # Unless required by applicable law or agreed to in writing, software
< # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
< # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
< # License for the specific language governing permissions and limitations
< # under the License.
< 
< 
33c16,18
<     mode: 0700
---
>     # changed just in case:
>     #mode: 0700
>     mode: 0777
57a43,50
> - name: change ownership of archive
>   # added this play just in case
>   file:
>     name: "{{ tripleo_transfer_tempfile.path }}"
>     mode: 0777
>   become: "{{ tripleo_transfer_src_become }}"
>   delegate_to: "{{ tripleo_transfer_src_host }}"
> 
58a52
>   # we removed become here because of known issue with fetch and become (check BZ)
63d56
<   become: "{{ tripleo_transfer_src_become }}"
103a97,99
>   # (Added by Alex) added become and delegate_to because of selinux failure
>   become: "{{ tripleo_transfer_src_become }}"
>   delegate_to: localhost

I will submit a patch based on the modifications done.

Comment 2 Alex Stupnikov 2020-12-15 08:39:31 UTC

Thank you for this update and resolving the problem!

Comment 5 Jose Luis Franco 2021-01-21 16:46:09 UTC

We will be handling this issue in https://bugzilla.redhat.com/show_bug.cgi?id=1916162 as it solves the problem with the large DB at the same time it improves the whole DB transfer.

*** This bug has been marked as a duplicate of bug 1916162 ***

Note You need to log in before you can comment on or make changes to this bug.