Description of problem: During RHOSP 13 -> 16.1 upgrade of RHOSP deployment with separate Database control role customer faced a situation when command [1] failed because of timeout. When we analyzed python logs it turned out that "fetch the archive" play was initiated, but was in progress after few hours. On director node we saw active ansible-playbook process that used ~2GB of RAM, but wasn't actually doing anything. Our first assumption was that DB archive is too big, but it was only ~7.2GB and there were a lot of space on controller node and director node. This problem was reproduced after second run, so I have modified /usr/share/ansible/roles/tripleo-transfer/tasks/main.yml: removed "become" parameter from fetch play and tuned related permissions. After next run we were able to overcome the limitation. It looks like it is known to some extend that fetch plays use different mechanisms and create extra load when "become" parameter is used. I am not sure what kind of solution should be applied here, so kindly asking developers to check. [1] "openstack overcloud external-upgrade run --stack overcloud --tags system_upgrade_transfer_data -y"
Hello Alex, Your modification was correct, as stated in the Ansible docs: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/fetch_module.html When running fetch with become, the ansible.builtin.slurp module will also be used to fetch the contents of the file for determining the remote checksum. This effectively doubles the transfer size, and depending on the file size can consume all available memory on the remote or local hosts causing a MemoryError. Due to this it is advisable to run this module without become whenever possible. I have accessed PSI's Undercloud and got the modifications which were performed in the role tasks: [jfrancoa@localhost tripleo-ansible]$ diff roles/tripleo-transfer/tasks/main.yml ~/tripleo-transfer-main.yml 1,17d0 < --- < # Copyright 2019 Red Hat, Inc. < # All Rights Reserved. < # < # Licensed under the Apache License, Version 2.0 (the "License"); you may < # not use this file except in compliance with the License. You may obtain < # a copy of the License at < # < # http://www.apache.org/licenses/LICENSE-2.0 < # < # Unless required by applicable law or agreed to in writing, software < # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT < # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the < # License for the specific language governing permissions and limitations < # under the License. < < 33c16,18 < mode: 0700 --- > # changed just in case: > #mode: 0700 > mode: 0777 57a43,50 > - name: change ownership of archive > # added this play just in case > file: > name: "{{ tripleo_transfer_tempfile.path }}" > mode: 0777 > become: "{{ tripleo_transfer_src_become }}" > delegate_to: "{{ tripleo_transfer_src_host }}" > 58a52 > # we removed become here because of known issue with fetch and become (check BZ) 63d56 < become: "{{ tripleo_transfer_src_become }}" 103a97,99 > # (Added by Alex) added become and delegate_to because of selinux failure > become: "{{ tripleo_transfer_src_become }}" > delegate_to: localhost I will submit a patch based on the modifications done.
Thank you for this update and resolving the problem!
We will be handling this issue in https://bugzilla.redhat.com/show_bug.cgi?id=1916162 as it solves the problem with the large DB at the same time it improves the whole DB transfer. *** This bug has been marked as a duplicate of bug 1916162 ***