Bug 2178951

Summary: Leapp upgrade hangs to 8.x
Product: Red Hat Enterprise Linux 7 Reporter: walid.douagi <walid.douagi>
Component: leapp-repositoryAssignee: Leapp Notifications Bot <leapp-notifications-bot>
Status: NEW --- QA Contact: upgrades-and-conversions
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.9CC: cbesson, igulina, lagordon, pstodulk, walid.douagi
Target Milestone: rcKeywords: Upgrades
Target Release: ---Flags: walid.douagi: needinfo-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot of leapp log
none
leapp db
none
sosreport link https://we.tl/t-ALHzWfvO9w none

Description walid.douagi@avaxia-group.com 2023-03-16 09:29:55 UTC
Description of problem:
Trying to upgrade from RHEL7 (7.9) to RHEL8, "sudo leapp preupgrade --no-rhsm" hangs immediately after it executes.

Version-Release number of selected component (if applicable):
leapp --version : 0.15.0
/leapp-data-19.tar.gz 

How reproducible:
Steps to Reproduce:
1. sudo leapp preupgrade --no-rhsm
2. sudo leapp preupgrade --no-rhsm --debug


Actual results:
2023-03-16 18:04:29.648 DEBUG    PID: 27411 leapp.workflow.TargetTransactionFactsCollection.target_userspace_creator: External command has finished: ['rm', '-rf', '/var/lib/leapp/scratch/mounts/root_/upper']
2023-03-16 18:04:29.652 DEBUG    PID: 27411 leapp.workflow.TargetTransactionFactsCollection.target_userspace_creator: External command has started: ['rm', '-rf', '/var/lib/leapp/scratch/mounts/root_/work']
2023-03-16 18:04:29.664 DEBUG    PID: 27411 leapp.workflow.TargetTransactionFactsCollection.target_userspace_creator: External command has finished: ['rm', '-rf', '/var/lib/leapp/scratch/mounts/root_/work']
2023-03-16 18:04:29.666 DEBUG    PID: 27411 leapp.workflow.TargetTransactionFactsCollection.target_userspace_creator: External command has started: ['rm', '-rf', '/var/lib/leapp/scratch/mounts/root_/system_overlay']
2023-03-16 18:04:29.676 DEBUG    PID: 27411 leapp.workflow.TargetTransactionFactsCollection.target_userspace_creator: External command has finished: ['rm', '-rf', '/var/lib/leapp/scratch/mounts/root_/system_overlay']
2023-03-16 18:04:29.678 DEBUG    PID: 27411 leapp.workflow.TargetTransactionFactsCollection.target_userspace_creator: External command has started: ['mount', '-t', 'overlay', 'overlay2', '-o', 'lowerdir=/,upperdir=/var/lib/leapp/scratch/mounts/root_/upper,workdir=/var/lib/leapp/scratch/mounts/root_/work', '/var/lib/leapp/scratch/mounts/root_/system_overlay']
[ 1680.196092] INFO: task rm:19192 blocked for more than 120 seconds.
[ 1680.202875] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1680.209713] rm              D ffff949dbfd1acc0     0 19192  14392 0x00000082
[ 1680.218826] Call Trace:
[ 1680.221394]  [<ffffffffafbb8bf9>] schedule_preempt_disabled+0x29/0x70
[ 1680.229761]  [<ffffffffafbb6ca7>] __mutex_lock_slowpath+0xc7/0x1e0
[ 1680.235060]  [<ffffffffaf65ec88>] ? __sb_start_write+0x58/0x120
[ 1680.240715]  [<ffffffffafbb602f>] mutex_lock+0x1f/0x33
[ 1680.245764]  [<ffffffffaf668231>] lock_rename+0x31/0x100
[ 1680.251788]  [<ffffffffc0a4b34c>] ovl_lock_rename_workdir+0x1c/0x50 [overlay]
[ 1680.258505]  [<ffffffffc0a4ddd0>] ovl_do_remove+0x220/0x420 [overlay]
[ 1680.264174]  [<ffffffffc0a4dfe6>] ovl_rmdir+0x16/0x20 [overlay]
[ 1680.269474]  [<ffffffffaf668b6c>] vfs_rmdir+0xdc/0x150
[ 1680.273842]  [<ffffffffaf66e711>] do_rmdir+0x1f1/0x220
[ 1680.278265]  [<ffffffffaf65e13e>] ? ____fput+0xe/0x20
[ 1680.282698]  [<ffffffffaf4c7f70>] ? task_work_run+0xc0/0xe0
[ 1680.287868]  [<ffffffffaf66f9dd>] SyS_unlinkat+0x2d/0x50
[ 1680.292886]  [<ffffffffafbc539a>] system_call_fastpath+0x25/0x2a
[ 1800.299088] INFO: task rm:19192 blocked for more than 120 seconds.
[ 1800.305752] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1800.312556] rm              D ffff949dbfd1acc0     0 19192  14392 0x00000082
[ 1800.321165] Call Trace:
[ 1800.323628]  [<ffffffffafbb8bf9>] schedule_preempt_disabled+0x29/0x70
[ 1800.330066]  [<ffffffffafbb6ca7>] __mutex_lock_slowpath+0xc7/0x1e0
[ 1800.335748]  [<ffffffffaf65ec88>] ? __sb_start_write+0x58/0x120
[ 1800.344977]  [<ffffffffafbb602f>] mutex_lock+0x1f/0x33
[ 1800.354599]  [<ffffffffaf668231>] lock_rename+0x31/0x100
[ 1800.359024]  [<ffffffffc0a4b34c>] ovl_lock_rename_workdir+0x1c/0x50 [overlay]
[ 1800.367295]  [<ffffffffc0a4ddd0>] ovl_do_remove+0x220/0x420 [overlay]
[ 1800.373837]  [<ffffffffc0a4dfe6>] ovl_rmdir+0x16/0x20 [overlay]
[ 1800.381923]  [<ffffffffaf668b6c>] vfs_rmdir+0xdc/0x150
[ 1800.386759]  [<ffffffffaf66e711>] do_rmdir+0x1f1/0x220
[ 1800.392250]  [<ffffffffaf65e13e>] ? ____fput+0xe/0x20
[ 1800.397702]  [<ffffffffaf4c7f70>] ? task_work_run+0xc0/0xe0
[ 1800.403570]  [<ffffffffaf66f9dd>] SyS_unlinkat+0x2d/0x50
[ 1800.409270]  [<ffffffffafbc539a>] system_call_fastpath+0x25/0x2a
[ 1800.414627] INFO: task mount:27425 blocked for more than 120 seconds.
[ 1800.420386] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1800.427017] mount           D ffff949d365f6300     0 27425  27411 0x00000080
[ 1800.433686] Call Trace:
[ 1800.436503]  [<ffffffffafbb8bf9>] schedule_preempt_disabled+0x29/0x70
[ 1800.443173]  [<ffffffffafbb6ca7>] __mutex_lock_slowpath+0xc7/0x1e0
[ 1800.449800]  [<ffffffffafbb602f>] mutex_lock+0x1f/0x33
[ 1800.455414]  [<ffffffffaf668231>] lock_rename+0x31/0x100
[ 1800.460923]  [<ffffffffc0a47dc1>] ovl_fill_super+0x581/0xc00 [overlay]
[ 1800.467617]  [<ffffffffaf5dca64>] ? register_shrinker+0x44/0x50
[ 1800.473560]  [<ffffffffaf65fc0d>] ? sget+0x7d/0xb0
[ 1800.478740]  [<ffffffffc0a47840>] ? ovl_get_lowerstack.isra.10+0x6a0/0x6a0 [overlay]
[ 1800.486834]  [<ffffffffaf65fc8f>] mount_nodev+0x4f/0xb0
[ 1800.493190]  [<ffffffffc0a4615f>] ovl_mount+0x4f/0x60 [overlay]
[ 1800.500260]  [<ffffffffaf66083e>] mount_fs+0x3e/0x1c0
[ 1800.505622]  [<ffffffffaf680327>] vfs_kern_mount+0x67/0x110
[ 1800.511057]  [<ffffffffaf682b4f>] do_mount+0x1ef/0xd00
[ 1800.516583]  [<ffffffffaf6587fa>] ? __check_object_size+0x1ca/0x250
[ 1800.522442]  [<ffffffffaf635cfc>] ? kmem_cache_alloc_trace+0x3c/0x200
[ 1800.529191]  [<ffffffffaf6839a3>] SyS_mount+0x83/0xd0
[ 1800.534104]  [<ffffffffafbc539a>] system_call_fastpath+0x25/0x2a
[ 1920.540101] INFO: task rm:19192 blocked for more than 120 seconds.
[ 1920.545725] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.551584] rm              D ffff949dbfd1acc0     0 19192  14392 0x00000082
[ 1920.557721] Call Trace:
[ 1920.560384]  [<ffffffffafbb8bf9>] schedule_preempt_disabled+0x29/0x70
[ 1920.565335]  [<ffffffffafbb6ca7>] __mutex_lock_slowpath+0xc7/0x1e0
[ 1920.570754]  [<ffffffffaf65ec88>] ? __sb_start_write+0x58/0x120
[ 1920.575866]  [<ffffffffafbb602f>] mutex_lock+0x1f/0x33
[ 1920.580339]  [<ffffffffaf668231>] lock_rename+0x31/0x100
[ 1920.585126]  [<ffffffffc0a4b34c>] ovl_lock_rename_workdir+0x1c/0x50 [overlay]
[ 1920.590744]  [<ffffffffc0a4ddd0>] ovl_do_remove+0x220/0x420 [overlay]
[ 1920.595633]  [<ffffffffc0a4dfe6>] ovl_rmdir+0x16/0x20 [overlay]
[ 1920.600206]  [<ffffffffaf668b6c>] vfs_rmdir+0xdc/0x150
[ 1920.604199]  [<ffffffffaf66e711>] do_rmdir+0x1f1/0x220
[ 1920.608356]  [<ffffffffaf65e13e>] ? ____fput+0xe/0x20
[ 1920.612607]  [<ffffffffaf4c7f70>] ? task_work_run+0xc0/0xe0
[ 1920.616865]  [<ffffffffaf66f9dd>] SyS_unlinkat+0x2d/0x50
[ 1920.621018]  [<ffffffffafbc539a>] system_call_fastpath+0x25/0x2a
[ 1920.625642] INFO: task mount:27425 blocked for more than 120 seconds.
[ 1920.630521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.636417] mount           D ffff949d365f6300     0 27425  27411 0x00000080
[ 1920.641658] Call Trace:
[ 1920.643832]  [<ffffffffafbb8bf9>] schedule_preempt_disabled+0x29/0x70
[ 1920.648611]  [<ffffffffafbb6ca7>] __mutex_lock_slowpath+0xc7/0x1e0
[ 1920.653267]  [<ffffffffafbb602f>] mutex_lock+0x1f/0x33
[ 1920.657200]  [<ffffffffaf668231>] lock_rename+0x31/0x100
[ 1920.661153]  [<ffffffffc0a47dc1>] ovl_fill_super+0x581/0xc00 [overlay]
[ 1920.665699]  [<ffffffffaf5dca64>] ? register_shrinker+0x44/0x50
[ 1920.670180]  [<ffffffffaf65fc0d>] ? sget+0x7d/0xb0
[ 1920.673631]  [<ffffffffc0a47840>] ? ovl_get_lowerstack.isra.10+0x6a0/0x6a0 [overlay]
[ 1920.678965]  [<ffffffffaf65fc8f>] mount_nodev+0x4f/0xb0
[ 1920.682823]  [<ffffffffc0a4615f>] ovl_mount+0x4f/0x60 [overlay]
[ 1920.687238]  [<ffffffffaf66083e>] mount_fs+0x3e/0x1c0
[ 1920.691071]  [<ffffffffaf680327>] vfs_kern_mount+0x67/0x110
[ 1920.695105]  [<ffffffffaf682b4f>] do_mount+0x1ef/0xd00
[ 1920.698853]  [<ffffffffaf6587fa>] ? __check_object_size+0x1ca/0x250
[ 1920.703462]  [<ffffffffaf635cfc>] ? kmem_cache_alloc_trace+0x3c/0x200
[ 1920.707841]  [<ffffffffaf6839a3>] SyS_mount+0x83/0xd0
[ 1920.711629]  [<ffffffffafbc539a>] system_call_fastpath+0x25/0x2a

Expected results:
successful preupgrade


Additional info:

Comment 3 walid.douagi@avaxia-group.com 2023-03-16 09:45:04 UTC
Created attachment 1951215 [details]
screenshot of leapp log

Comment 4 Petr Stodulka 2023-03-20 12:32:17 UTC
Hello, thank you for the report. We are not able to reproduce the issue in our scenarios so cannot say why this is happening, so we need /var/lib/leapp/leapp.db and the sosreport to be able to do any investigation.

Comment 5 Petr Stodulka 2023-03-20 12:36:57 UTC
Also, from the view of the storage in the screenshot, it's possible you do not have enough space on /var. So I would try to extend the size of /var partition. In case the system has XFS partitions formatted without ftype attributes (default until RHEL 7.3), it's most likely the root cause.

Comment 6 walid.douagi@avaxia-group.com 2023-03-21 07:35:10 UTC
Created attachment 1952282 [details]
leapp db

Comment 7 walid.douagi@avaxia-group.com 2023-03-21 07:47:59 UTC
Created attachment 1952284 [details]
sosreport link
https://we.tl/t-ALHzWfvO9w

Comment 8 walid.douagi@avaxia-group.com 2023-03-21 08:35:31 UTC
@pstodulk Thanks for your feedback, I uploaded db file and the sosreport. 
regarding to /var, it is enough space that usually I use it and no issue on that.
using XFS type, I already upgraded the Quality system from 7.9 to 8.6 without issues as shown with screenshots on this link https://we.tl/t-5GGxzvsE9u.

Comment 9 Petr Stodulka 2023-03-24 14:55:35 UTC
Hi, sorry for the late response. Neither of we.tl links works for me - I get to page but cannot download anything - seeing just adverts.
Regarding leapp.db file, we have possibly enough information for further investigation, but possibly we will need later the sosreport also.

When you speak about the Quality system, do you mean a different system or the same one? If a different one, the leapp.db file from such
system (should be still present unless leapp packages have been already removed manually to finish the upgrade) could help us
to see differences between those systems.

Currently we are busy, so cannot say when we will be able to continue with the additional investigation.

Additional notes:
* Detected SAP solutions on Azure (adding Irina Gulina to Cc)
* I see some parameters are different from the standard / default XFS format used in my VM which could be (but does not have to) related to the issue.
* * We should compare it with the configuration on our Azure machines, just to see whether it's possible clue or not.
* System seems to be up-to-date regarding the installed packages

Comment 10 Petr Stodulka 2023-03-24 15:15:09 UTC
Just got a tip, that you could try to enforce our XFS ftype=0 workaround solution for your all XFS partitions, however it will need much more space on /var/lib/leapp. In case it works, we will know it's caused by some parameters used for XFS partitions:

~~~
In /usr/share/leapp-repository/repositories/system_upgrade/common/libraries/overlaygen.py insert one line (52) to define an array in _prepare_required_mounts():

 46 def _prepare_required_mounts(scratch_dir, mounts_dir, mount_points, xfs_info):
 47     result = {
 48         mount_point.fs_file: mounting.NullMount(
 49             _mount_dir(mounts_dir, mount_point.fs_file)) for mount_point in mount_points
 50     }
 51 
 52     xfs_info.mountpoints_without_ftype = [ "/", "/var", "/usr", "/boot" ]  # <<< e.g. like this. adapt to your list of XFS partitions   
 53     if not xfs_info.mountpoints_without_ftype:
 54         return result
~~~

In case you have a possibility set up a dedicated partition for /var/lib/leapp to extend the needed storage space for the experiment, it could be possibly the easiest way to handle it (format such a partition with EXT4 in such a case).
See "How to set up a dedicated /var/lib/leapp for upgrade" for detail instructions:
    https://access.redhat.com/solutions/5057391 
It's just a try.

Comment 13 Petr Stodulka 2023-03-27 14:12:21 UTC
Just sharing the info we have already sos report provided from one of our customers who hit the very same issue so no other needed data is required.

Comment 14 Petr Stodulka 2023-03-27 14:45:45 UTC
Sharing the info that inmounting /usr/sap and /opt/sap before the leapp executions resolved the problem. No idea still what caused the original issue exactly.