Bug 1784536

Summary: Segfaults in agetty during Cloud image testing since util-linux 2.35 - "double free or corruption"
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: util-linuxAssignee: Karel Zak <kzak>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rawhideCC: dustymabe, jonathan, kzak
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: util-linux-2.35-0.4.fc32 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-21 16:16:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
backtrace of the first time the crash happened none

Description Adam Williamson 2019-12-17 16:25:54 UTC
Since util-linux 2.35-0.1 landed in Rawhide, I've noticed the Cloud tests in openQA (which use the test suite formerly run by autocloud) are sometimes failing due to agetty segfaulting. At first I mentioned this in https://bugzilla.redhat.com/show_bug.cgi?id=1783066 , but the cause of that turned out to be something else and with that fixed the agetty segfaults are still happening, so I'm filing this separately.

It doesn't seem to always segfault at the same time - it seems like it can happen at almost any time during the test. The test has run ten times since the new util-linux landed (5 in prod, 5 in stg) and it's failed 4 times due to agetty crashes, each time in a different place, and passed 6 (I guess it's possible agetty segfaults may have happened in the passed test too but somehow not interfered with the test).

I backtraced the first crash, I'll attach that backtrace to the bug; the backtrace pointed to "double free or corruption (out)".

Comment 1 Adam Williamson 2019-12-17 16:26:28 UTC
Created attachment 1645905 [details]
backtrace of the first time the crash happened

Comment 2 Adam Williamson 2019-12-17 16:29:26 UTC
I'm not sure if this really qualifies as a release-blocking bug as I guess it's fairly odd to use a local tty on a Cloud image...the typical way to interact with the system would be to ssh into it...

Comment 3 Dusty Mabe 2019-12-18 00:59:14 UTC
(In reply to Adam Williamson from comment #2)
> I'm not sure if this really qualifies as a release-blocking bug as I guess
> it's fairly odd to use a local tty on a Cloud image...the typical way to
> interact with the system would be to ssh into it...

I guess a good piece of information would be to know if it is something about the
way cloud image is configured that is causing this. i.e. would/could it happen to
server and workstation as well and we're just seeing it here first?

Comment 4 Adam Williamson 2019-12-18 09:29:28 UTC
Well, I think the VT consoles on typical Fedora installs are actually run by mingetty , not agetty. I did not look into it in detail yet, but my hand-wavy assumption so far has been that mingetty is being left out of the Cloud images either intentionally or by some kind of accident, and something is just falling back on using agetty instead.

Comment 5 Adam Williamson 2019-12-19 17:20:19 UTC
In fact, another interesting thing is that the serial console install test started failing in Rawhide around the same time. Now I believe mingetty doesn't support serial lines, so presumably we run something else on serial consoles. I guess it may well be agetty, and so serial consoles may be broken due to this bug as well. I'll have to dig into that, because if that's the case, it *will* be a release blocker.

Comment 6 Adam Williamson 2019-12-19 21:44:57 UTC
Hmm. so. We *do* use agetty on serial consoles. However, I can actually launch a serial console install in a local VM manually and have it work OK - it doesn't fail like it does in openQA. I'll have to fiddle with that a bit more.

Comment 7 Karel Zak 2019-12-20 09:15:18 UTC
This is very probably related to /etc/issue or /etc/issue.d (or /run/issue, /run/issue.d, /usr/lib/issue and /usr/lib/issue.d).

Comment 8 Karel Zak 2019-12-20 14:03:10 UTC
OK, I'm able to reproduce this probem. It's related to issue file autoreload -- this is reason why you see it on some machines where some network stuff (IP/hostname etc.) is updated after agetty start. In this case, agetty reloads the issue file. Unfortunately, it uses already freed pointer...

Comment 9 Karel Zak 2019-12-20 14:12:00 UTC
Fixed by upstream commit 9418ba6d05feed6061f5343741b1bc56e7bde663.

Comment 10 Adam Williamson 2019-12-20 16:39:51 UTC
Great, thanks! I owe you several beers for not just telling me to run valgrind on it :P

Do you mind if I backport it to Rawhide to see which openQA tests it fixes?

Comment 11 Adam Williamson 2019-12-20 16:40:48 UTC
Oh, never mind, I see you did it already! That's great.

Comment 12 Adam Williamson 2019-12-21 16:16:25 UTC
OK, the Cloud image test and serial console install both passed with today's Rawhide, so this is looking good. Thanks.

Comment 13 Dusty Mabe 2019-12-21 19:33:01 UTC
Thanks Adam for chasing this down and thanks Karel for fixing it!