Bug 434826

Summary: unaligned access warnings during install from libnl
Product: Red Hat Enterprise Linux 5 Reporter: Doug Chapman <dchapman>
Component: libnlAssignee: Dan Williams <dcbw>
Status: CLOSED ERRATA QA Contact: desktop-bugs <desktop-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: atodorov, dwa, gbeshers, prarit, rick.hester, tgraf
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0457 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 14:28:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 441922    
Bug Blocks: 409971    
Attachments:
Description Flags
simple test code to reproduce unaligned access warnings
none
stack trace of one of the libnl unaligned access warnings
none
potential fix? none

Description Doug Chapman 2008-02-25 18:11:08 UTC
Description of problem:
This is seen under anaconda during install but it appears the root of the
problem lies in libnl.

These warnings are seen during install.  Occasionally there are enough of them
to make installing via serial console very difficult as the screen becomes
unreadable:

anaconda(781): unaligned access to 0x6000000001010f4c, ip=0x20000000013632d0

anaconda(781): unaligned access to 0x6000000001010f54, ip=0x2000000001363310    

anaconda(781): unaligned access to 0x6000000001010f5c, ip=0x2000000001363320    

anaconda(781): unaligned access to 0x6000000001011244, ip=0x20000000013632d0    

anaconda(781): unaligned access to 0x600000000101124c, ip=0x2000000001363310    



If I drop to a shell and examine /proc/781/maps I see this isn't anaconda itself
 but libnl which it is linked to:

2000000001338000-2000000001394000 r-xp 00000000 07:00 3632              
/mnt/runtime/usr/lib/libnl.so.1.0-pre5



I will try to do some further triage on this and hopefully come up with a
specific line of code that generates the warning.


Version-Release number of selected component (if applicable):
libnl-1.0-0.10.pre5.4

How reproducible:
100%

Steps to Reproduce:
1. serial console install on ia64
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Doug Chapman 2008-02-26 00:03:44 UTC
Created attachment 295861 [details]
simple test code to reproduce unaligned access warnings

I did some additional digging and was surprised to see libnl has not changed
since RHEL5.1.	It turns out that the reason we are seeing these unaligned
warnings is because anaconda is making use of some libnl functionality that it
didn't use before.

I wrote a simple standalone program to do the same steps anaconda is doing at
the point of the warnings.  That code is attached.  It is taken nearly directly
from the anaconda source.

Running the code on an ia64 system will generate the following on the console:

a.out(4542): unaligned access to 0x60000000000042ac, ip=0x200000000007b2d0
a.out(4542): unaligned access to 0x60000000000042b4, ip=0x200000000007b310

a.out(4542): unaligned access to 0x60000000000042bc, ip=0x200000000007b320

a.out(4542): unaligned access to 0x60000000000045a4, ip=0x200000000007b2d0

a.out(4542): unaligned access to 0x60000000000045ac, ip=0x200000000007b310

Comment 2 Dan Williams 2008-02-26 14:50:08 UTC
It's a pretty old build of libnl; though newer versions break ABI compat so must
tread carefully here.  Thomas, any ideas?

Doug, any chance you could get a backtrace of the spot where some of the
accesses happen?

Comment 3 Doug Chapman 2008-02-26 15:15:14 UTC
Created attachment 295933 [details]
stack trace of one of the libnl unaligned access warnings

Here is a sample stack trace.  Note that this was using a slightly hacked up
version of libnl.  Since the unaligned trap doesn't give a stack trace what I
had to do was determine where the access was happening via the ip, then add my
own code to libnl to do my own unaligned check at that point and then sleep
there to allow me to attach gdb and get the stack trace.

One thing to note in the stack that confused me at first.  level 11 of the
stack which is at nl.c:199 is the nl.c that is part of anaconda, not libnl. 
Level 10 of the stack is the actual entry into libnl code (at nl.c:49).

Comment 4 RHEL Program Management 2008-03-05 17:49:02 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 5 Doug Chapman 2008-03-26 20:42:27 UTC
Any updates on this?  While this may not sound like a critical issue, it makes a
mess of the console during serial console text mode installs.  At times this
makes it so that the user is unable to read the screen.


Comment 6 Dan Williams 2008-03-27 15:36:51 UTC
Not yet, need some guidance from thomas first.  Thomas, thoughts on this issue?

Comment 7 Dan Williams 2008-03-27 15:41:56 UTC
*** Bug 434823 has been marked as a duplicate of this bug. ***

Comment 8 Thomas Graf 2008-03-27 15:50:13 UTC
Sorry, this slipped by me. Can you post a source code listing relevant to the
backtrace? I bet it's an unaligned 64bit read from a part in the netlink message.

Comment 9 Doug Chapman 2008-03-27 16:34:48 UTC
Thomas,

Getting info like this out of shared libs is tricky (for me at least) and
unfortunately I didn't keep my hacked up version of libnl around.  However, iirc
the unaligned access happened in this section of code from route/link.c (but
there certainly could have been other locations as well, this is just the one I
caught in the act)

    286         if (tb[IFLA_MAP]) {
    287                 struct rtnl_link_ifmap *map =  nla_data(tb[IFLA_MAP]);
    288                 link->l_map.lm_mem_start = map->mem_start;
    289                 link->l_map.lm_mem_end   = map->mem_end;
    290                 link->l_map.lm_base_addr = map->base_addr;
    291                 link->l_map.lm_irq       = map->irq;
    292                 link->l_map.lm_dma       = map->dma;
    293                 link->l_map.lm_port      = map->port;
    294                 link->l_mask |= LINK_ATTR_MAP;
    295         }


I think it was a few of the map-> reads that triggered it, not the link->
writes.  Inside the kernel the "fix" would be to wrap these in get_unaligned()
but that doesn't exist in userspace hearders.

I will do some more hacking to double check that what I said here matches reality.


Comment 10 Doug Chapman 2008-03-27 16:43:36 UTC
Created attachment 299357 [details]
potential fix?

Disclaimer!!! I don't know that this fixes _all_ the unaligned accesses we were
seeing during install.	The only way to test that would be to build a full
RHEL5.2 install image.

But, something like this patch might do the trick.  It does fix the case caught
by my reproducer.  It seems it would be better to make sure that the addresses
are aligned right in the first place but I imagine due the the format of the
netlink message that may not be possible?

Comment 11 Thomas Graf 2008-03-27 17:32:53 UTC
It's where I suspected the problem to be. It's not the only 64 bit member inside
of netlink message structs but the only one to my knowledge which is not
properly aligned with the use of __attribute__((aligned(8))).

The problem with the memcpy() solution is that I've noticed gcc producing code
which optimized the memcpy() away again. The use of ifmap is mainly deprecated
anyawys, I don't know of any application which actually makes use of it. I'll
push your fix into my tree although the problem is really on the kernel side,
though not fixable without breaking interfaces.

Comment 27 Issue Tracker 2008-04-04 01:07:44 UTC
HP has confirmed that they're willing to help QA for this bug. 


This event sent from IssueTracker by dwa 
 issue 171602

Comment 34 Doug Chapman 2008-04-10 17:10:11 UTC
This patch fixes some of the unaligned accesses but not all.  We have made some
improvement in that the unaligned accesses seen during anaconda stage2 are gone
but we still see some in stage1.  I was unable to fully test this because the
only way to verify was to try an install with a tree built with the latest libnl
which did not happen until last night.

Please note that I did file these as 2 separate BZs to start with but the other
one was closed as a dup of this one (granted they were both in libnl so they did
look like the same issue on the surface).

To avoid confusion I have opened BZ 441878 for the issues we are still seeing.



Comment 35 Doug Chapman 2008-04-10 18:37:24 UTC
Follow up to comment #34.

It turns out the reason I still see this in anaconda stage1 is that the stage1
"/sbin/loader" is static linked.  Since anaconda has not been rebuilt we still
see this issue there.  Since stage2 is dynamicly linked that also explains why
we no longer see the unaligned accessed there.

I will open a new anaconda BZ to ask for a rebuild with the new libnl.


Comment 36 Alexander Todorov 2008-04-21 15:42:36 UTC
*** Bug 441518 has been marked as a duplicate of this bug. ***

Comment 43 Doug Chapman 2008-04-24 19:20:32 UTC
With the rebuilt of anaconda I now no longer see these errors.

- Doug


Comment 44 Suzanne Hillman 2008-04-24 19:33:07 UTC
Verified by reporter.

Comment 47 errata-xmlrpc 2008-05-21 14:28:53 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0457.html