Bug 651890

Summary: [LTC 5.7 FEAT] Large memory machine spends huge amount of time in sysfs add of memory nodes (performance/boot)
Product: Red Hat Enterprise Linux 5 Reporter: IBM Bug Proxy <bugproxy>
Component: kernelAssignee: Steve Best <sbest>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.7CC: jarod, jfeeney, jjarvis, lwoodman, nobody+PNT0273897, qcai, riel, sbest
Target Milestone: betaKeywords: FutureFeature, OtherQA, Reopened
Target Release: 5.7   
Hardware: ppc64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-14 15:02:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580522, 618260, 668558    

Description IBM Bug Proxy 2010-11-10 15:01:31 UTC
1. Feature Overview:
Feature Id: [68164]
a. Name of Feature: [LTC 5.7 FEAT] Large memory machine spends huge amount of time in sysfs add of
memory nodes (performance/boot)
b. Feature Description
We have noticed very long boot times for PowerPC64 machines with a lot of RAM (> 512GB). The time is
almost entirely in memory_dev_init(). Some durations for that function vs RAM:

0.5TB RAM - 1 minute
1.5TB RAM - 30 minutes

The backtrace looks like:

c000000000248ee0 .__sysfs_add_one+0x28/0x128
c0000000002492a8 .sysfs_add_one+0x38/0x188
c000000000249c88 .create_dir+0x70/0x138
c000000000249d98 .sysfs_create_dir+0x48/0x78
c00000000032bad8 .kobject_add_internal+0x140/0x308
c00000000032beb4 .kobject_init_and_add+0x4c/0x68
c00000000046c2c0 .sysdev_register+0xa0/0x220
c00000000047b1dc .add_memory_block+0x124/0x1e8
c0000000008d1f28 .memory_dev_init+0xf4/0x168

With 1TB RAM we have about 64k memory nodes and the problem is sysfs has an O(n^2) issue with
duplicate entry detection:

int __sysfs_add_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd)
{
        struct sysfs_inode_attrs *ps_iattr;

        if (sysfs_find_dirent(acxt->parent_sd, sd->s_name))
                return -EEXIST;

...

struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
                                       const unsigned char *name)
{
        struct sysfs_dirent *sd;

        for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling)
                if (!strcmp(sd->s_name, name))
                        return sd;
        return NULL;
}

So with 64k nodes towards the end we are walking a 64k list and doing a strcmp on each.


2. Feature Details:
Sponsor: Power Virtualization
Architectures:  ppc64, 

Arch Specificity: both
Affects Kernel Modules: No
Delivery Mechanism: Backport
Category: kernel
Request Type: Package - Update Version
d. Upstream Acceptance: In Progress
Sponsor Priority P3
f. Severity: normal
IBM Confidential: No
Code Contribution: IBM code
g. Component Version Target: ---
h. Package - Version Update

3. Business Case
Customers purchasing large Power systems will experience extremely long boot times without this
patch, which will result in service calls. 

4. Primary contact at Red Hat:
John Jarvis, jjarvis

5. Primary contacts at Partner:
Project Management Contact:
Michael W. Wortman, wortman.com

Technical contact(s):
Nathan D. Fontenot, nfonteno.com

Comment 1 John Jarvis 2010-12-06 20:33:03 UTC
IBM is signed up to test and provide feedback, setting OtherQA

Comment 3 Qian Cai 2011-01-26 06:47:02 UTC
Is this the patchset for this upstream?
http://marc.info/?l=linux-mm&m=129554141716331&w=2

This is not ppc64 specific though.

Comment 6 RHEL Program Management 2011-01-31 02:05:22 UTC
Quality Engineering Management has reviewed and declined this request.  You may
appeal this decision by reopening this request.