Bug 841275
Summary: | Facts slow to generate | ||
---|---|---|---|
Product: | OKD | Reporter: | Mike McGrath <mmcgrath> |
Component: | Containers | Assignee: | Mrunal Patel <mpatel> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | libra bugs <libra-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 2.x | CC: | jialiu, mpatel, twiest |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | devenv_1933+ | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-08-07 20:42:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mike McGrath
2012-07-18 14:35:53 UTC
I just watched how long /usr/libexec/mcollective/update_yaml.rb takes to run on a few of our nodes in PROD that have a lot of gears, and on average it takes around 7 minutes. The longest that I've seen it run for was 15 minutes. Since this takes so long to run, I see sevaral of them running at the same time (which exacerbates the problem). Can we get timing information to calculate each fact by using -t flag? This script will get time for calculating each fact - http://fpaste.org/oduh/ # time facter -t kernel: 104.50ms macaddress: 16.76ms git_repos: 0.03ms max_active_apps: 0.02ms max_apps: 0.03ms lsbdistid: 328.16ms operatingsystem: 332.13ms hardwaremodel: 49.01ms architecture: 382.23ms lib: 0.02ms ec2_local_hostname: 0.02ms selinux: 11.16ms selinux_config_mode: 28.40ms processorcount: 1.04ms memoryfree: 1.83ms cart_list: 0.03ms mountpoints: 0.14ms ipaddress: 22.77ms netmask: 34.36ms netmask_eth0: 31.10ms cfkey: 0.12ms ec2_profile: 0.03ms sshdsakey: 0.52ms embed_cart_list: 0.03ms macaddress_lo: 19.11ms processor1: 0.03ms memorysize: 1.51ms memorytotal: 1.80ms district_uuid: 0.03ms virtual: 132.11ms ec2_public_keys_0_openssh_key: 0.03ms timezone: 0.46ms ec2_ami_id: 0.04ms augeasversion: 2.35ms public_ip: 0.02ms facterversion: 0.02ms uptime_seconds: 15.22ms id: 32.68ms path: 0.05ms ec2_block_device_mapping_ami: 0.03ms hardwareisa: 23.73ms kernelrelease: 22.79ms ec2_block_device_mapping_root: 0.04ms lsbdistcodename: 196.37ms ec2_instance_type: 0.04ms lsbdistrelease: 151.37ms lsbmajdistrelease: 151.87ms selinux_policyversion: 0.43ms capacity: 0.03ms ipaddress6: 11.91ms ipaddress6_eth0: 18.98ms osfamily: 0.09ms ec2_local_ipv4: 0.03ms selinux_config_policy: 12.02ms selinux_mode: 12.43ms ipaddress_lo: 19.90ms swapsize: 1.36ms devices: 0.03ms ec2_public_hostname: 0.04ms sshecdsakey: 0.04ms sshecdsakey: 0.04ms sshecdsakey: 0.03ms sshecdsakey: 0.03ms arp_eth0: 35.94ms netmask_lo: 21.30ms ps: 0.03ms ec2_reservation_id: 0.03ms uniqueid: 26.60ms district_active: 0.05ms uptime_hours: 0.13ms uptime_days: 0.36ms ec2_ami_launch_index: 0.03ms rubysitedir: 0.13ms public_hostname: 0.02ms is_virtual: 0.05ms hostname: 37.08ms domain: 28.94ms fqdn: 66.73ms lsbrelease: 139.62ms ec2_block_device_mapping_ephemeral0: 0.04ms network_eth0: 40.81ms operatingsystemrelease: 0.17ms swapfree: 1.18ms kernelversion: 0.05ms ec2_hostname: 0.03ms interfaces: 11.90ms ec2_kernel_id: 0.03ms selinux_current_mode: 10.39ms sshfp: 54.55ms macaddress_eth0: 19.29ms ec2_placement_availability_zone: 0.04ms ipaddress6_lo: 18.81ms processor0: 0.03ms ec2_public_ipv4: 0.03ms sshrsakey: 0.18ms physicalprocessorcount: 50.81ms puppetversion: 461.74ms arp_lo: 110.10ms arp: 41.38ms ec2_security_groups: 0.04ms uptime: 0.11ms shortdomain: 28.31ms ec2_ami_manifest_path: 0.04ms rubyversion: 0.02ms node_profile: 0.03ms lsbdistdescription: 170.76ms ec2_block_device_mapping_ephemeral1: 0.04ms network_lo: 64.46ms active_capacity: 0.04ms vlans: 0.10ms selinux_enforced: 0.53ms kernelmajversion: 0.09ms ec2_instance_id: 0.03ms ipaddress_eth0: 31.31ms [...snip...] real 0m43.383s user 0m3.112s sys 0m33.628s ---------------------------------------------------------------- # time ./facter_timing_script kernel: 32.24ms macaddress: 10.59ms git_repos: 0.03ms max_active_apps: 0.02ms max_apps: 0.03ms lsbdistid: 147.57ms operatingsystem: 151.02ms hardwaremodel: 26.98ms architecture: 179.28ms ipaddress6_eth0: 69.42ms selinux: 6.45ms selinux_config_mode: 13.30ms osfamily: 0.10ms swapsize: 1.52ms mountpoints: 0.03ms ec2_local_ipv4: 0.03ms processorcount: 0.39ms ipaddress_lo: 20.81ms sshecdsakey: 0.06ms sshecdsakey: 0.05ms sshecdsakey: 0.03ms sshecdsakey: 0.03ms domain: 47.05ms cart_list: 0.04ms ec2_public_hostname: 0.03ms lsbrelease: 154.11ms processor1: 0.03ms embed_cart_list: 0.03ms virtual: 69.32ms netmask_lo: 20.01ms timezone: 0.45ms ipaddress6: 16.58ms ec2_reservation_id: 0.03ms lsbdistrelease: 141.87ms lsbmajdistrelease: 142.36ms district_uuid: 0.03ms uptime_seconds: 14.51ms ec2_ami_launch_index: 0.04ms id: 41.66ms network_eth0: 41.50ms public_ip: 0.04ms path: 0.03ms kernelrelease: 22.09ms hostname: 32.40ms fqdn: 58.57ms ec2_block_device_mapping_ephemeral0: 0.06ms arp_eth0: 37.42ms operatingsystemrelease: 0.22ms swapfree: 1.22ms kernelversion: 0.07ms arp: 88.25ms ec2_hostname: 0.04ms interfaces: 13.81ms selinux_policyversion: 0.38ms ec2_kernel_id: 0.03ms memorysize: 1.16ms capacity: 0.04ms macaddress_eth0: 31.49ms selinux_config_policy: 23.39ms selinux_mode: 23.81ms ec2_placement_availability_zone: 0.04ms lib: 0.02ms devices: 0.03ms ipaddress6_lo: 20.05ms sshrsakey: 0.54ms physicalprocessorcount: 74.32ms ec2_public_ipv4: 0.06ms lsbdistdescription: 195.58ms ps: 0.03ms cfkey: 0.11ms uniqueid: 23.94ms district_active: 0.04ms uptime_hours: 0.12ms uptime_days: 0.34ms ec2_security_groups: 0.03ms rubysitedir: 0.15ms is_virtual: 0.05ms ec2_ami_manifest_path: 0.03ms network_lo: 49.35ms public_hostname: 0.03ms ec2_block_device_mapping_ephemeral1: 0.03ms arp_lo: 40.60ms ec2_instance_id: 0.04ms kernelmajversion: 0.13ms ipaddress_eth0: 21.18ms selinux_current_mode: 8.86ms ec2_local_hostname: 0.03ms memoryfree: 1.21ms sshfp: 51.11ms netmask_eth0: 19.98ms ec2_profile: 0.05ms lsbdistcodename: 203.44ms processor0: 0.03ms memorytotal: 0.07ms macaddress_lo: 20.52ms sshdsakey: 0.19ms ipaddress: 11.28ms ec2_public_keys_0_openssh_key: 0.04ms puppetversion: 475.06ms uptime: 0.07ms ec2_ami_id: 0.02ms shortdomain: 28.15ms netmask: 16.35ms facterversion: 0.03ms rubyversion: 0.04ms ec2_block_device_mapping_ami: 0.02ms hardwareisa: 40.59ms node_profile: 0.03ms vlans: 0.13ms ec2_block_device_mapping_root: 0.02ms augeasversion: 1.95ms active_capacity: 0.03ms selinux_enforced: 0.53ms ec2_instance_type: 0.02ms real 0m42.341s user 0m3.286s sys 0m35.333s The numbers that we got from loaded nodes don't really indicate any issues in calculating active capacity. However, the Dir.glob used in generating git repos count when ran by hand took 2-3 minutes. This has already been replaced by a more efficient version this sprint (takes 2-3 seconds). Also, https://bugzilla.redhat.com/show_bug.cgi?id=841681 fixed the other issue of preventing multiple scripts running at the same time and facts file getting wiped out for long periods. expect to close today Added an optimization - https://github.com/openshift/crankcase/pull/310 Verified this bug on devenv-stage_228, and PASS. Check the code of patch, it still counts git repo in gear dir as the return value of git_repos. The main optimization is that glob directories counting only happened once. (In old logic, it happened twice for git_repos and active_capacity facter) 1. Create a lot of gears, here I copied a lot of gear dirs (like ~2000) in /var/lib/stickshift dir, and increase mem and cup stress via stress tools to create a dummy env (stress --vm 1 --vm-bytes 1G --vm-keep -c 10). Using new code: # for i in `seq 1 6`; do time facter active_capacity; done 2501.25 real 0m12.022s user 0m0.435s sys 0m0.405s 2501.25 real 0m11.081s user 0m0.437s sys 0m0.331s 2501.25 real 0m12.090s user 0m0.469s sys 0m0.361s 2501.25 real 0m12.327s user 0m0.488s sys 0m0.382s 2501.25 real 0m12.281s user 0m0.438s sys 0m0.397s 2501.25 real 0m10.622s user 0m0.401s sys 0m0.350s The average value is about 11.737s. Using old code: # for i in `seq 1 6`; do time facter active_capacity; done 2501.25 real 0m13.423s user 0m0.505s sys 0m0.429s 2501.25 real 0m14.739s user 0m0.541s sys 0m0.476s 2501.25 real 0m13.620s user 0m0.476s sys 0m0.473s 2501.25 real 0m11.868s user 0m0.412s sys 0m0.409s 2501.25 real 0m12.074s user 0m0.424s sys 0m0.417s 2501.25 real 0m11.481s user 0m0.404s sys 0m0.389s The average value is about 12.867s. Compare the two value, found that this optimization is not very obvious, at least in my testing env. Whatever, it indeed is an optimization. In the initial report, I think the root cause of taking 30+ seconds to active_capacity is heavy load on that node. In testing reslut (comment 8), it takes about 12 sconds to get active_capacity facts under heavy mem load, once release mem load, it only takes about 1.2 seconds. # for i in `seq 1 6`; do time facter active_capacity; done 2501.25 real 0m1.103s user 0m0.401s sys 0m0.341s 2501.25 real 0m1.206s user 0m0.433s sys 0m0.370s 2501.25 real 0m1.177s user 0m0.387s sys 0m0.365s 2501.25 real 0m1.172s user 0m0.404s sys 0m0.345s 2501.25 real 0m1.102s user 0m0.402s sys 0m0.341s 2501.25 real 0m1.203s user 0m0.398s sys 0m0.345s |