Tutorial: Nimble Page Management for Tiered Memory Systems
- The kernel of “Nimble Page Management for Tiered Memory Systems” is here.
- Its companion userspace applications and microbenchmarks can be find here.
- A newer kernel is available at this branch, which is rebased on top of Linux v5.6_rc6 but not tested yet. You need to modify the userspace launcher and microbenchmarks, since the exchange_pages and mm_manage syscall numbers are changed to 439 (from 333) and 440 (from 334), respectively. For end-to-end launcher, you need to change the syscall number at this line. For exchange page microbenchmark, you need to change the syscall number at this line and this line.
Kernel compilation
Use make menuconfig
and select NIMBLE_PAGE_MANAGEMENT
to make sure the
kernel can be compiled correctly. (Use /
to search for that option.)
Make sure you have CONFIG_PAGE_MIGRATION_PROFILE=y
in your .config if you want
to run microbenchmarks. (Use make menuconfig
to search and enable this option.)
Microbenchmarks for page migration mechanisms
You need a two socket NUMA machine to run microbenchmarks properly. numactl and libnuma-dev packages are required. numactl -H
will tell your machine’s NUMA topology:
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 31374 MB
node 0 free: 203 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 32231 MB
node 1 free: 5535 MB
node distances:
node 0 1
0: 10 20
1: 20 10
This machine has two NUMA nodes (0-1), each has 32GB memory (Note: it is not necessarily the same machine as the one described in the paper).
-
Clone microbenchmarks from GitHub repository here.
-
In microbenchmarks folder, you can find microbenchmarks for all four page migration optimizations in different subfolders.
-
In each subfolder,
make thp_move_pages
generates page migrations using THPs, whilemake non_thp_move_pages
generates page migration using non-THPs, namely 4KB base pages. -
In thp_page_migration_and_parallel subfolder, there are four shell scripts migrating pages between NUMA node 0 and NUMA node 1.
- run_non_thp_test.sh: it migrates 20 to 29 4KB pages between node 0 and node 1.
- run_non_thp_2mb_page_test.sh: it migrates 20 to 29 512 4KB pages (each set of 512 4KB pages are equivalent to 1 2MB page size) between node 0 and node 1.
- run_split_thp_test.sh: it migrates 20 to 29 2MB THPs between node 0 and node 1 but each THP will be split before migration.
- run_thp_test.sh: it migrates 20 to 29 2MB THPs natively between node 0 and node 1.
In each of these scripts:
MULTI="1 2 4 8 16"
means 1, 2, 4, 8, 16 threads will be used to migrate pages. -
In concurrent_page_migration and exchange_page_migration subfolders, you will see similar scripts: run_non_thp_test.sh for 4KB page migration and run_thp_test.sh for 2MB THP migration.
After running the scripts, a stats_*
folder will be created, stats files have the names like [mt|seq]_[thp|non_thp]_[2mb|4kb]_page_order_[0-9]
, meaning sequential or multithreaded migration method for THP or non-THP (namely 4KB base pages) with page sizes 2MB or 4KB and number of 20 to 29 pages. In each file, you will see results like:
Total_cycles Begin_timestamp End_timestamp
6005843 24045933446997535 24045933453003378
syscall_timestamp check_rights_cycles migrate_prep_cycles form_page_node_info_cycles form_physical_page_list_cycles enter_unmap_and_move_cycles split_thp_page_cycles get_new_page_cycles lock_page_cycles unmap_page_cycles change_page_mapping_cycles copy_page_cycles remove_migration_ptes_cycles putback_old_page_cycles putback_new_page_cycles migrate_pages_cleanup_cycles store_page_status_cycles return_to_syscall_cycles last_timestamp
24045933447007103 1716 199242 66342 461837 33925 0 433761 55648 1315997 59190 2274685 391541 679706 0 299 20532 506 24045933453002030
--
Test successful.
The numbers are all in CPU cycles, read from x86_64 rtdsc
instructions. Here are the meaning of these items, where all numbers from Row 3 and 4 are recorded inside the kernel and read via /proc/<pid>/move_pages_breakdown
. The listed items below from 4 onwards are all kernel activities.
- Total_cycles: total CPU cycles spent on this migration.
- Begin_timestamp: a number returned by
rtdsc
instruction at the beginning of the migration.rtdsc
always gives an increasing number in your machine, so it can be used as a unique timestamp. - End_timestamp: a number returned by
rtdsc
at the end of the migration. - syscall_timestamp:
rtdsc
number recorded inside the kernel at the beginning ofmigrate_pages()
system call. - check_rights_cycles: cycles spent on checking process permissions in the kernel.
- migrate_prep_cycles: cycles spent on preparing page migrations.
- form_page_node_info_cycles: cycles spent on reading page node information.
- form_physical_page_list_cycles: cycles spent on making a page list for migration.
- enter_unmap_and_move_cycles: cycles spent before entering
unmap_and_move
kernel function. - split_thp_page_cycles: cycles spent on splitting a THP (will be 0 if migration 4KB pages or 2MB THPs without splitting).
- get_new_page_cycles: cycles spent on allocating new pages on remote node.
- lock_page_cycles: cycles spent on locking pages before migration.
- unmap_page_cycles: cycles spent on unmapping to-be-migrated pages.
- change_page_mapping_cycles: cycles spent on copying page mapping data structure from old pages to new pages.
- copy_page_cycles: cycles spent on copying page data from old pages to new pages.
- remove_migration_ptes_cycles: cycles spent on removing migration page table entries and mapping new pages.
- putback_old_page_cycles: cycles spent on putting back old pages.
- putback_new_page_cycles: cycles spent on putting back new pages on LRU page lists.
- migrate_pages_cleanup_cycles: cycles spent on cleaning up page migration process.
- store_page_status_cycles: cycles spent on storing page migration results to the status array passed via
migrate_pages()
system call. - return_to_syscall_cycles: cycles spent before exiting the system call.
- last_timestamp:
rtdsc
number recorded inside the kernel at the end ofmigrate_pages()
system call.
With these numbers, you should be able to get detail breakdown of Linux page migrations.
User application launcher
To launch benchmarks for end-to-end results, you will need the scripts and the launcher from end_to_end_launcher folder. Also you need to enable cgroup v2 support by adding systemd.unified_cgroup_hierarchy=1
to your kernel boot parameters. What I did is adding systemd.unified_cgroup_hierarchy=1
to GRUB_CMDLINE_LINUX
in /etc/default/grub
file, like GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"
and run sudo update_grub2
.
Note: You also need to disable Linux Auto NUMA balancing to avoid its interference either temporarily by executing sudo sysctl kernel.numa_balancing=0
or permenantly by adding kernel.numa_balancing=0
to /etc/sysctl.conf
.1
In end_to_end_launcher folder, run make
to create launcher binary from launch.c file.
- create_die_stack_mem.sh can help you create a cgroup limiting your fast memory size.
- run_bench.sh uses launcher to run your applications in cgroup created by create_die_stack_mem.sh. You should change
PROJECT_LOC
to point to the folder of launcher binary. - Each of your applications should be in a separate folder like graph500-omp from the GitHub repository. You also need to create a
bench_run.sh
file to tellrun_bench.sh
the name of your application (BENCH_NAME
) and how to run it (BENCH_RUN
). Note: you should adjust your benchmark size, so that it can fit in your slow memory. For example, in this tutorial, it is better to use 30GB as your benchmark footprint, considering kernel might use some memory in each 32GB memory node. - run_all.sh will loop through
BENCHMARK_LIST
and run each of them for end-to-end performance evaluation.SHRINK_PAGE_LISTS
means page migration policy will be used.NR_RUNS
means how many times you want repeat the experiments.MIGRATION_BATCH_SIZE
means how many pages you want to migrate at the same time when concurrent page migration is used.MIGRATION_MT
means how many CPU threads you want to use for page migration.MEM_SIZE_LIST
allows you to specify a list of fast memory sizes.PAGE_REPLACEMENT_SCHEMES
: you can provide different migration options showing in my paper: all-remote-access and all-local-access are All Slow and All Fast policies. non-thp-migration, thp-migration, exchange-pages are using different ways of migrating pages in end-to-end evaluation with the same Linux page replacement policy.THREAD_LIST
allows you to specify how many hardware threads your applications want to use.
For slow memory, you can use a real NVM, slowing down one of your NUMA nodes with memhog, or use HP’s quartz tool to slow down one of your NUMA nodes.
-
Jiaolin Luo’s questions reminded me of this important setup. Thanks Jiaolin. ↩
Leave a Comment