I started adding 1GB THP support on Linux x86_64 as a part of Translation Ranger (See commits from 286b744 to 93d558e3 in Translation Ranger Github Repo). Then, I sent TR as a LSF/MM proposal including 1GB THP support. Now I pulled 1GB THP support part out and am trying to get it upstream. I put some text here to describe the high-level design of 1GB THP support. The text is based on the code from 1GB THP Github Repo.
THP clarification: there are two major components of a THP, physical page and virtual page mapping using PUD page table entries. The former helps Linux manage physical pages belonging to a THP as a whole using
struct page (all 262144 4KB physical pages are seen as a single compound page), whereas the latter enables applications to use THPs via mappings established by page tables (using page table entries). It is easy to get confused and mix them together. But in fact, kernel can just have the 1GB THP support for the physical page part only and still works, since kernel can map 512 2MB PMD entries to the 1GB THP. The other way around, namely only having PUD entry mapping, however would not work, since you can map one page table entry to multiple physical pages. What if one of the pages is migrated or swapped out? Accesses to the corresponding part of the virtual addresses will get corrupted data.
Page table entries: on x86_64, PGD (level 4) points to a 512GB virtual address range, PUD (level 3) points to a 1GB virtual address range, PMD (level 2) points to a 2MB virtual address range, PTE (level 1) points to a 4KB virtual address range. YMMV for other architectures.
Why we need 1GB THP? IMHO, although we already have libhugetlb to allocate 1GB pages, THP is more flexible in terms of memory management. You can split a 1GB THP and swap out its subpages if the 1GB page is no longer used as a whole and only part of it is hot. The usage of THP is counted as other 4KB and 2MB pages in the kernel stats, so kernel and user can easily discover application memory usage, whereas libhugetlb has to be explicitly managed by admin/user, which is fine when you know what you are doing. But most of cases, you might not want to do that explicit memory management. 1GB THP can save you the hassle and still reduce the virtual memory overheads by using a single PUD page table entry to map 1GB data.
At the high level, 1GB THP is very similar to existing 2MB THP. But it requires changes to accommodate the larger page size and handle additional page table entry mappings to a 1GB THP and its subpages. Linux already has some support 1GB pages (not hugetlb) for DAX, which provides a skeleton to start with. The changes include:
- the ability of allocating 1GB pages (Linux x86_64 can allocate up to 4MB pages) either by changing buddy allocator or via other kernel mechanism for page allocation.
- pre-allocating page table pages at PMD-level and PTE-level which are used for THP mapping split (existing 2MB THP only pre-allocates at PTE-level).
- more complicated THP map counting for a 1GB THP mapped by PUD entries, PMD entries, and PTE entries at the same time. (existing 2MB THP supports PMD and PTE mapping).
- THP splitting, both compound page split and PUD/PMD entries split (existing 2MB THP supports compound page split and PMD entries split).
Get 1GB pages in Linux
Pages are usually obtained from buddy allocator via
alloc_pages_* calls. The largest size you can get from buddy allocator is decided by
MAX_ORDER, since during the boot time kernel free pages from
MAX_ORDER is used to enforce what the largest order can be. (See the call sequence mem_init() -> memblock_free_all() -> free_low_memory_core_early() -> __free_memory_core -> __free_pages_memory where
order = min(MAX_ORDER - 1UL, __ffs(start));) On x86_64,
MAX_ORDER is set to 11, so you can get pages up to 2^(11 - 1 + 12) = 4MB. This mean getting 1GB pages is impossible through usual code path. But there are other choices:
MAX_ORDER in buddy allocator and why we cannot simply do that
First thing comes to mind would be “let’s bump MAX_ORDER, so buddy allocator can give 1GB pages”. Yeah, it can be done and I did that in my TR work too (see this commit). But if you look at the changes, you might notice that
SECTION_SIZE_BITS is changed too, otherwise, you will get compilation error Allocator MAX_ORDER exceeds SECTION_SIZE from here. So “let’s change SECTION_SIZE_BITS too”. No, please do not do that. It is not possible for Linux kernel used everywhere in the world.
The problem comes from the section size, which will be increased by larger
SECTION_SIZE_BITS, cannot be arbitrarily large. According to the email exchange with David Hildenbrand, I learnt that the section size is the minimum size of adding memory to Linux kernel. So if we have a large section size, we will lose available physical memory. Also in the virtualized environment, memory hot-plug is used to provisioning guest machines for memory usage increase. You do not want to give a guest machine 1GB memory when it only needs several MB more.
MAX_ORDER does not work.
alloc_contig_pages and when it might not work
I was using alloc_contig_pages for 1GB THP allocation initially and it works when the system has a lot of free memory and not many unmovable pages. But if we look into the code of
alloc_contig_pages, we can see that it allocates a larger than
2^(MAX_ORDER - 1) page by cleaning out a free physical frame range using page migration by calling
alloc_contig_range. The process scans through zone by zone from the beginning of a zone to the end of it, which is pretty time consuming.
When it might not work: If there are enough scattered unmovable pages or pages pinned by others, the function might not be able to find a free physical frame range that are large enough for page allocation and still wastes a lot of time during the process.
Contiguous memory allocator (CMA) for 1GB page allocation as an alternative
CMA allocation serves as boost for
alloc_contig_pages, since under the hood it still uses
alloc_contig_range (see cma_alloc). To use CMA allocation, user needs to add a kernel parameter like
hugetlb_cma=<size> to convert a region (or regions) of pages to
MIGRATE_CMA instead of the original
MIGRATE_MOVABLE. According to the page allocation fallback path table, unmovable page allocations from
MIGRATE_UNMOVABLE can fall back to
MIGRATE_MOVABLE page allocation but not
MIGRATE_CMA. This means we should not see unmovable pages in CMA regions, which significantly boost the successful rate of getting pages from
alloc_contig_pages. In addition, cma_alloc uses a bitmap to tell which part of CMA regions has been used, so a lot of page scanning time can be saved during page allocation process by checking the bitmap.
CMA allocation is not the final solution: CMA allocation was initially used for devices that need contiguous physical memory accesses. It is a kind of abuse to repurpose it for 1GB THP allocation. Additionally, an overly-sized CMA region can cause kernel running out of memory, since there is not enough
MIGRATE_MOVABLE pages for kernel memory usage to fall back on, which is very dangerous. If user applications run out of memory, we can do page reclaim or even kill other applications to make free memory, but if kernel runs out of memory, the system can only go down.