Commit | Line | Data |
---|---|---|
c24b7201 DH |
1 | ============================== |
2 | UNEVICTABLE LRU INFRASTRUCTURE | |
3 | ============================== | |
4 | ||
5 | ======== | |
6 | CONTENTS | |
7 | ======== | |
8 | ||
9 | (*) The Unevictable LRU | |
10 | ||
11 | - The unevictable page list. | |
12 | - Memory control group interaction. | |
13 | - Marking address spaces unevictable. | |
14 | - Detecting Unevictable Pages. | |
15 | - vmscan's handling of unevictable pages. | |
16 | ||
17 | (*) mlock()'d pages. | |
18 | ||
19 | - History. | |
20 | - Basic management. | |
21 | - mlock()/mlockall() system call handling. | |
22 | - Filtering special vmas. | |
23 | - munlock()/munlockall() system call handling. | |
24 | - Migrating mlocked pages. | |
25 | - mmap(MAP_LOCKED) system call handling. | |
26 | - munmap()/exit()/exec() system call handling. | |
27 | - try_to_unmap(). | |
28 | - try_to_munlock() reverse map scan. | |
29 | - Page reclaim in shrink_*_list(). | |
30 | ||
31 | ||
32 | ============ | |
33 | INTRODUCTION | |
34 | ============ | |
35 | ||
36 | This document describes the Linux memory manager's "Unevictable LRU" | |
37 | infrastructure and the use of this to manage several types of "unevictable" | |
38 | pages. | |
39 | ||
40 | The document attempts to provide the overall rationale behind this mechanism | |
41 | and the rationale for some of the design decisions that drove the | |
42 | implementation. The latter design rationale is discussed in the context of an | |
43 | implementation description. Admittedly, one can obtain the implementation | |
44 | details - the "what does it do?" - by reading the code. One hopes that the | |
45 | descriptions below add value by provide the answer to "why does it do that?". | |
46 | ||
47 | ||
48 | =================== | |
49 | THE UNEVICTABLE LRU | |
50 | =================== | |
51 | ||
52 | The Unevictable LRU facility adds an additional LRU list to track unevictable | |
53 | pages and to hide these pages from vmscan. This mechanism is based on a patch | |
54 | by Larry Woodman of Red Hat to address several scalability problems with page | |
fa07e787 | 55 | reclaim in Linux. The problems have been observed at customer sites on large |
c24b7201 DH |
56 | memory x86_64 systems. |
57 | ||
58 | To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of | |
59 | main memory will have over 32 million 4k pages in a single zone. When a large | |
60 | fraction of these pages are not evictable for any reason [see below], vmscan | |
61 | will spend a lot of time scanning the LRU lists looking for the small fraction | |
62 | of pages that are evictable. This can result in a situation where all CPUs are | |
63 | spending 100% of their time in vmscan for hours or days on end, with the system | |
64 | completely unresponsive. | |
65 | ||
66 | The unevictable list addresses the following classes of unevictable pages: | |
67 | ||
68 | (*) Those owned by ramfs. | |
69 | ||
70 | (*) Those mapped into SHM_LOCK'd shared memory regions. | |
71 | ||
72 | (*) Those mapped into VM_LOCKED [mlock()ed] VMAs. | |
73 | ||
74 | The infrastructure may also be able to handle other conditions that make pages | |
fa07e787 LS |
75 | unevictable, either by definition or by circumstance, in the future. |
76 | ||
77 | ||
c24b7201 DH |
78 | THE UNEVICTABLE PAGE LIST |
79 | ------------------------- | |
fa07e787 LS |
80 | |
81 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list | |
82 | called the "unevictable" list and an associated page flag, PG_unevictable, to | |
c24b7201 DH |
83 | indicate that the page is being managed on the unevictable list. |
84 | ||
85 | The PG_unevictable flag is analogous to, and mutually exclusive with, the | |
86 | PG_active flag in that it indicates on which LRU list a page resides when | |
e6e8dd50 | 87 | PG_lru is set. |
fa07e787 LS |
88 | |
89 | The Unevictable LRU infrastructure maintains unevictable pages on an additional | |
90 | LRU list for a few reasons: | |
91 | ||
c24b7201 DH |
92 | (1) We get to "treat unevictable pages just like we treat other pages in the |
93 | system - which means we get to use the same code to manipulate them, the | |
94 | same code to isolate them (for migrate, etc.), the same code to keep track | |
95 | of the statistics, etc..." [Rik van Riel] | |
96 | ||
97 | (2) We want to be able to migrate unevictable pages between nodes for memory | |
98 | defragmentation, workload management and memory hotplug. The linux kernel | |
99 | can only migrate pages that it can successfully isolate from the LRU | |
100 | lists. If we were to maintain pages elsewhere than on an LRU-like list, | |
101 | where they can be found by isolate_lru_page(), we would prevent their | |
102 | migration, unless we reworked migration code to find the unevictable pages | |
103 | itself. | |
fa07e787 | 104 | |
fa07e787 | 105 | |
c24b7201 DH |
106 | The unevictable list does not differentiate between file-backed and anonymous, |
107 | swap-backed pages. This differentiation is only important while the pages are, | |
108 | in fact, evictable. | |
fa07e787 | 109 | |
c24b7201 DH |
110 | The unevictable list benefits from the "arrayification" of the per-zone LRU |
111 | lists and statistics originally proposed and posted by Christoph Lameter. | |
fa07e787 | 112 | |
c24b7201 DH |
113 | The unevictable list does not use the LRU pagevec mechanism. Rather, |
114 | unevictable pages are placed directly on the page's zone's unevictable list | |
115 | under the zone lru_lock. This allows us to prevent the stranding of pages on | |
116 | the unevictable list when one task has the page isolated from the LRU and other | |
117 | tasks are changing the "evictability" state of the page. | |
fa07e787 | 118 | |
fa07e787 | 119 | |
c24b7201 DH |
120 | MEMORY CONTROL GROUP INTERACTION |
121 | -------------------------------- | |
fa07e787 | 122 | |
c24b7201 DH |
123 | The unevictable LRU facility interacts with the memory control group [aka |
124 | memory controller; see Documentation/cgroups/memory.txt] by extending the | |
125 | lru_list enum. | |
126 | ||
127 | The memory controller data structure automatically gets a per-zone unevictable | |
128 | list as a result of the "arrayification" of the per-zone LRU lists (one per | |
129 | lru_list enum element). The memory controller tracks the movement of pages to | |
130 | and from the unevictable list. | |
fa07e787 | 131 | |
fa07e787 LS |
132 | When a memory control group comes under memory pressure, the controller will |
133 | not attempt to reclaim pages on the unevictable list. This has a couple of | |
c24b7201 DH |
134 | effects: |
135 | ||
136 | (1) Because the pages are "hidden" from reclaim on the unevictable list, the | |
137 | reclaim process can be more efficient, dealing only with pages that have a | |
138 | chance of being reclaimed. | |
139 | ||
140 | (2) On the other hand, if too many of the pages charged to the control group | |
141 | are unevictable, the evictable portion of the working set of the tasks in | |
142 | the control group may not fit into the available memory. This can cause | |
143 | the control group to thrash or to OOM-kill tasks. | |
144 | ||
145 | ||
146 | MARKING ADDRESS SPACES UNEVICTABLE | |
147 | ---------------------------------- | |
148 | ||
149 | For facilities such as ramfs none of the pages attached to the address space | |
150 | may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE | |
151 | address space flag is provided, and this can be manipulated by a filesystem | |
152 | using a number of wrapper functions: | |
153 | ||
154 | (*) void mapping_set_unevictable(struct address_space *mapping); | |
155 | ||
156 | Mark the address space as being completely unevictable. | |
157 | ||
158 | (*) void mapping_clear_unevictable(struct address_space *mapping); | |
159 | ||
160 | Mark the address space as being evictable. | |
161 | ||
162 | (*) int mapping_unevictable(struct address_space *mapping); | |
163 | ||
164 | Query the address space, and return true if it is completely | |
165 | unevictable. | |
166 | ||
167 | These are currently used in two places in the kernel: | |
168 | ||
169 | (1) By ramfs to mark the address spaces of its inodes when they are created, | |
170 | and this mark remains for the life of the inode. | |
171 | ||
172 | (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. | |
173 | ||
174 | Note that SHM_LOCK is not required to page in the locked pages if they're | |
175 | swapped out; the application must touch the pages manually if it wants to | |
176 | ensure they're in memory. | |
177 | ||
178 | ||
179 | DETECTING UNEVICTABLE PAGES | |
180 | --------------------------- | |
181 | ||
182 | The function page_evictable() in vmscan.c determines whether a page is | |
183 | evictable or not using the query function outlined above [see section "Marking | |
184 | address spaces unevictable"] to check the AS_UNEVICTABLE flag. | |
185 | ||
186 | For address spaces that are so marked after being populated (as SHM regions | |
187 | might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate | |
188 | the page tables for the region as does, for example, mlock(), nor need it make | |
189 | any special effort to push any pages in the SHM_LOCK'd area to the unevictable | |
190 | list. Instead, vmscan will do this if and when it encounters the pages during | |
191 | a reclamation scan. | |
192 | ||
193 | On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan | |
194 | the pages in the region and "rescue" them from the unevictable list if no other | |
195 | condition is keeping them unevictable. If an unevictable region is destroyed, | |
196 | the pages are also "rescued" from the unevictable list in the process of | |
197 | freeing them. | |
198 | ||
199 | page_evictable() also checks for mlocked pages by testing an additional page | |
200 | flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked, | |
201 | and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is | |
fa07e787 LS |
202 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and |
203 | update the appropriate statistics if the vma is VM_LOCKED. This method allows | |
204 | efficient "culling" of pages in the fault path that are being faulted in to | |
c24b7201 | 205 | VM_LOCKED VMAs. |
fa07e787 LS |
206 | |
207 | ||
c24b7201 DH |
208 | VMSCAN'S HANDLING OF UNEVICTABLE PAGES |
209 | -------------------------------------- | |
fa07e787 LS |
210 | |
211 | If unevictable pages are culled in the fault path, or moved to the unevictable | |
c24b7201 DH |
212 | list at mlock() or mmap() time, vmscan will not encounter the pages until they |
213 | have become evictable again (via munlock() for example) and have been "rescued" | |
214 | from the unevictable list. However, there may be situations where we decide, | |
215 | for the sake of expediency, to leave a unevictable page on one of the regular | |
216 | active/inactive LRU lists for vmscan to deal with. vmscan checks for such | |
217 | pages in all of the shrink_{active|inactive|page}_list() functions and will | |
218 | "cull" such pages that it encounters: that is, it diverts those pages to the | |
219 | unevictable list for the zone being scanned. | |
220 | ||
221 | There may be situations where a page is mapped into a VM_LOCKED VMA, but the | |
222 | page is not marked as PG_mlocked. Such pages will make it all the way to | |
fa07e787 | 223 | shrink_page_list() where they will be detected when vmscan walks the reverse |
c24b7201 DH |
224 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, |
225 | shrink_page_list() will cull the page at that point. | |
fa07e787 | 226 | |
c24b7201 DH |
227 | To "cull" an unevictable page, vmscan simply puts the page back on the LRU list |
228 | using putback_lru_page() - the inverse operation to isolate_lru_page() - after | |
229 | dropping the page lock. Because the condition which makes the page unevictable | |
230 | may change once the page is unlocked, putback_lru_page() will recheck the | |
231 | unevictable state of a page that it places on the unevictable list. If the | |
232 | page has become unevictable, putback_lru_page() removes it from the list and | |
233 | retries, including the page_unevictable() test. Because such a race is a rare | |
234 | event and movement of pages onto the unevictable list should be rare, these | |
235 | extra evictabilty checks should not occur in the majority of calls to | |
236 | putback_lru_page(). | |
fa07e787 LS |
237 | |
238 | ||
c24b7201 DH |
239 | ============= |
240 | MLOCKED PAGES | |
241 | ============= | |
fa07e787 | 242 | |
c24b7201 DH |
243 | The unevictable page list is also useful for mlock(), in addition to ramfs and |
244 | SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in | |
245 | NOMMU situations, all mappings are effectively mlocked. | |
246 | ||
247 | ||
248 | HISTORY | |
249 | ------- | |
250 | ||
251 | The "Unevictable mlocked Pages" infrastructure is based on work originally | |
fa07e787 | 252 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". |
c24b7201 DH |
253 | Nick posted his patch as an alternative to a patch posted by Christoph Lameter |
254 | to achieve the same objective: hiding mlocked pages from vmscan. | |
255 | ||
256 | In Nick's patch, he used one of the struct page LRU list link fields as a count | |
257 | of VM_LOCKED VMAs that map the page. This use of the link field for a count | |
258 | prevented the management of the pages on an LRU list, and thus mlocked pages | |
259 | were not migratable as isolate_lru_page() could not find them, and the LRU list | |
260 | link field was not available to the migration subsystem. | |
261 | ||
262 | Nick resolved this by putting mlocked pages back on the lru list before | |
263 | attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When | |
264 | Nick's patch was integrated with the Unevictable LRU work, the count was | |
265 | replaced by walking the reverse map to determine whether any VM_LOCKED VMAs | |
266 | mapped the page. More on this below. | |
267 | ||
268 | ||
269 | BASIC MANAGEMENT | |
270 | ---------------- | |
271 | ||
272 | mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable | |
273 | pages. When such a page has been "noticed" by the memory management subsystem, | |
274 | the page is marked with the PG_mlocked flag. This can be manipulated using the | |
275 | PageMlocked() functions. | |
276 | ||
277 | A PG_mlocked page will be placed on the unevictable list when it is added to | |
278 | the LRU. Such pages can be "noticed" by memory management in several places: | |
279 | ||
280 | (1) in the mlock()/mlockall() system call handlers; | |
281 | ||
282 | (2) in the mmap() system call handler when mmapping a region with the | |
283 | MAP_LOCKED flag; | |
284 | ||
285 | (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE | |
286 | flag | |
287 | ||
288 | (4) in the fault path, if mlocked pages are "culled" in the fault path, | |
289 | and when a VM_LOCKED stack segment is expanded; or | |
290 | ||
291 | (5) as mentioned above, in vmscan:shrink_page_list() when attempting to | |
292 | reclaim a page in a VM_LOCKED VMA via try_to_unmap() | |
293 | ||
294 | all of which result in the VM_LOCKED flag being set for the VMA if it doesn't | |
295 | already have it set. | |
296 | ||
297 | mlocked pages become unlocked and rescued from the unevictable list when: | |
298 | ||
299 | (1) mapped in a range unlocked via the munlock()/munlockall() system calls; | |
300 | ||
301 | (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including | |
302 | unmapping at task exit; | |
303 | ||
304 | (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file; | |
305 | or | |
306 | ||
307 | (4) before a page is COW'd in a VM_LOCKED VMA. | |
308 | ||
309 | ||
310 | mlock()/mlockall() SYSTEM CALL HANDLING | |
311 | --------------------------------------- | |
fa07e787 LS |
312 | |
313 | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() | |
c24b7201 | 314 | for each VMA in the range specified by the call. In the case of mlockall(), |
fa07e787 | 315 | this is the entire active address space of the task. Note that mlock_fixup() |
c24b7201 DH |
316 | is used for both mlocking and munlocking a range of memory. A call to mlock() |
317 | an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is | |
318 | treated as a no-op, and mlock_fixup() simply returns. | |
319 | ||
320 | If the VMA passes some filtering as described in "Filtering Special Vmas" | |
321 | below, mlock_fixup() will attempt to merge the VMA with its neighbors or split | |
322 | off a subset of the VMA if the range does not cover the entire VMA. Once the | |
323 | VMA has been merged or split or neither, mlock_fixup() will call | |
324 | __mlock_vma_pages_range() to fault in the pages via get_user_pages() and to | |
325 | mark the pages as mlocked via mlock_vma_page(). | |
326 | ||
327 | Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, | |
328 | get_user_pages() will be unable to fault in the pages. That's okay. If pages | |
329 | do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the | |
fa07e787 LS |
330 | fault path or in vmscan. |
331 | ||
332 | Also note that a page returned by get_user_pages() could be truncated or | |
c24b7201 DH |
333 | migrated out from under us, while we're trying to mlock it. To detect this, |
334 | __mlock_vma_pages_range() checks page_mapping() after acquiring the page lock. | |
335 | If the page is still associated with its mapping, we'll go ahead and call | |
336 | mlock_vma_page(). If the mapping is gone, we just unlock the page and move on. | |
337 | In the worst case, this will result in a page mapped in a VM_LOCKED VMA | |
338 | remaining on a normal LRU list without being PageMlocked(). Again, vmscan will | |
339 | detect and cull such pages. | |
340 | ||
341 | mlock_vma_page() will call TestSetPageMlocked() for each page returned by | |
342 | get_user_pages(). We use TestSetPageMlocked() because the page might already | |
343 | be mlocked by another task/VMA and we don't want to do extra work. We | |
344 | especially do not want to count an mlocked page more than once in the | |
345 | statistics. If the page was already mlocked, mlock_vma_page() need do nothing | |
346 | more. | |
fa07e787 LS |
347 | |
348 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the | |
349 | page from the LRU, as it is likely on the appropriate active or inactive list | |
c24b7201 DH |
350 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put |
351 | back the page - by calling putback_lru_page() - which will notice that the page | |
352 | is now mlocked and divert the page to the zone's unevictable list. If | |
fa07e787 | 353 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle |
c24b7201 | 354 | it later if and when it attempts to reclaim the page. |
fa07e787 LS |
355 | |
356 | ||
c24b7201 DH |
357 | FILTERING SPECIAL VMAS |
358 | ---------------------- | |
fa07e787 | 359 | |
c24b7201 | 360 | mlock_fixup() filters several classes of "special" VMAs: |
fa07e787 | 361 | |
c24b7201 | 362 | 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind |
fa07e787 | 363 | these mappings are inherently pinned, so we don't need to mark them as |
c24b7201 DH |
364 | mlocked. In any case, most of the pages have no struct page in which to so |
365 | mark the page. Because of this, get_user_pages() will fail for these VMAs, | |
366 | so there is no sense in attempting to visit them. | |
367 | ||
368 | 2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We | |
369 | neither need nor want to mlock() these pages. However, to preserve the | |
370 | prior behavior of mlock() - before the unevictable/mlock changes - | |
371 | mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to | |
372 | allocate the huge pages and populate the ptes. | |
373 | ||
374 | 3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of | |
375 | kernel pages, such as the VDSO page, relay channel pages, etc. These pages | |
fa07e787 | 376 | are inherently unevictable and are not managed on the LRU lists. |
c24b7201 | 377 | mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls |
fa07e787 LS |
378 | make_pages_present() to populate the ptes. |
379 | ||
c24b7201 | 380 | Note that for all of these special VMAs, mlock_fixup() does not set the |
fa07e787 | 381 | VM_LOCKED flag. Therefore, we won't have to deal with them later during |
c24b7201 DH |
382 | munlock(), munmap() or task exit. Neither does mlock_fixup() account these |
383 | VMAs against the task's "locked_vm". | |
384 | ||
385 | ||
386 | munlock()/munlockall() SYSTEM CALL HANDLING | |
387 | ------------------------------------------- | |
388 | ||
389 | The munlock() and munlockall() system calls are handled by the same functions - | |
390 | do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs | |
391 | lock operation indicated by an argument. So, these system calls are also | |
392 | handled by mlock_fixup(). Again, if called for an already munlocked VMA, | |
393 | mlock_fixup() simply returns. Because of the VMA filtering discussed above, | |
394 | VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be | |
fa07e787 LS |
395 | ignored for munlock. |
396 | ||
c24b7201 DH |
397 | If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the |
398 | specified range. The range is then munlocked via the function | |
399 | __mlock_vma_pages_range() - the same function used to mlock a VMA range - | |
fa07e787 LS |
400 | passing a flag to indicate that munlock() is being performed. |
401 | ||
c24b7201 | 402 | Because the VMA access protections could have been changed to PROT_NONE after |
63d6c5ad | 403 | faulting in and mlocking pages, get_user_pages() was unreliable for visiting |
c24b7201 | 404 | these pages for munlocking. Because we don't want to leave pages mlocked, |
fa07e787 | 405 | get_user_pages() was enhanced to accept a flag to ignore the permissions when |
c24b7201 DH |
406 | fetching the pages - all of which should be resident as a result of previous |
407 | mlocking. | |
fa07e787 LS |
408 | |
409 | For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling | |
410 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked | |
c24b7201 DH |
411 | flag using TestClearPageMlocked(). As with mlock_vma_page(), |
412 | munlock_vma_page() use the Test*PageMlocked() function to handle the case where | |
413 | the page might have already been unlocked by another task. If the page was | |
414 | mlocked, munlock_vma_page() updates that zone statistics for the number of | |
415 | mlocked pages. Note, however, that at this point we haven't checked whether | |
416 | the page is mapped by other VM_LOCKED VMAs. | |
417 | ||
418 | We can't call try_to_munlock(), the function that walks the reverse map to | |
419 | check for other VM_LOCKED VMAs, without first isolating the page from the LRU. | |
fa07e787 | 420 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page |
c24b7201 DH |
421 | not be on an LRU list [more on these below]. However, the call to |
422 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So, | |
423 | we go ahead and clear PG_mlocked up front, as this might be the only chance we | |
424 | have. If we can successfully isolate the page, we go ahead and | |
fa07e787 | 425 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone |
c24b7201 | 426 | page statistics if it finds another VMA holding the page mlocked. If we fail |
fa07e787 | 427 | to isolate the page, we'll have left a potentially mlocked page on the LRU. |
c24b7201 DH |
428 | This is fine, because we'll catch it later if and if vmscan tries to reclaim |
429 | the page. This should be relatively rare. | |
430 | ||
431 | ||
432 | MIGRATING MLOCKED PAGES | |
433 | ----------------------- | |
434 | ||
435 | A page that is being migrated has been isolated from the LRU lists and is held | |
436 | locked across unmapping of the page, updating the page's address space entry | |
437 | and copying the contents and state, until the page table entry has been | |
438 | replaced with an entry that refers to the new page. Linux supports migration | |
439 | of mlocked pages and other unevictable pages. This involves simply moving the | |
440 | PG_mlocked and PG_unevictable states from the old page to the new page. | |
441 | ||
442 | Note that page migration can race with mlocking or munlocking of the same page. | |
443 | This has been discussed from the mlock/munlock perspective in the respective | |
444 | sections above. Both processes (migration and m[un]locking) hold the page | |
445 | locked. This provides the first level of synchronization. Page migration | |
446 | zeros out the page_mapping of the old page before unlocking it, so m[un]lock | |
447 | can skip these pages by testing the page mapping under page lock. | |
448 | ||
449 | To complete page migration, we place the new and old pages back onto the LRU | |
450 | after dropping the page lock. The "unneeded" page - old page on success, new | |
451 | page on failure - will be freed when the reference count held by the migration | |
452 | process is released. To ensure that we don't strand pages on the unevictable | |
453 | list because of a race between munlock and migration, page migration uses the | |
454 | putback_lru_page() function to add migrated pages back to the LRU. | |
455 | ||
456 | ||
457 | mmap(MAP_LOCKED) SYSTEM CALL HANDLING | |
458 | ------------------------------------- | |
fa07e787 LS |
459 | |
460 | In addition the the mlock()/mlockall() system calls, an application can request | |
c24b7201 | 461 | that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap() |
fa07e787 LS |
462 | call. Furthermore, any mmap() call or brk() call that expands the heap by a |
463 | task that has previously called mlockall() with the MCL_FUTURE flag will result | |
c24b7201 DH |
464 | in the newly mapped memory being mlocked. Before the unevictable/mlock |
465 | changes, the kernel simply called make_pages_present() to allocate pages and | |
466 | populate the page table. | |
fa07e787 LS |
467 | |
468 | To mlock a range of memory under the unevictable/mlock infrastructure, the | |
469 | mmap() handler and task address space expansion functions call | |
470 | mlock_vma_pages_range() specifying the vma and the address range to mlock. | |
c24b7201 DH |
471 | mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in |
472 | "Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have | |
473 | already been set by the caller, in filtered VMAs. Thus these VMA's need not be | |
474 | visited for munlock when the region is unmapped. | |
fa07e787 | 475 | |
c24b7201 | 476 | For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to |
fa07e787 LS |
477 | fault/allocate the pages and mlock them. Again, like mlock_fixup(), |
478 | mlock_vma_pages_range() downgrades the mmap semaphore to read mode before | |
c24b7201 | 479 | attempting to fault/allocate and mlock the pages and "upgrades" the semaphore |
fa07e787 LS |
480 | back to write mode before returning. |
481 | ||
c24b7201 DH |
482 | The callers of mlock_vma_pages_range() will have already added the memory range |
483 | to be mlocked to the task's "locked_vm". To account for filtered VMAs, | |
fa07e787 | 484 | mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the |
c24b7201 DH |
485 | callers then subtract a non-negative return value from the task's locked_vm. A |
486 | negative return value represent an error - for example, from get_user_pages() | |
487 | attempting to fault in a VMA with PROT_NONE access. In this case, we leave the | |
488 | memory range accounted as locked_vm, as the protections could be changed later | |
489 | and pages allocated into that region. | |
fa07e787 LS |
490 | |
491 | ||
c24b7201 DH |
492 | munmap()/exit()/exec() SYSTEM CALL HANDLING |
493 | ------------------------------------------- | |
fa07e787 LS |
494 | |
495 | When unmapping an mlocked region of memory, whether by an explicit call to | |
496 | munmap() or via an internal unmap from exit() or exec() processing, we must | |
c24b7201 | 497 | munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. |
63d6c5ad HD |
498 | Before the unevictable/mlock changes, mlocking did not mark the pages in any |
499 | way, so unmapping them required no processing. | |
fa07e787 LS |
500 | |
501 | To munlock a range of memory under the unevictable/mlock infrastructure, the | |
c24b7201 | 502 | munmap() handler and task address space call tear down function |
fa07e787 | 503 | munlock_vma_pages_all(). The name reflects the observation that one always |
c24b7201 DH |
504 | specifies the entire VMA range when munlock()ing during unmap of a region. |
505 | Because of the VMA filtering when mlocking() regions, only "normal" VMAs that | |
fa07e787 LS |
506 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). |
507 | ||
c24b7201 | 508 | munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup() |
fa07e787 | 509 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table |
c24b7201 DH |
510 | for the VMA's memory range and munlock_vma_page() each resident page mapped by |
511 | the VMA. This effectively munlocks the page, only if this is the last | |
512 | VM_LOCKED VMA that maps the page. | |
fa07e787 | 513 | |
fa07e787 | 514 | |
c24b7201 DH |
515 | try_to_unmap() |
516 | -------------- | |
fa07e787 | 517 | |
c24b7201 | 518 | Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may |
fa07e787 | 519 | have VM_LOCKED flag set. It is possible for a page mapped into one or more |
c24b7201 DH |
520 | VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one |
521 | of the active or inactive LRU lists. This could happen if, for example, a task | |
522 | in the process of munlocking the page could not isolate the page from the LRU. | |
523 | As a result, vmscan/shrink_page_list() might encounter such a page as described | |
524 | in section "vmscan's handling of unevictable pages". To handle this situation, | |
525 | try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse | |
526 | map. | |
fa07e787 LS |
527 | |
528 | try_to_unmap() is always called, by either vmscan for reclaim or for page | |
c24b7201 DH |
529 | migration, with the argument page locked and isolated from the LRU. Separate |
530 | functions handle anonymous and mapped file pages, as these types of pages have | |
531 | different reverse map mechanisms. | |
532 | ||
533 | (*) try_to_unmap_anon() | |
534 | ||
535 | To unmap anonymous pages, each VMA in the list anchored in the anon_vma | |
536 | must be visited - at least until a VM_LOCKED VMA is encountered. If the | |
537 | page is being unmapped for migration, VM_LOCKED VMAs do not stop the | |
538 | process because mlocked pages are migratable. However, for reclaim, if | |
539 | the page is mapped into a VM_LOCKED VMA, the scan stops. | |
540 | ||
541 | try_to_unmap_anon() attempts to acquire in read mode the mmap semphore of | |
542 | the mm_struct to which the VMA belongs. If this is successful, it will | |
543 | mlock the page via mlock_vma_page() - we wouldn't have gotten to | |
544 | try_to_unmap_anon() if the page were already mlocked - and will return | |
545 | SWAP_MLOCK, indicating that the page is unevictable. | |
546 | ||
547 | If the mmap semaphore cannot be acquired, we are not sure whether the page | |
548 | is really unevictable or not. In this case, try_to_unmap_anon() will | |
549 | return SWAP_AGAIN. | |
550 | ||
551 | (*) try_to_unmap_file() - linear mappings | |
552 | ||
553 | Unmapping of a mapped file page works the same as for anonymous mappings, | |
554 | except that the scan visits all VMAs that map the page's index/page offset | |
555 | in the page's mapping's reverse map priority search tree. It also visits | |
556 | each VMA in the page's mapping's non-linear list, if the list is | |
557 | non-empty. | |
558 | ||
559 | As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file | |
560 | page, try_to_unmap_file() will attempt to acquire the associated | |
561 | mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this | |
562 | is successful, and SWAP_AGAIN, if not. | |
563 | ||
564 | (*) try_to_unmap_file() - non-linear mappings | |
565 | ||
566 | If a page's mapping contains a non-empty non-linear mapping VMA list, then | |
567 | try_to_un{map|lock}() must also visit each VMA in that list to determine | |
568 | whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit | |
569 | all VMAs in the non-linear list to ensure that the pages is not/should not | |
570 | be mlocked. | |
571 | ||
572 | If a VM_LOCKED VMA is found in the list, the scan could terminate. | |
573 | However, there is no easy way to determine whether the page is actually | |
574 | mapped in a given VMA - either for unmapping or testing whether the | |
575 | VM_LOCKED VMA actually pins the page. | |
576 | ||
577 | try_to_unmap_file() handles non-linear mappings by scanning a certain | |
578 | number of pages - a "cluster" - in each non-linear VMA associated with the | |
579 | page's mapping, for each file mapped page that vmscan tries to unmap. If | |
580 | this happens to unmap the page we're trying to unmap, try_to_unmap() will | |
581 | notice this on return (page_mapcount(page) will be 0) and return | |
582 | SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to | |
583 | recirculate this page. We take advantage of the cluster scan in | |
584 | try_to_unmap_cluster() as follows: | |
585 | ||
586 | For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the | |
587 | mmap semaphore of the associated mm_struct for read without blocking. | |
588 | ||
589 | If this attempt is successful and the VMA is VM_LOCKED, | |
590 | try_to_unmap_cluster() will retain the mmap semaphore for the scan; | |
591 | otherwise it drops it here. | |
592 | ||
593 | Then, for each page in the cluster, if we're holding the mmap semaphore | |
594 | for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to | |
595 | mlock the page. This call is a no-op if the page is already locked, | |
596 | but will mlock any pages in the non-linear mapping that happen to be | |
597 | unlocked. | |
598 | ||
599 | If one of the pages so mlocked is the page passed in to try_to_unmap(), | |
600 | try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default | |
601 | SWAP_AGAIN. This will allow vmscan to cull the page, rather than | |
602 | recirculating it on the inactive list. | |
603 | ||
604 | Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it | |
605 | returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED | |
606 | VMA, but couldn't be mlocked. | |
607 | ||
608 | ||
609 | try_to_munlock() REVERSE MAP SCAN | |
610 | --------------------------------- | |
611 | ||
612 | [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the | |
613 | page_referenced() reverse map walker. | |
614 | ||
615 | When munlock_vma_page() [see section "munlock()/munlockall() System Call | |
616 | Handling" above] tries to munlock a page, it needs to determine whether or not | |
617 | the page is mapped by any VM_LOCKED VMA without actually attempting to unmap | |
618 | all PTEs from the page. For this purpose, the unevictable/mlock infrastructure | |
619 | introduced a variant of try_to_unmap() called try_to_munlock(). | |
fa07e787 LS |
620 | |
621 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and | |
622 | mapped file pages with an additional argument specifing unlock versus unmap | |
623 | processing. Again, these functions walk the respective reverse maps looking | |
c24b7201 | 624 | for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file |
fa07e787 LS |
625 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions |
626 | attempt to acquire the associated mmap semphore, mlock the page via | |
627 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the | |
63d6c5ad | 628 | pre-clearing of the page's PG_mlocked done by munlock_vma_page. |
fa07e787 | 629 | |
c24b7201 DH |
630 | If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap |
631 | semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to | |
632 | recycle the page on the inactive list and hope that it has better luck with the | |
633 | page next time. | |
634 | ||
635 | For file pages mapped into non-linear VMAs, the try_to_munlock() logic works | |
636 | slightly differently. On encountering a VM_LOCKED non-linear VMA that might | |
637 | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the | |
638 | page. munlock_vma_page() will just leave the page unlocked and let vmscan deal | |
639 | with it - the usual fallback position. | |
640 | ||
641 | Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's | |
642 | reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. | |
643 | However, the scan can terminate when it encounters a VM_LOCKED VMA and can | |
644 | successfully acquire the VMA's mmap semphore for read and mlock the page. | |
645 | Although try_to_munlock() might be called a great many times when munlocking a | |
646 | large region or tearing down a large address space that has been mlocked via | |
647 | mlockall(), overall this is a fairly rare event. | |
648 | ||
649 | ||
650 | PAGE RECLAIM IN shrink_*_list() | |
651 | ------------------------------- | |
652 | ||
653 | shrink_active_list() culls any obviously unevictable pages - i.e. | |
654 | !page_evictable(page, NULL) - diverting these to the unevictable list. | |
655 | However, shrink_active_list() only sees unevictable pages that made it onto the | |
656 | active/inactive lru lists. Note that these pages do not have PageUnevictable | |
657 | set - otherwise they would be on the unevictable list and shrink_active_list | |
658 | would never see them. | |
fa07e787 LS |
659 | |
660 | Some examples of these unevictable pages on the LRU lists are: | |
661 | ||
c24b7201 DH |
662 | (1) ramfs pages that have been placed on the LRU lists when first allocated. |
663 | ||
664 | (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to | |
665 | allocate or fault in the pages in the shared memory region. This happens | |
666 | when an application accesses the page the first time after SHM_LOCK'ing | |
667 | the segment. | |
fa07e787 | 668 | |
c24b7201 DH |
669 | (3) mlocked pages that could not be isolated from the LRU and moved to the |
670 | unevictable list in mlock_vma_page(). | |
fa07e787 | 671 | |
c24b7201 DH |
672 | (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't |
673 | acquire the VMA's mmap semaphore to test the flags and set PageMlocked. | |
674 | munlock_vma_page() was forced to let the page back on to the normal LRU | |
675 | list for vmscan to handle. | |
fa07e787 | 676 | |
c24b7201 DH |
677 | shrink_inactive_list() also diverts any unevictable pages that it finds on the |
678 | inactive lists to the appropriate zone's unevictable list. | |
fa07e787 | 679 | |
c24b7201 DH |
680 | shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd |
681 | after shrink_active_list() had moved them to the inactive list, or pages mapped | |
682 | into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to | |
683 | recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter, | |
684 | but will pass on to shrink_page_list(). | |
fa07e787 LS |
685 | |
686 | shrink_page_list() again culls obviously unevictable pages that it could | |
63d6c5ad | 687 | encounter for similar reason to shrink_inactive_list(). Pages mapped into |
c24b7201 | 688 | VM_LOCKED VMAs but without PG_mlocked set will make it all the way to |
63d6c5ad HD |
689 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list |
690 | when try_to_unmap() returns SWAP_MLOCK, as discussed above. |