Commit | Line | Data |
---|---|---|
fa07e787 LS |
1 | |
2 | This document describes the Linux memory management "Unevictable LRU" | |
3 | infrastructure and the use of this infrastructure to manage several types | |
4 | of "unevictable" pages. The document attempts to provide the overall | |
5 | rationale behind this mechanism and the rationale for some of the design | |
6 | decisions that drove the implementation. The latter design rationale is | |
7 | discussed in the context of an implementation description. Admittedly, one | |
8 | can obtain the implementation details--the "what does it do?"--by reading the | |
9 | code. One hopes that the descriptions below add value by provide the answer | |
10 | to "why does it do that?". | |
11 | ||
12 | Unevictable LRU Infrastructure: | |
13 | ||
14 | The Unevictable LRU adds an additional LRU list to track unevictable pages | |
15 | and to hide these pages from vmscan. This mechanism is based on a patch by | |
16 | Larry Woodman of Red Hat to address several scalability problems with page | |
17 | reclaim in Linux. The problems have been observed at customer sites on large | |
18 | memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB | |
19 | of main memory will have over 32 million 4k pages in a single zone. When a | |
20 | large fraction of these pages are not evictable for any reason [see below], | |
21 | vmscan will spend a lot of time scanning the LRU lists looking for the small | |
22 | fraction of pages that are evictable. This can result in a situation where | |
23 | all cpus are spending 100% of their time in vmscan for hours or days on end, | |
24 | with the system completely unresponsive. | |
25 | ||
26 | The Unevictable LRU infrastructure addresses the following classes of | |
27 | unevictable pages: | |
28 | ||
29 | + page owned by ramfs | |
30 | + page mapped into SHM_LOCKed shared memory regions | |
31 | + page mapped into VM_LOCKED [mlock()ed] vmas | |
32 | ||
33 | The infrastructure might be able to handle other conditions that make pages | |
34 | unevictable, either by definition or by circumstance, in the future. | |
35 | ||
36 | ||
37 | The Unevictable LRU List | |
38 | ||
39 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list | |
40 | called the "unevictable" list and an associated page flag, PG_unevictable, to | |
41 | indicate that the page is being managed on the unevictable list. The | |
42 | PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active | |
43 | flag in that it indicates on which LRU list a page resides when PG_lru is set. | |
44 | The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU | |
45 | Kconfig option. | |
46 | ||
47 | The Unevictable LRU infrastructure maintains unevictable pages on an additional | |
48 | LRU list for a few reasons: | |
49 | ||
50 | 1) We get to "treat unevictable pages just like we treat other pages in the | |
51 | system, which means we get to use the same code to manipulate them, the | |
52 | same code to isolate them (for migrate, etc.), the same code to keep track | |
53 | of the statistics, etc..." [Rik van Riel] | |
54 | ||
55 | 2) We want to be able to migrate unevictable pages between nodes--for memory | |
56 | defragmentation, workload management and memory hotplug. The linux kernel | |
57 | can only migrate pages that it can successfully isolate from the lru lists. | |
58 | If we were to maintain pages elsewise than on an lru-like list, where they | |
59 | can be found by isolate_lru_page(), we would prevent their migration, unless | |
60 | we reworked migration code to find the unevictable pages. | |
61 | ||
62 | ||
63 | The unevictable LRU list does not differentiate between file backed and swap | |
64 | backed [anon] pages. This differentiation is only important while the pages | |
65 | are, in fact, evictable. | |
66 | ||
67 | The unevictable LRU list benefits from the "arrayification" of the per-zone | |
68 | LRU lists and statistics originally proposed and posted by Christoph Lameter. | |
69 | ||
70 | The unevictable list does not use the lru pagevec mechanism. Rather, | |
71 | unevictable pages are placed directly on the page's zone's unevictable | |
72 | list under the zone lru_lock. The reason for this is to prevent stranding | |
73 | of pages on the unevictable list when one task has the page isolated from the | |
74 | lru and other tasks are changing the "evictability" state of the page. | |
75 | ||
76 | ||
77 | Unevictable LRU and Memory Controller Interaction | |
78 | ||
79 | The memory controller data structure automatically gets a per zone unevictable | |
80 | lru list as a result of the "arrayification" of the per-zone LRU lists. The | |
81 | memory controller tracks the movement of pages to and from the unevictable list. | |
82 | When a memory control group comes under memory pressure, the controller will | |
83 | not attempt to reclaim pages on the unevictable list. This has a couple of | |
84 | effects. Because the pages are "hidden" from reclaim on the unevictable list, | |
85 | the reclaim process can be more efficient, dealing only with pages that have | |
86 | a chance of being reclaimed. On the other hand, if too many of the pages | |
87 | charged to the control group are unevictable, the evictable portion of the | |
88 | working set of the tasks in the control group may not fit into the available | |
89 | memory. This can cause the control group to thrash or to oom-kill tasks. | |
90 | ||
91 | ||
92 | Unevictable LRU: Detecting Unevictable Pages | |
93 | ||
94 | The function page_evictable(page, vma) in vmscan.c determines whether a | |
95 | page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, | |
96 | page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's | |
97 | address space using a wrapper function. Wrapper functions are used to set, | |
98 | clear and test the flag to reduce the requirement for #ifdef's throughout the | |
99 | source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. | |
100 | This flag remains for the life of the inode. | |
101 | ||
102 | For shared memory regions, AS_UNEVICTABLE is set when an application | |
103 | successfully SHM_LOCKs the region and is removed when the region is | |
104 | SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page | |
105 | tables for the region as does, for example, mlock(). So, we make no special | |
106 | effort to push any pages in the SHM_LOCKed region to the unevictable list. | |
107 | Vmscan will do this when/if it encounters the pages during reclaim. On | |
108 | SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the | |
109 | unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed | |
110 | region is destroyed, the pages are also "rescued" from the unevictable list in | |
111 | the process of freeing them. | |
112 | ||
113 | page_evictable() detects mlock()ed pages by testing an additional page flag, | |
114 | PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a | |
115 | non-NULL vma is supplied, page_evictable() will check whether the vma is | |
116 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and | |
117 | update the appropriate statistics if the vma is VM_LOCKED. This method allows | |
118 | efficient "culling" of pages in the fault path that are being faulted in to | |
119 | VM_LOCKED vmas. | |
120 | ||
121 | ||
122 | Unevictable Pages and Vmscan [shrink_*_list()] | |
123 | ||
124 | If unevictable pages are culled in the fault path, or moved to the unevictable | |
125 | list at mlock() or mmap() time, vmscan will never encounter the pages until | |
126 | they have become evictable again, for example, via munlock() and have been | |
127 | "rescued" from the unevictable list. However, there may be situations where we | |
128 | decide, for the sake of expediency, to leave a unevictable page on one of the | |
129 | regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for | |
130 | such pages in all of the shrink_{active|inactive|page}_list() functions and | |
131 | will "cull" such pages that it encounters--that is, it diverts those pages to | |
132 | the unevictable list for the zone being scanned. | |
133 | ||
134 | There may be situations where a page is mapped into a VM_LOCKED vma, but the | |
135 | page is not marked as PageMlocked. Such pages will make it all the way to | |
136 | shrink_page_list() where they will be detected when vmscan walks the reverse | |
137 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() | |
138 | will cull the page at that point. | |
139 | ||
140 | Note that for anonymous pages, shrink_page_list() attempts to add the page to | |
141 | the swap cache before it tries to unmap the page. To avoid this unnecessary | |
142 | consumption of swap space, shrink_page_list() calls try_to_munlock() to check | |
143 | whether any VM_LOCKED vmas map the page without attempting to unmap the page. | |
144 | If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page | |
145 | without consuming swap space. try_to_munlock() will be described below. | |
146 | ||
147 | To "cull" an unevictable page, vmscan simply puts the page back on the lru | |
148 | list using putback_lru_page()--the inverse operation to isolate_lru_page()-- | |
149 | after dropping the page lock. Because the condition which makes the page | |
150 | unevictable may change once the page is unlocked, putback_lru_page() will | |
151 | recheck the unevictable state of a page that it places on the unevictable lru | |
152 | list. If the page has become unevictable, putback_lru_page() removes it from | |
153 | the list and retries, including the page_unevictable() test. Because such a | |
154 | race is a rare event and movement of pages onto the unevictable list should be | |
155 | rare, these extra evictabilty checks should not occur in the majority of calls | |
156 | to putback_lru_page(). | |
157 | ||
158 | ||
159 | Mlocked Page: Prior Work | |
160 | ||
161 | The "Unevictable Mlocked Pages" infrastructure is based on work originally | |
162 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". | |
163 | Nick posted his patch as an alternative to a patch posted by Christoph | |
164 | Lameter to achieve the same objective--hiding mlocked pages from vmscan. | |
165 | In Nick's patch, he used one of the struct page lru list link fields as a count | |
166 | of VM_LOCKED vmas that map the page. This use of the link field for a count | |
167 | prevented the management of the pages on an LRU list. Thus, mlocked pages were | |
168 | not migratable as isolate_lru_page() could not find them and the lru list link | |
169 | field was not available to the migration subsystem. Nick resolved this by | |
170 | putting mlocked pages back on the lru list before attempting to isolate them, | |
171 | thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated | |
172 | with the Unevictable LRU work, the count was replaced by walking the reverse | |
173 | map to determine whether any VM_LOCKED vmas mapped the page. More on this | |
174 | below. | |
175 | ||
176 | ||
177 | Mlocked Pages: Basic Management | |
178 | ||
179 | Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of | |
180 | unevictable pages. When such a page has been "noticed" by the memory | |
181 | management subsystem, the page is marked with the PG_mlocked [PageMlocked()] | |
182 | flag. A PageMlocked() page will be placed on the unevictable LRU list when | |
183 | it is added to the LRU. Pages can be "noticed" by memory management in | |
184 | several places: | |
185 | ||
186 | 1) in the mlock()/mlockall() system call handlers. | |
187 | 2) in the mmap() system call handler when mmap()ing a region with the | |
188 | MAP_LOCKED flag, or mmap()ing a region in a task that has called | |
189 | mlockall() with the MCL_FUTURE flag. Both of these conditions result | |
190 | in the VM_LOCKED flag being set for the vma. | |
191 | 3) in the fault path, if mlocked pages are "culled" in the fault path, | |
192 | and when a VM_LOCKED stack segment is expanded. | |
193 | 4) as mentioned above, in vmscan:shrink_page_list() with attempting to | |
194 | reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_munlock(). | |
195 | ||
196 | Mlocked pages become unlocked and rescued from the unevictable list when: | |
197 | ||
198 | 1) mapped in a range unlocked via the munlock()/munlockall() system calls. | |
199 | 2) munmapped() out of the last VM_LOCKED vma that maps the page, including | |
200 | unmapping at task exit. | |
201 | 3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. | |
202 | 4) before a page is COWed in a VM_LOCKED vma. | |
203 | ||
204 | ||
205 | Mlocked Pages: mlock()/mlockall() System Call Handling | |
206 | ||
207 | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() | |
208 | for each vma in the range specified by the call. In the case of mlockall(), | |
209 | this is the entire active address space of the task. Note that mlock_fixup() | |
210 | is used for both mlock()ing and munlock()ing a range of memory. A call to | |
211 | mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED | |
212 | is treated as a no-op--mlock_fixup() simply returns. | |
213 | ||
214 | If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" | |
215 | below, mlock_fixup() will attempt to merge the vma with its neighbors or split | |
216 | off a subset of the vma if the range does not cover the entire vma. Once the | |
217 | vma has been merged or split or neither, mlock_fixup() will call | |
218 | __mlock_vma_pages_range() to fault in the pages via get_user_pages() and | |
219 | to mark the pages as mlocked via mlock_vma_page(). | |
220 | ||
221 | Note that the vma being mlocked might be mapped with PROT_NONE. In this case, | |
222 | get_user_pages() will be unable to fault in the pages. That's OK. If pages | |
223 | do end up getting faulted into this VM_LOCKED vma, we'll handle them in the | |
224 | fault path or in vmscan. | |
225 | ||
226 | Also note that a page returned by get_user_pages() could be truncated or | |
227 | migrated out from under us, while we're trying to mlock it. To detect | |
228 | this, __mlock_vma_pages_range() tests the page_mapping after acquiring | |
229 | the page lock. If the page is still associated with its mapping, we'll | |
230 | go ahead and call mlock_vma_page(). If the mapping is gone, we just | |
231 | unlock the page and move on. Worse case, this results in page mapped | |
232 | in a VM_LOCKED vma remaining on a normal LRU list without being | |
233 | PageMlocked(). Again, vmscan will detect and cull such pages. | |
234 | ||
235 | mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will | |
236 | TestSetPageMlocked() for each page returned by get_user_pages(). We use | |
237 | TestSetPageMlocked() because the page might already be mlocked by another | |
238 | task/vma and we don't want to do extra work. We especially do not want to | |
239 | count an mlocked page more than once in the statistics. If the page was | |
240 | already mlocked, mlock_vma_page() is done. | |
241 | ||
242 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the | |
243 | page from the LRU, as it is likely on the appropriate active or inactive list | |
244 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will | |
245 | putback the page--putback_lru_page()--which will notice that the page is now | |
246 | mlocked and divert the page to the zone's unevictable LRU list. If | |
247 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle | |
248 | it later if/when it attempts to reclaim the page. | |
249 | ||
250 | ||
251 | Mlocked Pages: Filtering Special Vmas | |
252 | ||
253 | mlock_fixup() filters several classes of "special" vmas: | |
254 | ||
255 | 1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind | |
256 | these mappings are inherently pinned, so we don't need to mark them as | |
257 | mlocked. In any case, most of the pages have no struct page in which to | |
258 | so mark the page. Because of this, get_user_pages() will fail for these | |
259 | vmas, so there is no sense in attempting to visit them. | |
260 | ||
261 | 2) vmas mapping hugetlbfs page are already effectively pinned into memory. | |
262 | We don't need nor want to mlock() these pages. However, to preserve the | |
263 | prior behavior of mlock()--before the unevictable/mlock changes--mlock_fixup() | |
264 | will call make_pages_present() in the hugetlbfs vma range to allocate the | |
265 | huge pages and populate the ptes. | |
266 | ||
267 | 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of | |
268 | kernel pages, such as the vdso page, relay channel pages, etc. These pages | |
269 | are inherently unevictable and are not managed on the LRU lists. | |
270 | mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls | |
271 | make_pages_present() to populate the ptes. | |
272 | ||
273 | Note that for all of these special vmas, mlock_fixup() does not set the | |
274 | VM_LOCKED flag. Therefore, we won't have to deal with them later during | |
275 | munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() | |
276 | account these vmas against the task's "locked_vm". | |
277 | ||
278 | Mlocked Pages: Downgrading the Mmap Semaphore. | |
279 | ||
280 | mlock_fixup() must be called with the mmap semaphore held for write, because | |
281 | it may have to merge or split vmas. However, mlocking a large region of | |
282 | memory can take a long time--especially if vmscan must reclaim pages to | |
283 | satisfy the regions requirements. Faulting in a large region with the mmap | |
284 | semaphore held for write can hold off other faults on the address space, in | |
285 | the case of a multi-threaded task. It can also hold off scans of the task's | |
286 | address space via /proc. While testing under heavy load, it was observed that | |
287 | the ps(1) command could be held off for many minutes while a large segment was | |
288 | mlock()ed down. | |
289 | ||
290 | To address this issue, and to make the system more responsive during mlock()ing | |
291 | of large segments, mlock_fixup() downgrades the mmap semaphore to read mode | |
292 | during the call to __mlock_vma_pages_range(). This works fine. However, the | |
293 | callers of mlock_fixup() expect the semaphore to be returned in write mode. | |
294 | So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not | |
295 | support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore | |
296 | and reacquire it in write mode. In a multi-threaded task, it is possible for | |
297 | the task memory map to change while the semaphore is dropped. Therefore, | |
298 | mlock_fixup() looks up the vma at the range start address after reacquiring | |
299 | the semaphore in write mode and verifies that it still covers the original | |
300 | range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of | |
301 | mlock_fixup() have been changed to deal with this new error condition. | |
302 | ||
303 | Note: when munlocking a region, all of the pages should already be resident-- | |
304 | unless we have racing threads mlocking() and munlocking() regions. So, | |
305 | unlocking should not have to wait for page allocations nor faults of any kind. | |
306 | Therefore mlock_fixup() does not downgrade the semaphore for munlock(). | |
307 | ||
308 | ||
309 | Mlocked Pages: munlock()/munlockall() System Call Handling | |
310 | ||
311 | The munlock() and munlockall() system calls are handled by the same functions-- | |
312 | do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock | |
313 | vs lock operation indicated by an argument. So, these system calls are also | |
314 | handled by mlock_fixup(). Again, if called for an already munlock()ed vma, | |
315 | mlock_fixup() simply returns. Because of the vma filtering discussed above, | |
316 | VM_LOCKED will not be set in any "special" vmas. So, these vmas will be | |
317 | ignored for munlock. | |
318 | ||
319 | If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off | |
320 | the specified range. The range is then munlocked via the function | |
321 | __mlock_vma_pages_range()--the same function used to mlock a vma range-- | |
322 | passing a flag to indicate that munlock() is being performed. | |
323 | ||
324 | Because the vma access protections could have been changed to PROT_NONE after | |
325 | faulting in and mlocking some pages, get_user_pages() was unreliable for visiting | |
326 | these pages for munlocking. Because we don't want to leave pages mlocked(), | |
327 | get_user_pages() was enhanced to accept a flag to ignore the permissions when | |
328 | fetching the pages--all of which should be resident as a result of previous | |
329 | mlock()ing. | |
330 | ||
331 | For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling | |
332 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked | |
333 | flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() | |
334 | use the Test*PageMlocked() function to handle the case where the page might | |
335 | have already been unlocked by another task. If the page was mlocked, | |
336 | munlock_vma_page() updates that zone statistics for the number of mlocked | |
337 | pages. Note, however, that at this point we haven't checked whether the page | |
338 | is mapped by other VM_LOCKED vmas. | |
339 | ||
340 | We can't call try_to_munlock(), the function that walks the reverse map to check | |
341 | for other VM_LOCKED vmas, without first isolating the page from the LRU. | |
342 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page | |
343 | not be on an lru list. [More on these below.] However, the call to | |
344 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). | |
345 | So, we go ahead and clear PG_mlocked up front, as this might be the only chance | |
346 | we have. If we can successfully isolate the page, we go ahead and | |
347 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone | |
348 | page statistics if it finds another vma holding the page mlocked. If we fail | |
349 | to isolate the page, we'll have left a potentially mlocked page on the LRU. | |
350 | This is fine, because we'll catch it later when/if vmscan tries to reclaim the | |
351 | page. This should be relatively rare. | |
352 | ||
353 | Mlocked Pages: Migrating Them... | |
354 | ||
355 | A page that is being migrated has been isolated from the lru lists and is | |
356 | held locked across unmapping of the page, updating the page's mapping | |
357 | [address_space] entry and copying the contents and state, until the | |
358 | page table entry has been replaced with an entry that refers to the new | |
359 | page. Linux supports migration of mlocked pages and other unevictable | |
360 | pages. This involves simply moving the PageMlocked and PageUnevictable states | |
361 | from the old page to the new page. | |
362 | ||
363 | Note that page migration can race with mlocking or munlocking of the same | |
364 | page. This has been discussed from the mlock/munlock perspective in the | |
365 | respective sections above. Both processes [migration, m[un]locking], hold | |
366 | the page locked. This provides the first level of synchronization. Page | |
367 | migration zeros out the page_mapping of the old page before unlocking it, | |
368 | so m[un]lock can skip these pages by testing the page mapping under page | |
369 | lock. | |
370 | ||
371 | When completing page migration, we place the new and old pages back onto the | |
372 | lru after dropping the page lock. The "unneeded" page--old page on success, | |
373 | new page on failure--will be freed when the reference count held by the | |
374 | migration process is released. To ensure that we don't strand pages on the | |
375 | unevictable list because of a race between munlock and migration, page | |
376 | migration uses the putback_lru_page() function to add migrated pages back to | |
377 | the lru. | |
378 | ||
379 | ||
380 | Mlocked Pages: mmap(MAP_LOCKED) System Call Handling | |
381 | ||
382 | In addition the the mlock()/mlockall() system calls, an application can request | |
383 | that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() | |
384 | call. Furthermore, any mmap() call or brk() call that expands the heap by a | |
385 | task that has previously called mlockall() with the MCL_FUTURE flag will result | |
386 | in the newly mapped memory being mlocked. Before the unevictable/mlock changes, | |
387 | the kernel simply called make_pages_present() to allocate pages and populate | |
388 | the page table. | |
389 | ||
390 | To mlock a range of memory under the unevictable/mlock infrastructure, the | |
391 | mmap() handler and task address space expansion functions call | |
392 | mlock_vma_pages_range() specifying the vma and the address range to mlock. | |
393 | mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in | |
394 | "Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will | |
395 | have already been set by the caller, in filtered vmas. Thus these vma's need | |
396 | not be visited for munlock when the region is unmapped. | |
397 | ||
398 | For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to | |
399 | fault/allocate the pages and mlock them. Again, like mlock_fixup(), | |
400 | mlock_vma_pages_range() downgrades the mmap semaphore to read mode before | |
401 | attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore | |
402 | back to write mode before returning. | |
403 | ||
404 | The callers of mlock_vma_pages_range() will have already added the memory | |
405 | range to be mlocked to the task's "locked_vm". To account for filtered vmas, | |
406 | mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the | |
407 | callers then subtract a non-negative return value from the task's locked_vm. | |
408 | A negative return value represent an error--for example, from get_user_pages() | |
409 | attempting to fault in a vma with PROT_NONE access. In this case, we leave | |
410 | the memory range accounted as locked_vm, as the protections could be changed | |
411 | later and pages allocated into that region. | |
412 | ||
413 | ||
414 | Mlocked Pages: munmap()/exit()/exec() System Call Handling | |
415 | ||
416 | When unmapping an mlocked region of memory, whether by an explicit call to | |
417 | munmap() or via an internal unmap from exit() or exec() processing, we must | |
418 | munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. | |
419 | Before the unevictable/mlock changes, mlocking did not mark the pages in any way, | |
420 | so unmapping them required no processing. | |
421 | ||
422 | To munlock a range of memory under the unevictable/mlock infrastructure, the | |
423 | munmap() hander and task address space tear down function call | |
424 | munlock_vma_pages_all(). The name reflects the observation that one always | |
425 | specifies the entire vma range when munlock()ing during unmap of a region. | |
426 | Because of the vma filtering when mlocking() regions, only "normal" vmas that | |
427 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). | |
428 | ||
429 | munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() | |
430 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table | |
431 | for the vma's memory range and munlock_vma_page() each resident page mapped by | |
432 | the vma. This effectively munlocks the page, only if this is the last | |
433 | VM_LOCKED vma that maps the page. | |
434 | ||
435 | ||
436 | Mlocked Page: try_to_unmap() | |
437 | ||
438 | [Note: the code changes represented by this section are really quite small | |
439 | compared to the text to describe what happening and why, and to discuss the | |
440 | implications.] | |
441 | ||
442 | Pages can, of course, be mapped into multiple vmas. Some of these vmas may | |
443 | have VM_LOCKED flag set. It is possible for a page mapped into one or more | |
444 | VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one | |
445 | of the active or inactive LRU lists. This could happen if, for example, a | |
446 | task in the process of munlock()ing the page could not isolate the page from | |
447 | the LRU. As a result, vmscan/shrink_page_list() might encounter such a page | |
448 | as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To | |
449 | handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED | |
450 | vmas while it is walking a page's reverse map. | |
451 | ||
452 | try_to_unmap() is always called, by either vmscan for reclaim or for page | |
453 | migration, with the argument page locked and isolated from the LRU. BUG_ON() | |
454 | assertions enforce this requirement. Separate functions handle anonymous and | |
455 | mapped file pages, as these types of pages have different reverse map | |
456 | mechanisms. | |
457 | ||
458 | try_to_unmap_anon() | |
459 | ||
460 | To unmap anonymous pages, each vma in the list anchored in the anon_vma must be | |
461 | visited--at least until a VM_LOCKED vma is encountered. If the page is being | |
462 | unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked | |
463 | pages are migratable. However, for reclaim, if the page is mapped into a | |
464 | VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap | |
465 | semphore of the mm_struct to which the vma belongs in read mode. If this is | |
466 | successful, try_to_unmap() will mlock the page via mlock_vma_page()--we | |
467 | wouldn't have gotten to try_to_unmap() if the page were already mlocked--and | |
468 | will return SWAP_MLOCK, indicating that the page is unevictable. If the | |
469 | mmap semaphore cannot be acquired, we are not sure whether the page is really | |
470 | unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. | |
471 | ||
472 | try_to_unmap_file() -- linear mappings | |
473 | ||
474 | Unmapping of a mapped file page works the same, except that the scan visits | |
475 | all vmas that maps the page's index/page offset in the page's mapping's | |
476 | reverse map priority search tree. It must also visit each vma in the page's | |
477 | mapping's non-linear list, if the list is non-empty. As for anonymous pages, | |
478 | on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will | |
479 | attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, | |
480 | returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. | |
481 | ||
482 | try_to_unmap_file() -- non-linear mappings | |
483 | ||
484 | If a page's mapping contains a non-empty non-linear mapping vma list, then | |
485 | try_to_un{map|lock}() must also visit each vma in that list to determine | |
486 | whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit | |
487 | all vmas in the non-linear list to ensure that the pages is not/should not be | |
488 | mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. | |
489 | However, there is no easy way to determine whether the page is actually mapped | |
490 | in a given vma--either for unmapping or testing whether the VM_LOCKED vma | |
491 | actually pins the page. | |
492 | ||
493 | So, try_to_unmap_file() handles non-linear mappings by scanning a certain | |
494 | number of pages--a "cluster"--in each non-linear vma associated with the page's | |
495 | mapping, for each file mapped page that vmscan tries to unmap. If this happens | |
496 | to unmap the page we're trying to unmap, try_to_unmap() will notice this on | |
497 | return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it | |
498 | will return SWAP_AGAIN, causing vmscan to recirculate this page. We take | |
499 | advantage of the cluster scan in try_to_unmap_cluster() as follows: | |
500 | ||
501 | For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap | |
502 | semaphore of the associated mm_struct for read without blocking. If this | |
503 | attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will | |
504 | retain the mmap semaphore for the scan; otherwise it drops it here. Then, | |
505 | for each page in the cluster, if we're holding the mmap semaphore for a locked | |
506 | vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This | |
507 | call is a no-op if the page is already locked, but will mlock any pages in | |
508 | the non-linear mapping that happen to be unlocked. If one of the pages so | |
509 | mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will | |
510 | return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan | |
511 | to cull the page, rather than recirculating it on the inactive list. Again, | |
512 | if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns | |
513 | SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but | |
514 | couldn't be mlocked. | |
515 | ||
516 | ||
517 | Mlocked pages: try_to_munlock() Reverse Map Scan | |
518 | ||
519 | TODO/FIXME: a better name might be page_mlocked()--analogous to the | |
520 | page_referenced() reverse map walker--especially if we continue to call this | |
521 | from shrink_page_list(). See related TODO/FIXME below. | |
522 | ||
523 | When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() System | |
524 | Call Handling" above--tries to munlock a page, or when shrink_page_list() | |
525 | encounters an anonymous page that is not yet in the swap cache, they need to | |
526 | determine whether or not the page is mapped by any VM_LOCKED vma, without | |
527 | actually attempting to unmap all ptes from the page. For this purpose, the | |
528 | unevictable/mlock infrastructure introduced a variant of try_to_unmap() called | |
529 | try_to_munlock(). | |
530 | ||
531 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and | |
532 | mapped file pages with an additional argument specifing unlock versus unmap | |
533 | processing. Again, these functions walk the respective reverse maps looking | |
534 | for VM_LOCKED vmas. When such a vma is found for anonymous pages and file | |
535 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions | |
536 | attempt to acquire the associated mmap semphore, mlock the page via | |
537 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the | |
538 | pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs | |
539 | shrink_page_list() that the anonymous page should be culled rather than added | |
540 | to the swap cache in preparation for a try_to_unmap() that will almost | |
541 | certainly fail. | |
542 | ||
543 | If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap | |
544 | semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() | |
545 | to recycle the page on the inactive list and hope that it has better luck | |
546 | with the page next time. | |
547 | ||
548 | For file pages mapped into non-linear vmas, the try_to_munlock() logic works | |
549 | slightly differently. On encountering a VM_LOCKED non-linear vma that might | |
550 | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking | |
551 | the page. munlock_vma_page() will just leave the page unlocked and let | |
552 | vmscan deal with it--the usual fallback position. | |
553 | ||
554 | Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' | |
555 | reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. | |
556 | However, the scan can terminate when it encounters a VM_LOCKED vma and can | |
557 | successfully acquire the vma's mmap semphore for read and mlock the page. | |
558 | Although try_to_munlock() can be called many [very many!] times when | |
559 | munlock()ing a large region or tearing down a large address space that has been | |
560 | mlocked via mlockall(), overall this is a fairly rare event. In addition, | |
561 | although shrink_page_list() calls try_to_munlock() for every anonymous page that | |
562 | it handles that is not yet in the swap cache, on average anonymous pages will | |
563 | have very short reverse map lists. | |
564 | ||
565 | Mlocked Page: Page Reclaim in shrink_*_list() | |
566 | ||
567 | shrink_active_list() culls any obviously unevictable pages--i.e., | |
568 | !page_evictable(page, NULL)--diverting these to the unevictable lru | |
569 | list. However, shrink_active_list() only sees unevictable pages that | |
570 | made it onto the active/inactive lru lists. Note that these pages do not | |
571 | have PageUnevictable set--otherwise, they would be on the unevictable list and | |
572 | shrink_active_list would never see them. | |
573 | ||
574 | Some examples of these unevictable pages on the LRU lists are: | |
575 | ||
576 | 1) ramfs pages that have been placed on the lru lists when first allocated. | |
577 | ||
578 | 2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to | |
579 | allocate or fault in the pages in the shared memory region. This happens | |
580 | when an application accesses the page the first time after SHM_LOCKing | |
581 | the segment. | |
582 | ||
583 | 3) Mlocked pages that could not be isolated from the lru and moved to the | |
584 | unevictable list in mlock_vma_page(). | |
585 | ||
586 | 3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't | |
587 | acquire the vma's mmap semaphore to test the flags and set PageMlocked. | |
588 | munlock_vma_page() was forced to let the page back on to the normal | |
589 | LRU list for vmscan to handle. | |
590 | ||
591 | shrink_inactive_list() also culls any unevictable pages that it finds | |
592 | on the inactive lists, again diverting them to the appropriate zone's unevictable | |
593 | lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became | |
594 | SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or | |
595 | pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from | |
596 | the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice | |
597 | the latter, but will pass on to shrink_page_list(). | |
598 | ||
599 | shrink_page_list() again culls obviously unevictable pages that it could | |
600 | encounter for similar reason to shrink_inactive_list(). As already discussed, | |
601 | shrink_page_list() proactively looks for anonymous pages that should have | |
602 | PG_mlocked set but don't--these would not be detected by page_evictable()--to | |
603 | avoid adding them to the swap cache unnecessarily. File pages mapped into | |
604 | VM_LOCKED vmas but without PG_mlocked set will make it all the way to | |
605 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list when | |
606 | try_to_unmap() returns SWAP_MLOCK, as discussed above. | |
607 | ||
608 | TODO/FIXME: If we can enhance the swap cache to reliably remove entries | |
609 | with page_count(page) > 2, as long as all ptes are mapped to the page and | |
610 | not the swap entry, we can probably remove the call to try_to_munlock() in | |
611 | shrink_page_list() and just remove the page from the swap cache when | |
612 | try_to_unmap() returns SWAP_MLOCK. Currently, remove_exclusive_swap_page() | |
613 | doesn't seem to allow that. | |
614 | ||
615 |