Appearance
text
6.1810 2024 Lecture 16: User-level virtual memory
Reading: VM primitives for user programs by Appel & Li (1991)
Plan:
OS kernel uses virtual memory in creative ways
Paper argues user-level application can also benefit from VM
Concurrent garbage collector
Generation garbage collector
Concurrent check-pointing
Data-compression paging
Persistent stores
Most OSes have mmap() and user-level pagefaults
What VM user-level VM primitives?
Trap: handle page-fault traps in user mode
Prot1: decrease the accessibility of a page
ProtN: decrease the accessibility of N pages
Unprot: increase the accessibility of a page
Dirty: return a list of dirtied pages since previous calls
Map2: map the same physical page at two different VAs with
different prot levels.
Xv6 supports none of them
One way to see the paper is that a good OS should support them
What about Unix today?
Unix today: mmap()
Maps memory into the address space (many flags and options)
Example: mapping file
mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, offset)
Examples: anonymous memory
mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
(preferred over sbrk)
Unix today: mprotect()
Changes the permission of a mapping
Examples:
mprotect(addr, len, PROT_READ)
mprotect(addr, len, PROT_NONE)
results in a trap on access
Unix today: munmap()
Removes a mapping
Example:
munmap(addr, len)
Unix today: sigaction()
Configures a signal handler
(Think sigalarm() lab)
act.sa_sigaction = handle_sigsegv;
act.sa_flags = SA_SIGINFO;
segemptyset(&act.sa_mask)
sigaction(SIGSEGV, &act, NULL);
Additional Linux VM calls
Madvise(), Mincore(), Mremap(), Msync(), Mlock(), Mbind(), Shmat(),..
VM Implementation
Address spaces consists of VMAs and page tables
VMA (virtual memory area)
contiguous range of virtual addresses
same permissions
backed by the same object (file, anonymous memory)
VMA helps the kernel to decide how to handle page faults
Trap/sigaction implementation
PTE (or TLB entry) marked "protected"
CPU saves user state, jumps into kernel.
Kernel asks VM system what to do?
I.e. page in from disk? Core dump?
VM system looks at VMA
Generate signal -- upcall into user process.
Lower on user stack, or on separate stack...
Run user handler, can do anything.
Probably must call UNPROT for referenced page.
That is, must avoid repeated fault.
User handler returns to kernel.
Kernel returns to user program.
Continue or re-start instruction that trapped.
Can we support user-level VM primitives?
Trap: sigaction and SIGSEGV
Prot1: mprotect()
ProtN: mprotect()
Unprot: mprotect()
Dirty: No, but workaround exists
Map2: Not directly, but shm_open/mmap/mmap
Demo: large sqrt table backed by a single page
Application code thinks there is a pre-computed big table
n -> sqrt(n)
Application can lookup sqrt: sqrts[n]
If present, very fast!
Table is bigger than physical memory
User-level VM primitives allow it to occupy only a *single* page
Use case: Concurrent GC
Application allocates memory and computes with it
Application is called the "mutator"
Application doesn't have to call free
(which is error prone, in particular in concurrent programs)
Collector concurrently finds free memory
Memory that isn't in use by application anymore
Not reachable from root pointer or registers of any thread
Traditional implementation requires language/compiler support
Instrument loads/stores
User-level virtual memory can avoid instrumenting load/stores
Faster
Example: Baker's real-time GC algorithm (see toy version in baker.c)
https://dl.acm.org/citation.cfm?id=359460.359470
Divide heap into 2 regions: from-space and to-space
to-space is further divided in: scanned, unscanned, and new
At start of collection: all objects are in from-space
Copy roots to to-space (register and stack)
Put forwarding pointers in from-space to the to-space copy
At the end of scan, from-space is free memory
Don't have to stop the world for GC
Do a little of scanning on each allocation
Or when dereferencing pointer in from-space
Observations
Why attractive?
Alloc is cheap. Compacts, so no free list.
Incremental: every allocation scan/copy a little
This is the "real-time" aspect
What are the cost?
Does pointer reside in the from space? (If so, it needs to be copied)
Requires test and branch for every dereference
Difficult to run collector and program at same time
Race condition between collector tracing heap and program threads
Risk: two copies of the same object
Solution: let application use VM
https://dl.acm.org/doi/10.1145/53990.53992
Avoid explicit checks for references in application threads
After copying roots, unmap the unscanned part of to-space
Initially a page with roots etc.
Page fault when thread accesses unscanned region
handlers scans just the page and inspects all objects, then UNPROT
at most one fault per page
no compiler changes
Easy to make concurrent
A collector thread can run concurrently with application thread
A collector thread can UNPROT a page after scanning
Only synchronization needed is for which thread is scanning which page
Are existing VM primitives good enough for concurrent GC?
MAP2 is the only functionality issue
Can work around with shm_open and mmap the object at different addresses
Are traps &c fast enough?
They say no: 500 us to scan a page, 1200 us to take the trap.
Why not scan 3 pages?
How much slower to run Baker's actual algorithm, w/ checks?
VM version might be faster! Even w/ slow traps.
What about time saved by 2nd CPU scanning? They don't count this.
Is it an issue how often faults occur for concurrent GC?
Not really -- more faults means more scanning.
I.e. we'll get <= one fault per page, at most.
Measurements
Sun3/60: 2080/0.12, 17333 adds
i7-8550U CPU @ 1.80GHz (kaby lake), Linux 6.0.7
One fault in sqrts.c is 5.4us on my laptop (wo. sqrt)
One add @ 2.4Ghz is ~ 0.417ns; thus, ~ 12980 adds
with power policy performance
2024: 10233us for nfault 1000, Linux 6.11.6
1.2 GHz Athlon, FreeBSD 4.3. For trap, unprot, prot.
12 microseconds
Is user-level VM a good idea?
Most of the use cases can be implemented by adding instructions
E.g., check in application thread on reference
Pro:
Avoid compiler changes
CPU provides VM support
Con:
Requires OS support, and efficient support
Most OS kernel can't expose the raw hardware performance of paging
OS imposes much abstraction
What has changed between 1991 and 2022?
Tons of changes to VM API and implementations
VM system evolves continuously
Continued research too (e.g., see OSDI 2020)
Other primitives
mlock, munlock, madvise, mremap, mincore,...
Switching address space is free (tagged TLBs)
Extended addressibility doesn't matter
2^52 bytes of virtual address space