// a, b, c have same length
void compute(float[] a, float[] b, float[] c) {
for (int i = 0; i < a.length; i++) {
// c = -(a² + b²)
c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
}
}
Project Panama |
Project Loom |
Project Leyden |
Project Valhalla |
Project Babylon |
Project Amber |
Slides at slides.nipafx.dev/java-next.
Interconnecting JVM and native code
Profile:
launched July 2014
led by Maurizio Cimadamore
vector API
foreign memory API
foreign function API
Given two float
arrays a
and b
,
compute c = - (a² + b²)
:
// a, b, c have same length
void compute(float[] a, float[] b, float[] c) {
for (int i = 0; i < a.length; i++) {
// c = -(a² + b²)
c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
}
}
Vectorization - modern CPUs:
have multi-word registers (e.g. 512 bit)
can store several numbers (e.g. 16 float
s)
can execute several computations at once
⇝ single instruction, multiple data (SIMD)
Just-in-time compiler tries to vectorize loops.
⇝ Auto-vectorization
Works but isn’t reliable.
static final VectorSpecies<Float> VS =
FloatVector.SPECIES_PREFERRED;
// a, b, c length is multiple of vector length
void compute(float[] a, float[] b, float[] c) {
int upperBound = VS.loopBound(a.length);
for (int i = 0; i < upperBound; i += VS.length()) {
var va = FloatVector.fromArray(VS, a, i);
var vb = FloatVector.fromArray(VS, b, i);
// c = -(a² + b²)
var vc = va.mul(va)
.add(vb.mul(vb))
.neg();
vc.intoArray(c, i);
}
}
Properties:
clear and concise API (given the requirements)
platform agnostic
reliable run-time compilation and performance
graceful degradation
Storing data off-heap is tough:
ByteBuffer
is limited (2GB) and inefficient
Unsafe
is… unsafe and not supported
Safe and performant foreign-memory API:
control (de)allocation:
Arena
, MemorySegment
, SegmentAllocator
to access/manipulate: MemoryLayout
, VarHandle
JNI isn’t ideal:
involves several tedious artifacts (header file, impl, …)
can only interoperate with languages that align
with OS/architecture the JVM was built for
doesn’t reconcile Java/C type systems
Streamlined tooling/API for foreign functions
based on method handles:
jextract
: generates method handles from header file
classes to call foreign functions
Linker
, FunctionDescriptor
, SymbolLookup
connects Java with the native world
offers safe, detailed, and performant APIs
Current work:
improve memory access performance
reduce startup/warmup cost
refine record mappers
improve jextract
Vector API:
🎥 Fast Java Code with the Vector API (Mar 2023)
🎥 The Vector API in JDK 17 (Sep 2021)
📝 FizzBuzz – SIMD Style! (Mar 2021)
Foreign APIs:
📝 design documents
🎥 Panama Update with Maurizio Cimadamore (Jul 2019)
🎥 ByteBuffers are dead, long live ByteBuffers! (Feb 2020)
🎥 The State of Project Panama with Maurizio Cimadamore (Jun 2021)
JVM features and APIs for supporting easy-to-use, high-throughput, lightweight concurrency and new programming models
Profile:
project / wiki / mailing list
launched January 2018
led by Ron Pressler
An application with many blocking operations
had two options:
block platform (OS) threads until task completion:
simple-to-use programming paradigm
can limit throughput
use asnychronous programming
harder to write and harder still to debug
allows higher throughput
Resolve the conflict between:
simplicity
throughput
A virtual thread:
is a regular Thread
low memory footprint (stack + bytes)
small switching cost
scheduled by the Java runtime
executes on platform thread
waits in memory
(no platform thread blocked)
a pinned VT will block the PT
caused by object monitors,
native calls, class initialization
a captured VT blocks the PT
caused by file I/O
Object monitor implementation:
was bound to OS threads
required deep refactoring
to work with VTs
fix ships with JDK 24
⇝ No more pinning for synchronized
.
Cause:
native code works on PT’s stack
switching PTs would wreak havoc
Fix:
possible in the JVM, but expensive
fairly easy to avoid
⇝ Don’t call native code, then back to Java, then block.
File I/O capture is caused by JVM/OS limitations.
Linux io_uring
allows async I/O but:
adoption incurrs overhead
considerable compared to cached SSD-reads
cost/benefit is not good
⇝ No fix for now.
Virtual threads aren’t "faster threads":
Each task takes the same time (same latency).
Virtual threads increase throughput:
when workload is not CPU-bound and
when number of concurrent tasks is high
Virtual threads are cheap and plentiful:
no pooling necessary
allows thread per task
allows liberal creation
of threads for subtasks
⇝ Enables new concurrency programming models.
prescribes single entry point
and clearly defined exit points
influenced languages and runtimes
When the flow of execution splits into multiple concurrent flows, they rejoin in the same code block.
⇝ Threads are short-lived:
start when task begins
end on completion
⇝ Enables parent-child/sibling relationships
and logical grouping of threads.
void handle(Request request, Response response)
throws InterruptedException {
// implicitly short-circuits on error
try (var scope = StructuredTaskScope.open()) {
var subtaskA = scope.fork(this::taskA);
var subtaskB = scope.fork(this::taskB);
// wait explicitly for success
// (throws errors if there were any)
scope.join();
response.send(subtaskA.get() + subtaskB.get());
} catch (ExecutionException ex) {
response.fail(ex);
}
}
Use Joiner
to configure completion:
how are results collected?
when are subtasks cancelled?
when does join
throw?
Pass to StructuredTaskScope.open(Joiner)
.
Existing joiners for heterogeneous results:
awaitAllSuccessfulOrThrow()
:
cancels/throws on first error
default behavior of open()
awaitAll()
:
never cancels/throws
Existing joiners for homogeneous results:
allSuccessfulOrThrow()
:
cancels/throws on first error
returns Stream<RESULT>
anySuccessfulResultOrThrow()
cancels/throws if all fail
returns RESULT
forked tasks are children of the scope
(visible in thread dumps)
creates relationship between threads
success/failure policy can be defined
across all children
With ThreadLocal
:
static final ThreadLocal<Principal> PRINCIPAL =
new ThreadLocal<>();
public void serve(Request request, Response response) {
var level = request.isAdmin() ? ADMIN : GUEST;
var principal = new Principal(level);
PRINCIPAL.set(principal);
Application.handle(request, response);
}
// elsewhere
PRINCIPAL.get()
ThreadLocal
downsides:
unconstrained mutability
unbounded lifetime
expensive inheritance
ScopedValues
improve on that:
write-once (per thread)
clearly scoped
free inheritance
static final ScopedValue<Principal> PRINCIPAL =
new ScopedValue<>();
public void serve(Request request, Response response) {
var level = request.isAdmin() ? ADMIN : GUEST;
var principal = new Principal(level);
ScopedValue
.where(PRINCIPAL, principal)
.run(() -> Application
.handle(request, response));
}
// elsewhere
PRINCIPAL.get()
Virtual threads:
code is simple to write, debug, profile
allows high throughput
Structured concurrency:
clearer concurrency code
simpler failure/success policies
better debugging
Scoped values:
safer, more scalable data sharing
Current work:
finalize structured concurrency and
scoped values APIs
reduce pinning during class initialization
improve lock info in thread dumps
JDK 25:
structured concurrency in 5th preview (JEP 499)
scoped values 🤷🏾♂️
Faster startup, shorter time to peak performance, smaller footprint
Profile:
launched May 2022
led by Mark Reinhold
Java has really good peak performance,
but also tends to have:
slow startup time
slow warmup time
large footprint
For now, Leyden focusses on startup/warmup.
Two kinds of computation:
expressed by the program
done on behalf of the program, e.g.:
class-loading
JIT compilation
garbage collection
For now, Leyden focusses in the latter.
Early computation on behalf of the program:
class loading
callsite linkage
constant pool resolution
interpretation
profile gathering
JIT compilation (C1, C2)
Java already shifts computation:
compile-time constant folding
class loading
garbage collection
out-of-order execution
Let’s shift more computation ahead of time!
What computation?
Shift everything ahead of time?
class loading & linking
JIT compilation
method profiling
lambda resolution
dead-code elimination
…
But…
Java is highly dynamic:
class loading
class redefinition
linkage
access control
method dispatch
run-time typing (e.g. casting)
introspection
JIT compilation, decompilation
How to AOT everything?
Leyden introduces AOTCache:
observe JVM
capture decisions in AOTCache
(expansion of CDS Archive)
use as "initial state" during future run
fall back to live observation/optimization
if necessary and possible
# training run (⇝ profile)
$ java -XX:AOTMode=record
-XX:AOTConfiguration=app.aotconf
-cp app.jar com.example.App ...
# assembly phase (profile ⇝ AOTCache)
$ java -XX:AOTMode=create
-XX:AOTConfiguration=app.aotconf
-XX:AOTCache=app.aot
-cp app.jar
# production run (AOTCache ⇝ performance)
$ java -XX:AOTCache=app.aot
-cp app.jar com.example.App ...
(Open to improvements.)
Introduced by JEP 483:
Improve startup time by making the classes of an application instantly available, in a loaded and linked state, when the HotSpot JVM starts.
Spring PetClinic benchmarks:
up to ~40% startup time reduction
AOT cache size of ~130 MB
Limitation:
same JDK release / hardware / OS
consistent class path for training and production
consistent module options
limited use of JVMTI agents
Otherwise, AOTCache is ignored.
Leyden’s early access builds AOT more:
method profiling
constant resolution
code compilation
dynamic proxies
reflection data
unfound classes
Benchmarks show ~70% startup time reduction.
Most cached data can be:
validated at runtime
replaced with more accurate
or better data (e.g. JIT code)
More optimizations are possible:
if dynamism is constrained
if program is constrained
Let developers accept constraints, e.g.:
limited class redefinition
closed-world assumption
fixed program configuration
Let Java apply suitable optimizations.
⇝ Performance is an emergent property.
improves Java’s overall footprint
for now: focusses on startup/warmup time
by caching early JVM work
in the future: explores stricter constraints
for more aggressive optimization
Advanced Java VM and Language feature candidates
Profile:
launched July 2014
led by Brian Goetz
Java has a split type system:
primitives
classes
We can only create classes, but:
have identity
have references
All classes come with identity:
extra memory for header
mutability
locking, synchronization, etc.
But not all custom types need that!
All class instances come as references:
memory access indirection
nullability
But not all custom types need that!
Valhalla’s goal is to unify the type system:
value types (disavow identity)
null-restriction + implicit constructors
(disavow identity + references)
universal generics (ArrayList<int>
)
specialized generics (backed by int[]
)
value class ComplexNumber {
private double real;
private double imaginary;
// constructor, etc.
}
Codes (almost) like a class - exceptions:
class and fields are implcitly final
superclasses are limited
No identity:
some runtime operations throw exceptions
"identity" check ==
compares by state
null
is default value
Benefits:
guaranteed immutability
more expressiveness
more optimizations
The JDK (as well as other libraries) has many value-based classes, such as
Optional
andLocalDateTime
. […] We plan to migrate many value-based classes in the JDK to value classes.
In general, value types have references:
allow null
prevent flattening
How do we get rid of them?
Details are in flux, but possibly:
null-restructed variables and fields:
// number can't be null
ComplexNumber! number = // ...
implicit constructor marks good default instance
value class ComplexNumber {
private double real;
private double imaginary;
// implicitly sets all fields to default values
public implicit ComplexNumber();
public ComplexNumber(double r, double i) {
// ...
}
// etc.
}
The just-in-time compiler can
inline/flatten variables …
of a value type
with implicit constructor
that are null-restricted
Performance comparable to today’s primitives! 🚀
Don’t create a type in order to get performance.
Instead:
"Is the type value-ish?" ⇝ value type
"Is all-fields-default usable?" ⇝ implicit constructor
"Is no null
needed?" ⇝ restrict nullness
Performance emerges from domain decisions!
When everybody creates their own value classes,
boxing becomes omni-present and very painful!
Universal generics allow value classes
as type parameters:
List<long> ids = new ArrayList<>();
List<RationalNumber> numbers = new ArrayList<>();
Healing the rift in the type system is great!
But if ArrayList<int>
is backed by Object[]
,
it will still be avoided in many cases.
Specialized generics will fix that:
Generics over primitives will avoid references!
Value types, implicit constructors, null-restriction
plus universal and specialized generics:
fewer trade-offs between
design and performance
no more manual specializations
better performance
can express design more clearly
more robust APIs
Makes Java more expressive and performant.
🤷🏾♂️
(All effort is focussed on JEP 401.)
📝 State of Valhalla
🎥 Valhalla - Java’s Epic Refactor (Aug 2021)
Extend the reach of Java to foreign programming models such as SQL, differentiable programming, machine learning models, and GPUs
Profile:
launched January 2024
led by Paul Sandoz
Java is adjacent to other programmable systems:
GPUs and FPGAs
SQL databases
differentiable functions
Allow programming them with Java code.
Don’t adapt to each realm in a separate project.
Instead:
make Java code accessible
provide API to read and transform it
let ecosystem provide adaptions
Babylons’s central mechanism is code reflection:
enhancement of "regular" reflection
reaches down into methods/lambdas
symbolic representation of (Java) code
These are called code models.
Abstract syntax tree:
constructed during compilation
closely aligned with Java grammar
too much syntactic info
Bytecode:
created by compiler
specified by JVM Specification
too little important info
The code model design is heavily influenced by the design of data structures used by many modern compilers to represent code. These data structures are commonly referred to as Intermediate Representations (IRs). The design is further influenced by Multi-Level Intermediate Representation (MLIR), a sub-project of the LLVM Compiler Infrastructure project.
Identify code (e.g. with annotation):
@CodeReflection
static double sub(double a, double b) {
return a - b;
}
Then:
compiler creates code model
stored in class files
accessible via reflection API
can be transformed by Java code
"Direct" GPU programming:
transform to GPU kernels (OpenCL C or CUDA C)
compile with GPU-specific toolchain
Triton-style:
offer class Triton
with static methods
transform to Triton code model
compile with Triton toolchain
@CodeReflection
static void add_kernel2(
Ptr x, Ptr y, Ptr result, int n, int size) {
var pid = Triton.programId(0);
var block_start = pid * size;
var range = Triton.arange(0, size);
var offsets = Triton.add(block_start, range);
var mask = Triton.compare(
offsets, n, Triton.CompareKind.LessThan);
var x = Triton.load(Triton.add(x, offsets), mask);
var y = Triton.load(Triton.add(y, offsets), mask);
var output = Triton.add(x, y);
Triton.store(
Triton.add(result, offsets), output, mask);
}
introduces code reflection & code models
allows their transformation
expands Java to foreign programming models
spearheads Java-on-GPU efforts (HAT)
🤷🏾♂️
📝 Exploring Triton GPU programming for neural networks in Java
🎥 Code Reflection (Aug 2024)
🎥 Heterogeneous Accelerator Toolkit (Sep 2024)
🎥 Translating Java to SPIR-V (Aug 2024)
Smaller, productivity-oriented Java language features
Profile:
project / wiki / mailing list
launched March 2017
led by Brian Goetz
Some downsides of Java:
can be cumbersome
tends to require boilerplate
situational lack of expressiveness
Amber continuously improves that situation.
multi-file source launcher ㉒ (JEP 458)
unnamed variables and patterns ㉒ (JEP 456)
patterns in switch ㉑ (JEP 441)
record patterns ㉑ (JEP 440)
sealed types ⑰ (JEP 409)
records ⑯ (JEP 395)
type pattern matching ⑯ (JEP 394)
text blocks ⑮ (JEP 378)
switch expressions ⑭ (JEP 361)
local-variable type inference with var
⑩ (JEP 286)
Amber’s main thrust is pattern matching:
records
sealed types
improved switch
patterns
Other endeavors and conversations:
primitive types in patterns (JEP 488)
simplified main (JEP 495)
flexible constructor bodies (JEP 492)
deconstruction of classes
derived record creation ("withers") (JEP 468)
deconstruction assignment (announcement)
serialization 2.0 (talk at Devoxx BE)
concise method bodies (JEP draft)
makes Java more expressive
reduces amount of code
makes us more productive
JDK 21:
records & sealed types
pattern matching basics
text blocks
single-file source launcher
JDK 22:
unnamed patterns
multi-file source launcher
https://commons.wikimedia.org/wiki/File:Pieter_Bruegel_the_Elder_-The_Tower_of_Babel(Vienna)_-_Google_Art_Project.jpg[babylon]: Pieter Brueghel the Elder, public domain
valhalla: Emil Doepler, public domain