Most AL code doesn’t need to be fast. You read a few records, run some business logic, post a document — the database round-trips dominate, and the language overhead is noise. But every so often you end up doing real computation in AL: parsing a binary format, transforming a buffer, hashing, encrypting, compressing. The moment you’re touching millions of bytes in a loop, AL stops being “fast enough by default” and starts punishing every habit you brought from record-level code.
I recently took a compute-heavy routine from 16 seconds down to about 1 second. None of it required leaving AL or adding a dependency. It came down to a handful of patterns that are worth knowing before you write your next tight loop.
The cost model: boundary crossings are the enemy
Here’s the mental model that explains almost everything below. In a hot loop, the expensive thing is rarely the arithmetic. It’s crossing a boundary:
- An
InStream.Read/OutStream.Writecall crosses into the platform’s stream layer. Each call is cheap in isolation and ruinous a million times over. - A call into another codeunit (
TypeHelper,Math, anything) pays a dispatch cost on every invocation. - Even a call to a local procedure has overhead that’s invisible at record scale and very visible at per-byte scale.
So the recurring move is: do more work per boundary crossing, or eliminate the crossing entirely. Let’s make that concrete.
Pattern 1: Batch your stream I/O
The naive way to transform a stream is one byte at a time:
for i := 1 to ByteCount do begin
InStr.Read(B); // one boundary crossing
OutStr.Write(Xform(B)); // another
end;
For a megabyte of data that’s two million interop calls. The fix depends on what you’re doing:
If you’re copying or skipping, let the platform move the bytes in bulk. CopyStream with a length does in one native call what your loop does in thousands:
CopyStream(OutStr, InStr, ByteCount); // bulk copy, no per-byte loop
A bulk skip is the same idea — copy into a throwaway stream rather than reading byte by byte.
If you must touch each byte, read and write four at a time through an Integer. AL’s InStream.Read(MyInteger) reads four bytes (little-endian) in a single call, and OutStream.Write(MyInteger) writes four:
// Read 4 source bytes per interop call instead of 1
for n := 1 to ByteCount div 4 do begin
if InStr.Read(V) <> 4 then Error('stream ended early');
// unpack V into 4 bytes, transform, repack into an Integer, write once
...
end;
// then handle the 0–3 leftover bytes with a small byte loop
That’s a 4× cut in interop calls for free. Mind the sign bit: an Integer is signed, so the top byte ≥ 128 means V is negative — unpack it through a BigInteger (add 2^32) in that case.
If you’re accumulating text, never concatenate in a loop and never write fragments to a stream as you go. Build into a TextBuilder and emit once:
// Before: a stream write per fragment, many crossings per item Buffer.WriteText(Piece); // After: accumulate in memory, flush once at the end Builder.Append(Piece); // ... later ... OutStr.WriteText(Builder.ToText());
In the routine I optimized, the per-item output was making roughly eight stream writes apiece. Buffering the whole thing and flushing once per batch collapsed that to a single write — same bytes, a fraction of the crossings.
Pattern 2: Replace per-byte computation calls with lookup tables
AL has no native bitwise operators, so you reach for Codeunit "Type Helper":
Result := TypeHelper.BitwiseXor(A, B);
That’s exactly right — for cold code. But in a loop that XORs every byte of a large buffer, you’re now paying a cross-codeunit dispatch per byte. In one profiling run, this single pattern accounted for 12 of 16 seconds — roughly 80 million calls.
The fix is to precompute the answer into a table once, then index it. Byte XOR has only 256×256 possible inputs, so a full table is 65,536 entries:
// One-time setup (seeded from Type Helper, paid once)
for i := 0 to 255 do
for j := 0 to 255 do
XorTable[i * 256 + j + 1] := TypeHelper.BitwiseXor(i, j);
// Hot path: a single array index, zero calls
Result := XorTable[A * 256 + B + 1];
A 64K-entry array costs 256 KB of memory and turns a procedure call into a array read. If memory is tight, a 16×16 nibble table (256 entries) does the same job with two lookups instead of one — still vastly cheaper than the call.
The same trick generalizes: any pure function over a small integer domain (transforms, gamma/scaling curves, character classification, x*2 in a finite field) can be a precomputed array. Build it once, guard it with an Initialized boolean, and the hot loop never calls anything.
Pattern 3: Inline the hottest procedures, hoist the invariants
Once the boundary calls are gone, the next layer is your own helper procedures. A clean design factors the inner step into a tidy local procedure:
local procedure EmitByte(B: Integer)
begin
...
end;
Lovely for readability, costly when called per byte in the innermost loop. For the one or two hottest loops — not everywhere; this is a targeted move — inline the body:
// Was: EmitByte(Value); now the body sits directly in the loop
QuadPos += 1;
Quad[QuadPos] := Value;
if QuadPos = 4 then begin
OutStr.Write(PackQuad(Quad));
QuadPos := 0;
end;
While you’re there, hoist anything loop-invariant out of the loop. Recomputing a bound, a length, or a base offset on every iteration is pure waste:
// Before: MaxLen recomputed every pass
while ... do begin
MaxLen := Limit - Pos + 1;
...
end;
// After: computed once
MaxLen := Limit - Pos + 1;
while ... do begin ... end;
Keep the readable, factored version everywhere else. Inlining is a scalpel for proven hot spots, not a style.
Pattern 4: Cache decoded data instead of re-deriving it
A subtle one. Suppose a measurement routine needs some reference data that lives in a blob. The straightforward implementation fetches and decodes it on every call:
local procedure GetMetrics(Key: Text): Text
begin
Rec.Get(Key);
Rec.CalcFields(BlobField); // re-reads the blob
// ... stream it out and rebuild a string every time ...
end;
If a higher-level routine calls this per item — say, measuring every word in a paragraph — you’re re-reading and re-parsing the same blob hundreds of times. Cache it:
// Dictionary cache of the raw data, plus a decoded array for the current key
if not Cache.ContainsKey(Key) then
Cache.Add(Key, ComputeMetrics(Key));
Data := Cache.Get(Key);
Better still, decode it into a typed array once and index that, so the per-character path is an array read rather than a substring-plus-parse. Decoding “0278033305560556…” into integers with CopyStr + Evaluate per character allocates and parses on every lookup; doing it once into an array of Integer and indexing turns the inner loop into arithmetic.
Pattern 5: Fuse passes, don’t round-trip through buffers
When a transformation has stages, the tidy approach gives each stage its own input and output blob:
StageA(Source, BlobA); StageB(BlobA, BlobB); StageC(BlobB, Result);
Every intermediate blob is written byte by byte and then read back byte by byte — two boundary crossings per byte per handoff. If the stages can run in lockstep (each consumes what the previous produces, in order), fuse them into a single pass that keeps the working row or chunk in an array and never materializes the intermediates:
// One pass: read a chunk, transform it in place, feed it straight
// to the next stage's accumulator — no intermediate blob
for each chunk do begin
Reconstruct(chunk); // in an array
Split(chunk); // straight into the downstream writers
end;
In my case, fusing three stages that had round-tripped pixel-like data through intermediate blobs roughly halved the total stream traffic.
The gotcha that will bite you: AL does not short-circuit
This one caused a real crash, and it’s worth burning into memory because it’s invisible if you test by porting your logic to another language first.
AL evaluates both operands of and / or. Always. There is no short-circuit. So this innocent-looking loop:
// BUG: when L reaches MaxLen, AL still evaluates the right side,
// reading Data[Pos + MaxLen] — one past the end. Index out of bounds.
while (L < MaxLen) and (Data[Pos + L] = Data[Cand + L]) do
L += 1;
…reads out of bounds exactly when the guard L < MaxLen is false, because AL evaluates the array access anyway. The fix is to move the indexed expression inside the body, where the guard actually protects it:
Done := false;
while (not Done) and (L < MaxLen) do
if Data[Pos + L] = Data[Cand + L] then
L += 1
else
Done := true;
Note why this is so easy to ship by accident: most languages a developer might prototype in — C#, JavaScript, PowerShell — do short-circuit, so a reference implementation runs clean and the AL port crashes only on specific inputs (here, when a run reached the exact end of a buffer). If you verify AL logic against a prototype, either guard the prototype the same way or have it assert that no index ever exceeds the bound.
A related safe case worth recognizing: when the index variable is itself clamped to the array’s range by the loop structure (while (i > 1) and (Table[i] > x) with i always in 1..N), the always-evaluated access stays in bounds. It’s only dangerous when the guard is the only thing keeping the index legal.
The takeaway
AL will never be C, but it’s far faster than its reputation when you respect the cost model:
- Minimize boundary crossings — bulk stream ops, 4-bytes-per-call I/O,
TextBuilderaccumulation. - Turn per-byte computation into table lookups — precompute once, index in the loop.
- Inline and hoist the proven hot spots, and only those.
- Cache decoded data instead of re-reading and re-parsing it.
- Fuse pipeline stages so intermediate data never round-trips through a buffer.
Measure first — a profiler will tell you which of these matters for your code. In mine, one table replaced 80 million calls, the compression and batching shrank both the work and the output, and a sixteen-second routine became a one-second one. The platform was never the bottleneck. The patterns were.