Predicting Masked Assembly with Diffusion Models

How I built a continuous diffusion system that predicts masked spans of x86-64 assembly instructions — and what it teaches us about binary understanding.

HunterADyer/binary-patch-diffusion

The Problem

When reverse engineering stripped binaries, analysts frequently encounter corrupted or obfuscated code regions. What if we could train a model to predict what should be there?

This project explores using continuous diffusion models — the same family behind image generators like Stable Diffusion — to predict masked spans of x86-64 assembly instructions.

Why Diffusion?

The initial approach used a BERT-style masked language model, which reached ~85% accuracy predicting individual masked tokens. But assembly isn’t natural language — instruction sequences have rich structural dependencies that benefit from iterative refinement.

Diffusion models offer:

  • Iterative denoising: The model refines predictions over multiple steps
  • Continuous representations: Embed discrete tokens into continuous space for smoother gradients
  • Conditional generation: Condition on surrounding context naturally

Architecture

The model uses a Diffusion Transformer (DiT) with adaptive layer normalization (adaLN-Zero):

class DiTBlock(nn.Module):
"""Transformer block with adaLN-Zero conditioning."""
def __init__(self, d_model, n_heads, ff_dim):
super().__init__()
self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
self.ff = nn.Sequential(
nn.Linear(d_model, ff_dim),
nn.GELU(),
nn.Linear(ff_dim, d_model),
)
# adaLN-Zero: learn scale/shift from timestep
self.adaln = nn.Linear(d_model, 6 * d_model)

The key insight is that timestep conditioning via adaLN-Zero lets the model modulate its behavior depending on the noise level, producing coarse predictions early and fine details late.

Dataset

Training data comes from 5,675 system ELF binaries on a standard Linux installation:

MetricValue
Total examples518,000
Vocabulary size386 tokens
Sequence length512
Mask ratio15-30%

Training

The model has 134.9M parameters and trains on a single RTX PRO 6000 (96GB VRAM):

  • Batch size: 128, gradient accumulation: 2
  • Cosine noise schedule, 1000 timesteps
  • Mixed precision (AMP) training
  • ~42 minutes per epoch, 30 epochs total

What’s Next

In the next post, I’ll cover the sampling process and show how the model’s predictions compare to ground truth across different instruction types — arithmetic, control flow, and memory operations.

Stay tuned.