FLUX.2 is a modern image generator, and it can be used in two ways. In text-to-image (T2I) mode, you give it just a text prompt, like "a photograph of a cat on a skateboard", and it generates an image from scratch. In image-to-image (I2I) mode, you give it a reference image, say a photo of an auditorium, together with an edit instruction like "add a podium to the stage", and it returns an edited version. Both behaviors come out of the exact same model, with the same weights and the same forward pass. So how does FLUX.2 actually handle them internally?
Before we can answer that, we need to know that FLUX.2 represents its inputs as sequences of little pieces called tokens. Your sentence is broken into short chunks of characters, and each chunk becomes one text token. Your reference image is sliced into a grid of small square patches, and each patch becomes one reference token. The output image, which starts as a sample of Gaussian noise, is sliced into a grid of small square patches the same way, and each patch becomes one image token. These three token sequences are concatenated into a single multimodal attention stream, where every token attends to every other token at every layer.
With three very different kinds of tokens dumped into the same attention pool, it's not obvious which kind of token ends up carrying which information. When a reference image is provided, where does its content actually live during the forward pass: does it go directly to the image tokens, or does some of it transfer to the text tokens along the way?
We answer these questions with three experiments on the FLUX.2 attention stream. Together, they reveal that the text tokens and the reference tokens end up carrying different kinds of reference content, a behavior the model learns on its own without any part of the training objective enforcing it.
We start our investigation with text tokens, given that is the common input across both the T2I and I2I models. We ask the question - what information do the text tokens carry inside the model? We do this by decoding the text tokens back into pixel space using an unconditional T2I forward pass.
Concretely, we run a normal I2I edit and save the text-token activations at one of the intermediate blocks of the model. We then start a separate T2I generation with no reference image and an empty prompt, and copy the saved activations onto its raw text embeddings (the input to layer 0). Because this second pass has no reference and no instruction of its own, anything in its output that resembles the original reference must have ridden in on the patched activations.
Reference
T2I baseline ("Add a podium", no patching)
T2I Lens (text-token activations patched in)
It is not obvious this should work at all. By the middle of the forward pass, the text tokens have been mixing with the reference tokens, and there is no reason they should still look like text to the rest of the model. The fact that the lens recovers something coherent says they do: the reference content they have picked up is held in a text-compatible form.
FLUX.2's text encoder produces a fixed-length sequence of 512 tokens. The content tokens hold the words of the actual instruction; whenever the instruction is shorter than 512 tokens, the rest of the sequence is filled up with padding tokens (a special placeholder token that carries no instruction text). Short edit instructions like "put it in an auditorium" leave the vast majority of the sequence as padding. To localize which part of the sequence is holding the reference content surfaced by T2I Lens, we re-run the same intervention but patch only one of the two subsets at a time.
Padding tokens only
Content tokens only
T2I Lens shows that the text tokens encode reference content during an I2I edit, but it does not tell us whether the model actually uses them as a routing channel into the generated image.
In every FLUX.2 transformer block, an attention module lets every token attend to every other token in the sequence and pull information from it. That means, for example, image tokens can attend to text tokens, text tokens can attend to reference tokens, and image tokens can attend directly to reference tokens too. The diagram below shows these edges; our experiment selectively knocks out attention pathways at every layer to ask which pathway the model actually relies on.
Concretely,
KOref→text blocks the reference information from flowing through text tokens to the output image, and KOref→image blocks reference information from flowing directly from the reference tokens to the output image. If a property of the output disappears under
one knockout but survives the other, it was reaching the output
through the pathway that was cut.
Reference
Normal I2I edit
Block ref → text
Block ref → image
Our hypothesis is that once the reference has written its style and color into the text tokens, the reference image itself plays no further role in carrying them. We call the point in the forward pass where it finishes that writing the vision-language binding.
The KOref→image knockout above only
blocked the reference from reaching the generated image directly.
But if the reference really stops contributing to the style and
color after the vision-language binding, we can be far more
aggressive: we keep KOref→image in
place up to the binding, so the text tokens stay the reference's
only outlet, and past the binding we drop the reference
tokens out of the computation entirely, cutting every attention
edge into and out of them. If the style still survives, then after
the vision-language binding, the text tokens must have carried it
to the output on their own.
Removing Reference Tokens Pre-Binding
Removing Reference Tokens Post-Binding
KOref→image up to the binding, then the reference tokens dropped entirely. The style still comes through, routed through the text tokens.
T2I Lens and Attention Knockout together tell us that the text tokens absorb reference content and that the model uses them as a routing channel. What remains is to check that the text-token activations themselves causally carry reference-specific properties — that the property travels with the activations from one run to another.
We run two I2I edits side by side, a source and a target, that share the same instruction but use different reference images. We then copy the text-token activations from the source run into the target run at the same intermediate layer. If a property of the source reference appears in the target output, the text tokens were causally carrying that property between the two runs.
Source reference (solid green)
Target reference (solid yellow)
Target output, no patching
Target output, source text tokens patched in
For the full set of experiments, additional reference images and instructions, and quantitative results across all three methods above, see our paper.
@misc{ge2026visionlanguagebindingincontextimage,
title={Vision-Language Binding in In-Context Image Generation},
author={Chris Ge and Rohit Gandikota and Antonio Torralba and Tamar Rott Shaham},
year={2026},
eprint={2605.24624},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.24624},
}