Vision-Language Binding in In-Context Image Generation

1MIT, 2Northeastern University

When editing a reference image, FLUX.2 writes the reference image content through text tokens rather than writing directly to the output image.

Setup: what does FLUX.2 actually do internally?

FLUX.2 is a modern image generator, and it can be used in two ways. In text-to-image (T2I) mode, you give it just a text prompt, like "a photograph of a cat on a skateboard", and it generates an image from scratch. In image-to-image (I2I) mode, you give it a reference image, say a photo of an auditorium, together with an edit instruction like "add a podium to the stage", and it returns an edited version. Both behaviors come out of the exact same model, with the same weights and the same forward pass. So how does FLUX.2 actually handle them internally?

Before we can answer that, we need to know that FLUX.2 represents its inputs as sequences of little pieces called tokens. Your sentence is broken into short chunks of characters, and each chunk becomes one text token. Your reference image is sliced into a grid of small square patches, and each patch becomes one reference token. The output image, which starts as a sample of Gaussian noise, is sliced into a grid of small square patches the same way, and each patch becomes one image token. These three token sequences are concatenated into a single multimodal attention stream, where every token attends to every other token at every layer.

With three very different kinds of tokens dumped into the same attention pool, it's not obvious which kind of token ends up carrying which information. When a reference image is provided, where does its content actually live during the forward pass: does it go directly to the image tokens, or does some of it transfer to the text tokens along the way?

We answer these questions with three experiments on the FLUX.2 attention stream. Together, they reveal that the text tokens and the reference tokens end up carrying different kinds of reference content, a behavior the model learns on its own without any part of the training objective enforcing it.

T2I Lens — what information do the text tokens carry?

We start our investigation with text tokens, given that is the common input across both the T2I and I2I models. We ask the question - what information do the text tokens carry inside the model? We do this by decoding the text tokens back into pixel space using an unconditional T2I forward pass.

Concretely, we run a normal I2I edit and save the text-token activations at one of the intermediate blocks of the model. We then start a separate T2I generation with no reference image and an empty prompt, and copy the saved activations onto its raw text embeddings (the input to layer 0). Because this second pass has no reference and no instruction of its own, anything in its output that resembles the original reference must have ridden in on the patched activations.

Diagram of the T2I Lens method: text-token activations from an I2I edit are patched into a reference-free, empty-prompt T2I pass
Reference image: an empty auditorium with rows of seats facing a stage

Reference

Plain T2I output from the prompt 'Add a podium' with the same noise seed: a podium against a blank background, no auditorium

T2I baseline ("Add a podium", no patching)

T2I Lens output: text-token activations from the I2I edit patched in, recovering the auditorium with a podium on stage

T2I Lens (text-token activations patched in)

T2I Lens applied to "Add a podium" with an auditorium reference image. A plain T2I generation with the same prompt and no reference produces a podium on a blank background (middle). The T2I Lens output (right) recovers a full auditorium with the podium on stage. Since the only difference between the two passes is the patched text-token activations, the auditorium must have ridden in on those activations: reference information lives in the text tokens during the conditioned forward pass.

It is not obvious this should work at all. By the middle of the forward pass, the text tokens have been mixing with the reference tokens, and there is no reason they should still look like text to the rest of the model. The fact that the lens recovers something coherent says they do: the reference content they have picked up is held in a text-compatible form.

Where in the text tokens? Padding vs. content

FLUX.2's text encoder produces a fixed-length sequence of 512 tokens. The content tokens hold the words of the actual instruction; whenever the instruction is shorter than 512 tokens, the rest of the sequence is filled up with padding tokens (a special placeholder token that carries no instruction text). Short edit instructions like "put it in an auditorium" leave the vast majority of the sequence as padding. To localize which part of the sequence is holding the reference content surfaced by T2I Lens, we re-run the same intervention but patch only one of the two subsets at a time.

T2I output with only padding-token activations patched in: the auditorium is recovered

Padding tokens only

T2I output with only content-token activations patched in: a podium-like object against a blank background, no auditorium

Content tokens only

Patching only the padding tokens still recovers an auditorium; patching only the content tokens does not. The reference-text binding lives in the padding tokens, which makes the maximum text length an implicit capacity hyperparameter for how much reference information the text channel can hold.

Attention Knockout — testing the important pathways in the model

T2I Lens shows that the text tokens encode reference content during an I2I edit, but it does not tell us whether the model actually uses them as a routing channel into the generated image.

In every FLUX.2 transformer block, an attention module lets every token attend to every other token in the sequence and pull information from it. That means, for example, image tokens can attend to text tokens, text tokens can attend to reference tokens, and image tokens can attend directly to reference tokens too. The diagram below shows these edges; our experiment selectively knocks out attention pathways at every layer to ask which pathway the model actually relies on.

Concretely, KOref→text blocks the reference information from flowing through text tokens to the output image, and KOref→image blocks reference information from flowing directly from the reference tokens to the output image. If a property of the output disappears under one knockout but survives the other, it was reaching the output through the pathway that was cut.

Diagram of the Attention Knockout method: blocking ref-to-text or ref-to-image attention edges across all layers of an I2I edit
Reference image: woman runner in distinctive style

Reference

Normal I2I edit output

Normal I2I edit

Output with ref-to-text attention knocked out

Block ref → text

Output with ref-to-image attention knocked out

Block ref → image

Attention Knockout applied to "A photograph of the woman in this image sprinting down a red running track" with an illustrated reference. The normal I2I edit preserves the cartoon style of the reference. Blocking ref→text changes the style of the output to realistic instead of cartoonish; blocking ref→image leaves it almost untouched. The style is being routed through the text tokens, not read directly off the reference.

Does the style and color survive even if we drop the reference?

Our hypothesis is that once the reference has written its style and color into the text tokens, the reference image itself plays no further role in carrying them. We call the point in the forward pass where it finishes that writing the vision-language binding.

The KOref→image knockout above only blocked the reference from reaching the generated image directly. But if the reference really stops contributing to the style and color after the vision-language binding, we can be far more aggressive: we keep KOref→image in place up to the binding, so the text tokens stay the reference's only outlet, and past the binding we drop the reference tokens out of the computation entirely, cutting every attention edge into and out of them. If the style still survives, then after the vision-language binding, the text tokens must have carried it to the output on their own.

Plain T2I generation of the woman on a running track: the illustrated reference style is absent

Removing Reference Tokens Pre-Binding

Output with the reference disconnected after the vision-language binding: the illustrated style still appears

Removing Reference Tokens Post-Binding

Dropping the reference image after the vision-language binding. Removing Reference Tokens Pre-Binding drops the reference from the very first layer. The model never binds it at all, so this is a plain T2I generation, and the illustrated style is absent. Removing Reference Tokens Post-Binding is the knockout described above: KOref→image up to the binding, then the reference tokens dropped entirely. The style still comes through, routed through the text tokens.

I2I → I2I Patching — do the text tokens causally transfer content?

T2I Lens and Attention Knockout together tell us that the text tokens absorb reference content and that the model uses them as a routing channel. What remains is to check that the text-token activations themselves causally carry reference-specific properties — that the property travels with the activations from one run to another.

We run two I2I edits side by side, a source and a target, that share the same instruction but use different reference images. We then copy the text-token activations from the source run into the target run at the same intermediate layer. If a property of the source reference appears in the target output, the text tokens were causally carrying that property between the two runs.

Diagram of the I2I-to-I2I Patching method: text-token activations from a source I2I edit are patched into a target I2I edit that shares the instruction but uses a different reference image
Source reference: solid green color

Source reference (solid green)

Target reference: solid yellow color

Target reference (solid yellow)

Target output without patching

Target output, no patching

Target output with source text tokens patched in

Target output, source text tokens patched in

I2I → I2I Patching on the shared prompt "draw a car in this color". The source run uses a solid green reference; the target run uses a solid yellow reference. Without patching, the target output takes the yellow color from its own reference. With the source run's text-token activations patched in at the same intermediate layer, the target output instead takes on the green of the source, even though the model is still looking at the solid yellow image as its own reference. The color is being causally carried between runs by the text-token activations alone.

For the full set of experiments, additional reference images and instructions, and quantitative results across all three methods above, see our paper.

BibTeX

@misc{ge2026visionlanguagebindingincontextimage,
      title={Vision-Language Binding in In-Context Image Generation},
      author={Chris Ge and Rohit Gandikota and Antonio Torralba and Tamar Rott Shaham},
      year={2026},
      eprint={2605.24624},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.24624},
}