What 48 Images Taught Me About How Nano Banana 2 Actually Reads Prompts

I spent two days this week running a single investigation. The question was simple on the surface: when you want a photorealistic image of some object being overtaken by biological overgrowth, a camera wrapped in coral, a typewriter sprouting mushrooms, what phrasing actually produces the best integration between the object and the organism?

The answer turned out to be more interesting than the question. What started as a prompt comparison ended as something closer to a map of how Nano Banana 2 actually reads language when you ask it to combine two things.

Forty-eight images, ten variations, two subjects, one rubric. Here's what I found.

Round 1: Six Phrasings, One Surprise

I started with six different ways to describe the same concept, bioluminescent coral and small mushrooms overtaking a vintage film camera. Everything else was locked: same lighting, same background, same framing, same photographic grounding. The only variable was the language connecting the host object to the organisms growing on it.

The six phrasings ran from most common to most specific.

Variation A used "growing from," the phrase most people reach for first. Variation B used "reclaimed by" and "colonizing," nature documentary language. Variation C used "fused with" and "integrated into," borrowed from material transformation. Variation D used "transitioning into a living reef of," a phrase that names a destination. Variation E used "specimen photography style documenting the biological takeover," genre framing. Variation F used anatomical specificity, naming exactly which organism grew on which camera part: "coral tendrils extending from the lens barrel, cup fungi clustered along the top plate seams, tiny mushroom stems emerging from the viewfinder and winding lever."

Four images per variation. I scored all 24 against a five-dimension rubric weighted for visual quality, prompt alignment, consistency, uniqueness, and X engagement.

Variation A, the baseline. "Growing from" is the phrase most prompters reach for, and it averaged 8.01. The organisms read as additions to the camera rather than as integrated growth.

Here's where they landed:

Variation	Phrasing	Average	Peak	Floor
F	Anatomical specificity	9.13	9.25	9.00
D	"Transitioning into a living reef"	9.09	9.40	8.85
B	"Reclaimed by" / "colonizing"	8.95	9.40	8.70
E	"Specimen photography style"	8.66	9.25	8.00
C	"Fused with" / "integrated into"	8.51	9.35	8.00
A	"Growing from" (baseline)	8.01	8.70	7.45

The anatomical specificity prompt won by average. But D, the one that said "transitioning into a living reef," tied it for highest peak. And they looked completely different.

Variation F, anatomical specificity. Coral tendrils from the lens barrel, cup fungi on the top plate seams. Studio-clean, anatomically precise, 9.00 floor. Every image disciplined.

F produced tightly controlled images where coral tendrils emerged exactly from named camera parts. Studio-clean, anatomically precise, photographically disciplined. Every image scored 9.00 or higher. No outliers.

Variation D, "transitioning into a living reef." No named parts. No anatomical targeting. The phrase "living reef" alone pulled Nano Banana 2 into a full underwater render with god rays, substrate, and marine life that the prompt never explicitly asked for.

D produced cinematic underwater scenes. The camera wasn't just overgrown, it had transformed. Light rays from above. Seafloor substrate. Reef-accurate coral species. One image had the lens barrel becoming a glowing portal. The word "living reef" had pulled Nano Banana 2 into a full environmental render that the prompt never explicitly asked for.

Two prompts. Roughly the same score. Totally different strategies. F was optimizing for reliability. D was optimizing for spectacle.

This was the first thing I didn't expect: these weren't competing strategies for the same job. They looked like they might be operating on different dimensions. I didn't have the data yet to prove it, but I had a hypothesis.

The Detour That Turned Into the Actual Finding

Before testing the hypothesis, I wanted to understand what made F work. Because F wasn't one prompt technique, it was three stacked on top of each other.

It named specific organism types, "coral tendrils" instead of "coral," "cup fungi" instead of "mushrooms," "mushroom stems" instead of "more mushrooms."

It named specific host parts, "lens barrel" instead of "lens," "top plate seams" instead of "body," "viewfinder and winding lever" instead of "mechanism."

And it used three different action verbs, "extending from," "clustered along," "emerging from," instead of placeholder verbs.

Any one of those could have been doing the work. Or all three. Or some could have been actively hurting results, and F was winning despite them.

So I stripped them one at a time.

I switched subjects for round 2, vintage mechanical typewriter instead of camera. Different anatomy, different iconic recognition, a stress test of whether any findings would generalize. Then I ran four strip tests.

G kept host-part specificity but stripped organism names. Generic "coral" and "mushrooms" instead of "coral tendrils" and "cup fungi." Host parts still specific.

H did the opposite. Kept organism names but stripped host-part specificity. "Coral tendrils extending from the body" instead of "the carriage and platen."

I kept both specificity layers but replaced the action verbs with neutral "on the."

J ran the full F formula on the typewriter as a control, a check on whether F generalized to a new subject at all.

Variation H. Organism names kept, host parts stripped to "body" and "top surface." The coral drifts above the subject instead of integrating with it. This image scored a 7.50, the floor of the entire teardown.

Here's the data:

Variation	What was stripped	Average	Peak	Floor
I	Action verbs	9.33	9.40	9.25
J	Nothing (full formula)	9.33	9.40	9.25
G	Organism-type names	9.15	9.40	8.70
H	Host-part names	8.11	8.75	7.50

Three things fell out of this almost immediately.

H is a full point below everything else. Stripping host-part names, writing "body" instead of "carriage and platen," "top surface" instead of "type bar seams," caused the biggest quality drop I've ever measured in a strip test. One H image drifted so far that the coral stopped reading as coral and became abstract alien tendrils. Without integration points, Nano Banana 2 doesn't know where to put things, so it puts them in a floating cloud above the subject and hopes for the best.

G and J scored nearly identically. Stripping the organism-type names, writing "coral" instead of "coral tendrils," cost nothing. Nano Banana 2's visual defaults filled in tendril-shaped coral anyway. Naming the morphology was redundant because the model already knew it.

I tied J exactly. Replacing the three varied action verbs with neutral "on the" produced identical results. The verbs weren't doing connective work. They were grammatical filler, and the model ignored them.

Translation: of the three components I thought were making F win, only one was actually responsible. Host-part specificity is the engine. Organism-type names and action verbs are ornamental.

Variation J, the full F formula applied to a typewriter. Coral on the carriage and platen. Cup fungi on the type bar seams. Mushroom stems on the keys. Score: 9.25. The formula generalizes from cameras to typewriters without quality loss.

That finding alone would be a useful article. But at this point in round 2, I still had the hypothesis from round 1 to test.

The Test That Produced the First 9.85

Back to the original question. Were F and D competing strategies, or were they operating on different dimensions?

If they were competing, combining them would produce noise, two prompts arguing with each other. If they were operating on different dimensions, combining them would compound.

I ran Variation K: the F structure plus D's genre phrase at the front.

Vintage mechanical typewriter transitioning into a living reef, with bioluminescent coral tendrils extending from the carriage and platen, cup fungi clustered along the type bar seams, tiny mushroom stems emerging from the keys and ribbon spools...

Four images.

Three of them scored 9.85.

Variation K. Anatomical specificity plus genre invocation. Three of four images scored 9.85, the first time that ceiling had been broken in the entire study.

That's the first 9.85 I'd seen anywhere in the study. Across 40 images by that point, the ceiling had been 9.40. K broke through by nearly half a point, three times in one generation. The fourth image scored 9.40, meaning every single K image landed at what had previously been the absolute ceiling of everything else.

The full Round 2 numbers:

Variation	Description	Average	Peak	Floor
K	Specificity + "transitioning into a living reef"	9.74	9.85	9.40
I	No action verbs	9.33	9.40	9.25
J	Full F formula	9.33	9.40	9.25
G	Generic organisms	9.15	9.40	8.70
L	5 anatomy pairs (see below)	8.90	9.15	8.15
H	Generic host parts	8.11	8.75	7.50

The hypothesis was right. F and D weren't competing. They were working on different layers.

Why They Compound

Once the K numbers came in, I could see the structure.

Host-part specificity tells Nano Banana 2 where to integrate organisms. Coral on the carriage. Fungi on the type bar seams. Mushrooms on the keys. It controls structural placement. Without it, the model scatters growth in whatever arrangement looks vaguely pleasing. With it, every named anatomy point gets populated.

Genre invocation tells Nano Banana 2 what atmosphere to render. "Living reef" doesn't just add more coral. It pulls in underwater lighting, substrate, secondary marine life, god rays from above, the full visual vocabulary of reef photography. It controls environmental context. Without it, you get an object on a background. With it, the object is inside a world.

Another K output. God rays from above. Species-accurate coral. Secondary marine life Nano Banana 2 added without prompting. The anatomical pins kept the typewriter recognizable inside an environment the genre noun generated.

These are genuinely independent axes. One handles subject integration. The other handles scene atmosphere. Neither does the other's job. Neither substitutes for the other. And when both are present, the image has to satisfy both simultaneously, which is why the K images didn't just score higher, they scored higher on every dimension at once.

The numbers back this up. Specificity alone (J) averaged 9.33. Genre alone (D in round 1) averaged 9.09. Combined, they averaged 9.74. The combination isn't just additive, it's better than either parent by a margin that can't be explained by either one alone doing more work.

The Ceiling Test

I ran one more variation to close out round 2. L, five anatomy pairs instead of three. I wanted to know if host-part specificity, the proven engine, would scale linearly or hit a ceiling.

It hit a ceiling.

Variation L. Five anatomy pairs instead of three. Mycelium got rendered four different ways across four images: tree branches, electrical discharge, cobwebs, ambiguous stringy stuff. Score: 8.15.

L dropped to 8.90, 0.43 points below J with the same subject. Three things went wrong. The mycelium I added ("thread-like mycelium branching across the paper guide") got interpreted four different ways across four images, as tree branches, as electrical discharge, as cobwebs, as ambiguous stringy stuff. "Encrusting coral polyps" and "cup fungi" visually conflated because they share morphology and the model couldn't differentiate under load. And the prompt got long enough that the length penalty I've documented in earlier testing reasserted itself.

The practical finding: three anatomy pairs is the sweet spot. Probably works at two. Actively degrades at five. If you're naming host parts, pick the three most important ones and stop.

The Formula

Here's what two rounds and 48 images say about how to prompt biological overgrowth, or, I suspect, any integration problem where two things need to combine into one image.

[Host object] transitioning into a [genre-invoking destination],
with [organism] on the [specific part 1],
[organism] on the [specific part 2],
[organism] on the [specific part 3],
[lighting], [background], [photographic grounding]

Two rules do almost all the work.

Name three specific parts of the host object. Not the body, the carriage and platen. Not the surface, the type bar seams. Not the mechanism, the winding lever. Three pairs. The most anatomically precise terms you can think of. This is the single biggest driver of quality in the study and most prompters skip it in favor of describing the organism in detail. Reverse the instinct. The organism renders fine on its own; the placement is what you have to specify.

Add a destination noun that invokes a photographic genre. "Living reef." "Ancient forest." "Crystal cavern." "Alien ecosystem." The word has to carry visual conventions the model can draw from. "A green environment" won't do what "a mossy woodland" will. The destination noun is what triggers atmosphere.

If you use both, you get both. Structural integration that's anatomically targeted and environmental transformation that's atmospherically complete.

K3. The heaviest coral integration in the full 48-image study. The typewriter body reads as reef formation. Anatomical pins plus genre noun, compounding all the way to the top of the rubric.

If you only use one, you're leaving about 0.7 points on a 10-point scale. Which is, in my rubric, the difference between a good AI image and an image that looks like it came out of a camera.

What I'd Still Like to Know

Two rounds and 48 images is enough to call this a reliable pattern. It's not enough to call it proven. Three things are still open.

Does the pattern hold on a third subject? I've tested cameras and typewriters. Both have complex anatomy with lots of named parts. I don't yet know whether the formula works as well on something without that structure, a sneaker, a food item, a piece of furniture. Round 3 is the generalization check.

Is "living reef" specifically privileged, or will any strong genre noun work? I've only tested one destination noun across both rounds. If "ancient forest" and "crystal cavern" produce equivalent quality jumps, the compounding-axes principle generalizes cleanly. If they don't, there's something specific to underwater contexts that I haven't understood yet.

Does this work on other models? This whole study was on Nano Banana 2, which has genre intelligence I've documented before. Firefly Image 5 responds to different language. GPT Image 1.5 responds to different language again. Whether the two-axis structure translates is the cross-model question.

I'll run those tests this week. If you want the full data, all 48 images, session logs, scoring breakdowns, reply to this newsletter and I'll send it over.

In the meantime: if you're doing overgrowth, material transformation, or really any prompt where you're trying to combine two things into one coherent image, try the formula. Three anatomy pairs plus one genre noun. See what happens.

I'd be curious to see what you get.

Testing methodology: Nano Banana 2 (via Adobe Firefly). All images scored using a weighted 5-dimension rubric. Minimum 4 generations per variation before drawing conclusions.