Building "The Aperture: Episode 1, The Artist"

This article documents the creation of Episode 1 of The Aperture, a five-part psychological cosmic horror miniseries built using a stack of AI generation tools. Episode 1 runs 3 minutes 21 seconds. It tells the story of Elias Marsh, a Brooklyn painter who realizes he hasn't been the one painting his work. The episode was produced through a workflow combining still image generation in Adobe Firefly and Nano Banana 2, video generation in Seedance 2.0, voice generation in ElevenLabs V3, and final assembly in Adobe Premiere Pro.

https://x.com/GlennHasABeard/status/2048429428242297095?s=20

What follows is the actual production process. Some of these decisions were good. Some I had to walk back. The goal of documenting both is so that anyone attempting similar work has access to the iteration trail, not just the polished end state.

The Concept

The Aperture is structured as a descent. Each of the five episodes takes a different protagonist (a painter, a pianist, a cartographer, an archivist, a dreamer) and shows the moment they realize they have become a conduit for something on the other side of reality. The barrier between worlds is thinner in each episode. Episode 1 represents the thinnest breach: a man who has been an unwitting door for months. Episode 5 represents full collapse.

The visual language is grounded in specific craft. Shot on 35mm with anamorphic 2.39:1 framing, painterly chiaroscuro, a desaturated palette warmed by tungsten. The tonal register sits closer to Robert Eggers's The Lighthouse than to mainstream contemporary horror. Articulate, calm, unsettling because it stays composed.

The horror itself is delivered through a recurring visual motif: the Witness. In Episode 1, the Witness appears only as visual glitch corruption inside the protagonist's paintings. RGB chromatic aberration. Localized datamosh. Pixel displacement. A vertical region in the painted background where the image hasn't rendered correctly. Across the series, the Witness becomes progressively more humanoid, but always with proportions that are wrong in ways the audience can't quite name.

The Production Stack

The pipeline for Episode 1 broke down into roughly seven phases:

Script and shot breakdown
Tier A reference image generation (foundational character and environment stills)
Shot-specific still generation (one or more stills per shot, using Tier A references plus shot-specific prompts)
Voiceover generation
Seedance video generation (using the shot stills as keyframes)
Lip sync application to one specific shot
Final editorial assembly in Premiere

The tools used:

Adobe Firefly with Nano Banana 2 model. All still image generation
Seedance 2.0. All video generation, with native audio output
ElevenLabs V3. Narration and diegetic dialogue
Premiere Pro. Final cut, color grade, audio mix
Lip sync tool. Applied to Shot 18 only, using ElevenLabs audio

Phase 1: Script and Structural Decisions

The script went through several drafts before locking. Key structural decisions:

24 distinct shots across 3:21. Some shots are single-still animations of about 5-10 seconds. Others (particularly the flip-through sequence and the transformation) are longer compound generations. One shot (Shot 01) is pure black with audio only. The opening establishes a three-beat rhythm in the audio that becomes the structural bookend of the episode.

The descent structure inside the episode itself. Even within the 3:21 runtime, Elias's grip on reality erodes in measurable beats. He paints. He notices things. He tries to confirm what he sees with a neighbor. He finds physical evidence that can't be explained. He finds a meta-painting that depicts him being watched. He speaks the realization aloud. He transforms.

Audio as a narrative force. The episode opens with three brush strokes on canvas in pure black, no visual. It closes with three piano notes from a different protagonist in a different apartment. The same three-beat rhythm plays in both. An audio rhyme that the audience clocks subliminally before they understand its significance. Three is also the number of dream-records Elias kept (one written, the rest abandoned), the number of mornings he counted the wrong tower's windows, and the implied number of times he repeats key phrases as his testimony fragments.

The "we" pronoun as structural payload. This was the single most important writing decision in the episode. Across the voiceover, Elias uses "we" five times. Early instances read as figurative or rhetorical. Later instances are impossible to read as anything but literal. By the closing whisper "What we let through," the audience knows there has always been a second presence in the room with him. The horror lands because the audience realizes it was always there in the language.

Phase 2: Tier A Reference Image Generation

Before generating any shot-specific stills, I built a foundational reference library. Character sheets, environment establishing images, key prop images. These Tier A references would be uploaded as @-tagged references in subsequent shot prompts to maintain continuity.

The Tier A library for Episode 1 contained 15 reference images:

Character sheets for Elias (REF-01), the transformed Elias (REF-02), the Witness in glitch state (REF-04), Mara from Episode 2 in cameo (REF-05), Martha the neighbor (REF-15)
Environment establishing shots for Elias's studio (REF-06), the kitchen (REF-07), and Mara's loft (REF-12)
Key prop images for the wrong tower (REF-08), the collapse painting (REF-09), the meta-painting of the studio (REF-11), the newspaper (REF-13), the canvas stack (REF-14)

Each reference was generated in Firefly with prompts under 1800 characters, using natural-language cinematic prose rather than tag-list syntax. The generation followed strict rules: 2.39:1 aspect ratio, 35mm film grain, painterly chiaroscuro, desaturated palette with warm tungsten accents.

The Witness in Episode 1's "Glitch" state was generated as a vertical region of corruption rather than a figure. Three persistent visual signatures show up everywhere the glitch does. A 5 degree head-tilt offset in the upper third. RGB channel separation, red shifted left and blue right, by 4-6 pixels. Localized datamosh in roughly a quarter of the corrupted area. These signatures had to remain consistent across every painting in the episode and across the in-world tower exterior, which meant generating reference variants at multiple scales (3% of canvas height for early paintings, 6%, 10%, 15%, 25% for the final reveal).

Phase 3: Shot-Specific Still Generation

With Tier A locked, I generated shot-specific stills using a three-tier reference chain:

Tier A foundational references. Character and environment continuity
Scene Anchors. Locked compositional reference frames from key shots (Anchor A: studio at dawn from Shot 02; Anchor B: studio at night from Shot 09; Anchor C: kitchen from Shot 06; Anchor D: Mara's loft from Shot 23)
Previous-shot references. For sequences requiring tight continuity, each still references the prior generated still as well

The three-tier chain mattered most where one physical space had to hold across multiple compositions. The flip-through sequence (Shots 10A through 10E) was the most reference-dense. Each canvas needed identical hand position, identical sleeve fall, identical out-of-focus background, with only the painted content changing between them.

Some shots required rework during this phase. The Shot 13 still of Martha (the neighbor) initially produced an open-mouthed mid-conversation pose, which created problems in later video generation when Seedance interpreted the parted lips as ongoing speech. I regenerated Shot 13 with explicit closed-mouth direction and negative prompts ("open mouth, talking, mid-speech, lips parted, teeth showing") to lock the still as a listening pose. The narrative still works. Elias's voiceover covers what Martha allegedly told him, and visually we see a woman who has just spoken, not one who is speaking.

A similar issue came up with Shot 24, the closing close-up of fingers at piano keys. The initial still generation tripped content moderation despite being a completely innocuous music performance composition. The fix was to reframe the entire still. Instead of "a woman's hand lowering to piano keys" with descriptors of her body in shadow, I rewrote it as "an extreme close-up on piano keys with fingertips visible at the top edge of frame." Subject became the keys. The fingers became compositional accents. No body language at all. The still generated cleanly.

Production lesson: when moderation flags something innocuous, don't argue with the prompt. Reframe what the image is about. A close-up that's "a woman doing something" reads differently to moderators than a close-up that's "an object with a hand visible at frame edge."

Phase 4: Voiceover Generation

The voiceover went through three substantial drafts before locking.

Draft 1 narrated psychological horror from outside it. Calm narrator. Past tense. Workmanlike sentences. The script described strange events rather than enacting them. This version felt safe and inert. Discarded.

Draft 2 swung in the opposite direction. Fragmented syntax. Hesitations. Self-corrections in the middle of sentences. The speaker was visibly losing his grip. The problem was that this version performed disintegration too explicitly. The technique was visible, which made it feel like watching an actor pretend to crack rather than watching a person actually crack. The horror became theatrical. Also discarded.

Draft 3 committed to a specific tonal register: Robert Eggers's The Lighthouse. Calm. Articulate. Slightly archaic cadence. Sentences that finish. Grammar that holds. The horror lives entirely in what he says, said with composed precision. He sounds like a man giving testimony before a tribunal he doesn't quite remember being summoned to.

This was the version that worked. It also unlocked the "we" pronoun arc. Because Elias sounds composed, early uses of "we" read as figurative. The audience accepts them without alarm. By the time it's clear the "we" is literal, the structure has already done its work.

[INSERT: A short audio clip, 15-30 seconds, of a representative voiceover passage. Recommend Block 07 ("The tower outside my window is wrong in proportion...") or Block 12 (the closing testimony). These are the strongest examples of the Lighthouse register working. Block 12 with the whispered "What we let through" close is the most powerful single VO clip but might be too spoiler-forward depending on whether readers have watched the episode yet.]

The script was generated as a single continuous file in ElevenLabs V3, using surgical bracketed delivery tags ([measured], [reflective], [matter-of-fact], [casual unstressed], [testimonial not sad], [quiet], [whispers]). Most lines have no tag. The voice and prompt do the work. Tags appear only at moments where the wrong reading would damage the line. The whispered "What we let through" is the only whisper in the entire 3-minute file. It lands because it has been earned by 3 minutes of restraint.

The voice was created from a custom prompt:

❝

A weary, articulate male voice in his mid-thirties. Mid-range timbre with a slight gravelly weight, like someone who has not spoken to anyone in several days. Soft-spoken with an unhurried, deliberate pace. Slightly old-fashioned cadence without sounding affected, the kind of voice that uses full sentences and finishes them. Neutral American accent with a faint Atlantic formality. Calm and grounded throughout, with no theatrical inflection. The voice of a man giving testimony rather than a man telling a story. Perfect audio quality.

ElevenLabs V3 settings: stability 55, similarity 80, style 15, speaker boost on. Multiple full-script generations produced varying results. I selected the take where the "we" pronouns landed most naturally and the closing whisper was gentle rather than breathy.

The diegetic line ("I'm not the one painting. We never were.") was generated separately with different settings (stability 65, style 10) and lower expressiveness, since it needed to read as a man speaking quietly to himself in a room rather than as narration. Light room reverb was applied in Premiere to place it in the studio's acoustic space.

Phase 5: Seedance 2.0 Video Generation

Each of the 21 video generations used the previously generated stills as keyframes, with prompts under 3500 characters describing the motion to interpolate between (or animate within) the stills.

Seedance 2.0's flexibility on duration (4-15 seconds in 1-second increments) mattered. Some shots wanted only 5-6 seconds (single-still atmospheric motion). The transformation needed two 10-second clips. The flip-through wanted the full 15-second ceiling for a single continuous generation.

The flip-through (Shot 10) deserves particular attention. The original plan was to split it into two 10-second generations, since this was the historical Seedance ceiling. With the new flexibility, I unified it into a single 15-second continuous shot. This unified treatment was critical for the audio bed. The breathing escalation, the heartbeat-like thuds at each cut, and the lamp fluctuation in the final 3 seconds all needed to compound continuously without resetting at a generation boundary. Splitting would have forced the audio to restart and undermined the dread.

Several specific Seedance lessons emerged.

The flip-through has hard cuts at 2.5s, 5s, 7.5s, and 10s. The prompt described these as "the canvas content swaps in the frame as if cutting between five takes of the same shot" rather than as physical canvas-handling actions. Naming the cuts as cuts gave Seedance the right interpretive frame.

Static descriptions of state can imply motion. Shot 16 originally described the canvas as "flipped backwards on the easel, the wooden stretcher-bar back now faces the room, painted front turned away." Seedance read these descriptors as motion to perform (turning, flipping, rotating) rather than as the existing static state of the canvas. The fix was to remove all descriptors of how the canvas got that way and just say "hold static for 5 seconds. Nothing moves."

Negative prompts are load-bearing. Several shots (Shot 13 with Martha, Shot 14 with the dusty canvas, Shot 18 with the diegetic-line close-up) used explicit negative direction to prevent Seedance from defaulting to its priors. "No lip movement, no mouth opening, no speaking" got repeated three times in different ways for Shot 13 because it had to land. "Painted front of canvas visible, painted image visible, tower visible" was the negative for Shot 14 because the audience seeing the painted side of that canvas was narratively wrong.

Continuity-critical shots need budget for regeneration. The flip-through specifically (five canvases held in identical hand position with only the painted content changing) required eight to twelve generations before getting a take where hand position remained stable across all five beats. This is a real production cost. Plan for it.

Continuity transitions between split shots require end-frame discipline. The transformation (Shots 20-22) is split across two 10-second generations. After generating the first half, I exported its final frame as a still, uploaded it as the start frame for the second half. The cut between the two clips became invisible because both halves shared a literal identical frame at the seam.

The Shot 18 close-up presented a unique problem. Elias speaks the diegetic line "I'm not the one painting" while on camera. If I generated him with lip movement, the lips wouldn't sync to the ElevenLabs audio that gets layered in post, producing the worst of both worlds. Visible speech that doesn't match what we hear. The solution was to generate Shot 18 with a completely closed mouth, no lip movement at all, then apply lip sync in post using the diegetic line audio. This required extensive negative direction to keep Seedance from defaulting to mouth animation: "no lip movement, no mouth opening, no speaking, no talking, no mouthed words, no jaw motion, no mouth animation, lips parted, teeth visible, tongue visible."

Phase 6: Lip Sync Application

Shot 18 is the only shot in the episode that received post-production lip sync. The closed-mouth Seedance video was paired with the separately generated ElevenLabs audio of the diegetic line, and a lip sync tool animated the mouth to match the audio.

Generating audio and video separately, then syncing at the end, beats generating speech at the video level for projects like this. The video gets to be the face. The audio gets to be the voice. The sync tool's job is to introduce them. Nothing has to "look close enough."

For a single-shot lip sync requirement on a 7-second clip, the result is essentially seamless. Anyone who needed lip sync across multiple shots or longer durations would face additional challenges (mouth-shape consistency across shots, performance-quality control), but for one isolated diegetic moment in a 3-minute piece, this approach worked well.

Phase 7: Final Assembly in Premiere Pro

The final cut was assembled in Premiere from:

21 Seedance video clips
1 lip-synced version of Shot 18
1 continuous ElevenLabs voiceover file (sliced into 12 segments and placed against picture)
1 separately-generated diegetic line file
Sound design layers for additional ambient texture and sonic threading

Premiere did the work that no AI tool can yet do: it integrated the pieces. Specifically:

Slip-syncing voiceover to picture. The VO segments had target placement timestamps but were nudged frame by frame in the cut so each line breathed naturally with what was on screen. VO that lands too perfectly on cuts feels artificial. VO that breathes with picture feels lived-in.

Color grading for unification. Each Seedance clip came out with slightly different color characteristics depending on the generation. A unified grade in Premiere (slightly desaturated, warm tungsten lift in the highlights, deep cool shadows) pulled all 21 clips into one visual world.

Audio mixing and sonic threading. The episode has recurring sonic threads. A slightly-too-slow clock tick that introduces in Shot 06 and slows further by Shot 17. Off-camera floor creaks at irregular intervals. A tungsten lamp hum that fluctuates in the flip-through finale. Some of these were in Seedance's native audio output. Others were layered or strengthened in Premiere's audio editor.

Reverb on the diegetic line. Light room reverb was applied to "I'm not the one painting. We never were." to place it in the studio acoustic, distinguishing it from the dry intimacy of the voiceover.

The fade-to-black and the closing piano note. The episode's last second resolves Elias's whispered final phrase into the first piano note of the closing image. The handoff between the audio of the whisper and the audio of the piano note was timed in Premiere, with the dissolve happening visually as the audio shifts.

What Worked

The three-tier reference chain (Tier A, Scene Anchors, Previous-Shot) made cross-shot continuity manageable. Without it, the studio would have looked subtly different from shot to shot, breaking the dream of a unified physical space.

Generating the voiceover as one continuous file rather than 12 separate generations preserved emotional arc continuity. The voice in Block 12 has been building from Block 1 across the same audio session, and you can hear it. Disjointed regenerations would have produced 12 islands of voice that didn't quite belong to each other.

Committing to the Lighthouse tonal register over the more conventional "fragmented narrator" approach turned out to be the most important writing decision. Calm narration of impossible content lands harder than panicked narration of strange content.

The "we" pronoun arc proves out in the final cut. By the time the whispered "What we let through" arrives, the audience has retroactively reorganized the entire episode around the question of who Elias has been talking about. This is a kind of structural payoff only available if you commit early and trust the audience to do the work.

The unified flip-through generation (15 seconds, single continuous Seedance clip) worked. The audio compounds across the duration in a way that splitting would have prevented. The longest hold (5 seconds on the meta-painting) earns the dread because the audience has time to see what they're looking at without the shot moving on too quickly.

What I Had to Walk Back

Several decisions had to be reversed during production. Documenting them is more useful than pretending.

The first voiceover draft. Calm narrator describing strange events. Felt safe. Was inert. Discarded.

The second voiceover draft. Fragmented syntax. Hesitations. Performed disintegration. Felt theatrical. Also discarded. The third draft committed to a specific register and worked.

Shot 13's open-mouth reference still. Caused Seedance to animate ongoing speech in a shot where Martha was supposed to be silent. Regenerated with closed-mouth direction.

Shot 14's canvas reveal. Original staging had Elias turning the canvas to show its painted front before flipping it to show the date. The painted front was the wrong tower, but no Tier A reference existed for that specific painting in handheld context. The audience also didn't need to see the painted front since they had just seen the wrong tower outside the window in Shot 12. Restaging the shot to keep the painted side away from camera throughout was simpler and more disturbing. The painting becomes a thing in the world that exists but cannot be looked at directly.

Shot 16's canvas description. The Beat 2 prose described the backwards-facing canvas using motion-implying language ("flipped backwards," "painted front turned away," "wooden stretcher-bar back now faces the room"). Seedance read the descriptors as motion to perform. Stripping the prose to "hold static for 5 seconds, nothing moves" produced the intended held-still composition.

Shot 17's start position. Original Shot 17 began with Elias already at the easel holding the canvas. But the previous shot (Shot 16) ended with him seated against the right wall. There was no in-between shot showing him cross to the easel, a continuity teleport. The fix was to start Shot 17 on the Shot 16 end-frame composition and have him perform the cross-and-lift action within Shot 17's own 10 seconds. This actually made the shot stronger. Watching him approach the canvas in real time builds dread that the cleaner cut would have skipped.

Shot 24's content moderation flag. Innocuous music close-up was tripping moderation due to body-related descriptors. Reframing the prompt so the keys were the subject and the fingers were a compositional accent at frame edge resolved the flag.

Generation count. The original plan was 22-24 video generations. The final episode used 21, plus one shot (Shot 04 montage) that was assembled in Premiere from stills rather than generated as video. Smaller generation counts when the visual material allows.

Tools and Their Specific Roles

Adobe Firefly with Nano Banana 2 handled all stills. Strong prompt adherence, especially for cinematic compositions and consistent character work across multiple references.

Seedance 2.0 handled all video. The native audio output is genuinely useful. Having ambient room tone, breathing, lamp hum, and mechanical sounds generated alongside the picture saved significant sound design work in post. The 4-15 second flexible duration support changes what's possible structurally. Longer continuous generations with internal hard cuts are now viable, which the flip-through would not have been before.

ElevenLabs V3 handled voice. The bracketed delivery tags work as advertised when used surgically. Over-tagging produces theatrical reads. Single-tag direction at moments that matter produces clean reads. The model's ability to maintain emotional continuity across a continuous generation is meaningful for narrative work.

Premiere Pro handled assembly. Color grading, audio mixing, slip-syncing voiceover to picture, applying reverb, sequencing the visual cuts, and resolving the final whisper into the closing piano note.

What's Next

Episode 1 is done. Episode 2 follows Mara, the pianist. Same descent structure. Same three-tier reference workflow. The Witness in Episode 2 advances from pure glitch corruption to something with the suggestion of form, a "Suggestion" state. Mara's testimony will not have the same Lighthouse register as Elias's. Her voice belongs to a different character with a different relationship to her work and her household.

The framework built for Episode 1 should accelerate Episode 2's production substantially. The reference library structure, the script-to-stills-to-video pipeline, the voiceover generation approach, the lip sync workflow for diegetic dialogue. Most of the iteration work happened in Episode 1.

Episode 2 next.

https://youtu.be/or9eLVQrzGA?si=cCDTkbMz8QDTNOAB

Glenn Williams (@glennhasabeard) is a luthier at PRS Guitars and an Adobe Firefly Ambassador. The Render is his newsletter on AI-driven creative work.