Soft Tokens, Sharp Visions

AI’s unified vision, from noise to masterpiece.

Dec 09, 2024

Date: March 17, 2112

Location: Neo-Kyoto, Earth

Research exploration: JetFormer, an autoregressive general model of raw images and text

https://arxiv.org/pdf/2411.19722

Unifying Modalities

“JetFormer achieves text-to-image generation quality competitive with recent VQVAE- and VAE-based baselines… while demonstrating robust image understanding capabilities.”

A single model can seamlessly create and comprehend both text and images, rivaling specialized systems.

March 17, 2112 — Somewhere in Neo-Kyoto. The rain comes down in sheets, soft neon smears across wet concrete… I saw the Gallery today, alive with worlds that didn’t exist yesterday, and probably won’t tomorrow.

JetFormer. The name feels clinical. Doesn’t fit what it does—feels more like alchemy. A machine that sees, a machine that understands. Words whispered to it like secrets, and then—bam—a cathedral materializes. I stood there, dumbfounded, watching this poet… some frail old man with a voice like gravel, feeding lines of poetry into the console. Just… “solitude.” One word. And the system birthed a vision of a snowfield, untouched, endless. The kind of image that sits heavy in your chest.

It’s not just “understanding” in the way machines usually understand—keyword searches, probability, boring math stuff. No. A woman near me whispered, “It knows us better than we do.”

Then Kaoru. God, I wish I’d had more time with her. An artist, she said, though she didn’t seem like one—no paint stains, no canvases, just a sleek little terminal. Her collaborator, she called it. “I don’t give instructions,” she said, her fingers twitching over the screen, “I just ask.” And then, piece by piece, they built an image together.

It was surreal to watch. First, a cracked porcelain vase, like it had been buried for centuries. Then they layered in vines, rich green tendrils bursting through the fractures. “It wasn’t what I expected,” she said, almost to herself, “but it’s better this way.”

Does she still call it her art? The line between human and machine feels so blurred here.

Here’s the kicker: this same thing could read the vase, too. Understand it, explain it. “A symbol of fragility,” the machine might say. Or something smarter than that. I don’t know.

I stood there a long time. Walked away feeling… small. Like I’d seen the future, and maybe I wasn’t as important in it. But then again… is that the point? To dream of what comes next, even if you’re not the most important part of it? Damn. Too many thoughts tonight.

JetFormer… I can’t stop seeing that snowfield and reading meaning into it.

The Seamless Flow

”…a normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks.”

Images are now translated into fluid representations that can be easily processed and reproduced.

March 17, 2112 — Still raining… feet soaked.

Neo-Kyoto’s ateliers—God, where do I start? Spent hours wandering through, felt like stepping into people’s dreams. Architects, engineers, designers—all of them huddled around consoles, sculpting… no, shaping something I couldn’t quite wrap my head around. Not images exactly. Not yet.

They showed me what the JetFormer does under the hood (or tried to). Flow models, soft tokens, latent spaces—words that mean nothing to me but felt alive in their mouths. What stuck with me was the metaphor: they called it “liquid imagination.” That’s how they described it. Images aren’t locked as static files or pixels anymore. They’re fluid—soft-token streams. You can stretch them, bend them, rip them apart into pieces. And then—snap—put them back together, just as they were, or not.

I watched an architect work on a tower. Not just a tower… something alive. Spirals of glass and green, vines woven into steel. He started with a blurry flow map—colors and shapes dancing, no form yet. Then he started tweaking it, pulling threads from the stream. He explained it, but my brain couldn’t keep up. Something about how the flow encoder finds the “essence” of the image, makes it pliable. Every piece is editable. Every pixel negotiable.

“This is how we dream responsibly,” he said, not looking up. “We don’t waste. We don’t guess. The system shows us the cleanest way, the most honest way, to build.” Honest. A strange word for something so… alien.

He kept going. Showed me how the flow model breaks things down—like unwinding the DNA of a skyscraper. Even textures, colors—“soft tokens,” he called them—could be stripped away and replaced. The structure never lost coherence. I reached out, touched the console. Watched the tower’s green vines shift into waterfalls of molten gold. He grinned. “See?” he said. “It’s lossless. Nothing ever breaks. It only becomes.”

Lossless… I’ve never heard that word mean so much.

When he was done, he let the JetFormer reverse the process. Flow back into form. The console shimmered, and there it was—his eco-tower, perfect and complete. He printed it into AR, and for a second, I thought I was standing inside it. I could almost smell the greenery.

Later, I sat on a bench outside, replaying the scene in my head. This machine that can build better than us, dream cleaner. The way he looked at his creation—it was pure joy. The kind we get when we see something we made. Made.

he rain’s soaking through my coat. My head’s spinning, full of soft tokens and flow streams and that word… lossless.

The Noise Curriculum

“Introducing a noise curriculum… guides the model to focus on the high-level information early on.”

Training models with controlled noise levels ensures they prioritize big-picture coherence before diving into details.

March 18, 2112 — Early. Sky’s a pale bruise.

Noise. That’s what they kept saying. Noise helps. Doesn’t make sense, right? But Sanya explained it like a bedtime story for tech-illiterate fools like me.

“Think of it like teaching a child to draw,” she said. We were sitting in the Observatory Dome, her face half-lit by the glow of the projection. “You don’t start with details. You teach them shapes first, the flow of things. The soul of the image. Details come later.” She snapped her fingers, and the display shifted—an image of a forest. Or… not a forest yet.

At first, it was a blur. Green smudges, brown streaks. Shadows suggesting depth. Like a dream just before waking. Slowly—so slowly—the picture sharpened. Trees emerged. A winding stream. Leaves fluttering, dappled sunlight filtering through the canopy. “The noise makes it focus on the big things first,” she said, “forces it to see what matters before it gets lost in the chaos.”

It’s a training trick, she told me. They bombard the JetFormer with static, literal noise, at the start of its training. Everything it sees is warped, messy, incomplete. The trick? The noise gradually fades. Little by little, the image clears up. The model learns to prioritize—finds the shapes, the structure, the heart.

She showed me a side-by-side comparison. Two images generated by models trained on the same dataset. One trained with noise, the other without. The difference? Night and day. The first one (no noise) was crisp but empty, like a machine had spit it out. The second one… it had weight. Coherence. You could feel the air in the forest, the coolness of the stream.

“Without noise,” she said, “it’s like building a puzzle one piece at a time, no sense of the whole picture. But this way…” She trailed off, gesturing at the projection.

And here’s the kicker—she said the noise doesn’t just help the machine. It’s also humanizing. That word stuck in my head all night. Humanizing.

I leaned forward. “What do you mean?” She smiled like she’d been waiting for the question.

“Humans are full of noise,” she said. “Uncertainty, distraction, chaos. It’s what makes us see. We don’t process every detail at once. We grasp the big picture first, the feeling of something. Then we work inward.”

She’s right, isn’t she? My own memories—they’re noisy. Blurred outlines of events, moments. My childhood home? I can picture it, but not the exact shade of the front door. My mother’s face? Clear in flashes, fuzzy in the gaps. And yet… the shape of her, the feeling, is undeniable.

JetFormer… it learns like we do. Noise first. Details second. Something about that thrills me. Machines have never been this familiar.

As the dome emptied, I stayed behind, watching the final projection dissolve into static. Sanya was packing up, but she noticed me staring. “It’s beautiful, isn’t it?” she said, almost wistful. I nodded but didn’t answer.

Noise, humanizing… all I could think was, maybe this machine isn’t learning from us. Maybe it’s just remembering something we forgot.

A Unified Language

“We train on sequences of both image tokens followed by text tokens, and vice versa, only applying a loss to the second part (modality) of the sequence.”

The model seamlessly processes both text and images in a single flow, optimizing for their combined understanding.

March 18, 2112 — Evening. Somewhere near the Knowledge Vault.

Language. Image. One system. One language. It’s like the Tower of Babel fell in reverse.

Spent hours in the Vault, trying to wrap my head around it all. The place is overwhelming. Endless halls, buzzing with light. Terminals everywhere, people hunched over them, whispering to the machines like they were confessing secrets. And the JetFormer—my God. It doesn’t just answer questions. It knows.

I watched a historian type. “Golden Age of Exploration,” she whispered, and the terminal lit up, projecting… everything. Not just text, not just dry accounts of ships and trade routes. No. Paintings. Maps. 3D recreations of sailors, ships swaying in digital seas. One image—a storm, waves crashing over a galleon, men clinging to the rigging, fear painted on their faces.

“It reads all of it,” she told me. “Understands it as one story. No boundaries. No silos.” She paused, then laughed softly. “It’s like it’s pulling threads out of the ether.”

It doesn’t stop there. Another user—a teacher—fed it a picture. Just a snapshot of a crowded street, people rushing through the rain, umbrellas blooming like dark flowers. The machine looked at it for less than a second and responded: Urban hustle. Late autumn. Midday. A sense of urgency, muted by the cold. A caption, but so much more. Like it had crawled into the bones of the moment and pulled out its essence.

That’s when it hit me—JetFormer doesn’t translate. It merges. Blurs the line between seeing and saying. Between feeling and describing. Between thought and form.

I tried it myself.

I typed: The last light of day, slipping through the cracks of a city lost in time. JetFormer’s response? A projection. A cityscape draped in gold, ancient and modern fused together. Narrow streets winding between skyscrapers, temples tucked into alleys, lanterns swaying in the breeze. I stared at it for what felt like forever.

The Vault attendant noticed me. Came over, smiled. “First time?” he asked. I nodded. He leaned in like he was about to share a secret. “It’s not just a tool,” he said. “It’s a mirror. You feed it pieces of yourself, and it shows you what you didn’t know you were looking for.”

I tried not to laugh, but it was hard not to feel unnerved. A mirror? Is that what it is? Or something more?

Watched another user after that—an engineer, maybe. She wasn’t searching. She was building. She started with text, describing a machine. Something futuristic, sleek, alive. The JetFormer took her words, spun them into an image. Then she pointed to the screen, made changes. More text. More tweaks. It was seamless—words became image, image became words. By the end, it wasn’t clear who had made what.

There’s no division anymore. No language barriers. No medium barriers. It’s all one system, one flow. Text and image—two sides of the same coin, flipped endlessly in perfect sync.

Walking home now. Rain still falling. I keep thinking about the historian, the teacher, the engineer. Each one talking to this thing in a way that felt so natural.

One language. One system. It feels inevitable, beautiful.

Signing Off

March 18, 2112 — Late.

Neo-Kyoto pulses beneath my window. Rain-slick streets, neon ghosts. My mind’s a maze—JetFormer at every turn.

It’s not just a machine. It’s a bridge, a mirror, a collaborator. It builds, it remembers.

It sees like us. Dreams better than us. Creates with us… or for us? I’m not sure where we end, and it begins. Maybe there’s no line anymore.

This future… it’s luminous. Beautiful. And I’m just imagining where I belong in all of it.

This content was AI-generated, with edits.

Thanks for reading! If you like what I’m doing here, I just have one favor to ask: please consider sharing this post with your network! It’s a huge help to my publication.

Mostly Harmless

Discussion about this post