Fable Knows. AI & Tech, decoded
AI News

Grok Imagine Video 1.5: Real Pricing, Honest Limits, and How to Actually Use It

By Ved Vyas June 16, 2026 12 min read

Most write-ups of xAI’s new video model read like a press release with the serial numbers filed off. They list the same six features and move on.

This one answers the questions those pages skip. What does a finished clip actually cost? Can it really do text-to-video, or is that a myth half the internet is repeating? And where should you run it, because the price swings by more than 3x depending on the door you walk through.

Here is the full picture, including the parts xAI would rather you not dwell on.

What Grok Imagine Video 1.5 actually is

Grok Imagine Video 1.5 is an image-to-video model from xAI. You hand it a still image and a prompt describing motion, and it animates that frame into a short clip with sound baked in. It shipped as a preview under the API name grok-imagine-video-1.5-preview, with the dated alias grok-imagine-video-1.5-2026-05-30.

That date in the alias is the cleanest answer to a small mess you will run into if you read around. One platform says the model launched May 30. Another says May 31. xAI’s own news post is stamped June 3. The model itself carries the 2026-05-30 tag, so treat the end of May as the real ship date and the June 3 post as the announcement catching up.

One clarification saves a lot of confusion: this is not the Grok chatbot. Same brand, different product, different job. Grok the assistant answers questions and writes code. Grok Imagine Video 1.5 turns pictures into moving footage. Judging the video model by the chatbot’s reasoning is like reviewing a camera based on how the company’s phone takes calls. They share a logo and nothing else that matters here.

The text-to-video question nobody answers straight

Here is the contradiction that sent me digging.

xAI’s official model documentation states plainly that this model does not support text-to-video. The “modalities” line reads image and video as inputs, and there is a one-line note saying text-to-video is not available.

Yet Morphic’s model page and Imagine Art’s guide both describe text-to-video as a supported mode. Two reputable platforms, flatly contradicting the model’s own spec sheet.

So which is true? Both, in a sense, and the gap explains a lot about how these models reach you.

The raw grok-imagine-video-1.5-preview endpoint, the thing you call directly through xAI’s API, takes an image as a required input. No image, no video. When a third-party platform offers you a “text-to-video” box for this model, it is almost always running a text-to-image step first, then feeding that generated frame into the video model. You type words, a still gets made behind the scenes, and the video model animates it. The platform calls the whole pipeline “text-to-video” because that is what it feels like from your seat.

Why this matters in practice: if you go straight to the API expecting to type a sentence and get a clip, you will hit a wall. Generate or supply a first frame first. If you use a wrapper that advertises text-to-video, understand you are getting a two-step pipeline, and the quality of your clip depends heavily on that intermediate image you never see. The image-to-video path, where you control the starting frame yourself, is the one xAI actually built and the one that produces the most predictable results.

What it can do

Six capabilities matter. I will rank them by how much you will actually use them.

Image-to-video. The core mode and the strong one. Your image becomes the literal first frame, not a loose suggestion. Subject, framing, color, and lighting carry forward, and the prompt steers how the scene moves. If you already have a look you like, this is the fastest way to put it in motion.

Native synchronized audio. This is the headline. The model generates sound in the same pass as the picture: dialogue with lip-sync, ambient noise, sound effects, and background music, all timed to the motion. Most video models hand you a silent clip and leave you to source and align audio yourself. Cutting that step out is the single biggest time saving here. Reports describe the audio engine even shifting positioning as subjects move across the frame, so a character walking left pulls the sound left. Whether you notice that on a phone speaker is another matter, but the dialogue timing genuinely improved over version 1.0, which had a stiff, mechanical cadence.

Video extension. Each generation caps out at 15 seconds. To go longer, you take the final frame of a clip and tell the model to continue from there, then chain those extensions into a longer sequence. Version 1.5 reduced the quality drop at the join, which is the whole reason this feature is usable rather than a gimmick. Creators stitch these into runs of 60 to 90 seconds.

Reference-to-video. Feed in reference images to hold a character or style steady across separate generations, rather than to animate a specific composition. Useful when you need the same face or the same aesthetic across five different shots.

Prompt-based editing. Describe a change to an existing clip and the model applies it while leaving the rest alone. The language-model foundation handles plain-language edit instructions without you touching parameters.

Text-to-video. Covered above. Real on some platforms through a behind-the-scenes image step, absent from the raw API.

The specs, without the marketing gloss

SpecGrok Imagine Video 1.5
Model namegrok-imagine-video-1.5-preview
Primary inputStill image plus text prompt
Resolution480p (drafts) or 720p (output)
Frame rate24 fps
Clip lengthRoughly 6 to 15 seconds per generation
Aspect ratios16:9, 9:16, 1:1 and others
AudioNative, generated in the same pass
Generation speedAround 5 to 30 seconds depending on complexity
StatusPreview, region us-east-1
API rate limit60 requests per minute

A note on the engine. Imagine Art’s guide attributes the model to an “Aurora” autoregressive engine trained on a large GB200 GPU cluster, and credits it with the top spot on an image-to-video leaderboard with a 52-point jump over version 1.0. That is a single-source claim, so I am flagging it as a claim rather than passing it off as confirmed fact. xAI’s own pages do not publish the training details or leaderboard numbers. The capabilities below are what holds up across multiple sources. The benchmark bragging is what one platform reports.

The autoregressive part does have a visible consequence worth knowing. The model builds the clip frame by frame from the start, so actions you describe early in your prompt render early in the clip. Bury the key motion at the end of a long prompt and it may arrive too late to land cleanly. Front-load the action. More on that below.

What it actually costs

This is the section every other page leaves out, and it is the one that decides whether this model fits your budget.

Through xAI’s API, pricing is per second of output, split by resolution:

ResolutionPrice per second15-second clip
480p$0.08$1.20
720p$0.14$2.10

Each input image adds one cent. Audio, when generated, is included at no extra charge. fal.ai, which serves the model commercially, matches these exact rates: a 5-second 480p clip runs $0.40, a 5-second 720p clip runs $0.70.

Now the math nobody runs. Say you want a polished 90-second piece at 720p, built by chaining six 15-second extensions. That is roughly six generations at $2.10 each, plus the small per-input charges for the frames you pass between them. Call it about $13 for a minute and a half of finished, scored footage. At 480p for drafts, the same 90 seconds drops to around $7.20.

For perspective, $13 of human video editing buys you a few minutes of someone’s attention. Here it buys a complete sequence with synchronized sound. That is the actual value proposition, stated in dollars instead of adjectives.

Watch the markup if you use a credit-based reseller. One platform charges 2 credits per second for 480p and 3 per second for 720p, with starter credits priced around ten cents each. That works out to roughly $0.30 per 720p second, more than double the raw API rate. You are paying for the interface, the convenience, and not having to manage an xAI account. Sometimes worth it. Worth knowing you are paying it.

Where to actually run it

You have four real doors into this model, and they suit different people.

Straight to the xAI API. Cheapest per second, full control, and you write code. The starter snippet is about a dozen lines of Python: import the SDK, point it at your image URL, describe the motion, set duration and resolution, print the result URL. Best for developers and anyone building an automated pipeline.

fal.ai. Same per-second pricing as xAI, plus a hosted playground and a clean REST API with no cold starts. A sensible middle ground if you want the API economics without standing up everything yourself.

Morphic. A full studio interface with the model in a picker alongside others, video mode, and conversational revision. Good if you want to work in a UI and iterate by chatting rather than re-running calls.

Credit-based consumer tools like the JXP and Imagine Art front ends. No account juggling, simple upload-and-generate flows, and that convenience markup on every second. Best for non-technical creators who value the interface over the per-clip cost.

There is no single right answer. A developer running a content pipeline should go straight to the API and pocket the difference. A solo creator who makes three clips a week and never wants to see a line of code is better served by a UI, even at a premium, because their time is the real cost.

How to write prompts that work

The model rewards specificity and punishes vagueness, like every video model, but it has two quirks worth building into your habit.

First, front-load the action. Because the clip generates from the first frame forward, the model handles early instructions more reliably than late ones. Lead with the main movement, then layer in atmosphere and detail.

Second, name your camera move explicitly. Camera behavior is where this model is genuinely strong. Plain-language directions like slow push-in, dolly forward, pan left, or tracking shot get executed cleanly, without the stutter that gave earlier models away. A directed camera is most of what separates a clip that looks intentional from one that looks generated.

A workable prompt formula:

[Subject and what it does] + [camera move] + [lighting and atmosphere] +

Applied:

Slow push-in on the bottle as condensation forms, gentle rim light, soft ambient room tone

Or for a talking portrait:

The subject smiles, looks to camera, and says a short line of welcome, soft studio light, natural lip-sync

Keep scenes clean. Sparse, well-defined frames stay stable. Dense scenes packed with competing elements drift more, especially during fast camera moves. If consistency matters, give the model less to track.

Strengths and the honest weaknesses

The strengths are real and specific:

  • Image-to-video with strong anchoring to your source frame, so the output continues your image rather than reinventing it
  • Native audio in one pass, which removes an entire production step
  • Fast generation, often under 30 seconds, which changes how you work because you can test five directions in the time a heavier model renders one
  • Clean, directable camera movement, among the best in this class

The weaknesses are just as real, and you should hear them before you commit:

  • The 15-second ceiling per generation is a hard limit. Longer work means chaining, and chaining adds cost and small quality joins.
  • Fine detail drifts. Packaging typography, logos, intricate garment details, and other precise brand elements can shift during camera movement. For a lifestyle teaser, fine. For an ad where the label has to be pixel-accurate in every frame, this will bite you, and a model tuned for frame-to-frame product consistency is the safer call.
  • Camera control is good but not surgical. Tools like Kling offer more granular path specification if you need to choreograph exact moves.
  • Dense scenes are less stable than clean ones, so busy compositions are a gamble.
  • It is a preview. Expect rough edges and changes.

The pattern is clear. This is a speed-and-sound model, not a precision model. Knowing which one you need is the whole game.

How it stacks up against the alternatives

Versus Seedance 2.0. Both ship native audio. Seedance holds the edge on frame-to-frame product detail and takes more input types in a single generation, which makes it the steadier pick for complex commercial work. Grok 1.5 counters with faster generation, lip-synced dialogue, and the extension workflow.

Versus Kling 3.0. Kling gives you finer camera-path control, multi-shot construction, and a longer 20-second clip. Grok 1.5 generates faster and handles audio natively with less setup per clip. Choose Kling to choreograph, choose Grok to move quickly.

Versus Runway Gen 4.5. Runway brings an in-browser editing suite, multi-reference character consistency, and timeline tools, which win for branded series where cross-clip consistency is the standard. Grok 1.5 moves faster with less per-clip overhead, which suits high-volume short-form output.

The honest summary: Grok Imagine Video 1.5 is not trying to be the most controllable or the highest fidelity model on the market. It is trying to be the fastest way to get a sounded, image-anchored clip out the door. On that narrow goal, it delivers.

Who should use it, and who should skip it

Use it if you make short-form social content, animate product or portrait stills for quick teasers, need synchronized audio without a separate pass, or want to test creative directions fast before committing to a heavier production model. The speed and the built-in sound are tailored to exactly this work.

Skip it, or at least pair it with something else, if you need pixel-accurate brand detail across every frame, long-form narrative beyond what chaining comfortably handles, surgical camera choreography, or anything past basic prompt-based editing. Those jobs belong to specialist tools, and forcing this model into them will cost you reshoots.

Frequently asked questions

What is Grok Imagine Video 1.5? It is xAI’s preview image-to-video model. You give it a still image and a prompt, and it produces a short clip, up to 15 seconds at 480p or 720p and 24 fps, with native synchronized audio including dialogue, sound effects, and music. It is a standalone generation model, separate from the Grok chatbot.

Does it really do text-to-video? The raw API does not. xAI’s documentation lists it as image-to-video only. Some platforms offer a text-to-video box by generating an image from your text first, then animating it, so the experience exists even though the underlying model needs a starting frame.

Does it generate audio? Yes, in the same pass as the video. A single generation can include lip-synced dialogue, ambient sound, effects, and background music, with no separate audio step.

How long can clips be? Up to about 15 seconds per generation. For longer sequences, chain the extend-from-frame feature, continuing each new clip from the last frame of the previous one.

What does it cost? Through the API, 480p runs $0.08 per second and 720p runs $0.14 per second, plus a cent per input image, with audio included. A 15-second 720p clip is about $2.10. Credit-based resellers add a markup, often pushing 720p past $0.27 per second.

Where can I use it? Directly through xAI’s API, through fal.ai at matching rates, inside Morphic’s studio, or through consumer credit tools like JXP and Imagine Art. Pick based on whether you value the lowest cost or the easiest interface.

Is it good for product ads? For concept testing and teasers, yes. For final ads needing exact packaging detail in every frame, a model built for frame-to-frame consistency is the safer choice.

Ved Vyas

Writer at Fable Knows, covering AI and the technology shaping everyday life.

Leave a Reply

Your email address will not be published. Required fields are marked *