
Midjourney V8.1 Text-to-Image API by MIDJOURNEY
Midjourney V8.1 generates four images from a text prompt, with optional native 2K HD, a style reference, and aspect-ratio / stylize / chaos / weird controls.
1. Introduction
Midjourney V8.1 is the latest iteration of Midjourney's image-synthesis model. This README covers the two core generation endpoints:
midjourney/v8.1/text-to-imagemidjourney/v8.1/image-to-video
It belongs to a larger Midjourney V8.1 family on this platform, which also includes midjourney/v8.1/image-to-image, midjourney/v8.1/blend, midjourney/v8.1/style-transfer, and midjourney/v8.1/remove-background (each documented separately).
Midjourney V8.1 is designed to produce high-aesthetic, prompt-faithful imagery at native 2K resolution with substantially faster generation than prior versions. It is built by Midjourney, an independent, self-funded San Francisco research lab (~11–50 staff) founded in August 2021 by David Holz, and is positioned as a speed- and quality-focused evolution of the company's image pipeline.
The V8 line is a full from-scratch rewrite of Midjourney's image model, accompanied by a migration from TPU-based to GPU-native PyTorch infrastructure. The model's defining methodology is a human-preference aesthetic tuning loop combined with per-user personalization, prioritizing visually compelling output over raw fidelity to a reference dataset. V8.1 entered alpha on April 14, 2026, reached general availability across web and Discord on April 30, 2026, and became Midjourney's default model on June 10, 2026. A few capabilities from the prior V7 model are not yet present in V8.1 (see below), but it is now the company's primary image model.
2. Key Features & Innovations
-
Native 2K HD output without a separate upscaler: V8.1 generates directly at 2048px resolution, eliminating the dedicated upscaling step required by earlier versions. HD renders take roughly 1.33 GPU-minutes and standard-definition renders under 1 GPU-minute, with HD running approximately 3× faster and cheaper than in V8.
-
~4–5× faster generation: The GPU-native PyTorch rewrite delivers an estimated four-to-fivefold speedup in generation time over previous Midjourney versions (a Midjourney-stated figure, not an independent benchmark), improving iteration speed for creative workflows.
-
Improved text rendering: V8.1 renders in-image text more reliably, with quoted strings in prompts used to specify the intended text — narrowing a long-standing weakness relative to text-specialized competitors.
-
Stronger prompt-following: The model adheres more closely to prompt instructions, improving controllability and reducing the prompt-engineering effort needed to achieve a target composition.
-
Restored image conditioning: Image prompts and image weights — absent in the V8.0 alpha — returned in V8.1, alongside backward compatibility with V7 style references (srefs), moodboards, and personalization profiles. (Image-driven generation is offered here as the dedicated
midjourney/v8.1/image-to-imageandblendmodels.) -
Workflow tooling: V8.1 ships with a Prompt Shortener and an updated
/describecommand, and its aesthetic has been re-tuned "in the spirit of V7" to preserve the look users prefer. -
Personalized aesthetic tuning: A human-preference (RLHF-style) aesthetic tuning loop combined with per-user personalization shapes outputs toward individually preferred visual styles.
3. Model Architecture & Technical Details
Midjourney V8.1 is a complete from-scratch rewrite of the company's image model. As part of the V8 program, Midjourney migrated from TPU-based infrastructure to a GPU-native PyTorch stack; David Holz has publicly stated that the original TPU choice "set research back a year." The underlying generative approach is understood to be latent diffusion, though Midjourney has not published a technical paper or model card, and the specific backbone, parameter count, and text encoder remain undisclosed.
Training details are not publicly documented. The dataset has never been disclosed and is the subject of active, unresolved copyright litigation — Disney Enterprises, Inc. v. Midjourney, Inc. (No. 2:25-cv-05275, U.S. District Court for the Central District of California), filed June 11, 2025 by a coalition of major studios including Disney, Marvel, Lucasfilm, Twentieth Century, Universal, and DreamWorks Animation. The studios' infringement claims are allegations in pending litigation and have not been adjudicated. The defining training methodology is a human-preference aesthetic tuning loop (an RLHF-style process) layered with per-user personalization, which together steer the model toward high-aesthetic, user-aligned outputs rather than optimizing for a single fixed objective.
Because V8.1 began as an alpha, several capabilities present in V7 were initially unavailable; the gaps still being closed include Omni Reference (--oref), Character Reference, the --no negative prompt, multi-prompts, the Niji model, Draft Mode, and Turbo mode. (Image prompts and image weights have returned; quality is supported at the 1 and 4 levels; speed is fixed to the fast tier.)
Regarding the midjourney/v8.1/image-to-video identifier: Midjourney's video capability is separately branded V1, launched June 18, 2025, and is image-to-video only (no text-to-video). It produces 5-second base clips at 24fps, extendable to roughly 21 seconds, with a 480p base resolution and 720p available on higher tiers. It offers Low/High Motion, Auto/Manual settings, and looping with end-frame control (added July 2025). No V8-native or "V8.1" video model has been confirmed, so the "v8.1" tag on the video endpoint reflects the model-family naming on this platform rather than a distinct V8.1 video model.
4. Performance Highlights
Midjourney has not published quantitative benchmarks, ELO scores, or arena rankings for V8.1, and the absence of an official public API limits the model's presence in third-party evaluation arenas. Performance is therefore best described qualitatively:
- Speed and efficiency: Approximately 4–5× faster generation overall (Midjourney-stated), with native 2K HD rendering at ~1.33 GPU-minutes and SD under 1 GPU-minute.
- Resolution: Direct 2048px output with no separate upscaling pass.
- Text fidelity: Materially improved in-image text rendering versus prior Midjourney versions.
- Prompt adherence: Stronger instruction-following and controllability.
- Aesthetics: Re-tuned to preserve the visual character of
V7while improving fidelity.
The table below summarizes the competitive landscape for context. No directly comparable arena scores are available across these systems.
| Category | Model | Developer | Notable Strength |
|---|---|---|---|
| Text-to-image | Midjourney V8.1 | Midjourney | Aesthetics, native 2K HD, speed |
| Text-to-image | Flux 2 | Black Forest Labs | Photorealism, open weights |
| Text-to-image | Imagen 4 | In-image text | |
| Text-to-image | Ideogram v3 | Ideogram | In-image text |
| Text-to-image | GPT Image / DALL·E | OpenAI | Instruction-following |
| Text-to-image | Firefly 3 | Adobe | Commercial licensing |
| Video | Sora | OpenAI | Text-to-video |
| Video | Veo | High-fidelity video | |
| Video | Runway / Kling / Luma | Various | Motion control, length |
As a rule of thumb, V8.1 is preferred for speed, HD resolution, and text rendering.
5. Intended Use & Applications
-
Concept art & pre-production: Rapid generation of high-resolution concept imagery for games, film, and product design, accelerating early ideation with fast 2K output.
-
Marketing & social content: Production of on-brand visuals and social media assets at scale, leveraging improved text rendering for graphics that include words and short phrases.
-
Film storyboarding & previsualization: Creation of storyboard frames and previs imagery, optionally animated into short clips via Midjourney's separate V1 image-to-video pipeline (
midjourney/v8.1/image-to-video). -
Brand & graphic design: Exploration of visual identities, typography-inclusive layouts, and stylistic directions using image prompts, style references, and moodboards.
-
Personalized creative iteration: Per-user aesthetic personalization tailors outputs to an individual's preferred visual style, supporting consistent look-and-feel across a body of work.
For image-guided workflows, see the companion models: image-to-image (generate from a reference), blend (fuse multiple images), style-transfer (restyle while preserving composition), and remove-background (isolate a subject on transparency).

















