Iris Coleman
Sep 18, 2024 03:29
Mistral AI has unveiled Pixtral 12B, an advanced multimodal model that excels at handling both text and image tasks, showcasing impressive capabilities in instruction adherence and reasoning.
Mistral AI has officially rolled out Pixtral 12B, marking the company’s first foray into multimodal models that effectively manage both text and image inputs. The model operates under the Apache 2.0 license, as per Mistral AI.
Pixtral 12B Key Highlights
Pixtral 12B is distinguished by its inherent multimodal design, trained using interspersed image and text datasets. It features a newly introduced vision encoder with 400M parameters and a 12B parameter multimodal decoder built on Mistral Nemo. This configuration supports varying image dimensions and aspect ratios while being capable of processing multiple images in its extensive context window of 128K tokens.
In terms of performance, Pixtral 12B shines at multimodal tasks, achieving top-tier results in text-only evaluations. It has recorded a 52.5% score on the MMMU reasoning benchmark, outperforming several models with larger architectures.
Evaluation and Performance Metrics
Designed as a seamless upgrade to Mistral Nemo 12B, Pixtral 12B offers premier multimodal reasoning features without sacrificing prowess in text-oriented tasks such as instruction compliance, coding, and mathematical computation. Consistent evaluation methods were utilized across various datasets, with Pixtral surpassing both open-source and proprietary models, including Claude 3 Haiku. Remarkably, it matches or exceeds the effectiveness of larger models like LLaVa OneVision 72B in multimodal assessments.
In the realm of instruction fidelity, Pixtral has achieved a 20% relative enhancement in text IF-Eval and MT-Bench compared to the closest open-source alternative. It excels in multimodal instruction conformity, outperforming competitors like Qwen2-VL 7B and Phi-3.5 Vision.
Design and Functional Capabilities
Pixtral 12B’s design is optimized for rapid processing and effective performance. The vision encoder tokenizes images at their native dimensions and aspect ratios, transforming them into tokens for each 16×16 image segment. These tokens are organized into a sequence, incorporating [IMG BREAK] and [IMG END] tokens for structure within rows and concluding the image. This enables the model to comprehend complex graphics and documents accurately while ensuring quick inference for smaller images.
The model’s ultimate setup comprises two main components: the Vision Encoder and the Multimodal Transformer Decoder. It is trained to forecast the subsequent text token based on interleaved image and text information, facilitating processing of any number of images with diverse sizes across its expansive 128K tokens context window.
Real-World Implementations
Pixtral 12B has demonstrated remarkable aptitude in a range of practical scenarios, such as reasoning with intricate figures, interpreting charts, and following multi-image instructions. For instance, it can consolidate data from several tables into a singular markdown table or produce HTML code to construct a website from an image prompt.
Getting Started with Pixtral
Users can readily explore Pixtral through Le Chat, Mistral AI’s interactive chat platform, or via La Plateforme, which facilitates integration through API calls. Comprehensive documentation is accessible for those eager to harness Pixtral’s functionalities within their projects.
For users interested in local deployment, Pixtral can be accessed using the mistral-inference library or the vLLM library, providing enhanced serving throughput. Detailed setup and usage instructions are included in the accompanying documentation.
Image source: Shutterstock