What Your Organization Can Actually Self-Host: A Plain Guide to Open-Weight AI Models, Hardware, and Licenses

What Your Organization Can Actually Self-Host: A Plain Guide to Open-Weight AI Models, Hardware, and Licenses

Enterprise AI Infrastructure · Reference Guide
Sixty-three percent of organizations already run self-hosted AI models. Most of them didn't choose to. This guide covers which models you can run, on what hardware, under which licenses.
63% of organizations run self-hosted AI models (Wiz, 2026)
68% of those doing so transitively; model arrived in third-party software, not by choice (Wiz, 2026)
8 GB VRAM floor for 7–8B models at INT4 quantization
55 GB VRAM floor for Llama 4 Scout on a single H100 at INT4
Apache 2.0 / MIT the only licenses with zero commercial deployment restrictions

Your organization is probably already self-hosting AI models. The question is whether anyone in legal, compliance, or infrastructure knows which ones, under what licenses, and on hardware they control. A 2026 cloud security report from Wiz found that 63 percent of organizations run self-hosted models, and 68 percent of those are doing so transitively: the model arrived bundled inside a third-party application, not through a deliberate deployment decision. Eighteen percent of organizations have a self-hosted AI footprint made up entirely of models they didn't knowingly deploy.

The standard framing in coverage of open-weight AI is that self-hosting is a cost trade-off. Pay for API access or buy the GPU. That framing is accurate but incomplete. The decisions that matter are about data residency, license risk, and geopolitical exposure. A model running on your infrastructure costs differently than one running on a vendor's servers in another jurisdiction, and that difference doesn't show up in a token price comparison.

This guide is organized the way the chip stack reference worked: by the job the hardware does, not alphabetically by model name. Your video random access memory, or VRAM, budget determines your options before your use case does.

Key Takeaway

The self-hosting conversation most enterprises are having is about cost. The conversation they should be having is about control: which models are running, on whose hardware, under which legal terms, and in which jurisdiction. VRAM sets the floor; the license sets the ceiling.

Tier 1: A Workstation or High-End Laptop (8 to 24 GB VRAM)

This is the tier most organizations already have deployed, whether they know it or not. A developer with a recent workstation or an M-series Mac can run a capable model today without any infrastructure request.

Model Maker Size VRAM (INT4) License Best for
Phi-4-mini Microsoft 3.8B ~2 GB Apache 2.0 Edge, CPU-only, offline deployments. 128K context window at this size is the standout spec.
Qwen3 8B / 14B Alibaba 8B–14B 4–8 GB Apache 2.0 Strong reasoning and coding across 100+ languages. The practical local default for most developers in 2026. No commercial restrictions.
Gemma 3 27B Google DeepMind 27B ~16 GB Gemma ToS Strongest single-GPU general-purpose option in this tier. Verify commercial terms against your use case before deploying at scale.
DeepSeek R1 Distill 32B DeepSeek 32B ~20 GB Apache 2.0 Reasoning performance that exceeds many larger alternatives. Fits on a single RTX 4090. MIT-licensed distilled variants also available at 7B and 14B. See the DeepSeek flag section below before deploying API-connected versions.

Tier 2: A Single Data Center GPU (40 to 80 GB VRAM)

A single NVIDIA H100 or AMD MI300X handles this tier. These are the models where the capability-to-infrastructure ratio starts to make a credible economic argument against closed-API pricing at production volume.

Model Maker Size VRAM (INT4) License Best for
Llama 4 Scout Meta 109B total / 17B active ~55 GB Meta Community ⚠ 10 million token context window. The most cost-effective Llama 4 entry point. Single H100 at INT4. See license note below.
Mistral Large 2 Mistral AI 123B ~62 GB Apache 2.0 Strong instruction following and function calling. Fits on a single H100 at INT4. Apache 2.0 license with no commercial restrictions, the cleanest option at this tier for regulated industries.
Arcee Trinity-Large Arcee AI Large Varies Custom (verify) US-origin model built specifically for enterprises that cannot send data to a third-party API. Healthcare, financial services, and defense organizations with data residency requirements are the primary fit.

The Llama 4 license requires attention before deployment. Meta's Community License prohibits using model outputs to train competing models and requires a separate commercial agreement for products with more than 700 million monthly active users. The Open Source Initiative does not recognize it as meeting the open-source definition. More critically for European buyers: the license explicitly excludes EU-domiciled entities. Any European organization evaluating Llama 4 Scout or Maverick needs legal review before production deployment.

Tier 3: Multi-GPU Server (4 to 8x H100 or H200)

This tier requires a real infrastructure commitment. The models that live here are frontier-competitive on benchmarks and cost-competitive against closed APIs at sufficient volume. The breakeven calculation matters: at very high token volumes, the infrastructure cost justifies itself. Below that threshold, managed API access usually wins on total cost of ownership.

Model Maker Size VRAM (INT4) License Best for
Llama 4 Maverick Meta 400B total / 17B active ~200 GB (4× H100) Meta Community ⚠ Benchmark-competitive with GPT-4o class models. Same EU exclusion and commercial restrictions as Scout.
DeepSeek V4-Flash DeepSeek 284B total / 13B active ~158 GB (2× H100 FP8) MIT MIT license. Self-hosting the weights is the mitigation path for teams where Chinese-jurisdiction API routing is a compliance concern. See the DeepSeek flag section below.
Qwen3-235B-A22B Alibaba 235B total / 22B active ~120 GB (multi-GPU) Apache 2.0 Currently leads the broadest range of public benchmarks across reasoning, coding, and multilingual capability. Apache 2.0 license, the cleanest option at this tier. Strong EU and commercial deployment case.
The license table is the thing most model comparison guides skip. VRAM determines what you can run. The license determines what you're permitted to do with the output.

The License Table

Apache 2.0 and MIT are the only licenses with zero commercial deployment restrictions and no geographic carve-outs. Every other license in the table below carries at least one condition that legal teams in regulated industries will flag.

Model Family License EU Deployment Commercial Use Train Other Models on Outputs
Qwen3 family Apache 2.0 Yes Yes Yes
Phi-4-mini Apache 2.0 Yes Yes Yes
Mistral Large 2 Apache 2.0 Yes Yes Yes
DeepSeek V4 / R1 MIT Yes Yes Yes
Gemma 3 Gemma ToS Yes Verify Restrictions apply
Llama 4 Scout / Maverick Meta Community No (EU excluded) Yes (<700M MAU) No
Arcee Trinity Custom Verify Yes Verify

The DeepSeek Flag

DeepSeek's models ship under MIT and Apache licenses, which are among the most permissive in this guide. The weights themselves are jurisdiction-neutral once downloaded and deployed on your own infrastructure. The problem is the hosted API. Every prompt sent to DeepSeek's API goes to servers operated in China, retained under China's National Intelligence Law. Multiple governments have banned the hosted API: Italy removed it from app stores, Australia blocked it across government systems, and the US Navy, NASA, and the Commerce Department followed. The EU has investigations open across 13 member states.

Self-hosting the weights is the clean mitigation. An organization that downloads DeepSeek R1 Distill 32B and runs it on its own GPU cluster in its own data center has no cross-border data exposure. The model performs, the license permits it, and no prompt ever leaves the organization's perimeter. The risk is in the API, not the weights.

One forward risk worth tracking: US lawmakers have escalated calls to add DeepSeek to the Commerce Department's Entity List. No listing has been finalized as of this publication. Enterprise procurement and legal teams should plan against that scenario. Already-downloaded weights are not retroactively prohibited, but redistribution and commercial deployment in restricted sectors could become complicated if the regulatory situation shifts.

Key Takeaway

Self-hosting DeepSeek weights eliminates the data sovereignty problem. It does not eliminate the regulatory risk of a future Entity List designation. Track this actively, not after the fact.

The Hardware Floor Is Not a Server

The three tiers above assume GPU infrastructure. A fourth category is emerging that doesn't fit the tier model: inference on purpose-built edge silicon with no GPU server in the equation.

EdgeCortix's SAKURA-II accelerator delivers 60 trillion operations per second at a typical power draw of 8 watts, in an M.2 or PCIe form factor that drops into existing edge hardware rather than requiring a dedicated server. The company closed an oversubscribed Series B exceeding $110 million in 2026. That investment signal matters: purpose-built inference silicon at the edge is being treated as a distinct infrastructure category, not a smaller version of the data center problem.

Qualcomm's Hexagon neural processing unit, or NPU, in Snapdragon X Series laptops is already deployed inside most enterprise fleets, whether infrastructure teams have mapped it or not. Snapdragon X Elite devices run Llama 3.1 8B at approximately 5 tokens per second entirely on-device, with no cloud dependency. A developer can turn a Snapdragon laptop into a self-contained inference appliance accessible over the local network. That capability is already sitting on desks across your organization. The policy question is whether your AI governance framework covers it.

mimik's mimOE platform addresses the orchestration layer that neither EdgeCortix nor Qualcomm provides on its own. It turns any device into a node in a distributed inference mesh, with an AI Router that scores available nodes on token speed, model size, available memory, and hardware capability, then routes each workload to the best available machine. All payload is encrypted between nodes. Data does not leave the organization's defined scope unless explicitly configured to do so. The combination of purpose-built inference silicon at the edge and an orchestration layer that routes across a mix of edge nodes, on-premises servers, and cloud describes a self-hosting architecture that doesn't start with "how many H100s do you have."

Qualcomm's Server Tier

Qualcomm occupies both ends of the self-hosting picture. The Snapdragon NPU lives on the laptop on your developer's desk. The Cloud AI 100 Ultra lives in the data center and is optimized for models up to 100 billion parameters on a single 150-watt card. Qualcomm's Dragonwing AI on-premises appliance bundles the Cloud AI 100 Ultra with a software stack covering chatbot deployment, retrieval-augmented generation, image generation, transcription, and multi-agent orchestration, packaged for enterprises that need low-latency, private inference without building GPU infrastructure from scratch.

The AI 200, Qualcomm's next-generation rack-level inference accelerator, is targeting commercial availability in 2026. It introduces near-memory computing architecture and direct liquid cooling at 160 kilowatts per rack. An AI 250 generation is projected for 2027.

Procurement teams evaluating on-premises AI inference have defaulted to NVIDIA for three years. Qualcomm's data center line warrants inclusion in any request for proposal for dedicated inference hardware, particularly for organizations where performance-per-watt and total cost of ownership matter more than CUDA ecosystem compatibility.

Running Tools

Ollama gets a model running in under five minutes and defaults to 4-bit quantization. It's the right starting point for any tier. vLLM is the production serving layer: it handles batching, tensor parallelism across multiple GPUs, and the OpenAI-compatible API surface that most enterprise integrations expect. llama.cpp runs on CPU without a GPU, which makes it the only option in air-gapped environments or hardware with no dedicated accelerator.

Modular's MAX platform addresses a constraint none of the three above solve: hardware portability across vendors. MAX runs the same inference stack across NVIDIA, AMD, Intel, ARM, and Apple Silicon with no code changes, supports over 1,000 open-weight models including DeepSeek and Kimi out of the box, and deploys either in Modular's managed cloud or directly in a customer's own virtual private cloud. Modular acquired BentoML in early 2026, absorbing one of the primary open-source model serving frameworks into the platform. For organizations running mixed GPU environments, or planning to avoid NVIDIA lock-in at the software layer, MAX is the option that makes hardware flexibility operational rather than theoretical.

CIO / CTO Viability Question

Your legal team has reviewed your AI vendor contracts. The question is whether they have ever seen a list of the models running inside your third-party software. If they haven't, the self-hosting conversation is already overdue: you have a self-hosting situation, not a self-hosting strategy.

Two questions before your next infrastructure procurement cycle: does your VRAM budget match the models your teams need to run, and does your license inventory cover the jurisdictions where your data lives? The first is an engineering question. The second belongs in the same conversation as your vendor contract renewals.

Sources

Wiz. "State of AI in the Cloud 2026." Wiz, 2026. wiz.io.

Spheron. "GPU Requirements Cheat Sheet 2026." Spheron, May 2026. spheron.network.

Meta. "Llama 4 Community License Agreement." Meta, Apr. 2026. llama.meta.com.

Meta. "Llama 4 Scout and Maverick Model Cards." Meta, Apr. 2026. huggingface.co.

Mistral AI. "Mistral Large 2 Model Card." Mistral AI, 2026. huggingface.co.

Alibaba. "Qwen3 Model Family." Alibaba DAMO Academy, 2026. huggingface.co.

DeepSeek. "DeepSeek V4 Technical Report." DeepSeek, Apr. 2026. deepseek.com.

DeepSeek. "DeepSeek Privacy Policy." DeepSeek, 2026. deepseek.com.

Arcee AI. "Trinity-Large-Thinking Model Release." Arcee AI, Apr. 2026. arcee.ai.

EdgeCortix. "SAKURA-II Platform." EdgeCortix, 2026. edgecortix.com.

mimik Technology. "mimOE Studio Platform." mimik, Apr. 2026. mimik.com.

Qualcomm. "Cloud AI 100 Ultra and Dragonwing AI On-Premises Appliance." Qualcomm, 2026. qualcomm.com.

GrapeUp. "Running LLMs On-Device with Qualcomm Snapdragon 8 Elite." GrapeUp, Mar. 2026. grapeup.com.

Modular. "MAX Platform." Modular, 2026. modular.com.

Onyx AI. "Self-Hosted LLM Leaderboard 2026." Onyx, Mar. 2026. onyx.app.

Big Hat Group. "DeepSeek V4 for Microsoft-Shop Enterprises." Big Hat Group, Apr. 2026. bighatgroup.com.

Bellamkonda, Shashi. "Every Layer of the Network Is Becoming a Data Center." shashi.co, May 2026. shashi.co.

Bellamkonda, Shashi. "Arcee AI Built the U.S. AI Model Enterprises Can Actually Own." shashi.co, Apr. 2026. shashi.co.

Bellamkonda, Shashi. "mimik Launches mimOE Studio." shashi.co, May 2026. shashi.co.

Bellamkonda, Shashi. "Wiz State of AI in the Cloud 2026." shashi.co, Apr. 2026. shashi.co.

Disclaimer: This blog reflects my personal views only. Content does not represent the views of my employer, Info-Tech Research Group. AI tools may have been used for brevity, structure, or research support. Please independently verify any information before relying on it.