Running LLMs on AWS g6e: vLLM Deployment & GPU Memory Analysis

After experimenting with self-hosted large language models on my Apple Macbook Air (M4), I wanted to document what a practical vLLM deployment looks like on a g6e instance — AWS’s NVIDIA L40S-based compute tier — and share real GPU memory consumption data across three different vision-capable models.

The Hardware: NVIDIA L40S

The g6e family ships with the NVIDIA L40S, a professional-grade Ada Lovelace GPU targeting AI inference and rendering workloads. Key specs relevant here:

VRAM: 46,068 MiB (~45 GB)
TDP: 350W
Architecture: Ada Lovelace
CUDA: 13.0 (driver 580.95.05)

With ~45 GB of VRAM, the L40S can comfortably host 7–12B parameter models at full precision, or significantly larger models with quantization.

Deployment Stack

The setup uses two containers orchestrated via Docker Compose:

architecture-beta group aws(logos:aws-vpc)[VPC] service publicalb(logos:aws-elb)[ELB] in aws service srv_gpu(logos:aws-ec2)[EC2GPU] in aws

vLLM Container

The inference engine uses AWS’s own Deep Learning Container (DLC), which comes pre-configured with all CUDA dependencies:

image: public.ecr.aws/deep-learning-containers/vllm:0.11.2-gpu-py312-cu129-ubuntu22.04-ec2-v1.3-soci
command: --model Qwen/Qwen3-VL-8B-Instruct --max-model-len 112000

Using the AWS DLC over the upstream vllm/vllm-openai image has a practical benefit: the image is hosted on ECR Public, so pulls from within an AWS region are free and fast.

Open-WebUI

Open-WebUI provides a ChatGPT-style interface connected to the vLLM OpenAI-compatible endpoint:

environment:
  - OPENAI_API_BASE_URL=http://vllm-qwen3vl-8b:8000/v1

Both services share a private bridge network (vllm-ai-net), so the WebUI communicates with the inference engine over the Docker internal network without exposing the API port publicly.

Persistent Storage

Two named volumes with bind mounts under /opt/volumes/ handle persistence:

Volume	Purpose
`huggingface-cache`	Model weights cache — avoids re-downloading on restart
`open-webui`	Chat history, user settings, uploaded files

GPU Memory Consumption: Three Models Compared

I tested three vision-language models on the same L40S. Each measurement was taken from nvidia-smi during active inference load.

xychart title "GPU Memory Usage (MiB) — NVIDIA L40S (46,068 MiB total)" x-axis ["Qwen3-VL-8B", "FARA-7B", "Nemotron-Nano-12B"] y-axis 0 --> 46068 bar [37378, 31184, 34292]

Model 1: Qwen/Qwen3-VL-8B-Instruct

Memory-Usage: 37,378 MiB / 46,068 MiB
  └─ Python process:       530 MiB
  └─ VLLM::EngineCore:  36,834 MiB
GPU-Util: 0% (idle between requests)
Power: 77W / 350W

The Qwen3-VL-8B is the heaviest of the three despite being an 8B model. The large VRAM footprint is consistent with a max-model-len of 112,000 tokens — vLLM pre-allocates KV-cache blocks for the full context length at startup.

Reducing --max-model-len is the most effective lever to reclaim VRAM if you don’t need 112K context.

Model 2: FARA-7B

Memory-Usage: 31,184 MiB / 46,068 MiB
  └─ Python process:       530 MiB
  └─ VLLM::EngineCore:  30,640 MiB
GPU-Util: 0% (idle between requests)
Power: 82W / 350W

FARA-7B uses ~6 GB less VRAM than Qwen3-VL-8B. With roughly 15 GB of free VRAM remaining, you could technically run a second small model concurrently on the same GPU — though scheduling and memory fragmentation make this non-trivial in practice.

Model 3: nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16

Memory-Usage: 34,292 MiB / 46,068 MiB
  └─ Python process:       530 MiB
  └─ VLLM::EngineCore:  33,748 MiB
GPU-Util: 0% (idle between requests)
Power: 84W / 350W
Launch flags: --trust-remote-code --dtype bfloat16 --video-pruning-rate 0

Nemotron-Nano-12B sits in the middle despite having more parameters than the other two. The bfloat16 dtype and --video-pruning-rate 0 flag (disabling temporal frame pruning) keep memory predictable. It requires --trust-remote-code as it ships with custom modeling code not yet merged upstream.

Memory Comparison Summary

Model	Parameters	VRAM Used	VRAM Free	Notes
Qwen3-VL-8B-Instruct	8B	37,378 MiB	8,690 MiB	112K context window
FARA-7B	7B	31,184 MiB	14,884 MiB	Smallest footprint
Nemotron-Nano-12B-VL	12B	34,292 MiB	11,776 MiB	BF16, custom code

All three models fit comfortably within the L40S’s 46 GB. The constant 530 MiB consumed by the Python host process is the vLLM API server overhead — consistent across all runs.

Key Takeaways

KV-cache dominates VRAM, not just model weights. Setting --max-model-len appropriately for your use case can save several GB.
AWS DLC images are the pragmatic choice on EC2 — ECR pulls are fast and free within a region.
The L40S handles all three models without VRAM pressure. A g6e.xlarge (single L40S) is a cost-effective option for single-model VLM inference.
Open-WebUI + vLLM is a solid self-hosted stack — the OpenAI-compatible API means near-zero configuration on the frontend side.

The Docker Compose file is a good starting template for spinning up similar stacks. The main thing to parameterize before going to production is replacing the placeholder API key with a proper secret management solution.