Running LLMs on AWS g6e: vLLM Deployment & GPU Memory Analysis
After experimenting with self-hosted large language models on my Apple Macbook Air (M4), I wanted to document what a practical vLLM deployment looks like on a g6e instance — AWS’s NVIDIA L40S-based compute tier — and share real GPU memory consumption data across three different vision-capable models.
The Hardware: NVIDIA L40S
The g6e family ships with the NVIDIA L40S, a professional-grade Ada Lovelace GPU targeting AI inference and rendering workloads. Key specs relevant here:
- VRAM: 46,068 MiB (~45 GB)
- TDP: 350W
- Architecture: Ada Lovelace
- CUDA: 13.0 (driver 580.95.05)
With ~45 GB of VRAM, the L40S can comfortably host 7–12B parameter models at full precision, or significantly larger models with quantization.
Deployment Stack
The setup uses two containers orchestrated via Docker Compose:
vLLM Container
The inference engine uses AWS’s own Deep Learning Container (DLC), which comes pre-configured with all CUDA dependencies:
image: public.ecr.aws/deep-learning-containers/vllm:0.11.2-gpu-py312-cu129-ubuntu22.04-ec2-v1.3-soci
command: --model Qwen/Qwen3-VL-8B-Instruct --max-model-len 112000
Using the AWS DLC over the upstream vllm/vllm-openai image has a practical benefit: the image is hosted on ECR Public, so pulls from within an AWS region are free and fast.
Open-WebUI
Open-WebUI provides a ChatGPT-style interface connected to the vLLM OpenAI-compatible endpoint:
environment:
- OPENAI_API_BASE_URL=http://vllm-qwen3vl-8b:8000/v1
Both services share a private bridge network (vllm-ai-net), so the WebUI communicates with the inference engine over the Docker internal network without exposing the API port publicly.
Persistent Storage
Two named volumes with bind mounts under /opt/volumes/ handle persistence:
| Volume | Purpose |
|---|---|
huggingface-cache |
Model weights cache — avoids re-downloading on restart |
open-webui |
Chat history, user settings, uploaded files |
GPU Memory Consumption: Three Models Compared
I tested three vision-language models on the same L40S. Each measurement was taken from nvidia-smi during active inference load.
Model 1: Qwen/Qwen3-VL-8B-Instruct
Memory-Usage: 37,378 MiB / 46,068 MiB
└─ Python process: 530 MiB
└─ VLLM::EngineCore: 36,834 MiB
GPU-Util: 0% (idle between requests)
Power: 77W / 350W
The Qwen3-VL-8B is the heaviest of the three despite being an 8B model. The large VRAM footprint is consistent with a max-model-len of 112,000 tokens — vLLM pre-allocates KV-cache blocks for the full context length at startup.
Reducing --max-model-len is the most effective lever to reclaim VRAM if you don’t need 112K context.
Model 2: FARA-7B
Memory-Usage: 31,184 MiB / 46,068 MiB
└─ Python process: 530 MiB
└─ VLLM::EngineCore: 30,640 MiB
GPU-Util: 0% (idle between requests)
Power: 82W / 350W
FARA-7B uses ~6 GB less VRAM than Qwen3-VL-8B. With roughly 15 GB of free VRAM remaining, you could technically run a second small model concurrently on the same GPU — though scheduling and memory fragmentation make this non-trivial in practice.
Model 3: nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16
Memory-Usage: 34,292 MiB / 46,068 MiB
└─ Python process: 530 MiB
└─ VLLM::EngineCore: 33,748 MiB
GPU-Util: 0% (idle between requests)
Power: 84W / 350W
Launch flags: --trust-remote-code --dtype bfloat16 --video-pruning-rate 0
Nemotron-Nano-12B sits in the middle despite having more parameters than the other two. The bfloat16 dtype and --video-pruning-rate 0 flag (disabling temporal frame pruning) keep memory predictable. It requires --trust-remote-code as it ships with custom modeling code not yet merged upstream.
Memory Comparison Summary
| Model | Parameters | VRAM Used | VRAM Free | Notes |
|---|---|---|---|---|
| Qwen3-VL-8B-Instruct | 8B | 37,378 MiB | 8,690 MiB | 112K context window |
| FARA-7B | 7B | 31,184 MiB | 14,884 MiB | Smallest footprint |
| Nemotron-Nano-12B-VL | 12B | 34,292 MiB | 11,776 MiB | BF16, custom code |
All three models fit comfortably within the L40S’s 46 GB. The constant 530 MiB consumed by the Python host process is the vLLM API server overhead — consistent across all runs.
Key Takeaways
- KV-cache dominates VRAM, not just model weights. Setting
--max-model-lenappropriately for your use case can save several GB. - AWS DLC images are the pragmatic choice on EC2 — ECR pulls are fast and free within a region.
- The L40S handles all three models without VRAM pressure. A g6e.xlarge (single L40S) is a cost-effective option for single-model VLM inference.
- Open-WebUI + vLLM is a solid self-hosted stack — the OpenAI-compatible API means near-zero configuration on the frontend side.
The Docker Compose file is a good starting template for spinning up similar stacks. The main thing to parameterize before going to production is replacing the placeholder API key with a proper secret management solution.