$ npm install -g compressx

Compress LLMs.
Keep the originals.

One command to shrink every model in your Ollama library. Originals stay intact — compressed versions get a -cx suffix.

$ npm install -g compressx

Universal install via npm. Works on any OS with Node.js.

Requires Node.js 18+

How it works View commands

NEW · v0.7Benchmark release

Benchmark before & after

New compressx benchmark command. Side-by-side speed, perplexity, and a 10-prompt regression battery with a verdict.

Live progress bar

Real-time per-tensor progress with percent and ETA while quantization runs. No more wondering if it hung.

Self-installing

First run auto-downloads llama.cpp binaries. No manual setup, no brew install prereqs.

Post-compression smoke test

Every compressed model gets a sanity check. Catches broken quants before you ever load them.

~/my-project

$ compressx

CompressX - LLM Compression for Ollama

✓ Ollama running with 20 models

✓ NVIDIA RTX 5060 | 32 GB RAM | 8 GB VRAM

Found 4 models that could be smaller:

Model Current → CompressX Savings

──────────────────────────────────────────────────

qwen3:14b 8.4 GB 6.2 GB Q3_K_M -26%

gemma4:12b 9.6 GB 5.8 GB Q3_K_M -40%

llama3.1:8b 4.9 GB 3.1 GB Q4_K_M -37%

? Select models to compress: (space to toggle)

❯ ◉ qwen3:14b

◉ gemma4:12b

◯ llama3.1:8b

[1/1] Re-quantizing local blob to Q3_K_M...

████████████████░░░░░░░░░░░░ 58.2% 169/291 tensors 0:14 elapsed eta 0:10

Using local Ollama blobs. ~30 sec each, zero download.

Know what you're shipping.

compressx benchmark qwen3:4b runs a side-by-side comparison with a color-coded verdict.

benchmark

CompressX Benchmark: qwen3:4b vs qwen3:4b-cx

────────────────────────────────────────────

Original

Compressed

Delta

Size on disk

8.10 GB

2.60 GB

-68%

Prompt eval

142 tok/s

187 tok/s

+32%

Generation

48 tok/s

74 tok/s

+54%

Perplexity

7.42

7.89

+6.3%

Prompt battery

10/10 ok

9/10 ok

1 diverged

Assessment: Good — typical quantization trade-off

• Size reduced by 68%

• Perplexity delta of 6.3% — within expected range

• Generation speed up 54%

Recommendation: Ship it unless quality-critical.

Works with your runtime

Source models from Ollama or LM Studio with --source. Deploy anywhere with --target.

Ollama

DEFAULT

Auto-registers as model:tag-cx. No extra steps.

compressx compress qwen3:4b

LM Studio

SOURCE + TARGET

Scan your LM Studio library with --source lmstudio, or deploy compressed GGUFs into ~/.lmstudio/models/.

compressx scan --source lmstudio

Everything else

Leaves the raw GGUF file in the output directory. Use with any GGUF-compatible tool.

compressx compress qwen3:4b --target gguf

Compatible with: Ollama · LM Studio · llama.cpp · Jan · GPT4All · Msty · text-generation-webui · koboldcpp

Scan

Run compressx. It scans your local Ollama library (or --source lmstudio) and auto-detects your GPU/RAM to find models that could be smaller.

Compress

CompressX re-quantizes the GGUF file already in your Ollama library — ~30 seconds, zero download. No model yet? It falls back to fetching the original weights automatically. Use --from-source for pristine quality.

Deploy

Auto-registers in Ollama (default), LM Studio, or leaves a raw GGUF file for llama.cpp, Jan, GPT4All, and friends. Pick with --target. Originals are never touched.

Why CompressX?

Originals stay intact

We never modify your existing models. Compressed versions live alongside them with a clear -cx suffix.

Fully local

Uses your own GPU. No upload, no cloud processing, no data leaving your machine. Privacy by design.

Hardware-aware

Auto-detects your VRAM and picks the right quantization level. No guessing, no OOM errors.

Open source

MIT licensed on GitHub. Free forever. No account required. No credits. No rate limits.

Commands

$ compressx

Scan Ollama library and interactively compress models

$ compressx --all

Show every installed model, even ones that already fit your hardware

$ compressx --preview

Library-wide preview: what compression would save for every installed model (read-only)

$ compressx preview qwen3:14b

See every quant level side-by-side for a specific model

$ compressx compress qwen3:4b

Compress a specific model to the auto-recommended quant level

$ compressx compress qwen3:4b -q q4_k_m

Compress with a specific quantization type

$ compressx compress qwen3:4b --from-source

Download original weights from HuggingFace for pristine quality (slower)

$ compressx scan --source lmstudio

Scan your LM Studio models directory for GGUFs to compress

$ compressx compress Qwen/Qwen3-4B --source lmstudio -q q3_k_m

Re-quantize a model from LM Studio (~30 sec, no download)

$ compressx compress qwen3:4b --target lmstudio

Deploy to LM Studio instead of Ollama

$ compressx compress qwen3:4b --target gguf

Just produce a GGUF file (for llama.cpp, Jan, GPT4All, Msty, etc.)

$ compressx compress qwen3:4b --benchmark

Compress and immediately run a side-by-side benchmark

$ compressx benchmark qwen3:4b

Full benchmark: speed, perplexity, 10-prompt battery, verdict (2-3 min)

$ compressx benchmark qwen3:4b --fast

Benchmark without perplexity — speed + prompts only (~30 sec)

$ compressx hardware

Show detected GPU, VRAM, RAM, and recommended model sizes

$ compressx models

List all supported models

$ compressx update

Update CompressX to the latest version

$ compressx uninstall

Remove CompressX data directory (CLI removal is one more step)

Updates & Uninstall

CompressX checks for updates automatically once per day. You can also manage it manually.

▲ Update

Get the latest version with new models, bug fixes, and features.