LexSubLM-Lite: Lightweight Lexical Substitution That Runs Anywhere

3 minute read

Published: April 24, 2025

Lexical substitution is one of those NLP tasks that seems simple until you try to do it well. You want a tool that can replace a word in a sentence with a substitute that makes sense, stays grammatically correct, and doesn’t rely on a 60GB language model. 😅

Meet LexSubLM-Lite — a compact, context-aware Python toolkit for one-word substitution that actually runs on your laptop. Built for researchers, tinkerers, and developers, it’s fast, extensible, and doesn’t assume you have a GPU cluster sitting idle.

✨ What does it do?

You give it a sentence and a target word. It gives you back top-k replacements that keep the original meaning and stay syntactically legal. For example:

lexsub run \
  --sentence "The bright student aced the exam." \
  --target bright \
  --top_k 5 \
  --model llama3-mini

Might return:

[
  "brilliant",
  "smart",
  "gifted",
  "clever",
  "talented"
]

Cool, right?

⚙️ How it works

It’s a modular pipeline, with each part doing one job:

Prompted generation: A quantized LLM generates candidate words.
Sanitisation: Removes multi-word noise and junk.
POS + morphology filtering: Ensures correct tense, number, etc.
Ranking: Uses cosine similarity with e5-small-v2 or next-token log-probs.
Evaluation: Built-in scoring against SWORDS, ProLex, and TSAR-2022 datasets.

All that with:

4-bit GGUF model support (blazingly fast on macOS)
No need to edit code to add new models (model_registry.yaml FTW)
Docker setup and dataset download scripts for reproducibility

🧪 Evaluate like a researcher

Run one command and get metrics like P@1, Recall@5, GAP, and ProF1:

lexsub eval \
  --dataset prolex \
  --split dev \
  --model distilgpt2

Whether you’re experimenting with LLMs or benchmarking your own model, LexSubLM-Lite helps you keep things measurable and reproducible.

🛠️ Built to hack

The toolkit is super easy to extend:

Drop a new alias in model_registry.yaml
Tweak the filters or add your own
Swap in your own generator or ranking logic

Everything is cleanly organized — no surprises, just Python done right.

🚀 Quick start

git clone https://github.com/shamspias/lexsublm-lite
cd lexsublm-lite
pip install -e .

If you’ve got CUDA, install bitsandbytes to unlock true 4-bit quantization:

pip install bitsandbytes

Then start substituting.

📈 Benchmarks (Sample)

Model	RAM GB	P@1	R@5	Jaccard
tinyllama	0.8	0.20	0.04	0.04
distilgpt2	1.1	0.10	0.05	0.08
llama3-mini	1.2	0.00	0.16	0.13

(M2 Pro, no GPU)

🧠 Nerdy Details

Uses spaCy + pymorphy3 for morphological matching
Ranks with sentence-transformers (SBERT) or native LLM logprobs
Data download via bash script — one command and done
Metrics via tabulate2, pydantic, orjson, etc.

All dependencies are cleanly listed in pyproject.toml.

🗺️ Roadmap

Coming soon:

LoRA fine-tuning
Gradio UI playground
Full multilingual eval on TSAR-2022 ES/PT

📚 Citation

If you use it in your work, cite it!

@software{lexsublm_lite_2025,
  author  = {Shamsuddin Ahmed},
  title   = {LexSubLM-Lite: Lightweight Contextual Lexical Substitution Toolkit},
  year    = {2025},
  url     = {https://github.com/shamspias/lexsublm-lite},
  license = {MIT}
}

🧵 TL;DR

LexSubLM-Lite gives you fast, controllable synonym generation with research-quality metrics. If you’re into language models but tired of setting up Hugging Face Transformers on an EC2 instance just to get synonyms — this one’s for you.

GitHub: github.com/shamspias/lexsublm-lite

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Shamsuddin Ahmed