Data & Training

NovaEdit is trained to emit patch DSL from code regions + diagnostics. This repo includes scaffolding you can extend with real data and checkpoints.

Data layout

bash

bash scripts/prepare_python_data.sh
# mines small diffs and synthetic samples into data/python/...

scripts/mine_git_diffs.py — collect recent Python diffs from a repo.
scripts/generate_synthetic_bugs.py — inject varied bugs (missing imports, typos, off-by-one, comparators, missing returns) and save JSONL.
scripts/build_edit_dataset.py — merge multiple JSONL sources into one.
scripts/train_tokenizer.py --input-glob 'data/python/raw/**/*.py' --output model/tokenizer.json — train a BPE tokenizer.
Use --validate on generate_synthetic_bugs.py to skip samples that fail AST parsing.
scripts/split_dataset.py --input data/python/processed/edits.jsonl --train-out ... — split a merged dataset into train/val/test.

trainer/pretrain.py — tiny character LM to smoke-test pipelines.
trainer/sft_edit.py — SFT scaffold using the heuristic model as pseudo-labels.
Config examples live in model/config/*.yaml (small/base).

Push weights/config with scripts/push_to_hub.py --repo <org/model> --path weights/novaedit-small.
For Spaces, run the FastAPI app (novaedit.server.main:app) via the provided Dockerfile.

data/python/processed/sample_edits.jsonl — tiny, ready-to-use edit dataset for smoke tests.
model/tokenizer-sample.json — toy tokenizer generated from examples/ for pipeline checks (train your own for real runs).