Data & Training
NovaEdit is trained to emit patch DSL from code regions + diagnostics. This repo includes scaffolding you can extend with real data and checkpoints.
Data layout
data/python/raw— raw corporadata/python/diffs— mined git diffsdata/python/synthetic_bugfixes— generated bugfix samplesdata/javascript/*— future languages
Prep scripts
bash
bash scripts/prepare_python_data.sh
# mines small diffs and synthetic samples into data/python/...scripts/mine_git_diffs.py— collect recent Python diffs from a repo.scripts/generate_synthetic_bugs.py— inject varied bugs (missing imports, typos, off-by-one, comparators, missing returns) and save JSONL.scripts/build_edit_dataset.py— merge multiple JSONL sources into one.scripts/train_tokenizer.py --input-glob 'data/python/raw/**/*.py' --output model/tokenizer.json— train a BPE tokenizer.- Use
--validateongenerate_synthetic_bugs.pyto skip samples that fail AST parsing. scripts/split_dataset.py --input data/python/processed/edits.jsonl --train-out ...— split a merged dataset into train/val/test.
Training stubs
trainer/pretrain.py— tiny character LM to smoke-test pipelines.trainer/sft_edit.py— SFT scaffold using the heuristic model as pseudo-labels.- Config examples live in
model/config/*.yaml(small/base).
Evaluation
eval/run_eval_bugfix.py --data <jsonl>— measures diagnostic count reduction.eval/run_eval_regression.py— prints patches for a small regression suite.
Hugging Face
- Push weights/config with
scripts/push_to_hub.py --repo <org/model> --path weights/novaedit-small. - For Spaces, run the FastAPI app (
novaedit.server.main:app) via the provided Dockerfile.
Samples
data/python/processed/sample_edits.jsonl— tiny, ready-to-use edit dataset for smoke tests.model/tokenizer-sample.json— toy tokenizer generated fromexamples/for pipeline checks (train your own for real runs).