feat(scraper): add checkpointing and richer page extraction

Add resumable checkpoint support so long scrapes can recover from
interruptions instead of restarting from scratch.

- introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling
- expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction
- skip infobox tables during recipe parsing to avoid false recipe matches
- add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in READMEfeat(scraper): add checkpointing and richer page extraction

Add resumable checkpoint support so long scrapes can recover from
interruptions instead of restarting from scratch.

- introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling
- expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction
- skip infobox tables during recipe parsing to avoid false recipe matches
- add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in README
This commit is contained in:
2026-03-15 17:08:24 +02:00
parent 42e2083ece
commit 6bf221de3f
9 changed files with 607 additions and 29 deletions

View File

@@ -36,6 +36,7 @@ var (
"status": {emoji: "🌀", label: "STATUS", color: colorYellow},
"done": {emoji: "✅", label: "DONE", color: colorGreen},
"write": {emoji: "💾", label: "WRITE", color: colorBlue},
"cache": {emoji: "🗂️", label: "CACHE", color: colorCyan},
"skip": {emoji: "⏭️", label: "SKIP", color: colorGray},
"warn": {emoji: "⚠️", label: "WARN", color: colorYellow},
"error": {emoji: "💥", label: "ERROR", color: colorRed},