feat(scraper): add checkpointing and richer page extraction
Add resumable checkpoint support so long scrapes can recover from interruptions instead of restarting from scratch. - introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling - expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction - skip infobox tables during recipe parsing to avoid false recipe matches - add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in READMEfeat(scraper): add checkpointing and richer page extraction Add resumable checkpoint support so long scrapes can recover from interruptions instead of restarting from scratch. - introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling - expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction - skip infobox tables during recipe parsing to avoid false recipe matches - add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in README
This commit is contained in:
@@ -29,6 +29,8 @@ go run ./cmd/scrappr
|
||||
- Limits crawl depth and queue size to avoid drifting into junk pages
|
||||
- Retries temporary failures with short backoff
|
||||
- Prints colored emoji logs for queueing, requests, responses, parsing, retries, and periodic status
|
||||
- Stores legacy and portable infobox fields, primary item image URLs, recipes, effects, and raw content tables for later processing
|
||||
- Saves resumable checkpoints into `.cache/scrape-state.json` on a timer, during progress milestones, and on `Ctrl+C`
|
||||
- Writes a stable, sorted JSON dataset to `outward_data.json`
|
||||
|
||||
## Tuning
|
||||
@@ -37,4 +39,4 @@ Scraper defaults live in `internal/scraper/config.go`.
|
||||
|
||||
- Lower or raise `RequestDelay` / `RequestJitter`
|
||||
- Tighten or relax `MaxQueuedPages`
|
||||
- Adjust `RequestTimeout`, `MaxRetries`, and `ProgressEvery`
|
||||
- Adjust `RequestTimeout`, `MaxRetries`, `ProgressEvery`, `AutosaveEvery`, and `AutosavePages`
|
||||
|
||||
Reference in New Issue
Block a user