Add resumable checkpoint support so long scrapes can recover from interruptions instead of restarting from scratch. - introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling - expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction - skip infobox tables during recipe parsing to avoid false recipe matches - add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in READMEfeat(scraper): add checkpointing and richer page extraction Add resumable checkpoint support so long scrapes can recover from interruptions instead of restarting from scratch. - introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling - expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction - skip infobox tables during recipe parsing to avoid false recipe matches - add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in README
1.4 KiB
1.4 KiB
Scrappr
Small Go scraper for the Outward Fandom wiki.
Layout
.
├── cmd/scrappr/main.go # binary entrypoint
├── internal/app # bootstrapping and output writing
├── internal/logx # colored emoji logger
├── internal/model # dataset models
├── internal/scraper # crawl flow, parsing, queueing, retries
├── go.mod
├── go.sum
└── outward_data.json # generated output
Run
go run ./cmd/scrappr
What It Does
- Crawls item and crafting pages from
outward.fandom.com - Uses browser-like headers and rotating user agents
- Limits crawl depth and queue size to avoid drifting into junk pages
- Retries temporary failures with short backoff
- Prints colored emoji logs for queueing, requests, responses, parsing, retries, and periodic status
- Stores legacy and portable infobox fields, primary item image URLs, recipes, effects, and raw content tables for later processing
- Saves resumable checkpoints into
.cache/scrape-state.jsonon a timer, during progress milestones, and onCtrl+C - Writes a stable, sorted JSON dataset to
outward_data.json
Tuning
Scraper defaults live in internal/scraper/config.go.
- Lower or raise
RequestDelay/RequestJitter - Tighten or relax
MaxQueuedPages - Adjust
RequestTimeout,MaxRetries,ProgressEvery,AutosaveEvery, andAutosavePages