Daniel Legt 6bf221de3f feat(scraper): add checkpointing and richer page extraction
Add resumable checkpoint support so long scrapes can recover from
interruptions instead of restarting from scratch.

- introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling
- expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction
- skip infobox tables during recipe parsing to avoid false recipe matches
- add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in READMEfeat(scraper): add checkpointing and richer page extraction

Add resumable checkpoint support so long scrapes can recover from
interruptions instead of restarting from scratch.

- introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling
- expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction
- skip infobox tables during recipe parsing to avoid false recipe matches
- add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in README
2026-03-15 17:08:24 +02:00
2026-03-15 16:42:43 +02:00
2026-03-15 16:42:43 +02:00
2026-03-15 16:42:43 +02:00
2026-03-15 16:42:43 +02:00

Scrappr

Small Go scraper for the Outward Fandom wiki.

Layout

.
├── cmd/scrappr/main.go       # binary entrypoint
├── internal/app              # bootstrapping and output writing
├── internal/logx             # colored emoji logger
├── internal/model            # dataset models
├── internal/scraper          # crawl flow, parsing, queueing, retries
├── go.mod
├── go.sum
└── outward_data.json         # generated output

Run

go run ./cmd/scrappr

What It Does

  • Crawls item and crafting pages from outward.fandom.com
  • Uses browser-like headers and rotating user agents
  • Limits crawl depth and queue size to avoid drifting into junk pages
  • Retries temporary failures with short backoff
  • Prints colored emoji logs for queueing, requests, responses, parsing, retries, and periodic status
  • Stores legacy and portable infobox fields, primary item image URLs, recipes, effects, and raw content tables for later processing
  • Saves resumable checkpoints into .cache/scrape-state.json on a timer, during progress milestones, and on Ctrl+C
  • Writes a stable, sorted JSON dataset to outward_data.json

Tuning

Scraper defaults live in internal/scraper/config.go.

  • Lower or raise RequestDelay / RequestJitter
  • Tighten or relax MaxQueuedPages
  • Adjust RequestTimeout, MaxRetries, ProgressEvery, AutosaveEvery, and AutosavePages
Description
AI Slop
Readme 1.3 MiB
Languages
Go 74.4%
JavaScript 15.6%
CSS 7.2%
HTML 2.7%