2026-03-15 16:42:43 +02:00
|
|
|
# Scrappr
|
|
|
|
|
|
|
|
|
|
Small Go scraper for the Outward Fandom wiki.
|
|
|
|
|
|
|
|
|
|
## Layout
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
.
|
2026-03-15 18:23:58 +02:00
|
|
|
├── cmd/outward-web/main.go # web UI entrypoint
|
2026-03-15 16:42:43 +02:00
|
|
|
├── cmd/scrappr/main.go # binary entrypoint
|
|
|
|
|
├── internal/app # bootstrapping and output writing
|
|
|
|
|
├── internal/logx # colored emoji logger
|
|
|
|
|
├── internal/model # dataset models
|
|
|
|
|
├── internal/scraper # crawl flow, parsing, queueing, retries
|
2026-03-15 18:23:58 +02:00
|
|
|
├── internal/webui # embedded web server + static UI
|
2026-03-15 16:42:43 +02:00
|
|
|
├── go.mod
|
|
|
|
|
├── go.sum
|
|
|
|
|
└── outward_data.json # generated output
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Run
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
go run ./cmd/scrappr
|
|
|
|
|
```
|
|
|
|
|
|
2026-03-15 18:23:58 +02:00
|
|
|
```bash
|
|
|
|
|
go run ./cmd/outward-web
|
|
|
|
|
```
|
|
|
|
|
|
2026-03-15 16:42:43 +02:00
|
|
|
## What It Does
|
|
|
|
|
|
|
|
|
|
- Crawls item and crafting pages from `outward.fandom.com`
|
|
|
|
|
- Uses browser-like headers and rotating user agents
|
|
|
|
|
- Limits crawl depth and queue size to avoid drifting into junk pages
|
|
|
|
|
- Retries temporary failures with short backoff
|
|
|
|
|
- Prints colored emoji logs for queueing, requests, responses, parsing, retries, and periodic status
|
feat(scraper): add checkpointing and richer page extraction
Add resumable checkpoint support so long scrapes can recover from
interruptions instead of restarting from scratch.
- introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling
- expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction
- skip infobox tables during recipe parsing to avoid false recipe matches
- add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in READMEfeat(scraper): add checkpointing and richer page extraction
Add resumable checkpoint support so long scrapes can recover from
interruptions instead of restarting from scratch.
- introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling
- expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction
- skip infobox tables during recipe parsing to avoid false recipe matches
- add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in README
2026-03-15 17:08:24 +02:00
|
|
|
- Stores legacy and portable infobox fields, primary item image URLs, recipes, effects, and raw content tables for later processing
|
|
|
|
|
- Saves resumable checkpoints into `.cache/scrape-state.json` on a timer, during progress milestones, and on `Ctrl+C`
|
2026-03-15 16:42:43 +02:00
|
|
|
- Writes a stable, sorted JSON dataset to `outward_data.json`
|
2026-03-15 18:23:58 +02:00
|
|
|
- Serves a local craft-planner UI backed by recipes from `outward_data.json`
|
2026-03-15 16:42:43 +02:00
|
|
|
|
|
|
|
|
## Tuning
|
|
|
|
|
|
|
|
|
|
Scraper defaults live in `internal/scraper/config.go`.
|
|
|
|
|
|
|
|
|
|
- Lower or raise `RequestDelay` / `RequestJitter`
|
|
|
|
|
- Tighten or relax `MaxQueuedPages`
|
feat(scraper): add checkpointing and richer page extraction
Add resumable checkpoint support so long scrapes can recover from
interruptions instead of restarting from scratch.
- introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling
- expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction
- skip infobox tables during recipe parsing to avoid false recipe matches
- add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in READMEfeat(scraper): add checkpointing and richer page extraction
Add resumable checkpoint support so long scrapes can recover from
interruptions instead of restarting from scratch.
- introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling
- expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction
- skip infobox tables during recipe parsing to avoid false recipe matches
- add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in README
2026-03-15 17:08:24 +02:00
|
|
|
- Adjust `RequestTimeout`, `MaxRetries`, `ProgressEvery`, `AutosaveEvery`, and `AutosavePages`
|