README.md

# Scrappr

Small Go scraper for the Outward Fandom wiki.

## Layout

```text
.
├── cmd/outward-web/main.go    # web UI entrypoint
├── cmd/scrappr/main.go       # binary entrypoint
├── internal/app              # bootstrapping and output writing
├── internal/logx             # colored emoji logger
├── internal/model            # dataset models
├── internal/scraper          # crawl flow, parsing, queueing, retries
├── internal/webui            # embedded web server + static UI
├── go.mod
├── go.sum
└── outward_data.json         # generated output
```

## Run

```bash
go run ./cmd/scrappr
```

```bash
go run ./cmd/outward-web
```

## What It Does

- Crawls item and crafting pages from `outward.fandom.com`
- Uses browser-like headers and rotating user agents
- Limits crawl depth and queue size to avoid drifting into junk pages
- Retries temporary failures with short backoff
- Prints colored emoji logs for queueing, requests, responses, parsing, retries, and periodic status
- Stores legacy and portable infobox fields, primary item image URLs, recipes, effects, and raw content tables for later processing
- Saves resumable checkpoints into `.cache/scrape-state.json` on a timer, during progress milestones, and on `Ctrl+C`
- Writes a stable, sorted JSON dataset to `outward_data.json`
- Serves a local craft-planner UI backed by recipes from `outward_data.json`

## Tuning

Scraper defaults live in `internal/scraper/config.go`.

- Lower or raise `RequestDelay` / `RequestJitter`
- Tighten or relax `MaxQueuedPages`
- Adjust `RequestTimeout`, `MaxRetries`, `ProgressEvery`, `AutosaveEvery`, and `AutosavePages`
Initial COmmit 2026-03-15 16:42:43 +02:00			`# Scrappr`

			`Small Go scraper for the Outward Fandom wiki.`

			`## Layout`

			```text
			`.`
Evil 2026-03-15 18:23:58 +02:00			`├── cmd/outward-web/main.go # web UI entrypoint`
Initial COmmit 2026-03-15 16:42:43 +02:00			`├── cmd/scrappr/main.go # binary entrypoint`
			`├── internal/app # bootstrapping and output writing`
			`├── internal/logx # colored emoji logger`
			`├── internal/model # dataset models`
			`├── internal/scraper # crawl flow, parsing, queueing, retries`
Evil 2026-03-15 18:23:58 +02:00			`├── internal/webui # embedded web server + static UI`
Initial COmmit 2026-03-15 16:42:43 +02:00			`├── go.mod`
			`├── go.sum`
			`└── outward_data.json # generated output`
			```

			`## Run`

			```bash
			`go run ./cmd/scrappr`
			```

Evil 2026-03-15 18:23:58 +02:00			```bash
			`go run ./cmd/outward-web`
			```

Initial COmmit 2026-03-15 16:42:43 +02:00			`## What It Does`

			- Crawls item and crafting pages from `outward.fandom.com`
			`- Uses browser-like headers and rotating user agents`
			`- Limits crawl depth and queue size to avoid drifting into junk pages`
			`- Retries temporary failures with short backoff`
			`- Prints colored emoji logs for queueing, requests, responses, parsing, retries, and periodic status`
feat(scraper): add checkpointing and richer page extraction Add resumable checkpoint support so long scrapes can recover from interruptions instead of restarting from scratch. - introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling - expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction - skip infobox tables during recipe parsing to avoid false recipe matches - add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in READMEfeat(scraper): add checkpointing and richer page extraction Add resumable checkpoint support so long scrapes can recover from interruptions instead of restarting from scratch. - introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling - expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction - skip infobox tables during recipe parsing to avoid false recipe matches - add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in README 2026-03-15 17:08:24 +02:00			`- Stores legacy and portable infobox fields, primary item image URLs, recipes, effects, and raw content tables for later processing`
			- Saves resumable checkpoints into `.cache/scrape-state.json` on a timer, during progress milestones, and on `Ctrl+C`
Initial COmmit 2026-03-15 16:42:43 +02:00			- Writes a stable, sorted JSON dataset to `outward_data.json`
Evil 2026-03-15 18:23:58 +02:00			- Serves a local craft-planner UI backed by recipes from `outward_data.json`
Initial COmmit 2026-03-15 16:42:43 +02:00
			`## Tuning`

			Scraper defaults live in `internal/scraper/config.go`.

			- Lower or raise `RequestDelay` / `RequestJitter`
			- Tighten or relax `MaxQueuedPages`
feat(scraper): add checkpointing and richer page extraction Add resumable checkpoint support so long scrapes can recover from interruptions instead of restarting from scratch. - introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling - expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction - skip infobox tables during recipe parsing to avoid false recipe matches - add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in READMEfeat(scraper): add checkpointing and richer page extraction Add resumable checkpoint support so long scrapes can recover from interruptions instead of restarting from scratch. - introduce autosave/load/clear checkpoint flow in `.cache/scrape-state.json`, including SIGINT/SIGTERM save-on-exit handling - expand parsing/model output to capture legacy and portable infobox fields, primary image URLs, effects, recipes, raw tables, and improved category extraction - skip infobox tables during recipe parsing to avoid false recipe matches - add cache log event type, ignore cache/output artifacts, and document new autosave tuning options in README 2026-03-15 17:08:24 +02:00			- Adjust `RequestTimeout`, `MaxRetries`, `ProgressEvery`, `AutosaveEvery`, and `AutosavePages`