Collecting Links
Input Sources
The Collector extracts links from four input types:
import io.mvnpm.raclette.collector.Collector;
import io.mvnpm.raclette.collector.Input;
import io.mvnpm.raclette.types.Uri;
Set<Uri> links = Collector.builder()
.build()
.collectLinks(Set.of(
new Input.FsPath(Path.of("target/site")), // local file or directory
new Input.FsGlob("docs/**/*.html"), // glob pattern
new Input.RemoteUrl("https://example.com"), // remote URL
new Input.StringContent("<a href='...'>link</a>") // raw HTML string
));
Input.FsPath
Checks a single file or walks a directory recursively. Hidden files and directories are skipped by default.
Input.FsGlob
Matches files using glob patterns like docs/**/*.html.
Input.RemoteUrl
Fetches the page over HTTP and extracts links from the response body. The URL itself is used as the base for resolving relative links.
Input.StringContent
Parses a raw HTML string. Useful for testing or checking dynamically generated content.
Base URL Resolution
Relative links need a base URL to resolve against. Set it on the builder:
Collector.builder()
.base("https://example.com/docs/")
.build()
.collectLinks(Set.of(new Input.StringContent(html)));
With base https://example.com/docs/:
../index.htmlresolves tohttps://example.com/index.html/aboutresolves tohttps://example.com/aboutpage.htmlresolves tohttps://example.com/docs/page.html
For RemoteUrl inputs, the fetched URL is automatically used as the base.
Collector Options
| Option | Default | Description |
|---|---|---|
base |
none | Base URL for resolving relative links |
includeVerbatim |
false | Extract links from verbatim blocks (<pre>, <code>) and process non-HTML files |
skipHidden |
true | Skip hidden files and directories |
userAgent |
raclette/0.1 |
User-Agent for fetching remote URLs |
What Gets Extracted
The collector uses JSoup to extract links from HTML elements:
<a href="..."><img src="..."><link href="..."><script src="..."><source src="..."><img srcset="...">(all URLs in srcset)<form action="...">- Bare email addresses in text content