synkit
A toolkit for building round-trip parsers with logos.
What is synkit?
synkit generates syn-like parsing infrastructure from token definitions. Define your tokens once, and get:
- Token enum with logos lexer integration
- Token structs (
EqToken,IdentToken, etc.) with typed values - TokenStream with whitespace skipping, fork/rewind, and span tracking
- Parse/Peek/ToTokens traits for building parsers
- Printer for round-trip code formatting
When to use synkit
| Use Case | synkit | Alternative |
|---|---|---|
| Custom DSL with formatting | ✅ | - |
| Config file parser | ✅ | serde + format-specific crate |
| Code transformation tool | ✅ | - |
| Rust source parsing | ❌ | syn |
| Simple regex matching | ❌ | logos alone |
synkit is ideal when you need:
- Round-trip fidelity: Parse → modify → print without losing formatting
- Span tracking: Precise error locations and source mapping
- Type-safe AST: Strongly-typed nodes with
Spanned<T>wrappers
Architecture
flowchart LR
Source["Source<br/>String"] --> TokenStream["TokenStream<br/>(lexer)"]
TokenStream --> AST["AST<br/>(parser)"]
TokenStream --> Span["Span<br/>tracking"]
AST --> Printer["Printer<br/>(output)"]
Quick Example
use synkit::parser_kit;
synkit::parser_kit! {
error: MyError,
skip_tokens: [Space],
tokens: {
#[token(" ")]
Space,
#[token("=")]
Eq,
#[regex(r"[a-z]+", |lex| lex.slice().to_string())]
Ident(String),
},
delimiters: {},
span_derives: [Debug, Clone, PartialEq],
token_derives: [Debug, Clone, PartialEq],
}
This generates:
Tokenenum withEq,Ident(String)variantsEqToken,IdentTokenstructsTokenStreamwithlex(),parse(),peek()Tok![=],Tok![ident]macrosParse,Peek,ToTokens,Diagnostictraits
Getting Started
Installation
Add synkit and logos to your Cargo.toml:
[dependencies]
synkit = "0.1"
logos = "0.15"
thiserror = "2" # recommended for error types
Optional Features
# For async streaming with tokio
synkit = { version = "0.1", features = ["tokio"] }
# For async streaming with futures (runtime-agnostic)
synkit = { version = "0.1", features = ["futures"] }
# For std::error::Error implementations
synkit = { version = "0.1", features = ["std"] }
Minimal Example
A complete parser in ~30 lines:
use thiserror::Error;
#[derive(Error, Debug, Clone, Default, PartialEq)]
pub enum LexError {
#[default]
#[error("unknown token")]
Unknown,
#[error("expected {expect}, found {found}")]
Expected { expect: &'static str, found: String },
#[error("expected {expect}")]
Empty { expect: &'static str },
}
synkit::parser_kit! {
error: LexError,
skip_tokens: [Space],
tokens: {
#[token(" ")]
Space,
#[token("=")]
Eq,
#[regex(r"[a-z]+", |lex| lex.slice().to_string())]
#[fmt("identifier")]
Ident(String),
#[regex(r"[0-9]+", |lex| lex.slice().parse().ok())]
#[fmt("number")]
Number(i64),
},
delimiters: {},
span_derives: [Debug, Clone, PartialEq],
token_derives: [Debug, Clone, PartialEq],
}
Using the Generated Code
After parser_kit!, you have access to:
use crate::{
// Span types
Span, Spanned,
// Token enum and structs
tokens::{Token, EqToken, IdentToken, NumberToken},
// Parsing infrastructure
stream::TokenStream,
// Traits
Parse, Peek, ToTokens, Diagnostic,
};
// Lex source into tokens
let mut stream = TokenStream::lex("x = 42")?;
// Parse tokens
let name: Spanned<IdentToken> = stream.parse()?;
let eq: Spanned<EqToken> = stream.parse()?;
let value: Spanned<NumberToken> = stream.parse()?;
assert_eq!(*name.value, "x");
assert_eq!(value.value.0, 42);
Generated Modules
parser_kit! generates these modules in your crate:
| Module | Contents |
|---|---|
span | Span, RawSpan, Spanned<T> |
tokens | Token enum, *Token structs, Tok!/SpannedTok! macros |
stream | TokenStream, MutTokenStream |
printer | Printer implementation |
delimiters | Delimiter structs (e.g., Bracket, Brace) |
traits | Parse, Peek, ToTokens, Diagnostic |
Error Type Requirements
Your error type must:
- Implement
Default(for unknown token errors from logos) - Have variants for parse errors (recommended pattern):
#[derive(Error, Debug, Clone, Default, PartialEq)]
pub enum MyError {
#[default]
#[error("unknown")]
Unknown,
#[error("expected {expect}, found {found}")]
Expected { expect: &'static str, found: String },
#[error("expected {expect}")]
Empty { expect: &'static str },
}
Next Steps
- Concepts - Understand tokens, parsing, spans
- Tutorial - Build a complete TOML parser
- Reference - Full macro documentation
Concepts Overview
This section covers the core concepts in synkit:
- Tokens - Token enum, token structs, and the
Tok!macro - Parsing -
ParseandPeektraits, stream operations - Spans & Errors - Source locations,
Spanned<T>, error handling - Printing -
ToTokenstrait and round-trip formatting
Core Flow
Source → Lexer → TokenStream → Parse → AST → ToTokens → Output
- Lexer (logos): Converts source string to token sequence
- TokenStream: Wraps tokens with span tracking and skip logic
- Parse: Trait for converting tokens to AST nodes
- AST: Your domain-specific tree structure
- ToTokens: Trait for converting AST back to formatted output
Tokens
synkit generates two representations for each token: an enum variant and a struct.
Token Enum
The Token enum contains all token variants, used by the lexer:
#[derive(Logos, Debug, Clone, PartialEq)]
pub enum Token {
#[token("=")]
Eq,
#[regex(r"[a-z]+", |lex| lex.slice().to_string())]
Ident(String),
#[regex(r"[0-9]+", |lex| lex.slice().parse().ok())]
Number(i64),
}
Token Structs
For each variant, synkit generates a corresponding struct:
// Unit token (no value)
pub struct EqToken;
impl EqToken {
pub fn new() -> Self { Self }
pub fn token(&self) -> Token { Token::Eq }
}
// Token with value
pub struct IdentToken(pub String);
impl IdentToken {
pub fn new(value: String) -> Self { Self(value) }
pub fn token(&self) -> Token { Token::Ident(self.0.clone()) }
}
impl std::ops::Deref for IdentToken {
type Target = String;
fn deref(&self) -> &Self::Target { &self.0 }
}
Token Attributes
#[token(...)] and #[regex(...)]
Standard logos attributes for matching:
#[token("=")] // Exact match
#[regex(r"[a-z]+")] // Regex pattern
#[regex(r"[0-9]+", |lex| lex.slice().parse().ok())] // With callback
#[fmt(...)]
Custom display name for error messages:
#[regex(r"[a-z]+", |lex| lex.slice().to_string())]
#[fmt("identifier")] // Error: "expected identifier, found ..."
Ident(String),
Without #[fmt], uses the variant name in snake_case.
#[derive(...)] on tokens
Additional derives for a specific token struct:
#[regex(r"[A-Za-z_]+", |lex| lex.slice().to_string())]
#[derive(Hash, Eq)] // Only for IdentToken
Ident(String),
priority
Logos priority for overlapping patterns:
#[token("true", priority = 2)] // Higher priority than bare keys
True,
#[regex(r"[A-Za-z]+", priority = 1)]
BareKey(String),
The Tok! Macro
Access token types by their pattern:
// Punctuation - use the literal
Tok![=] // → EqToken
Tok![.] // → DotToken
Tok![,] // → CommaToken
// Keywords - use the keyword
Tok![true] // → TrueToken
Tok![false] // → FalseToken
// Regex tokens - use snake_case name
Tok![ident] // → IdentToken
Tok![number] // → NumberToken
SpannedTok!
Shorthand for Spanned<Tok![...]>:
SpannedTok![=] // → Spanned<EqToken>
SpannedTok![ident] // → Spanned<IdentToken>
Auto-generated Trait Implementations
Each token struct automatically implements:
| Trait | Purpose |
|---|---|
Parse | Parse from TokenStream |
Peek | Check if token matches without consuming |
Diagnostic | Format name for error messages |
Display | Human-readable output |
Parsing
Parsing converts a token stream into an AST using the Parse and Peek traits.
The Parse Trait
pub trait Parse: Sized {
fn parse(stream: &mut TokenStream) -> Result<Self, Error>;
}
Token structs implement Parse automatically. For AST nodes, implement manually:
impl Parse for KeyValue {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
Ok(Self {
key: stream.parse()?,
eq: stream.parse()?,
value: stream.parse()?,
})
}
}
The Peek Trait
Check the next token without consuming:
pub trait Peek {
fn is(token: &Token) -> bool;
fn peek(stream: &TokenStream) -> bool;
}
Use in conditionals and loops:
// Check before parsing
if SimpleKey::peek(stream) {
let key: Spanned<SimpleKey> = stream.parse()?;
}
// Parse while condition holds
while Value::peek(stream) {
items.push(stream.parse()?);
}
TokenStream Operations
Basic Operations
// Create from source
let mut stream = TokenStream::lex("x = 42")?;
// Parse with type inference
let token: Spanned<IdentToken> = stream.parse()?;
// Peek at next token
if stream.peek::<EqToken>() {
// ...
}
// Get next raw token (including skipped)
let raw = stream.next_raw();
Fork and Rewind
Speculatively parse without committing:
let fork = stream.fork();
if let Ok(result) = try_parse(&mut fork) {
stream.advance_to(&fork); // Commit
return Ok(result);
}
// Didn't advance - stream unchanged
Whitespace Handling
skip_tokens in parser_kit! defines tokens to skip:
skip_tokens: [Space, Tab],
stream.next()- Skips whitespacestream.next_raw()- Includes whitespacestream.peek_token()- Skips whitespacestream.peek_token_raw()- Includes whitespace
Parsing Patterns
Sequential Fields
impl Parse for Assignment {
fn parse(stream: &mut TokenStream) -> Result<Self, Error> {
Ok(Self {
name: stream.parse()?, // Spanned<IdentToken>
eq: stream.parse()?, // Spanned<EqToken>
value: stream.parse()?, // Spanned<Value>
})
}
}
Enum Variants
Use peek to determine variant:
impl Parse for Value {
fn parse(stream: &mut TokenStream) -> Result<Self, Error> {
if stream.peek::<IntegerToken>() {
Ok(Value::Integer(stream.parse()?))
} else if stream.peek::<StringToken>() {
Ok(Value::String(stream.parse()?))
} else {
Err(Error::expected("value"))
}
}
}
Optional Fields
// Option<T> auto-implements Parse via Peek
let comma: Option<Spanned<CommaToken>> = stream.parse()?;
Repeated Items
// Manual loop
let mut items = Vec::new();
while Value::peek(stream) {
items.push(stream.parse()?);
}
// Using synkit::Repeated
use synkit::Repeated;
let items: Repeated<Value, CommaToken, Spanned<Value>> =
Repeated::parse(stream)?;
Delimited Content
Extract content between delimiters:
// Using the bracket! macro
let mut inner;
let bracket = bracket!(inner in stream);
// inner is a new TokenStream with bracket contents
let items = parse_items(&mut inner)?;
Spans & Errors
Spans track source locations for error reporting and source mapping.
Span Types
RawSpan
Byte offsets into source:
pub struct RawSpan {
pub start: usize,
pub end: usize,
}
Span
Handles both known and synthetic locations:
pub enum Span {
CallSite, // No source location (generated code)
Known(RawSpan), // Actual source position
}
Spanned<T>
Wraps a value with its source span:
pub struct Spanned<T> {
pub value: T,
pub span: Span,
}
Always use Spanned<T> for AST node fields:
pub struct KeyValue {
pub key: Spanned<Key>, // ✓
pub eq: Spanned<EqToken>, // ✓
pub value: Spanned<Value>, // ✓
}
This enables:
- Precise error locations
- Source mapping for transformations
- Hover information in editors
Error Handling
Error Type Pattern
#[derive(Error, Debug, Clone, Default, PartialEq)]
pub enum MyError {
#[default]
#[error("unknown token")]
Unknown,
#[error("expected {expect}, found {found}")]
Expected { expect: &'static str, found: String },
#[error("expected {expect}")]
Empty { expect: &'static str },
#[error("{source}")]
Spanned {
#[source]
source: Box<MyError>,
span: Span,
},
}
SpannedError Trait
Attach spans to errors:
impl synkit::SpannedError for MyError {
type Span = Span;
fn with_span(self, span: Span) -> Self {
Self::Spanned {
source: Box::new(self),
span,
}
}
fn span(&self) -> Option<&Span> {
match self {
Self::Spanned { span, .. } => Some(span),
_ => None,
}
}
}
Diagnostic Trait
Provide display names for error messages:
pub trait Diagnostic {
fn fmt() -> &'static str;
}
// Auto-implemented for tokens using #[fmt(...)] or snake_case name
impl Diagnostic for IdentToken {
fn fmt() -> &'static str { "identifier" }
}
Error Helpers
impl MyError {
pub fn expected<D: Diagnostic>(found: &Token) -> Self {
Self::Expected {
expect: D::fmt(),
found: format!("{}", found),
}
}
pub fn empty<D: Diagnostic>() -> Self {
Self::Empty { expect: D::fmt() }
}
}
Error Propagation
Parse implementations automatically wrap errors with spans:
impl Parse for KeyValue {
fn parse(stream: &mut TokenStream) -> Result<Self, MyError> {
Ok(Self {
// If parse fails, error includes span of failed token
key: stream.parse()?,
eq: stream.parse()?,
value: stream.parse()?,
})
}
}
Accessing Spans
let kv: Spanned<KeyValue> = stream.parse()?;
// Get span of entire key-value
let full_span = &kv.span;
// Get span of just the key
let key_span = &kv.value.key.span;
// Get span of the value
let value_span = &kv.value.value.span;
Printing
The ToTokens trait enables round-trip formatting: parse source, modify AST, print back.
The ToTokens Trait
pub trait ToTokens {
fn write(&self, printer: &mut Printer);
}
Implement for each AST node:
impl ToTokens for KeyValue {
fn write(&self, p: &mut Printer) {
self.key.value.write(p);
p.space();
self.eq.value.write(p);
p.space();
self.value.value.write(p);
}
}
Printer Methods
Basic Output
p.word("text"); // Append literal text
p.token(&tok); // Append token's string form
p.space(); // Single space
p.newline(); // Line break
Indentation
p.open_block(); // Increase indent, add newline
p.close_block(); // Decrease indent, add newline
p.indent(); // Just increase indent level
p.dedent(); // Just decrease indent level
Separators
// Write items with separator
p.write_separated(&items, ", ");
// Write with custom logic
for (i, item) in items.iter().enumerate() {
if i > 0 { p.word(", "); }
item.write(p);
}
Converting to String
// Using the trait method
let output = kv.to_string_formatted();
// Manual printer usage
let mut printer = Printer::new();
kv.write(&mut printer);
let output = printer.finish();
Round-trip Example
// Parse
let mut stream = TokenStream::lex("key = 42")?;
let kv: Spanned<KeyValue> = stream.parse()?;
// Modify
let mut modified = kv.value.clone();
modified.value = Spanned {
value: Value::Integer(IntegerToken::new(100)),
span: Span::CallSite,
};
// Print
let output = modified.to_string_formatted();
assert_eq!(output, "key = 100");
Implementation Patterns
Token Structs
Token structs need explicit ToTokens:
impl ToTokens for EqToken {
fn write(&self, p: &mut Printer) {
p.token(&self.token());
}
}
impl ToTokens for BasicStringToken {
fn write(&self, p: &mut Printer) {
// Re-add quotes stripped during lexing
p.word("\"");
p.word(&self.0);
p.word("\"");
}
}
Enum Variants
impl ToTokens for Value {
fn write(&self, p: &mut Printer) {
match self {
Value::String(s) => s.write(p),
Value::Integer(n) => n.write(p),
Value::True(t) => t.write(p),
Value::False(f) => f.write(p),
Value::Array(a) => a.write(p),
Value::InlineTable(t) => t.write(p),
}
}
}
Collections
impl ToTokens for Array {
fn write(&self, p: &mut Printer) {
self.lbracket.value.write(p);
for (i, item) in self.items.iter().enumerate() {
if i > 0 { p.word(", "); }
item.value.value.write(p);
}
self.rbracket.value.write(p);
}
}
Preserving Trivia
For exact round-trip, preserve comments and whitespace:
impl ToTokens for Trivia {
fn write(&self, p: &mut Printer) {
match self {
Trivia::Newline(_) => p.newline(),
Trivia::Comment(c) => {
p.token(&c.value.token());
}
}
}
}
Async Streaming
synkit supports incremental, asynchronous parsing for scenarios where data arrives in chunks:
- Network streams (HTTP, WebSocket, TCP)
- Large file processing
- Real-time log parsing
- Interactive editors
Architecture
┌─────────────┐ chunks ┌──────────────────┐
│ Source │ ──────────────► │ IncrementalLexer │
│ (network, │ │ (tokenizer) │
│ file, etc) │ └────────┬─────────┘
└─────────────┘ │
tokens│
▼
┌────────────────┐
│ IncrementalParse│
│ (parser) │
└────────┬───────┘
│
AST │
nodes ▼
┌────────────────┐
│ Consumer │
└────────────────┘
Key Traits
IncrementalLexer
Lex source text incrementally as chunks arrive:
pub trait IncrementalLexer: Sized {
type Token: Clone;
type Span: Clone;
type Spanned: Clone;
type Error: Display;
fn new() -> Self;
fn feed(&mut self, chunk: &str) -> Result<Vec<Self::Spanned>, Self::Error>;
fn finish(self) -> Result<Vec<Self::Spanned>, Self::Error>;
fn offset(&self) -> usize;
}
IncrementalParse
Parse AST nodes incrementally from token buffers:
pub trait IncrementalParse: Sized {
type Token: Clone;
type Error: Display;
fn parse_incremental<S>(
tokens: &[S],
checkpoint: &ParseCheckpoint,
) -> Result<(Option<Self>, ParseCheckpoint), Self::Error>
where
S: AsRef<Self::Token>;
fn can_parse<S>(tokens: &[S], checkpoint: &ParseCheckpoint) -> bool
where
S: AsRef<Self::Token>;
}
ParseCheckpoint
Track parser state across incremental calls:
pub struct ParseCheckpoint {
pub cursor: usize, // Position in token buffer
pub tokens_consumed: usize, // Total tokens processed
pub state: u64, // Parser-specific state
}
Feature Flags
Enable async streaming with feature flags:
# Tokio-based (channels, spawn)
synkit = { version = "0.1", features = ["tokio"] }
# Futures-based (runtime-agnostic Stream trait)
synkit = { version = "0.1", features = ["futures"] }
Tokio Integration
With the tokio feature, use channel-based streaming:
use synkit::async_stream::tokio_impl::{AsyncTokenStream, AstStream};
use tokio::sync::mpsc;
async fn parse_stream<L, T>(mut source_rx: mpsc::Receiver<String>)
where
L: IncrementalLexer,
T: IncrementalParse<Token = L::Token>,
{
let (token_tx, token_rx) = mpsc::channel(32);
let (ast_tx, mut ast_rx) = mpsc::channel(16);
// Lexer task
tokio::spawn(async move {
let mut lexer = AsyncTokenStream::<L>::new(token_tx);
while let Some(chunk) = source_rx.recv().await {
lexer.feed(&chunk).await?;
}
lexer.finish().await?;
});
// Parser task
tokio::spawn(async move {
let mut parser = AstStream::<T, L::Token>::new(token_rx, ast_tx);
parser.run().await?;
});
// Consume AST nodes
while let Some(node) = ast_rx.recv().await {
process(node);
}
}
Futures Integration
With the futures feature, use the Stream trait:
use synkit::async_stream::futures_impl::ParseStream;
use futures::StreamExt;
async fn parse_tokens<S, T>(tokens: S)
where
S: Stream<Item = Token>,
T: IncrementalParse<Token = Token>,
{
let mut parse_stream: ParseStream<_, T, _> = ParseStream::new(tokens);
while let Some(result) = parse_stream.next().await {
match result {
Ok(node) => process(node),
Err(e) => handle_error(e),
}
}
}
Error Handling
The StreamError enum covers streaming-specific failures:
pub enum StreamError {
ChannelClosed, // Channel unexpectedly closed
LexError(String), // Lexer error
ParseError(String), // Parser error
IncompleteInput, // EOF with incomplete input
}
Configuration
Customize buffer sizes and limits:
let config = StreamConfig {
token_buffer_size: 1024, // Token buffer capacity
ast_buffer_size: 64, // AST node buffer capacity
max_chunk_size: 64 * 1024, // Max input chunk size
};
let stream = AsyncTokenStream::with_config(tx, config);
Best Practices
-
Return
Nonewhen incomplete: Ifparse_incrementalcan’t complete a node, returnOk((None, checkpoint))rather than an error. -
Implement
can_parse: This optimization prevents unnecessary parse attempts when tokens are clearly insufficient. -
Use checkpoints for backtracking: Store parser state in
checkpoint.statefor complex grammars. -
Handle
IncompleteInput: At stream end, incomplete input may be valid (e.g., truncated file) or an error depending on your grammar. -
Buffer management: The
AstStreamautomatically compacts its buffer. For custom implementations, drain consumed tokens periodically.
Tutorial: TOML Parser
Build a complete TOML parser with round-trip printing using synkit.
Source Code
📦 Complete source: examples/toml-parser
What You’ll Build
A parser for a TOML subset supporting:
# Comment
key = "value"
number = 42
flag = true
[section]
nested = "data"
[section.subsection]
array = [1, 2, 3]
inline = { a = 1, b = 2 }
Source Code
The complete example lives in examples/toml-parser/. Each chapter references the actual code.
Chapters
- Project Setup - Dependencies, error type,
parser_kit!invocation - Defining Tokens - Token patterns and attributes
- AST Design - Node types with
Spanned<T> - Parse Implementations - Converting tokens to AST
- Round-trip Printing -
ToTokensfor output - Visitors - Traversing the AST
- Testing - Parse and round-trip tests
Project Setup
Create the Project
cargo new toml-parser --lib
cd toml-parser
Dependencies
[package]
name = "toml-parser"
version = "0.1.0"
edition = "2024"
[dependencies]
synkit = "0.1"
thiserror = "2"
logos = "0.15"
Error Type
Define an error type that implements Default (required by logos):
#[derive(Error, Debug, Clone, Default, PartialEq)]
pub enum TomlError {
#[default]
#[error("unknown lexing error")]
Unknown,
#[error("expected {expect}, found {found}")]
Expected { expect: &'static str, found: String },
#[error("expected {expect}, found EOF")]
Empty { expect: &'static str },
#[error("unclosed string")]
UnclosedString,
#[error("{source}")]
Spanned {
#[source]
source: Box<TomlError>,
span: Span,
},
}
Key requirements:
#[default]variant for unknown tokensExpectedvariant withexpectandfoundfieldsEmptyvariant for EOF errorsSpannedvariant wrapping errors with location
parser_kit! Invocation
The macro generates all parsing infrastructure:
synkit::parser_kit! {
error: TomlError,
skip_tokens: [Space, Tab],
tokens: {
// Whitespace
#[token(" ", priority = 0)]
Space,
#[token("\t", priority = 0)]
Tab,
#[regex(r"\r?\n")]
#[fmt("newline")]
#[no_to_tokens]
Newline,
// Comments
#[regex(r"#[^\n]*", allow_greedy = true)]
#[fmt("comment")]
Comment,
// Punctuation
#[token("=")]
Eq,
#[token(".")]
Dot,
#[token(",")]
Comma,
#[token("[")]
LBracket,
#[token("]")]
RBracket,
#[token("{")]
LBrace,
#[token("}")]
RBrace,
// Keywords/literals
#[token("true")]
True,
#[token("false")]
False,
// Bare keys: alphanumeric, underscores, dashes
#[regex(r"[A-Za-z0-9_-]+", |lex| lex.slice().to_string(), priority = 1)]
#[fmt("bare key")]
#[derive(PartialOrd, Ord, Hash, Eq)]
BareKey(String),
// Basic strings (double-quoted) - needs custom ToTokens for quote handling
#[regex(r#""([^"\\]|\\.)*""#, |lex| {
let s = lex.slice();
// Remove surrounding quotes
s[1..s.len()-1].to_string()
})]
#[fmt("string")]
#[no_to_tokens]
BasicString(String),
// Integers
#[regex(r"-?[0-9]+", |lex| lex.slice().parse::<i64>().ok())]
#[fmt("integer")]
Integer(i64),
},
delimiters: {
Bracket => (LBracket, RBracket),
Brace => (LBrace, RBrace),
},
span_derives: [Debug, Clone, PartialEq, Eq, Hash, Copy],
token_derives: [Clone, PartialEq, Debug],
}
This generates:
spanmodule withSpan,Spanned<T>tokensmodule withTokenenum and*Tokenstructsstreammodule withTokenStreamtraitsmodule withParse,Peek,ToTokensdelimitersmodule withBracket,Brace
Error Helpers
Add convenience methods for error creation:
impl TomlError {
pub fn expected<D: Diagnostic>(found: &Token) -> Self {
Self::Expected {
expect: D::fmt(),
found: format!("{}", found),
}
}
pub fn empty<D: Diagnostic>() -> Self {
Self::Empty { expect: D::fmt() }
}
}
impl synkit::SpannedError for TomlError {
type Span = Span;
fn with_span(self, span: Span) -> Self {
Self::Spanned {
source: Box::new(self),
span,
}
}
fn span(&self) -> Option<&Span> {
match self {
Self::Spanned { span, .. } => Some(span),
_ => None,
}
}
}
Module Structure
// lib.rs
mod ast;
mod parse;
mod print;
mod visitor;
pub use ast::*;
pub use parse::*;
pub use visitor::*;
Verify Setup
cargo check
The macro should expand without errors. If you see errors about missing traits, ensure your error type has the required variants.
Defining Tokens
The tokens: block in parser_kit! defines your grammar’s lexical elements.
Token Categories
Whitespace Tokens
Skipped during parsing but tracked for round-trip:
// Skipped automatically
#[token(" ", priority = 0)]
Space,
#[token("\t", priority = 0)]
Tab,
// Not skipped - we track these for formatting
#[regex(r"\r?\n")]
#[fmt("newline")]
Newline,
Use skip_tokens: [Space, Tab] to mark tokens to skip.
Punctuation
Simple exact-match tokens:
#[token("=")]
Eq,
#[token(".")]
Dot,
#[token(",")]
Comma,
#[token("[")]
LBracket,
#[token("]")]
RBracket,
#[token("{")]
LBrace,
#[token("}")]
RBrace,
Keywords
Keywords need higher priority than identifiers:
#[token("true")]
True,
#[token("false")]
False,
Value Tokens
Tokens with captured data use callbacks:
// Bare keys: alphanumeric, underscores, dashes
#[regex(r"[A-Za-z0-9_-]+", |lex| lex.slice().to_string(), priority = 1)]
#[fmt("bare key")]
#[derive(PartialOrd, Ord, Hash, Eq)]
BareKey(String),
// Basic strings (double-quoted)
#[regex(r#""([^"\\]|\\.)*""#, |lex| {
let s = lex.slice();
s[1..s.len()-1].to_string() // Strip quotes
})]
#[fmt("string")]
BasicString(String),
// Integers
#[regex(r"-?[0-9]+", |lex| lex.slice().parse::<i64>().ok())]
#[fmt("integer")]
Integer(i64),
Comments
Track but don’t interpret:
#[regex(r"#[^\n]*")]
#[fmt("comment")]
Comment,
Generated Types
For each token, synkit generates:
| Token | Struct | Macro |
|---|---|---|
Eq | EqToken | Tok![=] |
Dot | DotToken | Tok![.] |
BareKey(String) | BareKeyToken(String) | Tok![bare_key] |
BasicString(String) | BasicStringToken(String) | Tok![basic_string] |
Integer(i64) | IntegerToken(i64) | Tok![integer] |
Delimiters
Define delimiter pairs for extraction:
delimiters: {
Bracket => (LBracket, RBracket),
Brace => (LBrace, RBrace),
},
Generates Bracket and Brace structs with span information, plus bracket! and brace! macros.
Priority Handling
When patterns overlap, use priority:
#[token("true", priority = 2)] // Higher wins
True,
#[regex(r"[A-Za-z]+", priority = 1)]
BareKey(String),
Input "true" matches True, not BareKey("true").
Derives
Control derives at different levels:
// For all tokens
token_derives: [Clone, PartialEq, Debug],
// For specific token
#[derive(Hash, Eq)] // Additional derives for BareKeyToken only
BareKey(String),
// For span types
span_derives: [Debug, Clone, PartialEq, Eq, Hash],
AST Design
Design AST nodes that preserve all information for round-trip formatting.
Design Principles
- Use
Spanned<T>for all children - Enables error locations and source mapping - Include punctuation tokens - Needed for exact round-trip
- Track trivia - Comments and newlines for formatting
Document Structure
/// The root of a TOML document.
/// Contains a sequence of items (key-value pairs or tables).
#[derive(Debug, Clone)]
pub struct Document {
pub items: Vec<DocumentItem>,
}
/// A single item in the document: either a top-level key-value or a table section.
#[derive(Debug, Clone)]
pub enum DocumentItem {
/// A blank line or comment
Trivia(Trivia),
/// A key = value pair at the top level
KeyValue(Spanned<KeyValue>),
/// A [table] section
Table(Spanned<Table>),
}
Documentis the root containing all itemsDocumentItemdistinguishes top-level elementsTriviacaptures non-semantic content
Keys
/// A TOML key, which can be bare, quoted, or dotted.
#[derive(Debug, Clone)]
pub enum Key {
/// Bare key: `foo`
Bare(tokens::BareKeyToken),
/// Quoted key: `"foo.bar"`
Quoted(tokens::BasicStringToken),
/// Dotted key: `foo.bar.baz`
Dotted(DottedKey),
}
/// A dotted key like `server.host.name`
#[derive(Debug, Clone)]
pub struct DottedKey {
pub first: Spanned<SimpleKey>,
pub rest: Vec<(Spanned<tokens::DotToken>, Spanned<SimpleKey>)>,
}
/// A simple (non-dotted) key
#[derive(Debug, Clone)]
pub enum SimpleKey {
Bare(tokens::BareKeyToken),
Quoted(tokens::BasicStringToken),
}
Keyenum handles all key formsDottedKeypreserves dot tokens for round-tripSimpleKeyis the base case (bare or quoted)
Values
/// A TOML value.
#[derive(Debug, Clone)]
pub enum Value {
/// String value
String(tokens::BasicStringToken),
/// Integer value
Integer(tokens::IntegerToken),
/// Boolean true
True(tokens::TrueToken),
/// Boolean false
False(tokens::FalseToken),
/// Array value
Array(Array),
/// Inline table value
InlineTable(InlineTable),
}
Each variant stores its token type directly, preserving the original representation.
Key-Value Pairs
/// A key-value pair: `key = value`
#[derive(Debug, Clone)]
pub struct KeyValue {
pub key: Spanned<Key>,
pub eq: Spanned<tokens::EqToken>,
pub value: Spanned<Value>,
}
Note how eq stores the equals token—this enables formatting choices like key=value vs key = value.
Tables
/// A table section: `[section]` or `[section.subsection]`
#[derive(Debug, Clone)]
pub struct Table {
pub lbracket: Spanned<tokens::LBracketToken>,
pub name: Spanned<Key>,
pub rbracket: Spanned<tokens::RBracketToken>,
pub items: Vec<TableItem>,
}
/// An item within a table section.
#[derive(Debug, Clone)]
pub enum TableItem {
Trivia(Trivia),
KeyValue(Box<Spanned<KeyValue>>),
}
- Brackets stored explicitly for round-trip
- Items include trivia for blank lines/comments within table
Arrays
/// An array: `[1, 2, 3]`
#[derive(Debug, Clone)]
pub struct Array {
pub lbracket: Spanned<tokens::LBracketToken>,
pub items: Vec<ArrayItem>,
pub rbracket: Spanned<tokens::RBracketToken>,
}
/// An item in an array, including trailing trivia.
#[derive(Debug, Clone)]
pub struct ArrayItem {
pub value: Spanned<Value>,
pub comma: Option<Spanned<tokens::CommaToken>>,
}
ArrayItem includes optional trailing comma—essential for preserving:
[1, 2, 3] # No trailing comma
[1, 2, 3,] # With trailing comma
Inline Tables
/// An inline table: `{ key = value, ... }`
#[derive(Debug, Clone)]
pub struct InlineTable {
pub lbrace: Spanned<tokens::LBraceToken>,
pub items: Vec<InlineTableItem>,
pub rbrace: Spanned<tokens::RBraceToken>,
}
/// An item in an inline table.
#[derive(Debug, Clone)]
pub struct InlineTableItem {
pub kv: Spanned<KeyValue>,
pub comma: Option<Spanned<tokens::CommaToken>>,
}
Similar structure to arrays, with key-value pairs instead of values.
Why This Design?
Span Preservation
Every Spanned<T> carries source location:
let kv: Spanned<KeyValue> = stream.parse()?;
let key_span = &kv.value.key.span; // Location of key
let eq_span = &kv.value.eq.span; // Location of '='
let val_span = &kv.value.value.span; // Location of value
Round-trip Fidelity
Storing tokens enables exact reconstruction:
// Original: key = "value"
// After parse → print:
// key = "value" (identical)
Trivia Handling
Without trivia tracking:
# Comment lost!
key = value
With trivia in AST:
# Comment preserved
key = value
Parse Implementations
Convert token streams into AST nodes.
Basic Pattern
impl Parse for MyNode {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
Ok(Self {
field1: stream.parse()?,
field2: stream.parse()?,
})
}
}
Implementing Peek
For types used in conditionals:
impl Peek for SimpleKey {
fn is(token: &Token) -> bool {
matches!(token, Token::BareKey(_) | Token::BasicString(_))
}
}
impl Parse for SimpleKey {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
match stream.peek_token().map(|t| &t.value) {
Some(Token::BareKey(_)) => {
let tok: Spanned<tokens::BareKeyToken> = stream.parse()?;
Ok(SimpleKey::Bare(tok.value))
}
Some(Token::BasicString(_)) => {
let tok: Spanned<tokens::BasicStringToken> = stream.parse()?;
Ok(SimpleKey::Quoted(tok.value))
}
Some(other) => Err(TomlError::Expected {
expect: "key",
found: format!("{}", other),
}),
None => Err(TomlError::Empty { expect: "key" }),
}
}
}
Peek::is() checks a token variant; Peek::peek() checks the stream’s next token.
Parsing Keys
impl Peek for Key {
fn is(token: &Token) -> bool {
SimpleKey::is(token)
}
}
impl Parse for Key {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
let first: Spanned<SimpleKey> = stream.parse()?;
// Check if this is a dotted key
if stream.peek::<tokens::DotToken>() {
let mut rest = Vec::new();
while stream.peek::<tokens::DotToken>() {
let dot: Spanned<tokens::DotToken> = stream.parse()?;
let key: Spanned<SimpleKey> = stream.parse()?;
rest.push((dot, key));
}
Ok(Key::Dotted(DottedKey { first, rest }))
} else {
// Single key
match first.value {
SimpleKey::Bare(tok) => Ok(Key::Bare(tok)),
SimpleKey::Quoted(tok) => Ok(Key::Quoted(tok)),
}
}
}
}
Parsing Values
Match on peeked token to determine variant:
impl Peek for Value {
fn is(token: &Token) -> bool {
matches!(
token,
Token::BasicString(_)
| Token::Integer(_)
| Token::True
| Token::False
| Token::LBracket
| Token::LBrace
)
}
}
impl Parse for Value {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
match stream.peek_token().map(|t| &t.value) {
Some(Token::BasicString(_)) => {
let tok: Spanned<tokens::BasicStringToken> = stream.parse()?;
Ok(Value::String(tok.value))
}
Some(Token::Integer(_)) => {
let tok: Spanned<tokens::IntegerToken> = stream.parse()?;
Ok(Value::Integer(tok.value))
}
Some(Token::True) => {
let tok: Spanned<tokens::TrueToken> = stream.parse()?;
Ok(Value::True(tok.value))
}
Some(Token::False) => {
let tok: Spanned<tokens::FalseToken> = stream.parse()?;
Ok(Value::False(tok.value))
}
Some(Token::LBracket) => {
let arr = Array::parse(stream)?;
Ok(Value::Array(arr))
}
Some(Token::LBrace) => {
let tbl = InlineTable::parse(stream)?;
Ok(Value::InlineTable(tbl))
}
Some(other) => Err(TomlError::Expected {
expect: "value",
found: format!("{}", other),
}),
None => Err(TomlError::Empty { expect: "value" }),
}
}
}
Arrays with Delimiters
Use the bracket! macro to extract delimited content:
impl Peek for Array {
fn is(token: &Token) -> bool {
matches!(token, Token::LBracket)
}
}
impl Parse for Array {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
let lbracket: Spanned<tokens::LBracketToken> = stream.parse()?;
let mut items = Vec::new();
// Skip any leading newlines inside array
while peek_newline(stream) {
let _: Spanned<tokens::NewlineToken> = stream.parse()?;
}
// Parse array items
while stream.peek::<Value>() {
let value: Spanned<Value> = stream.parse()?;
// Skip newlines after value
while peek_newline(stream) {
let _: Spanned<tokens::NewlineToken> = stream.parse()?;
}
let comma = if stream.peek::<tokens::CommaToken>() {
let c: Spanned<tokens::CommaToken> = stream.parse()?;
// Skip newlines after comma
while peek_newline(stream) {
let _: Spanned<tokens::NewlineToken> = stream.parse()?;
}
Some(c)
} else {
None
};
items.push(ArrayItem { value, comma });
}
let rbracket: Spanned<tokens::RBracketToken> = stream.parse()?;
Ok(Array {
lbracket,
items,
rbracket,
})
}
}
Key points:
bracket!(inner in stream)extracts content between[and]- Returns a
Bracketstruct with span information inneris a newTokenStreamcontaining only bracket contents
Inline Tables
Similar pattern with brace!:
impl Peek for InlineTable {
fn is(token: &Token) -> bool {
matches!(token, Token::LBrace)
}
}
impl Parse for InlineTable {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
let lbrace: Spanned<tokens::LBraceToken> = stream.parse()?;
let mut items = Vec::new();
// Parse inline table items
while stream.peek::<Key>() {
let kv: Spanned<KeyValue> = stream.parse()?;
let comma = if stream.peek::<tokens::CommaToken>() {
Some(stream.parse()?)
} else {
None
};
items.push(InlineTableItem { kv, comma });
}
let rbrace: Spanned<tokens::RBraceToken> = stream.parse()?;
Ok(InlineTable {
lbrace,
items,
rbrace,
})
}
}
Tables and Documents
impl Peek for Table {
fn is(token: &Token) -> bool {
matches!(token, Token::LBracket)
}
}
impl Parse for Table {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
let lbracket: Spanned<tokens::LBracketToken> = stream.parse()?;
let name: Spanned<Key> = stream.parse()?;
let rbracket: Spanned<tokens::RBracketToken> = stream.parse()?;
let mut items = Vec::new();
// Consume trailing content on the header line
while Trivia::peek(stream) {
let trivia = Trivia::parse(stream)?;
items.push(TableItem::Trivia(trivia));
// Stop after we hit a newline
if matches!(items.last(), Some(TableItem::Trivia(Trivia::Newline(_)))) {
break;
}
}
// Parse table contents until we hit another table or EOF
loop {
// Check for trivia (newlines, comments)
if Trivia::peek(stream) {
let trivia = Trivia::parse(stream)?;
items.push(TableItem::Trivia(trivia));
continue;
}
// Check for key-value pair
if stream.peek::<Key>() {
// But first make sure this isn't a table header by checking for `[`
// This is tricky - we need to distinguish `[table]` from key-value
// Since Key::peek checks for bare keys and strings, and table headers
// start with `[`, we need to check `[` first in the document parser
let kv = Box::new(stream.parse()?);
items.push(TableItem::KeyValue(kv));
continue;
}
// Either EOF or another table section
break;
}
Ok(Table {
lbracket,
name,
rbracket,
items,
})
}
}
impl Peek for DocumentItem {
fn is(token: &Token) -> bool {
Trivia::is(token) || Key::is(token) || matches!(token, Token::LBracket)
}
}
impl Parse for Document {
fn parse(stream: &mut TokenStream) -> Result<Self, TomlError> {
let mut items = Vec::new();
loop {
// Check for trivia (newlines, comments)
if Trivia::peek(stream) {
let trivia = Trivia::parse(stream)?;
items.push(DocumentItem::Trivia(trivia));
continue;
}
// Check for table header `[name]`
if stream.peek::<tokens::LBracketToken>() {
let table: Spanned<Table> = stream.parse()?;
items.push(DocumentItem::Table(table));
continue;
}
// Check for key-value pair
if stream.peek::<Key>() {
let kv: Spanned<KeyValue> = stream.parse()?;
items.push(DocumentItem::KeyValue(kv));
continue;
}
// EOF or unknown token
if stream.is_empty() {
break;
}
// Unknown token - error
if let Some(tok) = stream.peek_token() {
return Err(TomlError::Expected {
expect: "key, table, or end of file",
found: format!("{}", tok.value),
});
}
break;
}
Ok(Document { items })
}
}
Error Handling
Expected Token Errors
Some(other) => Err(TomlError::Expected {
expect: "key",
found: format!("{}", other),
}),
EOF Errors
None => Err(TomlError::Empty { expect: "key" }),
Using Diagnostic
// Auto-generated for tokens
impl Diagnostic for BareKeyToken {
fn fmt() -> &'static str { "bare key" } // From #[fmt("bare key")]
}
// Use in errors
Err(TomlError::expected::<BareKeyToken>(found_token))
Parsing Tips
Use peek Before Consuming
if SimpleKey::peek(stream) {
// Safe to parse
let key: Spanned<SimpleKey> = stream.parse()?;
}
Fork for Lookahead
let mut fork = stream.fork();
if try_parse(&mut fork).is_ok() {
stream.advance_to(&fork);
}
Handle Optional Elements
// Option<T> auto-implements Parse if T implements Peek
let comma: Option<Spanned<CommaToken>> = stream.parse()?;
Raw Token Access
For tokens in skip_tokens (like Newline):
// Use peek_token_raw to see skipped tokens
fn peek_raw(stream: &TokenStream) -> Option<&Token> {
stream.peek_token_raw().map(|t| &t.value)
}
Round-trip Printing
Implement ToTokens to convert AST back to formatted output.
Basic Pattern
impl ToTokens for MyNode {
fn write(&self, p: &mut Printer) {
self.child1.value.write(p);
p.space();
self.child2.value.write(p);
}
}
Token Printing
// Custom ToTokens implementations for tokens that need special handling.
// Other token ToTokens are auto-generated by parser_kit!
impl ToTokens for tokens::BasicStringToken {
fn write(&self, p: &mut Printer) {
// BasicString stores content without quotes, so we add them back for round-trip
p.word("\"");
p.word(&self.0);
p.word("\"");
}
}
impl ToTokens for tokens::NewlineToken {
fn write(&self, p: &mut Printer) {
p.newline();
}
}
Note: BasicStringToken strips quotes during lexing, so we re-add them for output.
Trivia
Preserve newlines and comments:
impl ToTokens for Trivia {
fn write(&self, p: &mut Printer) {
match self {
Trivia::Newline(nl) => nl.value.write(p),
Trivia::Comment(c) => c.value.write(p),
}
}
}
Key-Value Pairs
impl ToTokens for KeyValue {
fn write(&self, p: &mut Printer) {
self.key.value.write(p);
p.space();
self.eq.value.write(p);
p.space();
self.value.value.write(p);
}
}
Spacing around = is a style choice—adjust as needed.
Arrays
Handle items with optional trailing commas:
impl ToTokens for ArrayItem {
fn write(&self, p: &mut Printer) {
self.value.value.write(p);
if let Some(comma) = &self.comma {
comma.value.write(p);
p.space();
}
}
}
impl ToTokens for Array {
fn write(&self, p: &mut Printer) {
self.lbracket.value.write(p);
for item in &self.items {
item.write(p);
}
self.rbracket.value.write(p);
}
}
Tables
impl ToTokens for TableItem {
fn write(&self, p: &mut Printer) {
match self {
TableItem::Trivia(trivia) => trivia.write(p),
TableItem::KeyValue(kv) => kv.value.write(p),
}
}
}
impl ToTokens for Table {
fn write(&self, p: &mut Printer) {
self.lbracket.value.write(p);
self.name.value.write(p);
self.rbracket.value.write(p);
for item in &self.items {
item.write(p);
}
}
}
Documents
impl ToTokens for DocumentItem {
fn write(&self, p: &mut Printer) {
match self {
DocumentItem::Trivia(trivia) => trivia.write(p),
DocumentItem::KeyValue(kv) => kv.value.write(p),
DocumentItem::Table(table) => table.value.write(p),
}
}
}
impl ToTokens for Document {
fn write(&self, p: &mut Printer) {
for item in &self.items {
item.write(p);
}
}
}
Using the Output
// Parse
let mut stream = TokenStream::lex(input)?;
let doc: Spanned<Document> = stream.parse()?;
// Print using trait method
let output = doc.value.to_string_formatted();
// Or manual printer
let mut printer = Printer::new();
doc.value.write(&mut printer);
let output = printer.finish();
Printer Methods Reference
| Method | Effect |
|---|---|
word(s) | Append string |
token(&tok) | Append token’s display |
space() | Single space |
newline() | Line break |
open_block() | Indent + newline |
close_block() | Dedent + newline |
indent() | Increase indent |
dedent() | Decrease indent |
write_separated(&items, sep) | Items with separator |
Formatting Choices
The ToTokens implementation defines your output format:
// Compact: key=value
self.key.value.write(p);
self.eq.value.write(p);
self.value.value.write(p);
// Spaced: key = value
self.key.value.write(p);
p.space();
self.eq.value.write(p);
p.space();
self.value.value.write(p);
For exact round-trip, store original spacing as trivia. For normalized output, apply consistent rules in write().
Visitors
The visitor pattern traverses AST nodes without modifying them.
Visitor Trait
/// Visitor trait for traversing TOML AST nodes.
///
/// Implement the `visit_*` methods you care about. Default implementations
/// call the corresponding `walk_*` methods to traverse children.
pub trait TomlVisitor {
fn visit_document(&mut self, doc: &Document) {
self.walk_document(doc);
}
fn visit_document_item(&mut self, item: &DocumentItem) {
self.walk_document_item(item);
}
fn visit_key_value(&mut self, kv: &KeyValue) {
self.walk_key_value(kv);
}
fn visit_key(&mut self, key: &Key) {
self.walk_key(key);
}
fn visit_simple_key(&mut self, key: &SimpleKey) {
let _ = key; // leaf node
}
fn visit_value(&mut self, value: &Value) {
self.walk_value(value);
}
fn visit_table(&mut self, table: &Table) {
self.walk_table(table);
}
fn visit_array(&mut self, array: &Array) {
self.walk_array(array);
}
fn visit_inline_table(&mut self, table: &InlineTable) {
self.walk_inline_table(table);
}
// Walk methods traverse child nodes
fn walk_document(&mut self, doc: &Document) {
for item in &doc.items {
self.visit_document_item(item);
}
}
fn walk_document_item(&mut self, item: &DocumentItem) {
match item {
DocumentItem::Trivia(_) => {}
DocumentItem::KeyValue(kv) => self.visit_key_value(&kv.value),
DocumentItem::Table(table) => self.visit_table(&table.value),
}
}
fn walk_key_value(&mut self, kv: &KeyValue) {
self.visit_key(&kv.key.value);
self.visit_value(&kv.value.value);
}
fn walk_key(&mut self, key: &Key) {
match key {
Key::Bare(tok) => self.visit_simple_key(&SimpleKey::Bare(tok.clone())),
Key::Quoted(tok) => self.visit_simple_key(&SimpleKey::Quoted(tok.clone())),
Key::Dotted(dotted) => {
self.visit_simple_key(&dotted.first.value);
for (_, k) in &dotted.rest {
self.visit_simple_key(&k.value);
}
}
}
}
fn walk_value(&mut self, value: &Value) {
match value {
Value::Array(arr) => self.visit_array(arr),
Value::InlineTable(tbl) => self.visit_inline_table(tbl),
_ => {}
}
}
fn walk_table(&mut self, table: &Table) {
self.visit_key(&table.name.value);
for item in &table.items {
match item {
TableItem::Trivia(_) => {}
TableItem::KeyValue(kv) => self.visit_key_value(&kv.value),
}
}
}
fn walk_array(&mut self, array: &Array) {
for item in &array.items {
self.visit_value(&item.value.value);
}
}
fn walk_inline_table(&mut self, table: &InlineTable) {
for item in &table.items {
self.visit_key_value(&item.kv.value);
}
}
}
Two method types:
visit_*: Override to handle specific nodes, callswalk_*by defaultwalk_*: Traverses children, typically not overridden
Example: Collecting Keys
/// Example visitor: collects all keys in the document.
pub struct KeyCollector {
pub keys: Vec<String>,
}
impl KeyCollector {
pub fn new() -> Self {
Self { keys: Vec::new() }
}
pub fn collect(doc: &Document) -> Vec<String> {
let mut collector = Self::new();
collector.visit_document(doc);
collector.keys
}
}
impl Default for KeyCollector {
fn default() -> Self {
Self::new()
}
}
impl TomlVisitor for KeyCollector {
fn visit_simple_key(&mut self, key: &SimpleKey) {
let name = match key {
SimpleKey::Bare(tok) => tok.0.clone(),
SimpleKey::Quoted(tok) => tok.0.clone(),
};
self.keys.push(name);
}
}
Usage:
let mut collector = KeyCollector::new();
collector.visit_document(&doc.value);
// collector.keys now contains all key names
Example: Counting Values
/// Example visitor: counts values by type.
#[derive(Default, Debug)]
pub struct ValueCounter {
pub strings: usize,
pub integers: usize,
pub booleans: usize,
pub arrays: usize,
pub inline_tables: usize,
}
impl ValueCounter {
pub fn new() -> Self {
Self::default()
}
pub fn count(doc: &Document) -> Self {
let mut counter = Self::new();
counter.visit_document(doc);
counter
}
}
impl TomlVisitor for ValueCounter {
fn visit_value(&mut self, value: &Value) {
match value {
Value::String(_) => self.strings += 1,
Value::Integer(_) => self.integers += 1,
Value::True(_) | Value::False(_) => self.booleans += 1,
Value::Array(arr) => {
self.arrays += 1;
self.visit_array(arr);
}
Value::InlineTable(tbl) => {
self.inline_tables += 1;
self.visit_inline_table(tbl);
}
}
}
}
Example: Finding Tables
/// Example visitor: finds all table names.
pub struct TableFinder {
pub tables: Vec<String>,
}
impl TableFinder {
pub fn new() -> Self {
Self { tables: Vec::new() }
}
pub fn find(doc: &Document) -> Vec<String> {
let mut finder = Self::new();
finder.visit_document(doc);
finder.tables
}
fn key_to_string(key: &Key) -> String {
match key {
Key::Bare(tok) => tok.0.clone(),
Key::Quoted(tok) => format!("\"{}\"", tok.0),
Key::Dotted(dotted) => {
let mut parts = vec![Self::simple_key_to_string(&dotted.first.value)];
for (_, k) in &dotted.rest {
parts.push(Self::simple_key_to_string(&k.value));
}
parts.join(".")
}
}
}
fn simple_key_to_string(key: &SimpleKey) -> String {
match key {
SimpleKey::Bare(tok) => tok.0.clone(),
SimpleKey::Quoted(tok) => format!("\"{}\"", tok.0),
}
}
}
impl Default for TableFinder {
fn default() -> Self {
Self::new()
}
}
impl TomlVisitor for TableFinder {
fn visit_table(&mut self, table: &Table) {
self.tables.push(Self::key_to_string(&table.name.value));
self.walk_table(table);
}
}
Visitor vs Direct Traversal
Visitor pattern when:
- Multiple traversal operations needed
- Want to separate traversal from logic
- Building analysis tools
Direct recursion when:
- One-off transformation
- Simple structure
- Need mutation
Transforming Visitors
For mutation, use a mutable visitor or return new nodes:
pub trait TomlTransform {
fn transform_value(&mut self, value: Value) -> Value {
self.walk_value(value)
}
fn walk_value(&mut self, value: Value) -> Value {
match value {
Value::Array(arr) => Value::Array(self.transform_array(arr)),
other => other,
}
}
// ...
}
Visitor Tips
Selective Traversal
Override visit_* to stop descent:
fn visit_inline_table(&mut self, _table: &InlineTable) {
// Don't call walk_inline_table - skip inline table contents
}
Accumulating Results
Use struct fields:
struct Stats {
tables: usize,
keys: usize,
values: usize,
}
impl TomlVisitor for Stats {
fn visit_table(&mut self, table: &Table) {
self.tables += 1;
self.walk_table(table);
}
// ...
}
Context Tracking
Track path during traversal:
struct PathTracker {
path: Vec<String>,
paths: Vec<String>,
}
impl TomlVisitor for PathTracker {
fn visit_table(&mut self, table: &Table) {
self.path.push(table_name(table));
self.paths.push(self.path.join("."));
self.walk_table(table);
self.path.pop();
}
}
Testing
Verify parsing correctness and round-trip fidelity.
Parse Tests
Test that parsing produces expected AST:
#[test]
fn test_simple_key_value() {
let mut stream = TokenStream::lex("key = \"value\"").unwrap();
let kv: Spanned<KeyValue> = stream.parse().unwrap();
match &kv.value.key.value {
Key::Bare(tok) => assert_eq!(&**tok, "key"),
_ => panic!("expected bare key"),
}
match &kv.value.value.value {
Value::String(tok) => assert_eq!(&**tok, "value"),
_ => panic!("expected string value"),
}
}
Round-trip Tests
Verify parse → print produces equivalent output:
fn roundtrip(input: &str) -> String {
let mut stream = TokenStream::lex(input).unwrap();
let doc: Spanned<Document> = stream.parse().unwrap();
doc.value.to_string_formatted()
}
#[test]
fn test_roundtrip_simple() {
let input = "key = \"value\"";
assert_eq!(roundtrip(input), input);
}
#[test]
fn test_roundtrip_table() {
let input = "[section]\nkey = 42";
assert_eq!(roundtrip(input), input);
}
Snapshot Testing with insta
For complex outputs, use snapshot testing:
use insta::assert_yaml_snapshot;
#[test]
fn snapshot_complex_document() {
let input = r#"
Header comment
title = "Example"
[server]
host = "localhost"
port = 8080
"#.trim();
let mut stream = TokenStream::lex(input).unwrap();
let doc: Spanned<Document> = stream.parse().unwrap();
let output = doc.value.to_string_formatted();
assert_yaml_snapshot!(output);
}
Run cargo insta test to review and accept snapshots.
Error Tests
Verify error handling:
#[test]
fn test_error_missing_value() {
let mut stream = TokenStream::lex("key =").unwrap();
let result: Result<Spanned<KeyValue>, _> = stream.parse();
assert!(result.is_err());
}
#[test]
fn test_error_invalid_token() {
let result = TokenStream::lex("@invalid");
assert!(result.is_err());
}
Visitor Tests
#[test]
fn test_key_collector() {
let input = "a = 1\nb = 2\n[section]\nc = 3";
let mut stream = TokenStream::lex(input).unwrap();
let doc: Spanned<Document> = stream.parse().unwrap();
let mut collector = KeyCollector::new();
collector.visit_document(&doc.value);
assert_eq!(collector.keys, vec!["a", "b", "c"]);
}
Test Organization
tests/
├── parse_test.rs # Parse correctness
├── roundtrip_test.rs # Round-trip fidelity
└── visitor_test.rs # Visitor behavior
Testing Tips
Test Edge Cases
#[test] fn test_empty_document() { /* ... */ }
#[test] fn test_trailing_comma() { /* ... */ }
#[test] fn test_nested_tables() { /* ... */ }
#[test] fn test_unicode_strings() { /* ... */ }
Property-Based Testing
With proptest:
proptest! {
#[test]
fn roundtrip_integers(n: i64) {
let input = format!("x = {}", n);
let output = roundtrip(&input);
assert_eq!(input, output);
}
}
Debug Output
#[test]
fn debug_parse() {
let mut stream = TokenStream::lex("key = [1, 2]").unwrap();
let doc: Spanned<Document> = stream.parse().unwrap();
// AST structure
dbg!(&doc);
// Formatted output
println!("{}", doc.value.to_string_formatted());
}
Running Tests
# All tests
cargo test
# Specific test file
cargo test --test parse_test
# Update snapshots
cargo insta test --accept
Incremental Parsing
This chapter demonstrates how to add incremental parsing support to the TOML parser for streaming scenarios.
Overview
Incremental parsing allows processing TOML data as it arrives in chunks, useful for:
- Parsing large configuration files without loading entirely into memory
- Processing TOML streams from network connections
- Real-time parsing in editors
Implementing IncrementalLexer
First, wrap the logos lexer with incremental capabilities:
use synkit::async_stream::IncrementalLexer;
pub struct TomlIncrementalLexer {
buffer: String,
offset: usize,
pending_tokens: Vec<Spanned<Token>>,
}
impl IncrementalLexer for TomlIncrementalLexer {
type Token = Token;
type Span = Span;
type Spanned = Spanned<Token>;
type Error = TomlError;
fn new() -> Self {
Self {
buffer: String::new(),
offset: 0,
pending_tokens: Vec::new(),
}
}
fn feed(&mut self, chunk: &str) -> Result<Vec<Self::Spanned>, Self::Error> {
use logos::Logos;
self.buffer.push_str(chunk);
let mut tokens = Vec::new();
let mut lexer = Token::lexer(&self.buffer);
while let Some(result) = lexer.next() {
let span = lexer.span();
let token = result.map_err(|_| TomlError::Unknown)?;
tokens.push(Spanned {
value: token,
span: Span::new(self.offset + span.start, self.offset + span.end),
});
}
// Handle chunk boundaries - hold back potentially incomplete tokens
let emit_count = if self.buffer.ends_with('\n') {
tokens.len()
} else {
tokens.len().saturating_sub(1)
};
let to_emit: Vec<_> = tokens.drain(..emit_count).collect();
self.pending_tokens = tokens;
if let Some(last) = to_emit.last() {
let consumed = last.span.end() - self.offset;
self.buffer.drain(..consumed);
self.offset = last.span.end();
}
Ok(to_emit)
}
fn finish(mut self) -> Result<Vec<Self::Spanned>, Self::Error> {
// Process remaining buffer
if !self.buffer.is_empty() {
use logos::Logos;
let mut lexer = Token::lexer(&self.buffer);
while let Some(result) = lexer.next() {
let span = lexer.span();
let token = result.map_err(|_| TomlError::Unknown)?;
self.pending_tokens.push(Spanned {
value: token,
span: Span::new(self.offset + span.start, self.offset + span.end),
});
}
}
Ok(self.pending_tokens)
}
fn offset(&self) -> usize {
self.offset
}
}
Implementing IncrementalParse
Define an incremental document item that emits as soon as parseable:
use synkit::async_stream::{IncrementalParse, ParseCheckpoint};
#[derive(Debug, Clone)]
pub enum IncrementalDocumentItem {
Trivia(Trivia),
KeyValue(Spanned<KeyValue>),
TableHeader {
lbracket: Spanned<tokens::LBracketToken>,
name: Spanned<Key>,
rbracket: Spanned<tokens::RBracketToken>,
},
}
impl IncrementalParse for IncrementalDocumentItem {
type Token = Token;
type Error = TomlError;
fn parse_incremental<S>(
tokens: &[S],
checkpoint: &ParseCheckpoint,
) -> Result<(Option<Self>, ParseCheckpoint), Self::Error>
where
S: AsRef<Self::Token>,
{
let cursor = checkpoint.cursor;
if cursor >= tokens.len() {
return Ok((None, checkpoint.clone()));
}
let token = tokens[cursor].as_ref();
match token {
// Newline trivia - emit immediately
Token::Newline => {
let item = IncrementalDocumentItem::Trivia(/* ... */);
let new_cp = ParseCheckpoint {
cursor: cursor + 1,
tokens_consumed: checkpoint.tokens_consumed + 1,
state: 0,
};
Ok((Some(item), new_cp))
}
// Table header: need [, name, ]
Token::LBracket => {
if cursor + 2 >= tokens.len() {
// Need more tokens
return Ok((None, checkpoint.clone()));
}
// Parse [name] and emit TableHeader
// ...
}
// Key-value: need key, =, value
Token::BareKey(_) | Token::BasicString(_) => {
if cursor + 2 >= tokens.len() {
return Ok((None, checkpoint.clone()));
}
// Parse key = value and emit KeyValue
// ...
}
// Skip whitespace
Token::Space | Token::Tab => {
let new_cp = ParseCheckpoint {
cursor: cursor + 1,
tokens_consumed: checkpoint.tokens_consumed + 1,
state: checkpoint.state,
};
Self::parse_incremental(tokens, &new_cp)
}
_ => Err(TomlError::Expected {
expect: "key, table header, or trivia",
found: format!("{:?}", token),
}),
}
}
fn can_parse<S>(tokens: &[S], checkpoint: &ParseCheckpoint) -> bool
where
S: AsRef<Self::Token>,
{
checkpoint.cursor < tokens.len()
}
}
Using with Tokio
Stream TOML parsing with tokio channels:
use synkit::async_stream::tokio_impl::AstStream;
use tokio::sync::mpsc;
#[tokio::main]
async fn main() {
let (source_tx, mut source_rx) = mpsc::channel::<String>(8);
let (token_tx, token_rx) = mpsc::channel(32);
let (ast_tx, mut ast_rx) = mpsc::channel(16);
// Lexer task
tokio::spawn(async move {
let mut lexer = TomlIncrementalLexer::new();
while let Some(chunk) = source_rx.recv().await {
for token in lexer.feed(&chunk).unwrap() {
token_tx.send(token).await.unwrap();
}
}
for token in lexer.finish().unwrap() {
token_tx.send(token).await.unwrap();
}
});
// Parser task
tokio::spawn(async move {
let mut parser = AstStream::<IncrementalDocumentItem, Spanned<Token>>::new(
token_rx,
ast_tx
);
parser.run().await.unwrap();
});
// Feed source chunks
source_tx.send("[server]\n".to_string()).await.unwrap();
source_tx.send("host = \"localhost\"\n".to_string()).await.unwrap();
source_tx.send("port = 8080\n".to_string()).await.unwrap();
drop(source_tx);
// Process items as they arrive
while let Some(item) = ast_rx.recv().await {
match item {
IncrementalDocumentItem::TableHeader { name, .. } => {
println!("Found table: {:?}", name);
}
IncrementalDocumentItem::KeyValue(kv) => {
println!("Found key-value: {:?}", kv.value.key);
}
IncrementalDocumentItem::Trivia(_) => {}
}
}
}
Testing Incremental Parsing
Test with various chunk boundaries:
#[test]
fn test_incremental_lexer_chunked() {
let mut lexer = TomlIncrementalLexer::new();
// Split across chunk boundary
let t1 = lexer.feed("ke").unwrap();
let t2 = lexer.feed("y = ").unwrap();
let t3 = lexer.feed("42\n").unwrap();
let remaining = lexer.finish().unwrap();
let total = t1.len() + t2.len() + t3.len() + remaining.len();
// Should produce: key, =, 42, newline
assert!(total >= 4);
}
#[test]
fn test_incremental_parse_needs_more() {
let tokens = vec![
Spanned { value: Token::BareKey("name".into()), span: Span::new(0, 4) },
Spanned { value: Token::Eq, span: Span::new(5, 6) },
// Missing value!
];
let checkpoint = ParseCheckpoint::default();
let (result, _) = IncrementalDocumentItem::parse_incremental(&tokens, &checkpoint).unwrap();
// Should return None, not error
assert!(result.is_none());
}
Summary
Key points for incremental parsing:
- Buffer management: Hold back tokens at chunk boundaries that might be incomplete
- Return
Nonefor incomplete: Don’t error when more tokens are needed - Track offset: Maintain byte offset across chunks for correct spans
- Emit early: Emit AST nodes as soon as they’re complete
- Test boundaries: Test parsing with data split at various points
Tutorial: JSONL Incremental Parser
Build a high-performance streaming JSON Lines parser using synkit’s incremental parsing infrastructure.
Source Code
📦 Complete source: examples/jsonl-parser
What You’ll Learn
- ChunkBoundary - Define where to split token streams
- IncrementalLexer - Buffer partial input, emit complete tokens
- IncrementalParse - Parse from token buffers with checkpoints
- Async streaming - tokio and futures integration
- Stress testing - Validate memory stability under load
JSON Lines Format
JSON Lines uses newline-delimited JSON:
{"user": "alice", "action": "login"}
{"user": "bob", "action": "purchase", "amount": 42.50}
{"user": "alice", "action": "logout"}
Each line is a complete JSON value. This makes JSONL ideal for:
- Log processing
- Event streams
- Large dataset processing
- Network protocols
Why Incremental Parsing?
Traditional parsing loads entire input into memory:
let input = fs::read_to_string("10gb_logs.jsonl")?; // ❌ OOM
let docs: Vec<Log> = parse(&input)?;
Incremental parsing processes chunks:
let mut lexer = JsonIncrementalLexer::new();
while let Some(chunk) = reader.read_chunk().await {
for token in lexer.feed(&chunk)? {
// Process tokens as they arrive
}
}
Prerequisites
- Completed the TOML Parser Tutorial (or familiarity with synkit basics)
- Understanding of async Rust (for chapters 5-6)
Chapters
| Chapter | Topic | Key Concepts |
|---|---|---|
| 1. Token Definition | Token enum and parser_kit! | logos patterns, #[no_to_tokens] |
| 2. Chunk Boundaries | ChunkBoundary trait | depth tracking, boundary detection |
| 3. Incremental Lexer | IncrementalLexer trait | buffering, offset tracking |
| 4. Incremental Parse | IncrementalParse trait | checkpoints, partial results |
| 5. Async Streaming | tokio/futures integration | channels, backpressure |
| 6. Stress Testing | Memory stability | 1M+ events, leak detection |
Token Definition
📦 Source: examples/jsonl-parser/src/lib.rs
Error Type
Define a parser error type with thiserror:
use thiserror::Error;
#[derive(Error, Debug, Clone, Default, PartialEq)]
pub enum JsonError {
#[default]
#[error("unknown lexing error")]
Unknown,
#[error("expected {expect}, found {found}")]
Expected { expect: &'static str, found: String },
#[error("expected {expect}, found EOF")]
Empty { expect: &'static str },
#[error("invalid number: {0}")]
InvalidNumber(String),
#[error("invalid escape sequence")]
InvalidEscape,
#[error("{source}")]
Spanned {
#[source]
source: Box<JsonError>,
span: Span,
},
}
Token Definition
Use parser_kit! to define JSON tokens:
synkit::parser_kit! {
error: JsonError,
skip_tokens: [Space, Tab],
tokens: {
// Whitespace (auto-skipped during parsing)
#[token(" ", priority = 0)]
Space,
#[token("\t", priority = 0)]
Tab,
// Newline is significant - it's our record delimiter
#[regex(r"\r?\n")]
#[fmt("newline")]
#[no_to_tokens] // Custom ToTokens impl
Newline,
// Structural tokens
#[token("{")]
LBrace,
#[token("}")]
RBrace,
#[token("[")]
LBracket,
#[token("]")]
RBracket,
#[token(":")]
Colon,
#[token(",")]
Comma,
// Literals
#[token("null")]
Null,
#[token("true")]
True,
#[token("false")]
False,
// Strings with escape sequences
#[regex(r#""([^"\\]|\\.)*""#, |lex| {
let s = lex.slice();
s[1..s.len()-1].to_string() // Strip quotes
})]
#[fmt("string")]
#[no_to_tokens]
String(String),
// JSON numbers (integers and floats)
#[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?",
|lex| lex.slice().to_string())]
#[fmt("number")]
Number(String),
},
delimiters: {
Brace => (LBrace, RBrace),
Bracket => (LBracket, RBracket),
},
}
Key Points
#[no_to_tokens] for Custom Printing
Some tokens need custom ToTokens implementations:
impl traits::ToTokens for tokens::StringToken {
fn write(&self, p: &mut printer::Printer) {
use synkit::Printer as _;
p.word("\"");
for c in self.0.chars() {
match c {
'"' => p.word("\\\""),
'\\' => p.word("\\\\"),
'\n' => p.word("\\n"),
'\r' => p.word("\\r"),
'\t' => p.word("\\t"),
c => p.char(c),
}
}
p.word("\"");
}
}
Newline as Boundary
Unlike whitespace, Newline is semantically significant in JSONL - it separates records. Keep it in the token stream but handle it specially in parsing.
Next
Chunk Boundaries
📦 Source: examples/jsonl-parser/src/incremental.rs
The ChunkBoundary trait defines where token streams can be safely split for incremental parsing.
The Problem
When processing streaming input, we need to know when we have enough tokens to parse a complete unit. For JSONL, a “complete unit” is a single JSON line ending with a newline.
But we can’t just split on any newline - consider:
{"message": "hello\nworld"}
The \n inside the string is NOT a record boundary.
ChunkBoundary Trait
pub trait ChunkBoundary {
type Token;
/// Is this token a potential boundary?
fn is_boundary_token(token: &Self::Token) -> bool;
/// Depth change: +1 for openers, -1 for closers
fn depth_delta(token: &Self::Token) -> i32 { 0 }
/// Should this token be skipped when scanning?
fn is_ignorable(token: &Self::Token) -> bool { false }
/// Find next boundary at depth 0
fn find_boundary<S: AsRef<Self::Token>>(
tokens: &[S],
start: usize
) -> Option<usize>;
}
JSONL Implementation
impl ChunkBoundary for JsonLine {
type Token = Token;
#[inline]
fn is_boundary_token(token: &Token) -> bool {
matches!(token, Token::Newline)
}
#[inline]
fn depth_delta(token: &Token) -> i32 {
match token {
Token::LBrace | Token::LBracket => 1, // Opens nesting
Token::RBrace | Token::RBracket => -1, // Closes nesting
_ => 0,
}
}
#[inline]
fn is_ignorable(token: &Token) -> bool {
matches!(token, Token::Space | Token::Tab)
}
}
How It Works
The default find_boundary implementation:
- Starts at
depth = 0 - For each token:
- Adds
depth_delta()to depth - If
depth == 0andis_boundary_token(): return position + 1
- Adds
- Returns
Noneif no boundary found
Example Token Stream
Tokens: { "a" : 1 } \n { "b" : 2 } \n
Depth: 1 1 0 0 1 1 0 0
^ ^ ^
open boundary boundary
The first \n at index 5 (after }) is a valid boundary because depth is 0.
Finding Boundaries
Use find_boundary to locate complete chunks:
let tokens: Vec<Spanned<Token>> = /* from lexer */;
let mut start = 0;
while let Some(end) = JsonLine::find_boundary(&tokens, start) {
let chunk = &tokens[start..end];
let line = parse_json_line(chunk)?;
process(line);
start = end;
}
// tokens[start..] contains incomplete data - wait for more
Design Considerations
Delimiter Matching
For formats with paired delimiters (JSON, TOML, XML), track nesting depth. A boundary is only valid when all delimiters are balanced.
String Literals
Newlines inside strings don’t affect depth because the lexer treats the entire string as one token. The ChunkBoundary operates on tokens, not characters.
Multiple Boundary Types
Some formats have multiple boundary types. For TOML:
fn is_boundary_token(token: &Token) -> bool {
matches!(token, Token::Newline | Token::TableHeader)
}
Next
Chapter 3: Incremental Lexer →
Incremental Lexer
📦 Source: examples/jsonl-parser/src/incremental.rs
The IncrementalLexer trait enables lexing input that arrives in chunks.
The Problem
Network data arrives in arbitrary chunks:
Chunk 1: {"name": "ali
Chunk 2: ce"}\n{"name
Chunk 3: e": "bob"}\n
We need to:
- Buffer incomplete tokens across chunks
- Emit complete tokens as soon as available
- Track source positions across all chunks
IncrementalLexer Trait
pub trait IncrementalLexer: Sized {
type Token: Clone;
type Span: Clone;
type Spanned: Clone;
type Error: fmt::Display;
/// Create with default capacity
fn new() -> Self;
/// Create with capacity hints for pre-allocation
fn with_capacity_hint(hint: LexerCapacityHint) -> Self;
/// Feed a chunk, return complete tokens
fn feed(&mut self, chunk: &str) -> Result<Vec<Self::Spanned>, Self::Error>;
/// Feed into existing buffer (avoids allocation)
fn feed_into(
&mut self,
chunk: &str,
buffer: &mut Vec<Self::Spanned>
) -> Result<usize, Self::Error>;
/// Finish and return remaining tokens
fn finish(self) -> Result<Vec<Self::Spanned>, Self::Error>;
/// Current byte offset
fn offset(&self) -> usize;
}
JSONL Implementation
pub struct JsonIncrementalLexer {
buffer: String, // Accumulated input
offset: usize, // Total bytes processed
token_hint: usize, // Capacity hint
}
impl IncrementalLexer for JsonIncrementalLexer {
type Token = Token;
type Span = Span;
type Spanned = Spanned<Token>;
type Error = JsonError;
fn new() -> Self {
Self {
buffer: String::new(),
offset: 0,
token_hint: 64,
}
}
fn with_capacity_hint(hint: LexerCapacityHint) -> Self {
Self {
buffer: String::with_capacity(hint.buffer_capacity),
offset: 0,
token_hint: hint.tokens_per_chunk,
}
}
fn feed(&mut self, chunk: &str) -> Result<Vec<Self::Spanned>, Self::Error> {
self.buffer.push_str(chunk);
self.lex_complete_lines()
}
fn finish(self) -> Result<Vec<Self::Spanned>, Self::Error> {
if self.buffer.is_empty() {
return Ok(Vec::new());
}
// Lex remaining buffer
self.lex_buffer(&self.buffer)
}
fn offset(&self) -> usize {
self.offset
}
}
Key Implementation: lex_complete_lines
fn lex_complete_lines(&mut self) -> Result<Vec<Spanned<Token>>, JsonError> {
use logos::Logos;
// Find last newline - only lex complete lines
let split_pos = self.buffer.rfind('\n').map(|p| p + 1);
let (to_lex, remainder) = match split_pos {
Some(pos) if pos < self.buffer.len() => {
// Have remainder after newline
let (prefix, suffix) = self.buffer.split_at(pos);
(prefix.to_string(), suffix.to_string())
}
Some(pos) if pos == self.buffer.len() => {
// Newline at end, no remainder
(std::mem::take(&mut self.buffer), String::new())
}
_ => return Ok(Vec::new()), // No complete lines yet
};
// Lex the complete portion
let mut tokens = Vec::with_capacity(self.token_hint);
let mut lexer = Token::lexer(&to_lex);
while let Some(result) = lexer.next() {
let token = result.map_err(|_| JsonError::Unknown)?;
let span = lexer.span();
tokens.push(Spanned {
value: token,
// Adjust span by global offset
span: Span::new(
self.offset + span.start,
self.offset + span.end
),
});
}
// Update state
self.offset += to_lex.len();
self.buffer = remainder;
Ok(tokens)
}
Capacity Hints
Pre-allocate buffers based on expected input:
// Small: <1KB inputs
let hint = LexerCapacityHint::small();
// Medium: 1KB-64KB (default)
let hint = LexerCapacityHint::medium();
// Large: >64KB
let hint = LexerCapacityHint::large();
// Custom: from expected chunk size
let hint = LexerCapacityHint::from_chunk_size(4096);
let lexer = JsonIncrementalLexer::with_capacity_hint(hint);
Using feed_into for Zero-Copy
Avoid repeated allocations with feed_into:
let mut lexer = JsonIncrementalLexer::new();
let mut token_buffer = Vec::with_capacity(1024);
while let Some(chunk) = source.next_chunk().await {
let added = lexer.feed_into(&chunk, &mut token_buffer)?;
println!("Added {} tokens", added);
// Process and drain tokens...
}
Span Tracking
All spans are global - they reference positions in the complete input:
Chunk 1 (offset 0): {"a":1}\n
Spans: 0-1, 1-4, 4-5, 5-6, 6-7, 7-8
Chunk 2 (offset 8): {"b":2}\n
Spans: 8-9, 9-12, 12-13, 13-14, 14-15, 15-16
^
offset added
Next
Chapter 4: Incremental Parse →
Incremental Parse
📦 Source: examples/jsonl-parser/src/incremental.rs
The IncrementalParse trait enables parsing from a growing token buffer.
IncrementalParse Trait
pub trait IncrementalParse: Sized {
type Token: Clone;
type Error: fmt::Display;
/// Attempt to parse from tokens starting at checkpoint
///
/// Returns:
/// - `Ok((Some(node), new_checkpoint))` - Parsed successfully
/// - `Ok((None, checkpoint))` - Need more tokens
/// - `Err(error)` - Unrecoverable error
fn parse_incremental<S>(
tokens: &[S],
checkpoint: &ParseCheckpoint,
) -> Result<(Option<Self>, ParseCheckpoint), Self::Error>
where
S: AsRef<Self::Token>;
/// Check if parsing might succeed with current tokens
fn can_parse<S>(tokens: &[S], checkpoint: &ParseCheckpoint) -> bool
where
S: AsRef<Self::Token>;
}
ParseCheckpoint
Track parser state between parse attempts:
#[derive(Debug, Clone, Default)]
pub struct ParseCheckpoint {
/// Position in token buffer
pub cursor: usize,
/// Tokens consumed (for buffer compaction)
pub tokens_consumed: usize,
/// Custom state (e.g., nesting depth)
pub state: u64,
}
JSONL Implementation Strategy
Rather than re-implementing parsing logic, we reuse the standard Parse trait:
impl IncrementalParse for JsonLine {
type Token = Token;
type Error = JsonError;
fn parse_incremental<S>(
tokens: &[S],
checkpoint: &ParseCheckpoint,
) -> Result<(Option<Self>, ParseCheckpoint), Self::Error>
where
S: AsRef<Self::Token>,
{
// 1. Find chunk boundary
let boundary = match Self::find_boundary(tokens, checkpoint.cursor) {
Some(b) => b,
None => return Ok((None, checkpoint.clone())), // Need more
};
// 2. Extract chunk tokens
let chunk = &tokens[checkpoint.cursor..boundary];
// 3. Build TokenStream from chunk
let stream_tokens: Vec<_> = chunk.iter()
.map(|s| /* convert to SpannedToken */)
.collect();
let mut stream = TokenStream::from_tokens(/* ... */);
// 4. Use standard Parse implementation
let line = JsonLine::parse(&mut stream)?;
// 5. Return with updated checkpoint
let consumed = boundary - checkpoint.cursor;
Ok((
Some(line),
ParseCheckpoint {
cursor: boundary,
tokens_consumed: checkpoint.tokens_consumed + consumed,
state: 0,
},
))
}
fn can_parse<S>(tokens: &[S], checkpoint: &ParseCheckpoint) -> bool
where
S: AsRef<Self::Token>,
{
// Can parse if there's a complete chunk
Self::find_boundary(tokens, checkpoint.cursor).is_some()
}
}
Key Design: Reuse Parse Trait
The incremental parser delegates to the standard Parse implementation. This ensures:
- Consistency - Same parsing logic for sync and async
- Maintainability - One parser implementation to update
- Testing - Sync tests validate incremental behavior
Using IncrementalBuffer
The IncrementalBuffer helper manages tokens efficiently:
use synkit::async_stream::{IncrementalBuffer, parse_available_chunks};
let mut buffer = IncrementalBuffer::with_capacity(1024);
let mut lexer = JsonIncrementalLexer::new();
// Feed tokens
buffer.extend(lexer.feed(chunk)?);
// Parse all available chunks
let results = parse_available_chunks::<JsonLine, _, _, _, _>(
&mut buffer,
|tokens| {
let mut stream = TokenStream::from_tokens(/* ... */);
JsonLine::parse(&mut stream)
},
)?;
// Compact buffer to release memory
buffer.compact();
IncrementalBuffer Operations
// Access unconsumed tokens
let remaining = buffer.remaining();
// Mark tokens as consumed
buffer.consume(count);
// Remove consumed tokens from memory
buffer.compact();
// Check size
let len = buffer.len(); // Unconsumed count
let total = buffer.total_tokens(); // Including consumed
Error Handling
Return errors for unrecoverable parsing failures:
fn parse_incremental<S>(
tokens: &[S],
checkpoint: &ParseCheckpoint,
) -> Result<(Option<Self>, ParseCheckpoint), Self::Error> {
// ...
match JsonLine::parse(&mut stream) {
Ok(line) => Ok((Some(line), new_checkpoint)),
Err(e) => {
// For recoverable errors, could return Ok((None, ...))
// For unrecoverable, propagate the error
Err(e)
}
}
}
Checkpoint State
Use state: u64 for parser-specific context:
// Example: Track nesting depth
let checkpoint = ParseCheckpoint {
cursor: 100,
tokens_consumed: 50,
state: 3, // Currently at depth 3
};
Next
Async Streaming
📦 Source: examples/jsonl-parser/src/incremental.rs
synkit provides async streaming support via tokio and futures feature flags.
Feature Flags
# Cargo.toml
# For tokio runtime
synkit = { version = "0.1", features = ["tokio"] }
# For runtime-agnostic futures
synkit = { version = "0.1", features = ["futures"] }
# For both
synkit = { version = "0.1", features = ["tokio", "futures"] }
Architecture
┌──────────┐ ┌───────────────────┐ ┌──────────────┐
│ Source │────▶│ AsyncTokenStream │────▶│ AstStream │────▶ Consumer
│ (chunks) │ │ (lexer) │ │ (parser) │
└──────────┘ └───────────────────┘ └──────────────┘
│ │
mpsc::channel mpsc::channel
Tokio Implementation
AsyncTokenStream
Receives source chunks, emits tokens:
use synkit::async_stream::tokio_impl::AsyncTokenStream;
use tokio::sync::mpsc;
let (token_tx, token_rx) = mpsc::channel(1024);
let mut lexer = AsyncTokenStream::<JsonIncrementalLexer>::new(token_tx);
// Feed chunks
lexer.feed(chunk).await?;
// Signal completion
lexer.finish().await?;
AstStream
Receives tokens, emits AST nodes:
use synkit::async_stream::tokio_impl::AstStream;
let (ast_tx, mut ast_rx) = mpsc::channel(64);
let mut parser = AstStream::<JsonLine, Token>::new(token_rx, ast_tx);
// Run until token stream exhausted
parser.run().await?;
Full Pipeline
use synkit::async_stream::{StreamConfig, tokio_impl::*};
use tokio::sync::mpsc;
async fn process_jsonl_stream(
mut source: impl Stream<Item = String>,
) -> Result<Vec<JsonLine>, StreamError> {
let config = StreamConfig::medium();
// Create channels
let (token_tx, token_rx) = mpsc::channel(config.token_buffer_size);
let (ast_tx, mut ast_rx) = mpsc::channel(config.ast_buffer_size);
// Spawn lexer task
let lexer_handle = tokio::spawn(async move {
let mut lexer = AsyncTokenStream::<JsonIncrementalLexer>::with_config(
token_tx,
config.clone()
);
while let Some(chunk) = source.next().await {
lexer.feed(&chunk).await?;
}
lexer.finish().await
});
// Spawn parser task
let parser_handle = tokio::spawn(async move {
let mut parser = AstStream::<JsonLine, Token>::with_config(
token_rx,
ast_tx,
config,
);
parser.run().await
});
// Collect results
let mut results = Vec::new();
while let Some(line) = ast_rx.recv().await {
results.push(line);
}
// Wait for tasks
lexer_handle.await??;
parser_handle.await??;
Ok(results)
}
StreamConfig
Configure buffer sizes and limits:
let config = StreamConfig {
token_buffer_size: 1024, // Channel + buffer capacity
ast_buffer_size: 64, // AST channel capacity
max_chunk_size: 64 * 1024, // Reject chunks > 64KB
lexer_hint: LexerCapacityHint::medium(),
};
// Or use presets
let config = StreamConfig::small(); // <1KB inputs
let config = StreamConfig::medium(); // 1KB-64KB (default)
let config = StreamConfig::large(); // >64KB inputs
Futures Implementation
For runtime-agnostic streaming, use ParseStream:
use synkit::async_stream::futures_impl::ParseStream;
use futures_util::StreamExt;
let token_stream: impl Stream<Item = Token> = /* from lexer */;
let mut parse_stream = ParseStream::<_, JsonLine, _>::new(token_stream);
while let Some(result) = parse_stream.next().await {
match result {
Ok(line) => process(line),
Err(e) => handle_error(e),
}
}
Error Handling
StreamError covers all streaming failure modes:
pub enum StreamError {
ChannelClosed, // Unexpected channel closure
LexError(String), // Lexer failure
ParseError(String), // Parser failure
IncompleteInput, // EOF with partial data
ChunkTooLarge { size, max }, // Input exceeds limit
BufferOverflow { current, max }, // Buffer exceeded
Timeout, // Deadline exceeded
ResourceLimit { resource, current, max },
}
Handle errors appropriately:
match parser.run().await {
Ok(()) => println!("Complete"),
Err(StreamError::IncompleteInput) => {
eprintln!("Warning: truncated input");
}
Err(StreamError::ParseError(msg)) => {
eprintln!("Parse error: {}", msg);
// Could log and continue with next record
}
Err(e) => return Err(e.into()),
}
Backpressure
Channel-based streaming provides natural backpressure:
- If consumer is slow, channels fill up
- Producers block on
send().await - Memory usage stays bounded
Configure based on throughput needs:
// High throughput, more memory
let (tx, rx) = mpsc::channel(4096);
// Low latency, less memory
let (tx, rx) = mpsc::channel(16);
Next
Stress Testing
Validate incremental parsers handle high throughput without memory leaks.
Test Strategy
- Volume - Process millions of events
- Memory stability - Track buffer sizes, detect leaks
- Varied input - Different object sizes and structures
- Buffer compaction - Verify consumed tokens are released
Million Event Test
#[test]
fn test_million_events_no_memory_leak() {
let config = StressConfig {
event_count: 1_000_000,
chunk_size: 4096,
memory_check_interval: 100_000,
max_memory_growth: 2.0,
};
let input = r#"{"id": 1, "name": "test", "value": 42.5}\n"#;
let mut lexer = JsonIncrementalLexer::new();
let mut token_buffer: Vec<Spanned<Token>> = Vec::new();
let mut checkpoint = ParseCheckpoint::default();
let mut total_parsed = 0;
let mut memory_tracker = MemoryTracker::new();
for i in 0..config.event_count {
// Feed one line
token_buffer.extend(lexer.feed(&input)?);
// Parse available
loop {
match JsonLine::parse_incremental(&token_buffer, &checkpoint) {
Ok((Some(_line), new_cp)) => {
total_parsed += 1;
checkpoint = new_cp;
}
Ok((None, _)) => break,
Err(e) => panic!("Parse error at event {}: {}", i, e),
}
}
// Compact frequently to avoid memory growth
if checkpoint.tokens_consumed > 500 {
token_buffer.drain(..checkpoint.tokens_consumed);
checkpoint.cursor -= checkpoint.tokens_consumed;
checkpoint.tokens_consumed = 0;
}
// Memory sampling
if i % config.memory_check_interval == 0 {
memory_tracker.sample(token_buffer.len(), 0);
}
}
assert_eq!(total_parsed, config.event_count);
assert!(memory_tracker.is_stable(config.max_memory_growth));
}
Memory Tracking
struct MemoryTracker {
initial_estimate: usize,
samples: Vec<usize>,
}
impl MemoryTracker {
fn sample(&mut self, token_buffer_size: usize, line_buffer_size: usize) {
let estimate = token_buffer_size * size_of::<Spanned<Token>>()
+ line_buffer_size * size_of::<JsonLine>();
if self.initial_estimate == 0 {
self.initial_estimate = estimate.max(1);
}
self.samples.push(estimate);
}
fn max_growth_ratio(&self) -> f64 {
let max = self.samples.iter().max().copied().unwrap_or(0);
max as f64 / self.initial_estimate as f64
}
fn is_stable(&self, max_growth: f64) -> bool {
self.max_growth_ratio() <= max_growth
}
}
Varied Input Test
Test with different JSON structures:
#[test]
fn test_varied_objects_stress() {
let objects = vec![
r#"{"type": "simple", "value": 1}"#,
r#"{"type": "nested", "data": {"inner": true}}"#,
r#"{"type": "array", "items": [1, 2, 3, 4, 5]}"#,
r#"{"type": "complex", "users": [{"name": "a"}], "count": 2}"#,
];
for i in 0..500_000 {
let obj = objects[i % objects.len()];
let input = format!("{}\n", obj);
// Feed, parse, verify...
}
}
Buffer Compaction
Critical for memory stability:
// Bad: Buffer grows unbounded
loop {
token_buffer.extend(lexer.feed(chunk)?);
while let Some(line) = parse(&token_buffer)? {
// Parse but never compact
}
}
// Good: Compact after consuming
loop {
token_buffer.extend(lexer.feed(chunk)?);
while let Some(line) = parse(&token_buffer)? {
checkpoint = new_checkpoint;
}
// Compact when enough consumed
if checkpoint.tokens_consumed > THRESHOLD {
token_buffer.drain(..checkpoint.tokens_consumed);
checkpoint.cursor -= checkpoint.tokens_consumed;
checkpoint.tokens_consumed = 0;
}
}
Performance Metrics
Track throughput:
let start = Instant::now();
// ... process events ...
let elapsed = start.elapsed();
let rate = total_parsed as f64 / elapsed.as_secs_f64();
println!(
"Processed {} events in {:?} ({:.0} events/sec)",
total_parsed, elapsed, rate
);
Expected performance (rough guidelines):
- Simple objects: 500K-1M events/sec
- Complex nested: 100K-300K events/sec
- Memory growth: <2x initial
Running Tests
# Run stress tests (may take minutes)
cd examples/jsonl-parser
cargo test stress -- --nocapture
# With release optimizations
cargo test --release stress -- --nocapture
Summary
Incremental parsing requires careful attention to:
- Buffer management - Compact regularly
- Memory bounds - Track growth, fail on overflow
- Throughput - Profile hot paths
- Correctness - Same results as sync parsing
The JSONL parser demonstrates these patterns at scale.
parser_kit! Macro
The parser_kit! macro generates parsing infrastructure from token definitions.
Syntax
synkit::parser_kit! {
error: ErrorType,
skip_tokens: [Token1, Token2],
#[logos(skip r"...")] // Optional logos-level attributes
tokens: {
#[token("=")]
Eq,
#[regex(r"[a-z]+", |lex| lex.slice().to_string())]
#[fmt("identifier")]
Ident(String),
},
delimiters: {
Bracket => (LBracket, RBracket),
},
span_derives: [Debug, Clone, PartialEq],
token_derives: [Debug, Clone, PartialEq],
custom_derives: [],
}
Fields
error: ErrorType (required)
Your error type. Must implement Default:
#[derive(Default)]
pub enum MyError {
#[default]
Unknown,
// ...
}
skip_tokens: [...] (required)
Tokens to skip during parsing. Typically whitespace:
skip_tokens: [Space, Tab],
Skipped tokens don’t appear in stream.next() but are visible in stream.next_raw().
tokens: { ... } (required)
Token definitions using logos attributes.
Unit Tokens
#[token("=")]
Eq,
Generates EqToken with new() and token() methods.
Tokens with Values
#[regex(r"[a-z]+", |lex| lex.slice().to_string())]
Ident(String),
Generates IdentToken(String) implementing Deref<Target=String>.
Token Attributes
| Attribute | Purpose |
|---|---|
#[token("...")] | Exact string match |
#[regex(r"...")] | Regex pattern |
#[regex(r"...", callback)] | Regex with value extraction |
#[fmt("name")] | Display name for errors |
#[derive(...)] | Additional derives for this token |
priority = N | Logos priority for conflicts |
delimiters: { ... } (optional)
Delimiter pair definitions:
delimiters: {
Bracket => (LBracket, RBracket),
Brace => (LBrace, RBrace),
Paren => (LParen, RParen),
},
Generates:
- Struct (e.g.,
Bracket) storing spans - Macro (e.g.,
bracket!) for extraction
span_derives: [...] (optional)
Derives for Span, RawSpan, Spanned<T>:
span_derives: [Debug, Clone, PartialEq, Eq, Hash],
Default: Debug, Clone, PartialEq, Eq, Hash
token_derives: [...] (optional)
Derives for all token structs:
token_derives: [Debug, Clone, PartialEq],
custom_derives: [...] (optional)
Additional derives for all generated types:
custom_derives: [serde::Serialize],
Generated Modules
span
pub struct RawSpan { pub start: usize, pub end: usize }
pub enum Span { CallSite, Known(RawSpan) }
pub struct Spanned<T> { pub value: T, pub span: Span }
tokens
pub enum Token { Eq, Ident(String), ... }
pub struct EqToken;
pub struct IdentToken(pub String);
// Macros
macro_rules! Tok { ... }
macro_rules! SpannedTok { ... }
stream
pub struct TokenStream { ... }
pub struct MutTokenStream<'a> { ... }
impl TokenStream {
pub fn lex(source: &str) -> Result<Self, Error>;
pub fn parse<T: Parse>(&mut self) -> Result<Spanned<T>, Error>;
pub fn peek<T: Peek>(&self) -> bool;
pub fn fork(&self) -> Self;
pub fn advance_to(&mut self, other: &Self);
}
printer
pub struct Printer { ... }
impl Printer {
pub fn new() -> Self;
pub fn finish(self) -> String;
pub fn word(&mut self, s: &str);
pub fn token(&mut self, tok: &Token);
pub fn space(&mut self);
pub fn newline(&mut self);
pub fn open_block(&mut self);
pub fn close_block(&mut self);
}
delimiters
For each delimiter definition:
pub struct Bracket { pub span: Span }
macro_rules! bracket {
($inner:ident in $stream:expr) => { ... }
}
traits
pub trait Parse: Sized {
fn parse(stream: &mut TokenStream) -> Result<Self, Error>;
}
pub trait Peek {
fn is(token: &Token) -> bool;
fn peek(stream: &TokenStream) -> bool;
}
pub trait ToTokens {
fn write(&self, printer: &mut Printer);
fn to_string_formatted(&self) -> String;
}
pub trait Diagnostic {
fn fmt() -> &'static str;
}
Expansion Example
Input:
synkit::parser_kit! {
error: E,
skip_tokens: [],
tokens: {
#[token("=")]
Eq,
},
delimiters: {},
span_derives: [Debug],
token_derives: [Debug],
}
Expands to ~500 lines including all modules, traits, and implementations.
Core Traits
synkit provides traits in two locations:
synkit(synkit-core): Generic traits for library-level abstractions- Generated
traitsmodule: Concrete implementations for your grammar
Parse
Convert tokens to AST nodes.
pub trait Parse: Sized {
fn parse(stream: &mut TokenStream) -> Result<Self, Error>;
}
Auto-implementations
Token structs implement Parse automatically:
// Generated for EqToken
impl Parse for EqToken {
fn parse(stream: &mut TokenStream) -> Result<Self, Error> {
match stream.next() {
Some(tok) => match &tok.value {
Token::Eq => Ok(EqToken::new()),
other => Err(Error::expected::<Self>(other)),
},
None => Err(Error::empty::<Self>()),
}
}
}
Blanket Implementations
// Option<T> parses if T::peek() succeeds
impl<T: Parse + Peek> Parse for Option<T> { ... }
// Box<T> wraps parsed value
impl<T: Parse> Parse for Box<T> { ... }
// Spanned<T> wraps with span
impl<T: Parse> Parse for Spanned<T> { ... }
Peek
Check next token without consuming.
pub trait Peek {
fn is(token: &Token) -> bool;
fn peek(stream: &TokenStream) -> bool;
}
Usage
// In conditionals
if Value::peek(stream) {
let v: Spanned<Value> = stream.parse()?;
}
// In loops
while SimpleKey::peek(stream) {
items.push(stream.parse()?);
}
ToTokens
Convert AST back to formatted output.
pub trait ToTokens {
fn write(&self, printer: &mut Printer);
fn to_string_formatted(&self) -> String {
let mut p = Printer::new();
self.write(&mut p);
p.finish()
}
}
Blanket Implementations
impl<T: ToTokens> ToTokens for Spanned<T> {
fn write(&self, p: &mut Printer) {
self.value.write(p);
}
}
impl<T: ToTokens> ToTokens for Option<T> {
fn write(&self, p: &mut Printer) {
if let Some(v) = self { v.write(p); }
}
}
Diagnostic
Provide display name for error messages.
pub trait Diagnostic {
fn fmt() -> &'static str;
}
Auto-generated using #[fmt("...")] or snake_case variant name.
IncrementalParse
Parse AST nodes incrementally from a token buffer with checkpoint-based state.
pub trait IncrementalParse: Sized {
fn parse_incremental(
tokens: &[Token],
checkpoint: &ParseCheckpoint,
) -> Result<(Option<Self>, ParseCheckpoint), Error>;
fn can_parse(tokens: &[Token], checkpoint: &ParseCheckpoint) -> bool;
}
Usage
impl IncrementalParse for KeyValue {
fn parse_incremental(
tokens: &[Token],
checkpoint: &ParseCheckpoint,
) -> Result<(Option<Self>, ParseCheckpoint), TomlError> {
let cursor = checkpoint.cursor;
// Need at least 3 tokens: key = value
if cursor + 2 >= tokens.len() {
return Ok((None, checkpoint.clone()));
}
// Parse key = value pattern
// ...
let new_cp = ParseCheckpoint {
cursor: cursor + 3,
tokens_consumed: checkpoint.tokens_consumed + 3,
state: 0,
};
Ok((Some(kv), new_cp))
}
fn can_parse(tokens: &[Token], checkpoint: &ParseCheckpoint) -> bool {
checkpoint.cursor < tokens.len()
}
}
With Async Streaming
use synkit::async_stream::tokio_impl::AstStream;
let (token_tx, token_rx) = mpsc::channel(32);
let (ast_tx, mut ast_rx) = mpsc::channel(16);
tokio::spawn(async move {
let mut parser = AstStream::<KeyValue, Token>::new(token_rx, ast_tx);
parser.run().await?;
});
while let Some(kv) = ast_rx.recv().await {
process_key_value(kv);
}
TokenStream (core trait)
Generic stream interface from synkit-core:
pub trait TokenStream {
type Token;
type Span;
type Spanned;
type Error;
fn next(&mut self) -> Option<Self::Spanned>;
fn peek_token(&self) -> Option<&Self::Spanned>;
fn next_raw(&mut self) -> Option<Self::Spanned>;
fn peek_token_raw(&self) -> Option<&Self::Spanned>;
}
The generated stream::TokenStream implements this trait.
Printer (core trait)
Generic printer interface from synkit-core:
pub trait Printer {
fn word(&mut self, s: &str);
fn token<T: std::fmt::Display>(&mut self, tok: &T);
fn space(&mut self);
fn newline(&mut self);
fn open_block(&mut self);
fn close_block(&mut self);
fn indent(&mut self);
fn dedent(&mut self);
fn write_separated<T, F>(&mut self, items: &[T], sep: &str, f: F)
where F: Fn(&T, &mut Self);
}
SpannedError
Attach source spans to errors:
pub trait SpannedError: Sized {
type Span;
fn with_span(self, span: Self::Span) -> Self;
fn span(&self) -> Option<&Self::Span>;
}
Implementation pattern:
impl SpannedError for MyError {
type Span = Span;
fn with_span(self, span: Span) -> Self {
Self::Spanned { source: Box::new(self), span }
}
fn span(&self) -> Option<&Span> {
match self {
Self::Spanned { span, .. } => Some(span),
_ => None,
}
}
}
SpanLike / SpannedLike
Abstractions for span types:
pub trait SpanLike {
fn call_site() -> Self;
fn new(start: usize, end: usize) -> Self;
}
pub trait SpannedLike<T> {
type Span: SpanLike;
fn new(value: T, span: Self::Span) -> Self;
fn value(&self) -> &T;
fn span(&self) -> &Self::Span;
}
Enable generic code over different span implementations.
Container Types
synkit provides container types for common parsing patterns.
Punctuated Sequence Types
Three wrapper types for punctuated sequences with different trailing policies:
| Type | Trailing Separator | Use Case |
|---|---|---|
Punctuated<T, P> | Optional | Array literals: [1, 2, 3] or [1, 2, 3,] |
Separated<T, P> | Forbidden | Function args: f(a, b, c) |
Terminated<T, P> | Required | Statements: use foo; use bar; |
Punctuated
use synkit::Punctuated;
// Optional trailing comma
let items: Punctuated<Value, CommaToken> = parse_punctuated(&mut stream)?;
for value in items.iter() {
process(value);
}
// Check if trailing comma present
if items.trailing_punct() {
// ...
}
Separated
use synkit::Separated;
// Trailing separator is an error
let args: Separated<Arg, CommaToken> = parse_separated(&mut stream)?;
Terminated
use synkit::Terminated;
// Each statement must end with separator
let stmts: Terminated<Stmt, SemiToken> = parse_terminated(&mut stream)?;
Common Methods
All three types share these methods via PunctuatedInner:
fn new() -> Self;
fn with_capacity(capacity: usize) -> Self;
fn push_value(&mut self, value: T);
fn push_punct(&mut self, punct: P);
fn len(&self) -> usize;
fn is_empty(&self) -> bool;
fn iter(&self) -> impl Iterator<Item = &T>;
fn pairs(&self) -> impl Iterator<Item = (&T, Option<&P>)>;
fn first(&self) -> Option<&T>;
fn last(&self) -> Option<&T>;
fn trailing_punct(&self) -> bool;
Repeated
Alternative sequence type preserving separator tokens:
use synkit::Repeated;
pub struct Repeated<T, Sep, Spanned> {
pub values: Vec<RepeatedItem<T, Sep, Spanned>>,
}
pub struct RepeatedItem<T, Sep, Spanned> {
pub value: Spanned,
pub sep: Option<Spanned>,
}
Use Repeated when you need to preserve separator token information (e.g., for source-accurate reprinting).
Methods
fn empty() -> Self;
fn with_capacity(capacity: usize) -> Self;
fn len(&self) -> usize;
fn is_empty(&self) -> bool;
fn iter(&self) -> impl Iterator<Item = &RepeatedItem<...>>;
fn push(&mut self, item: RepeatedItem<...>);
Delimited
Value enclosed by delimiters:
use synkit::Delimited;
pub struct Delimited<T, Span> {
pub span: Span, // Span covering "[...]" or "{...}"
pub inner: T, // The content
}
Created automatically by delimiter macros:
let mut inner;
let bracket = bracket!(inner in stream);
// bracket.span covers "[" through "]"
// inner is a TokenStream of the contents
Usage Patterns
Comma-Separated Arguments
pub struct FnCall {
pub name: Spanned<IdentToken>,
pub paren: Paren,
pub args: Separated<Expr, CommaToken>, // No trailing comma
}
Array with Optional Trailing
pub struct Array {
pub bracket: Bracket,
pub items: Punctuated<Expr, CommaToken>, // Optional trailing
}
Statement Block
pub struct Block {
pub brace: Brace,
pub stmts: Terminated<Stmt, SemiToken>, // Required trailing
}
// Parse arms manually for control
let mut arms = Vec::new();
while Pattern::peek(stream) {
arms.push(stream.parse::<Spanned<MatchArm>>()?);
}
Printing Containers
impl<T: ToTokens, Sep: ToTokens> ToTokens for Repeated<T, Sep, Spanned<T>> {
fn write(&self, p: &mut Printer) {
for (i, item) in self.iter().enumerate() {
if i > 0 {
// Separator between items
p.word(", ");
}
item.value.write(p);
}
// Handle trailing separator if present
if self.has_trailing() {
p.word(",");
}
}
}
Safety & Clamping Behavior
synkit uses safe Rust throughout and employs defensive clamping to prevent panics from edge-case inputs. This page documents behaviors where invalid inputs are silently corrected rather than rejected.
Span Length Calculation
The SpanLike::len() method uses saturating subtraction:
fn len(&self) -> usize {
self.end().saturating_sub(self.start())
}
Behavior: If end < start (an inverted span), returns 0 instead of panicking or wrapping around.
Rationale: Inverted spans can occur as sentinel values or from malformed input. Returning zero length treats them as empty spans.
Span Join
The SpanLike::join() method computes the union of two spans:
fn join(&self, other: &Self) -> Self {
Self::new(self.start().min(other.start()), self.end().max(other.end()))
}
Behavior: Uses min() for start and max() for end. No validation that inputs are well-formed.
Rationale: Mathematical min/max cannot overflow or panic. Even inverted input spans produce a consistent result.
Incremental Buffer Consumption
The IncrementalBuffer::consume() method advances the cursor:
pub fn consume(&mut self, n: usize) {
self.cursor = (self.cursor + n).min(self.tokens.len());
}
Behavior: If n exceeds remaining tokens, cursor clamps to buffer length.
Rationale: Allows callers to safely “consume all” by passing usize::MAX. Prevents out-of-bounds access.
Generated TokenStream Rewind
The TokenStream::rewind() method generated by parser_kit! uses clamp:
fn rewind(&mut self, pos: usize) {
self.cursor = pos.clamp(self.range_start, self.range_end);
}
Behavior: Invalid positions are silently adjusted to the valid range [range_start, range_end].
Rationale: Parsing backtrack positions may become stale after buffer modifications. Clamping ensures the cursor remains valid.
When Clamping Matters
These behaviors are designed to:
- Prevent panics in library code - synkit never panics on edge-case numeric inputs
- Allow sentinel values - Special spans like
(0, 0)or(MAX, MAX)work safely - Support defensive programming - Callers don’t need to pre-validate every operation
When to Validate Explicitly
If your application requires strict validation (e.g., rejecting inverted spans), add checks at parse boundaries:
fn validate_span(span: &impl SpanLike) -> Result<(), MyError> {
if span.end() < span.start() {
return Err(MyError::InvalidSpan);
}
Ok(())
}
Resource Limits
For protection against resource exhaustion (e.g., deeply nested input), see:
StreamError::ResourceLimitfor runtime limit checkingStreamConfigfor configuring buffer sizesParseConfig(when using recursion limits) for nesting depth
Security Considerations
synkit is designed for parsing untrusted input. This page documents the security model, protections, and best practices for generated parsers.
No Unsafe Code
synkit uses zero unsafe blocks in core, macros, and kit crates. Memory safety is guaranteed by the Rust compiler.
# Verify yourself
grep -r "unsafe" core/src macros/src kit/src
# Returns no matches
Resource Exhaustion Protection
Recursion Limits
Deeply nested input like [[[[[[...]]]]]] can cause stack overflow. synkit provides configurable recursion limits:
use synkit::ParseConfig;
// Default: 128 levels (matches serde_json)
let config = ParseConfig::default();
// Stricter limit for untrusted input
let config = ParseConfig::new()
.with_max_recursion_depth(32);
// Track depth manually in your parser
use synkit::RecursionGuard;
struct MyParser {
depth: RecursionGuard,
config: ParseConfig,
}
impl MyParser {
fn parse_nested(&mut self) -> Result<(), MyError> {
self.depth.enter(self.config.max_recursion_depth)?;
// ... parse nested content ...
self.depth.exit();
Ok(())
}
}
Token Limits
Prevent CPU exhaustion from extremely long inputs:
let config = ParseConfig::new()
.with_max_tokens(100_000); // Fail after 100k tokens
Buffer Limits (Streaming)
For incremental parsing, StreamConfig controls memory usage:
use synkit::StreamConfig;
let config = StreamConfig {
max_chunk_size: 16 * 1024, // 16KB max per chunk
token_buffer_size: 1024, // Token buffer capacity
ast_buffer_size: 64, // AST node buffer
..StreamConfig::default()
};
Exceeding limits produces explicit errors:
| Error | Trigger |
|---|---|
StreamError::ChunkTooLarge | Input chunk > max_chunk_size |
StreamError::BufferOverflow | Token buffer exceeded capacity |
StreamError::ResourceLimit | Generic limit exceeded |
Error::RecursionLimitExceeded | Nesting depth > max_recursion_depth |
Error::TokenLimitExceeded | Token count > max_tokens |
Integer Safety
All span arithmetic uses saturating operations to prevent overflow panics:
// Span length - saturating subtraction
fn len(&self) -> usize {
self.end().saturating_sub(self.start())
}
// Recursion guard - saturating increment
self.depth = self.depth.saturating_add(1);
// Cursor bounds - clamped to valid range
self.cursor = pos.clamp(self.range_start, self.range_end);
See Safety & Clamping for detailed behavior documentation.
Memory Safety
Generated TokenStream uses Arc for shared ownership:
pub struct TokenStream {
source: Arc<str>, // Shared source text
tokens: Arc<Vec<Token>>, // Shared token buffer
// ... cursors are Copy types
}
Benefits:
fork()is zero-copy (Arc::clone only)- Thread-safe:
TokenStreamisSend + Sync - No dangling references possible
Fuzz Testing
synkit includes continuous fuzz testing via cargo-fuzz:
# Run lexer fuzzer
cargo +nightly fuzz run fuzz_lexer
# Run parser fuzzer
cargo +nightly fuzz run fuzz_parser
Fuzz targets exercise:
- Arbitrary UTF-8 input
- Edge cases in span arithmetic
- Token stream operations
- Incremental buffer management
Adding Fuzz Tests for Your Parser
#![no_main]
use libfuzzer_sys::fuzz_target;
fuzz_target!(|data: &[u8]| {
if let Ok(s) = std::str::from_utf8(data) {
// Ignore lex errors, just ensure no panics
let _ = my_parser::TokenStream::lex(s);
}
});
Security Checklist
When building a parser for untrusted input:
- Set
max_recursion_depthappropriate for your format - Set
max_tokensto prevent CPU exhaustion - Use
StreamConfiglimits for streaming parsers - Handle all error variants (don’t unwrap)
- Add fuzz tests for your grammar
- Consider timeout limits at the application layer
Threat Model
synkit protects against:
| Threat | Protection |
|---|---|
| Stack overflow | Recursion limits |
| Memory exhaustion | Buffer limits, Arc sharing |
| CPU exhaustion | Token limits |
| Integer overflow | Saturating arithmetic |
| Undefined behavior | No unsafe code |
synkit does NOT protect against:
| Threat | Mitigation |
|---|---|
| Regex backtracking (logos) | Use logos’ regex restrictions |
| Application-level DoS | Add timeouts in your application |
| Malicious AST semantics | Validate AST after parsing |
Reporting Vulnerabilities
Please open a Github security advisory.
Testing Generated Code
This guide covers testing strategies for parsers built with synkit, from unit tests to fuzz testing.
Unit Testing
Token-Level Tests
Test individual token recognition:
#[test]
fn test_lex_identifier() {
let stream = TokenStream::lex("foo_bar").unwrap();
let tok = stream.peek_token().unwrap();
assert!(matches!(tok.value, Token::Ident(_)));
if let Token::Ident(s) = &tok.value {
assert_eq!(s, "foo_bar");
}
}
#[test]
fn test_lex_rejects_invalid() {
// Logos returns errors for unrecognized input
let result = TokenStream::lex("\x00\x01\x02");
assert!(result.is_err());
}
Span Accuracy Tests
Verify spans point to correct source locations:
#[test]
fn test_span_accuracy() {
let source = "let x = 42";
let mut stream = TokenStream::lex(source).unwrap();
let kw: Spanned<LetToken> = stream.parse().unwrap();
assert_eq!(&source[kw.span.start()..kw.span.end()], "let");
let name: Spanned<IdentToken> = stream.parse().unwrap();
assert_eq!(&source[name.span.start()..name.span.end()], "x");
}
Parse Tests
Test AST construction:
#[test]
fn test_parse_key_value() {
let mut stream = TokenStream::lex("name = \"Alice\"").unwrap();
let kv: Spanned<KeyValue> = stream.parse().unwrap();
assert!(matches!(kv.key.value, Key::Bare(_)));
assert!(matches!(kv.value.value, Value::String(_)));
}
#[test]
fn test_parse_error_recovery() {
let mut stream = TokenStream::lex("= value").unwrap();
let result: Result<Spanned<KeyValue>, _> = stream.parse();
assert!(result.is_err());
// Verify error message is helpful
let err = result.unwrap_err();
assert!(err.to_string().contains("expected"));
}
Round-Trip Testing
Verify parse-then-print produces equivalent output:
#[test]
fn test_roundtrip() {
let original = "name = \"value\"\ncount = 42";
let mut stream = TokenStream::lex(original).unwrap();
let doc: Document = stream.parse().unwrap();
let mut printer = Printer::new();
doc.write(&mut printer);
let output = printer.into_string();
// Re-parse and compare AST
let mut stream2 = TokenStream::lex(&output).unwrap();
let doc2: Document = stream2.parse().unwrap();
assert_eq!(format!("{:?}", doc), format!("{:?}", doc2));
}
Snapshot Testing
Use insta for golden-file testing:
use insta::assert_snapshot;
#[test]
fn snapshot_complex_document() {
let input = include_str!("fixtures/complex.toml");
let mut stream = TokenStream::lex(input).unwrap();
let doc: Document = stream.parse().unwrap();
assert_snapshot!(format!("{:#?}", doc));
}
#[test]
fn snapshot_formatted_output() {
let input = "messy = \"spacing\"";
let doc: Document = parse(input).unwrap();
let mut printer = Printer::new();
doc.write(&mut printer);
assert_snapshot!(printer.into_string());
}
Parameterized Tests
Use test-case for table-driven tests:
use test_case::test_case;
#[test_case("42", Value::Integer(42); "positive integer")]
#[test_case("-17", Value::Integer(-17); "negative integer")]
#[test_case("true", Value::Bool(true); "boolean true")]
#[test_case("false", Value::Bool(false); "boolean false")]
fn test_parse_value(input: &str, expected: Value) {
let mut stream = TokenStream::lex(input).unwrap();
let value: Spanned<Value> = stream.parse().unwrap();
assert_eq!(value.value, expected);
}
Edge Case Testing
Test boundary conditions:
#[test]
fn test_empty_input() {
let stream = TokenStream::lex("").unwrap();
assert!(stream.is_empty());
}
#[test]
fn test_whitespace_only() {
let mut stream = TokenStream::lex(" \t\n ").unwrap();
// peek_token skips whitespace
assert!(stream.peek_token().is_none());
}
#[test]
fn test_max_nesting() {
let nested = "[".repeat(200) + &"]".repeat(200);
let result = parse_array(&nested);
// Should fail with recursion limit error
assert!(matches!(
result,
Err(MyError::RecursionLimit { .. })
));
}
#[test]
fn test_unicode_boundaries() {
// Multi-byte UTF-8: emoji is 4 bytes
let input = "key = \"hello 🦀 world\"";
let mut stream = TokenStream::lex(input).unwrap();
let kv: Spanned<KeyValue> = stream.parse().unwrap();
// Spans should be valid UTF-8 boundaries
let slice = &input[kv.span.start()..kv.span.end()];
assert!(slice.is_char_boundary(0));
}
Fuzz Testing
Setup
Add fuzz targets to your project:
# fuzz/Cargo.toml
[package]
name = "my-parser-fuzz"
version = "0.0.0"
publish = false
edition = "2021"
[package.metadata]
cargo-fuzz = true
[[bin]]
name = "fuzz_lexer"
path = "fuzz_targets/fuzz_lexer.rs"
test = false
doc = false
bench = false
[[bin]]
name = "fuzz_parser"
path = "fuzz_targets/fuzz_parser.rs"
test = false
doc = false
bench = false
[dependencies]
libfuzzer-sys = "0.4"
my-parser = { path = ".." }
Lexer Fuzzing
// fuzz/fuzz_targets/fuzz_lexer.rs
#![no_main]
use libfuzzer_sys::fuzz_target;
fuzz_target!(|data: &[u8]| {
if let Ok(s) = std::str::from_utf8(data) {
// Should never panic
let _ = my_parser::TokenStream::lex(s);
}
});
Parser Fuzzing
// fuzz/fuzz_targets/fuzz_parser.rs
#![no_main]
use libfuzzer_sys::fuzz_target;
fuzz_target!(|data: &[u8]| {
if let Ok(s) = std::str::from_utf8(data) {
if let Ok(mut stream) = my_parser::TokenStream::lex(s) {
// Parse should never panic, only return errors
let _: Result<Document, _> = stream.parse();
}
}
});
Running Fuzzers
# Install cargo-fuzz (requires nightly)
cargo install cargo-fuzz
# Run lexer fuzzer
cargo +nightly fuzz run fuzz_lexer
# Run with timeout and iterations
cargo +nightly fuzz run fuzz_parser -- -max_total_time=60
# Run with corpus
cargo +nightly fuzz run fuzz_parser corpus/parser/
Integration Testing
Test complete workflows:
#[test]
fn test_parse_real_file() {
let content = std::fs::read_to_string("fixtures/config.toml").unwrap();
let doc = parse(&content).expect("should parse real config file");
// Verify expected structure
assert!(doc.get_table("server").is_some());
assert!(doc.get_value("server.port").is_some());
}
Benchmarking
Use divan or criterion for performance testing:
use divan::Bencher;
#[divan::bench]
fn bench_lex_small(bencher: Bencher) {
let input = include_str!("fixtures/small.toml");
bencher.bench(|| TokenStream::lex(input).unwrap());
}
#[divan::bench(args = [100, 1000, 10000])]
fn bench_lex_lines(bencher: Bencher, lines: usize) {
let input = "key = \"value\"\n".repeat(lines);
bencher.bench(|| TokenStream::lex(&input).unwrap());
}
CI Configuration
Example GitHub Actions workflow:
name: Test
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- run: cargo test --all-features
fuzz:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@nightly
- run: cargo install cargo-fuzz
- run: cargo +nightly fuzz run fuzz_lexer -- -max_total_time=30
- run: cargo +nightly fuzz run fuzz_parser -- -max_total_time=30
Test Coverage
Use cargo-llvm-cov for coverage reports:
cargo install cargo-llvm-cov
cargo llvm-cov --html
open target/llvm-cov/html/index.html
Aim for high coverage on:
- All token variants
- All AST node types
- Error paths
- Edge cases (empty, whitespace, limits)