Tree-Sitter Compression
- How It Works
- Supported Languages
- Usage
- What Gets Extracted
- What Gets Removed
- Compression Metrics
- When to Use Compression
- ABI Compatibility
ruley uses tree-sitter grammars to compress source code before sending it to the LLM. This reduces token count by approximately 70%, significantly lowering costs for large codebases.
How It Works
Tree-sitter parses source files into abstract syntax trees (ASTs). ruley walks these ASTs to extract structural elements – function signatures, type definitions, class declarations, imports – while removing implementation bodies. The result is a compressed representation that preserves the project’s API surface and architecture while discarding the details.
Before Compression
#![allow(unused)]
fn main() {
pub fn analyze_codebase(path: &Path, config: &Config) -> Result<Analysis> {
let files = scan_files(path, config)?;
let mut analysis = Analysis::new();
for file in &files {
let content = std::fs::read_to_string(&file.path)?;
let tokens = tokenize(&content);
analysis.add_file(file, tokens);
}
analysis.finalize()
}
}
After Compression
#![allow(unused)]
fn main() {
pub fn analyze_codebase(path: &Path, config: &Config) -> Result<Analysis> { ... }
}
The LLM sees the function signature, return type, and parameter types – enough to understand the codebase’s API surface without the implementation details.
Supported Languages
Each language requires a tree-sitter grammar compiled into ruley via Cargo feature flags:
| Language | Feature Flag | Grammar Version |
|---|---|---|
| TypeScript | compression-typescript (default) | tree-sitter-typescript 0.23.2 |
| Python | compression-python | tree-sitter-python 0.25.0 |
| Rust | compression-rust | tree-sitter-rust 0.24.0 |
| Go | compression-go | tree-sitter-go 0.25.0 |
Enable all languages with:
cargo install ruley --features compression-all
Files in unsupported languages are included at full size (no compression applied).
Usage
Enable compression with the --compress flag:
ruley --compress
Or in the config file:
[general]
compress = true
What Gets Extracted
The compression extracts structural elements that help the LLM understand your codebase:
- Functions: Signatures, parameters, return types
- Types: Struct/class definitions, enum variants, type aliases
- Traits/Interfaces: Method signatures
- Imports: Module dependencies
- Constants: Top-level constant definitions
- Module structure: File and directory organization
What Gets Removed
Implementation details that don’t affect the LLM’s understanding of conventions:
- Function bodies (replaced with
{ ... }) - Loop internals
- Conditional branches
- Local variable assignments
- Comments (optional, depending on grammar)
Compression Metrics
ruley tracks and reports compression statistics:
- Total files: Number of files processed
- Original size: Total bytes before compression
- Compressed size: Total bytes after compression
- Compression ratio: Ratio of compressed to original (lower is better)
These metrics are displayed during pipeline execution and in the final summary.
When to Use Compression
Use compression when:
- Your codebase is large (>1000 files or >500K tokens)
- You want to minimize LLM costs
- The codebase has languages with tree-sitter grammar support
Skip compression when:
- Your codebase is small (the cost savings are negligible)
- You need the LLM to see implementation details for accurate convention extraction
- Your primary language doesn’t have a tree-sitter grammar in ruley
ABI Compatibility
ruley uses tree-sitter 0.26.x (ABI v15). Language parsers may use slightly older ABI versions:
- tree-sitter-go 0.25.0: ABI v15
- tree-sitter-python 0.25.0: ABI v15
- tree-sitter-rust 0.24.0: ABI v15
- tree-sitter-typescript 0.23.2: ABI v14 (compatible via backward compatibility)
The tree-sitter core library supports backward-compatible ABI versions, so older grammar versions work correctly.