No description
This change enables the `split` command to accept multiple input paths, including directories, and mirrors their relative structure in the output. This significantly enhances usability by allowing batch processing of files and entire directory trees, eliminating the need for external scripting to manage complex input scenarios. Without this, users would be forced to manually iterate over files or directories, leading to cumbersome and error-prone workflows. - Introduced a new `inputs` property in `CliOptions.split` to store multiple input paths. - Modified `Chunkli::Parser` to populate `options.split.inputs` from repeated `--input` flags and backfill `options.global.input_path` for backward compatibility. - Updated `Chunkli::Commands::Split::Processor.split` to iterate over resolved inputs, handling both files and directories. - Implemented `InputPath` struct and helper methods (`resolve_inputs`, `expand_and_collect`, `collect_directory`) to manage input expansion and relative path generation. - Ensured that directory structures are mirrored in the output, with chunks for files within a directory being placed in corresponding subdirectories. - Adjusted `ensure_required_paths` to check `options.split.inputs` for the `split` command, allowing multiple inputs to satisfy the requirement. - Removed the now-redundant task file `.tasks/027-char-overlap-validation.md`. - Marked tasks `.tasks/038-multi-input-split.md` and `.tasks/039-multi-input-help-and-specs.md` as done. - Added a new task file `.tasks/042-split-exec-mode.md` for future development. Signed-off-by: Glenn Y. Rolland <glenux@glenux.net> |
||
|---|---|---|
| .actions | ||
| .design | ||
| .tasks | ||
| spec | ||
| src | ||
| .gitignore | ||
| ACTIONS.md | ||
| AGENTS.md | ||
| CONTRIBUTING.md | ||
| DESIGN.md | ||
| Makefile | ||
| README.md | ||
| shard.yml | ||
| TASKS.md | ||
chunkli — Chunking CLI for RAG and beyond
chunkli is a Crystal CLI that splits and joins text with reproducible metadata so you can feed search, RAG, or analytics pipelines and later rebuild the original documents.
Highlights
- Split and join text with reversible metadata front matter.
- Discover available chunkers with
chunkli chunkers. - Built-in chunkers:
fixed_size_words(default) andfixed_size_charswith overlap and UTF-8 safety. - Metadata toggle:
--with-metadata(default) or--without-metadata. - Simple, dependency-light install:
make prepare && make build.
Quick start
make prepare # install shards
make build # builds bin/chunkli
./bin/chunkli --help
Split with metadata (default chunker = fixed_size_words):
./bin/chunkli split --input=example.txt --output=chunks --with-metadata
Join chunks back together:
./bin/chunkli join --input=chunks --with-metadata
Discover available chunkers:
./bin/chunkli chunkers
Run without building:
crystal run src/cli.cr -- split --input=example.txt --output=chunks
Commands & flags
split: produce chunk files with optional metadata.--input=PATH(required when not piping)--output=PATH(required)--chunker=NAME(defaultfixed_size_words)- Word chunker:
--chunk-size-words=INT,--overlap-words=INT,--min-chunk-words=INT,--strip-punctuation/--keep-punctuation - Char/byte chunker:
--chunk-size-chars=INTor--chunk-size-bytes=INT,--overlap-chars=INT,--overlap-bytes=INT,--preserve-graphemes/--no-preserve-graphemes - Metadata toggle:
--with-metadata(default) /--without-metadata
join: reassemble chunk files.--input=PATH(directory or file list)--allow-missing-metadata(naive join; implied by--without-metadata)
chunkers: list available chunkers.- Help:
-h,--help.
Chunkers
fixed_size_words: split by words with overlap, minimum final chunk size, and optional punctuation stripping.fixed_size_chars: split by characters or bytes with overlap; preserves UTF-8 graphemes by default, with a byte-mode override.
Metadata format
When --with-metadata is set, each chunk starts with a YAML front matter containing:
version,source_id,chunk_index,chunk_countoffsets(byte and optional line ranges)checksumcreated_atchunkerobject (name+paramsmap for the chosen chunker)
Details on the metadata fields are in .design/20251123-2115-chunk-metadata-format.md.
Development
- Format:
make format - Tests:
make spec - Lint:
ameba(if installed) - Watch rebuild:
make watch(requireswatchexec) - Workflow guides:
ACTIONS.mdand.actions/.
Design notes live in DESIGN.md and .design/; tasks are tracked in .tasks/ with a consolidated checklist in TASKS.md.
Contributing
- Keep code, docs, and specs in English with concise comments on goals, stakes, constraints, and value.
- Follow Crystal defaults: 2-space indent,
snake_casefor methods/vars,PascalCasefor types; avoid macros unless required. - Add specs for every function/method and cover CLI edge cases (overlaps, invalid chunker names, missing metadata).
- Update task files and
TASKS.mdwhen finishing work; use conventional commits (e.g.,feat:,fix:,docs:,test:,chore:).
License
MIT (see shard.yml).