No description
Find a file
Glenn Y. Rolland 1f4a368fa1 feat(split): Implement multi-input processing and directory mirroring
This change enables the `split` command to accept multiple input paths,
including directories, and mirrors their relative structure in the
output. This significantly enhances usability by allowing batch
processing of files and entire directory trees, eliminating the need for
external scripting to manage complex input scenarios. Without this,
users would be forced to manually iterate over files or directories,
leading to cumbersome and error-prone workflows.

- Introduced a new `inputs` property in `CliOptions.split` to store
  multiple input paths.
- Modified `Chunkli::Parser` to populate `options.split.inputs` from
  repeated `--input` flags and backfill `options.global.input_path` for
  backward compatibility.
- Updated `Chunkli::Commands::Split::Processor.split` to iterate over
  resolved inputs, handling both files and directories.
- Implemented `InputPath` struct and helper methods (`resolve_inputs`,
  `expand_and_collect`, `collect_directory`) to manage input expansion
  and relative path generation.
- Ensured that directory structures are mirrored in the output, with
  chunks for files within a directory being placed in corresponding
  subdirectories.
- Adjusted `ensure_required_paths` to check `options.split.inputs` for
  the `split` command, allowing multiple inputs to satisfy the
  requirement.
- Removed the now-redundant task file
  `.tasks/027-char-overlap-validation.md`.
- Marked tasks `.tasks/038-multi-input-split.md` and
  `.tasks/039-multi-input-help-and-specs.md` as done.
- Added a new task file `.tasks/042-split-exec-mode.md` for future
  development.

Signed-off-by: Glenn Y. Rolland <glenux@glenux.net>
2025-11-24 16:15:25 +01:00
.actions feat(split): Implement output directory overwrite safety 2025-11-24 14:35:44 +01:00
.design feat(split): Implement output directory overwrite safety 2025-11-24 14:35:44 +01:00
.tasks feat(split): Implement multi-input processing and directory mirroring 2025-11-24 16:15:25 +01:00
spec feat(split): Implement multi-input processing and directory mirroring 2025-11-24 16:15:25 +01:00
src feat(split): Implement multi-input processing and directory mirroring 2025-11-24 16:15:25 +01:00
.gitignore feat: add cli base for split/join 2025-11-24 10:20:43 +01:00
ACTIONS.md feat: Introduce CLI for chunker discovery and flag validation 2025-11-24 10:20:43 +01:00
AGENTS.md feat(cli): Implement command-scoped options and chunker interface 2025-11-24 13:19:24 +01:00
CONTRIBUTING.md Initial import 2025-11-24 10:20:39 +01:00
DESIGN.md docs: Add design decision for YAML serialization and update task statuses 2025-11-24 10:20:43 +01:00
Makefile Initial import 2025-11-24 10:20:39 +01:00
README.md docs(README): Streamline project overview and command descriptions 2025-11-24 10:25:48 +01:00
shard.yml Initial import 2025-11-24 10:20:39 +01:00
TASKS.md feat(split): Implement multi-input processing and directory mirroring 2025-11-24 16:15:25 +01:00

chunkli — Chunking CLI for RAG and beyond

chunkli is a Crystal CLI that splits and joins text with reproducible metadata so you can feed search, RAG, or analytics pipelines and later rebuild the original documents.

Highlights

  • Split and join text with reversible metadata front matter.
  • Discover available chunkers with chunkli chunkers.
  • Built-in chunkers: fixed_size_words (default) and fixed_size_chars with overlap and UTF-8 safety.
  • Metadata toggle: --with-metadata (default) or --without-metadata.
  • Simple, dependency-light install: make prepare && make build.

Quick start

make prepare   # install shards
make build     # builds bin/chunkli
./bin/chunkli --help

Split with metadata (default chunker = fixed_size_words):

./bin/chunkli split --input=example.txt --output=chunks --with-metadata

Join chunks back together:

./bin/chunkli join --input=chunks --with-metadata

Discover available chunkers:

./bin/chunkli chunkers

Run without building:

crystal run src/cli.cr -- split --input=example.txt --output=chunks

Commands & flags

  • split: produce chunk files with optional metadata.
    • --input=PATH (required when not piping)
    • --output=PATH (required)
    • --chunker=NAME (default fixed_size_words)
    • Word chunker: --chunk-size-words=INT, --overlap-words=INT, --min-chunk-words=INT, --strip-punctuation / --keep-punctuation
    • Char/byte chunker: --chunk-size-chars=INT or --chunk-size-bytes=INT, --overlap-chars=INT, --overlap-bytes=INT, --preserve-graphemes / --no-preserve-graphemes
    • Metadata toggle: --with-metadata (default) / --without-metadata
  • join: reassemble chunk files.
    • --input=PATH (directory or file list)
    • --allow-missing-metadata (naive join; implied by --without-metadata)
  • chunkers: list available chunkers.
  • Help: -h, --help.

Chunkers

  • fixed_size_words: split by words with overlap, minimum final chunk size, and optional punctuation stripping.
  • fixed_size_chars: split by characters or bytes with overlap; preserves UTF-8 graphemes by default, with a byte-mode override.

Metadata format

When --with-metadata is set, each chunk starts with a YAML front matter containing:

  • version, source_id, chunk_index, chunk_count
  • offsets (byte and optional line ranges)
  • checksum
  • created_at
  • chunker object (name + params map for the chosen chunker)

Details on the metadata fields are in .design/20251123-2115-chunk-metadata-format.md.

Development

  • Format: make format
  • Tests: make spec
  • Lint: ameba (if installed)
  • Watch rebuild: make watch (requires watchexec)
  • Workflow guides: ACTIONS.md and .actions/.

Design notes live in DESIGN.md and .design/; tasks are tracked in .tasks/ with a consolidated checklist in TASKS.md.

Contributing

  • Keep code, docs, and specs in English with concise comments on goals, stakes, constraints, and value.
  • Follow Crystal defaults: 2-space indent, snake_case for methods/vars, PascalCase for types; avoid macros unless required.
  • Add specs for every function/method and cover CLI edge cases (overlaps, invalid chunker names, missing metadata).
  • Update task files and TASKS.md when finishing work; use conventional commits (e.g., feat:, fix:, docs:, test:, chore:).

License

MIT (see shard.yml).