Splitter

The Splitter module allows you to divide a single large file into multiple smaller files. This module is essential for breaking down massive datasets into more manageable chunks that can be processed more efficiently or distributed across multiple systems.

Splitting large files is particularly useful when working with datasets that exceed system memory limits or when parallel processing of smaller chunks is more efficient than sequential processing of one large file.

Options

File or folder containing the files you want to split.

large_file.txt

Usage Guide

Follow these steps to effectively use the Splitter module:

Select Source Files

Choose the file or folder containing the large files you want to split.

Define Chunk Size

Set the number of lines each output file should contain. Consider:

Your system's processing capabilities
The requirements of downstream tools
The desired granularity for parallel processing

Set Output Prefix

Specify a prefix for the output files to organize them meaningfully. The default "Chunk" works well for basic usage.

Execute Splitting

Run the module to split the selected files into multiple smaller files.

Verify Results

Check the output directory for the split files and verify they contain the expected data distribution.

Example Use Cases

When working with databases that exceed memory constraints:

Set Chunk to a size that fits comfortably in memory (e.g., 1,000,000 lines)
Use a descriptive Prefix (e.g., "DB_Part_")
Run the module
Process each smaller chunk sequentially

This approach allows you to work with extremely large datasets without overwhelming system resources.

Line Counting

The Splitter module counts lines in the input file to determine how many chunks to create. For very large files, this initial counting process may take some time before splitting begins.

Best Practices

Optimal Chunk Size: Choose a chunk size that balances file count with manageability. Too many small files can be unwieldy to manage, while too few large chunks may not provide enough granularity.
Meaningful Prefixes: Use descriptive prefixes that indicate the content and purpose of the files to simplify organization and tracking.
Tracking Original Order: When sequence matters, consider including sequential numbering in your prefix scheme to maintain the original order.

Line Integrity

The Splitter module operates on a line-by-line basis and will never split in the middle of a line, preserving the integrity of individual entries.

Joiner - For recombining split files
Filter - For selectively processing specific content within splits