Monday, 28 August 2006

Unix file processing magic...

Given a bunch of files like this


- divide each file into chunks small enough to upload to Dartmail.
Dartmail handles maximum of 1,000,000 bytes

-- find out how many lines in each file, and how big each one is

wc -l 2.csv

ls -l 2.csv

-- split a file into equal chunks, or max. specified number of lines per

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit

SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.

Report bugs to <>.

$ split -l 110000 2.csv 2

-- get header row from csv

$ head -1 2.csv > header.txt

-- add header file to other files that don't have header row

(doing it for 3* files

$ for i in b c <-- b and c is the array.. we'll be processing 3ab and
> do
> cat header.txt 3a$i > 3a${i}.csv <-- the 2nd i is in {} because of
the . after it.
> rm 3a$i
> done

do-done defines the block ...

-- can also be written out as one row, with commands separated by ";"
for i in b c; do cat header.txt 3a$i > 3a${i}.csv ; rm 3a$i; done


