Monday 28 August 2006

Unix file processing magic...


Given a bunch of files like this

2.csv
3.csv

- divide each file into chunks small enough to upload to Dartmail.
Dartmail handles maximum of 1,000,000 bytes

-- find out how many lines in each file, and how big each one is

wc -l 2.csv

ls -l 2.csv

-- split a file into equal chunks, or max. specified number of lines per
segment

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when
INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output
file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit

SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.

Report bugs to <bug-coreutils@gnu.org>.

$ split -l 110000 2.csv 2

-- get header row from csv

$ head -1 2.csv > header.txt

-- add header file to other files that don't have header row

(doing it for 3* files

$ for i in b c <-- b and c is the array.. we'll be processing 3ab and
3ac
> do
> cat header.txt 3a$i > 3a${i}.csv <-- the 2nd i is in {} because of
the . after it.
> rm 3a$i
> done

do-done defines the block ...

-- can also be written out as one row, with commands separated by ";"
for i in b c; do cat header.txt 3a$i > 3a${i}.csv ; rm 3a$i; done

=================================

history of all commands:

split
split -h
split --help
cd e:
less 0.csv
q
wc -l 2.csv
split -l 110000 2.csv 2
ls 2*
ls -l 2*
head -1 2.csv > header.txt
cat header.txt 2ab > 2ab.csv
less 2ab.csv
cat header.txt 2ac > 2ac.csv
cat header.txt 3ab > 3ab.csv
mv 2aa 2aa.csv
rm 2ab 2ac 2ad
ls 2*
wc -l 3.csv
ls -l 3.csv
split -l 90000 3.csv 3
ls -l 3*
for i in b c; do cat header.txt 3a$i > 3a${i}.csv ; rm 3a$i; done
ls 3*
mv 3aa 3aa.csv
less 3ab.csv
wc -l 4.csv
ls -l 4.csv
ls -l 5.csv
ls -l 6.csv
ls -l 7.csv
split --help
history

No comments: