Text file processing in BASH

Hb_Kai · February 14, 2010

Hey all. Long time, no see. Hope you're all doing well. Thought I'd pop in a little bit and start posting now but life bought me to a problem and chose to do it now and ask for help to the question instead. :lol:

I have two questions; more or less about the same type of thing though. I'm trying to use the standard Linux BASH shell for both. I don't think there are many differences between the Linux differences but just in case there is, the OS I'm doing this on is Ubuntu.

Basically, I have a 26.7GB text file which I would like to "split" into 5GB files. I have tried using the "split" command in the terminal to do this by typing loads of different ways around this but can't understand the man page to get the commands right. I was wondering if anybody knows how I would do this?

All I want is for my world.txt to be broken up into seperate 5GB files and for the last bit a 1.7GB file. I don't mind much about the prepending of .txt at the end as the command line can still cat the contents to stdout so that's not a bother for me; only problem is breaking the file up into smaller pieces which is bugging me.

The other thing is I have a database-like text file formatted like the following code snippet:

Name: ***, Dispname: ***, Email: *@*.*, Pass: *******

What I want to do with this file is extract only the email address (*@*.*). I have already gotten the file down to:

***@*.*, Pass

using grep, tsort and sed until it finally worked with grep but when using both the tr(im) and cut like:

tr -d ", P" <some-other-file.txt > yet-another.txt

...thinking it will only trim/cut that sequence of characters ( Pass(including the space)) but instead it also trimmed/cut all the "p", "a" and "s" characters away from the email addresses which I need the addresses to be full and not broken.

I know these are probably unusual questions but I have been trying to do this for the last day and it's really beginning to give me a headache so I thought I would try and find some help about it instead.

Does anybody have a clue how I would go about either one of these tasks or did I lose everyone? :lol:

Thanks in advanced anyway.

-pops- · February 14, 2010

I can't help with the problem but neither can I begin to imagine what you would have in a text file of 26.7GB (= very approx 5,000,000,000 words - I think).

Chris · February 14, 2010

You can use gawk/awk to split the file to show only the email addresses. However I only do this on a Windows machine. Not unix where it came from. There is no reason why this should not work but create a file called whatever.awk. In whatever.awk, paste this in:


BEGIN {

       FS = ","  # The field separator is a comma.
       OFS = "," # Not actually needed as you only want one field, but won't hurt.

      }

{

print($3) # $3 is the email (third) field. $0 is the entire row/record.

}

On the terminal you would enter: gawk whatever.awk dbfile.txt >emailaddys.txt

ɹəuəllıʍ ʇɐb · February 15, 2010

Why not write a tiny little C program for each of the problems? It wouldn't take more than a dozen lines of code for each. The split program could even be made to cut the junks only at the end of a line (instead of cutting it when 5,368,709,120 characters have been copied).

Seshomaru Samma · February 15, 2010

My understanding is that 'split' goes by lines numbers

so :

split -l 5 content.txt output/data_

will split the file into chunks of 5 lines each. now all you need to know is how many lines in your original file (I think 'counter' is the command for that) and divide it by the number of files you need.

I suggest though, that you post the same question in the Debian forums, they will have the answer.

Sign In

Text file processing in BASH

Recommended Posts

Hb_Kai

Link to comment

Share on other sites

-pops-

Link to comment

Share on other sites

Chris

Link to comment

Share on other sites

ɹəuəllıʍ ʇɐb

Link to comment

Share on other sites

Seshomaru Samma

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information