Jump to content

Text file processing in BASH


Hb_Kai
 Share

Recommended Posts

Hey all. Long time, no see. Hope you're all doing well. Thought I'd pop in a little bit and start posting now but life bought me to a problem and chose to do it now and ask for help to the question instead. :lol:

I have two questions; more or less about the same type of thing though. I'm trying to use the standard Linux BASH shell for both. I don't think there are many differences between the Linux differences but just in case there is, the OS I'm doing this on is Ubuntu.

Basically, I have a 26.7GB text file which I would like to "split" into 5GB files. I have tried using the "split" command in the terminal to do this by typing loads of different ways around this but can't understand the man page to get the commands right. I was wondering if anybody knows how I would do this?

All I want is for my world.txt to be broken up into seperate 5GB files and for the last bit a 1.7GB file. I don't mind much about the prepending of .txt at the end as the command line can still cat the contents to stdout so that's not a bother for me; only problem is breaking the file up into smaller pieces which is bugging me.

The other thing is I have a database-like text file formatted like the following code snippet:

Name: ***, Dispname: ***, Email: *@*.*, Pass: *******

What I want to do with this file is extract only the email address (*@*.*). I have already gotten the file down to:

***@*.*, Pass

using grep, tsort and sed until it finally worked with grep but when using both the tr(im) and cut like:

tr -d ", P" <some-other-file.txt > yet-another.txt

...thinking it will only trim/cut that sequence of characters ( Pass(including the space)) but instead it also trimmed/cut all the "p", "a" and "s" characters away from the email addresses which I need the addresses to be full and not broken.

I know these are probably unusual questions but I have been trying to do this for the last day and it's really beginning to give me a headache so I thought I would try and find some help about it instead.

Does anybody have a clue how I would go about either one of these tasks or did I lose everyone? :lol:

Thanks in advanced anyway.

Link to comment
Share on other sites

You can use gawk/awk to split the file to show only the email addresses. However I only do this on a Windows machine. Not unix where it came from. There is no reason why this should not work but create a file called whatever.awk. In whatever.awk, paste this in:


BEGIN {

FS = "," # The field separator is a comma.
OFS = "," # Not actually needed as you only want one field, but won't hurt.

}

{

print($3) # $3 is the email (third) field. $0 is the entire row/record.

}

On the terminal you would enter: gawk whatever.awk dbfile.txt >emailaddys.txt

Link to comment
Share on other sites

Why not write a tiny little C program for each of the problems? It wouldn't take more than a dozen lines of code for each. The split program could even be made to cut the junks only at the end of a line (instead of cutting it when 5,368,709,120 characters have been copied).

Link to comment
Share on other sites

My understanding is that 'split' goes by lines numbers

so :

split -l 5 content.txt output/data_

will split the file into chunks of 5 lines each. now all you need to know is how many lines in your original file (I think 'counter' is the command for that) and divide it by the number of files you need.

I suggest though, that you post the same question in the Debian forums, they will have the answer.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue. Privacy Policy