sidux.com
Menu

News

Give back
Last 3 Contributions
30-11-2008 20.00
25-11-2008 100.00
25-11-2008 20.00

Donate


Sponsor
hetzner.de

Languages
Preferred language:



Post new topic   Reply to topic
View previous topic Printable version Log in to check your private messages View next topic
Author Message
phen
Post subject: Cutting/deleting within textfiles  PostPosted: Apr 07, 2008 - 01:06 AM



Joined: Nov 30, 2006
Posts: 342
Location: Berlin
Status: Offline
Greetings everyone!
I am currently trying to cut certain lines/items from a textfile.
First off, a simple example:

Quote:
...
AAA - foo
BBB - bar
AAA - foobar
...

Now I want to delete all the lines starting with/containing BBB.
My approach is doing so by
Quote:
cat /path/to/textfile | grep BBB

In fact all the lines containing BBB will listed.

Now my question:
How to wrap that cat-command for it will delete the lines listed, within the textfile finally?

Thanks in advance!


EDIT:
Hm okay.. I found something

I am doing with
Code:
grep -v "BBB" /path/to/file > /newfile

for now. Perhaps adding a
Code:
mv newfile file

if necessary. Thanks nevertheless!=D


Last edited by phen on Apr 07, 2008 - 03:35 AM; edited 3 times in total
 
 View user's profile Send private message  
Reply with quote Back to top
phen
Post subject:   PostPosted: Apr 07, 2008 - 03:35 AM



Joined: Nov 30, 2006
Posts: 342
Location: Berlin
Status: Offline
Alright, I created a script from it and copied it to my /usr/local/bin folder. Still one tiny problem remains though:

The given example
Quote:
...
AAA - foo
BBB - bar
AAA - foobar
...

now results in
Quote:
...
AAA - foo
AAA - foobar
...


Now I wonder how can I get rid of the first column?

What I am trying with is
Quote:
cat /path/to/file | awk '{print $1}'


Any idea if the output is sort of inversible similar to the v-option for grep?
Or is there some whole different way to achieve it?

EDIT:
Code:
awk ' { $1="" ; print }'
at least does for replacing the first column with a space - any placeholder for "deleting"? (Like NR seems to be the one to tell to count)
 
 View user's profile Send private message  
Reply with quote Back to top
severin
Post subject:   PostPosted: Apr 07, 2008 - 08:50 AM



Joined: Dec 04, 2006
Posts: 1238

Status: Offline
Code:
awk '$1 !~ /BBB/ {print $3}' <filename>
(no need for cat'ing). "!" means NOT
 
 View user's profile Send private message Visit poster's website  
Reply with quote Back to top
muchan
Post subject:   PostPosted: Apr 07, 2008 - 10:13 AM



Joined: Aug 24, 2007
Posts: 472
Location: Ljubljana
Status: Offline
or
Code:

grep -v BBB /path/to/file

??
 
 View user's profile Send private message  
Reply with quote Back to top
kelmo
Post subject:   PostPosted: Apr 07, 2008 - 11:31 AM



Joined: Dec 19, 2006
Posts: 1030

Status: Offline
Code:

sed -i '/BBB/d' file
 
 View user's profile Send private message  
Reply with quote Back to top
phen
Post subject:   PostPosted: Apr 07, 2008 - 02:53 PM



Joined: Nov 30, 2006
Posts: 342
Location: Berlin
Status: Offline
Sorry, I think I forgot to provide 2 not too unimportant informations on it:
BBB is varying from line to line since its a timecode =x
Furthermore, considering Severin's line, "print $3" wouldnt be needed, since there are more than 3 columns, in fact its a varying amount of text following up in each line.
The whole thing is some sort of log file.

While searching the net I noticed "sed" too, however, I only found "awk" to be offering a possibility to have columns detected?
Perhaps Severin's command can be altered a little for it takes the first column independently of its content, a bit like "awk '$1 ! {print $3}'" (which doesnt work Embarassed ) ?
 
 View user's profile Send private message  
Reply with quote Back to top
kelmo
Post subject:   PostPosted: Apr 07, 2008 - 03:01 PM



Joined: Dec 19, 2006
Posts: 1030

Status: Offline
Its easier to prototype code when the exact input, and exact wanted output is visualized, good luck with your script Wink
 
 View user's profile Send private message  
Reply with quote Back to top
severin
Post subject:   PostPosted: Apr 07, 2008 - 03:01 PM



Joined: Dec 04, 2006
Posts: 1238

Status: Offline
guess it was simpler if you simply pasted some ten lines of your log
 
 View user's profile Send private message Visit poster's website  
Reply with quote Back to top
phen
Post subject:   PostPosted: Apr 07, 2008 - 04:28 PM



Joined: Nov 30, 2006
Posts: 342
Location: Berlin
Status: Offline
Hm, thanks for your patience - thought its easier with simplified lines.

Here's a snippet:
Quote:
(23:44:11) Slik steps up close to Deprah and looks over her shoulder "Oh, I guess I saw something interesting here... hm..", while he tries to dive his hand into Deprahs bag discretely
(23:44:19) Deprah counts the coins and drops them inside her purse
(23:44:33) Deprah tells you: [roll?]
(23:45:09) You tell Deprah: [you say o/]
(23:45:11) Deprah tells group: Player Slik has joined the group!
(23:45:15) >Player Deprah has rolled a 10-sided die for a 10.
(23:45:21) Slik says: [-.-]
(23:45:2Cool >Player Slik has rolled a 10-sided die for a 6.
(23:45:43) Deprah eyes down to Slik, catching his move "Shortie?"
(23:46:11) Deprah stretches her back and grabs the hilts of her swords
(23:46:12) Someone reaches for his hammer, eyeing Slik


For filtering out different types of lines I have this so far:
Code:
#!/bin/bash
grep -v "GuildChat" $1 | grep -v "You tell " | grep -v "tells you:" | grep -v "tells group:" | grep -v [[] | grep -v "picked up a" | grep -v ">Player" | awk '{$1="" ;  print}' > $1_cleaned


Quote:
Slik steps up close to Deprah and looks over her shoulder "Oh, I guess I saw something interesting here... hm..", while he tries to dive his hand into Deprahs bag discretely
Deprah counts the coins and drops them inside her purse
Deprah eyes down to Slik, catching his move "Shortie?"
Deprah stretches her back and grabs the hilts of her swords
Someone reaches for his hammer, eyeing Slik


Everything works fine and as wished, except for I currently have awk "only" replacing the timecode with a space - perfect would be to just have it removed completely.

Oh, and one more thing I was giving a (short) try. I added
Code:
grep -v "$2"

right before "> $1_cleaned", giving the chance to add additional filter-terms on the commandline. It worked, but when leaving it away "grep -v $2" was obviously taken as "nothing" (no input for $2 = all?). Ideally would be even to not have one more chance to add a filter term ($2), but to be free to add;)
 
 View user's profile Send private message  
Reply with quote Back to top
CaesarTjalbo
18 Post subject:   PostPosted: Apr 07, 2008 - 08:18 PM



Joined: Dec 01, 2007
Posts: 203
Location: Enschede NL
Status: Offline
Writing is fun and easy, testing sucks so I accept no liability whatsoever Wink

Getting your own code to behave is more fun, especially awk is a study in itself I believe but perhaps you want to check this out because it does meet some of your requirements, like removing the leading space and having the possibility of adding filters via the commandline.

You can call it what you want ("cut_delete_within_textfiles" if you like), the extension .py is not mandatory, chmod u+x to make it executable.

Code:
#!/usr/bin/env python

import sys
import optparse
import os

def main(argv):
    ## set path initially according to where the script's called
    index = argv[0].rfind('/')
    path = argv[0][:index]
    script_name = argv[0][index + 1:]

    ## deal with commandline options
    p = optparse.OptionParser('%s [options]\na CT UltraCrapWare (tm) product, no (c) 2008' % script_name)
    p.add_option('--filename', '-f', default = 'textfile', action = 'store', help = 'input filename')
    p.add_option('--path', '-p', action = 'store', help = 'path to files')
    p.add_option('--key', '-k', action = 'append', help = 'add extra keys to filter, 1 by 1')
    p.add_option('--test', '-t', action = 'store_true', default = 'False', help = 'reset input filename back to original name before processing')
    options, arguments = p.parse_args()

    filename = options.filename

    if options.path is not None and len(options.path) > 0:
        path = options.path

    ## strings to search for
    cut_keys = ["GuildChat", "You tell "
                , "tells you:", "tells group:"
                , "[", "picked up a"
                , ">Player"]
    if options.key is not None and len(options.key) > 0:
        cut_keys += options.key

    ## reset file, for testing only
    if options.test:
        if os.path.exists(os.path.join(path, filename + '.original')):
            os.rename(os.path.join(path, filename + '.original'), os.path.join(path, filename))

    in_file = open(os.path.join(path, filename), 'r')                    ## file to read
    out_file = open(os.path.join(path, filename + '.processed'), 'w')    ## file to write

    ## loop through the lines in the file
    keep_going = True
    while keep_going:
        in_string = in_file.readline()
        if len(in_string) > 0:          ## input, continue processing
            key_found = False
            for key in cut_keys:                ## search for key strings
                if in_string.find(key) != -1:
                    key_found = True
            if not key_found:
                #in_string = in_string[11:]     ## chop off time code first 11 characters, I prefer:
                index = in_string.find(' ')     ## chop off time code: everything before first space
                if index != -1:
                    in_string = in_string[index + 1:]
                out_file.write(in_string)       ## write line, return to process next line
        else:                           ## no further input, stop processing
            keep_going = False

    ## finished
    in_file.close()
    out_file.close()

    os.rename(os.path.join(path, filename), os.path.join(path, filename + '.original'))
    os.rename(os.path.join(path, filename + '.processed'), os.path.join(path, filename))



if __name__ == '__main__':
    main(sys.argv)
 
 View user's profile Send private message  
Reply with quote Back to top
phen
Post subject:   PostPosted: Apr 07, 2008 - 08:39 PM



Joined: Nov 30, 2006
Posts: 342
Location: Berlin
Status: Offline
Wow CaesarTjalbo!
I was searching the internet for awk and howtos about it, and perfectly like you say, awk seems to be very complex, while providing almost infinite possibilities.

That single commandline i posted above was actually my first approach on trying to write a "script" to automatize processes, and since it rather soon (and kinda intuitively) gave pleasant results I am pretty eager to get into it more.

I probably though will need some time to analyze and understand your script - will give reply to it as soon as possible.
But however, for now a huge thank you!=D
 
 View user's profile Send private message  
Reply with quote Back to top
CaesarTjalbo
Post subject:   PostPosted: Apr 08, 2008 - 10:37 AM



Joined: Dec 01, 2007
Posts: 203
Location: Enschede NL
Status: Offline
Well, analysis of the script is cool but I should point out there's a thing you can't analyze because it's missing: proper error handling, so keep back-ups of your files! I'm happy to explain things of course.

Another warning, this says something about silly me: I remember thinking "Oh, this awk + sed + grep + tail (etcetera) thing is cool, I need to learn more about it" but ever since I started using Python for simple shell scripting, I never got round to it. Usually when choosing between "quick fix via known territory" and "let's learn something new" I choose the former and I'm still a novice with the GNU tools. And 'simple shell scripting' with Python can lead to not-so-simple and large scripts, often far larger than with the GNU power tools.
 
 View user's profile Send private message  
Reply with quote Back to top
phen
Post subject:   PostPosted: Apr 09, 2008 - 10:54 AM



Joined: Nov 30, 2006
Posts: 342
Location: Berlin
Status: Offline
Thank you CaesarTjalbo, as far as I can see its mostly the predefined terms I am not familiar with at all (such as that "." dot), while the syntax appears to be sort of intuitive (years ago I was doing some basic if-then etc. scripts on old CommoderC64=P)

With all the loops easy to understand, I dont really have a clue whats it about the initiating imports (import sys aso). Perhaps you could give few words on it?
However, your script made python quite intruiging to me!=)
 
 View user's profile Send private message  
Reply with quote Back to top
CaesarTjalbo
Post subject:   PostPosted: Apr 09, 2008 - 08:41 PM



Joined: Dec 01, 2007
Posts: 203
Location: Enschede NL
Status: Offline
1. Python uses indentation to mark code blocks. Most indentation therefore is not for convenience or readability only but is necessary because of the language. There are no begin/end or {/} statements.

2. Python is largely object-oriented. It's not enforced, my script isn't very object-oriented for example, but you'll find objects in most Python programs. Coming from C64 BASIC you have a lot of catching up to do on Object Oriented design and programming, I think.

3. Comments are what's behind a #

------

At the top there's a she-bang ("#!"), a comment to tell that the following text should be send to the python interpreter.

Just below that are a few lines starting with "import". "sys", "os" and "optparse" are modules. Python is a small language with most functionality in libraries, called modules. A module can be another Python program or a wrapper around a binary library. A lot of libraries available to C and C++ programmers are that way also available in Python. The "import" statements tell the python interpreter to load those modules.

At the bottom of the script there are 2 lines of code which are the entry point to the program. Without going into details about the format all it does is call the function (sub / procedure) "main" and pass as argument "sys.argv".

"sys.argv" is an object "argv" from the module "sys". It's a list containing the commandline when the script was started, much like in C. A list is a Python datatype containing a collection of objects. The first item is accessed by "argv[0]", the second by "argv[1]", etcetera. It's the same as $0, $1, etc. in your code. So the 1st item contains the name of the script, the 2nd contains the first argument, etc.

Back to the top: below the "import" lines starts the definition of the main-function. I use argv[0], the first item on the commandline, which is a string containing the location and name of the script. A string (also a Python datatype) is an object and has methods (and properties), like "rfind" and "find" for example. The notation is object.method similar to modules. You can also use module.object.method if you like.
A string is also a list of characters, so if the first item of the list "argv" is called "argv[0]" and is of type string, then "argv[0][0]" is the first character of "argv[0]".
I use both approaches to separate path from scriptname, I'll explain later.

Next comes a block with lots of reference to the "optparse" module. Optparse is just another way of dealing with the commandline. With optparse I don't have to code all sorts of ways to deal with arguments myself but I get the commandline arguments easy how I want them and "optparse" automatically creates a neat help function: try typing "-h" or "--help" when invoking the script.
Without details: It starts with instantiating an object "p" from type "optparse.OptionParser". Next I create commandline options by calling method "add_option" on that object "p". Method "parse_args" returns an object called "options" with all the commandline arguments as properties. "options.filename" is a string and holds the filename, "options.test" is a boolean and holds True or False, etc.

"cut_keys" is a list containing the filter arguments I copied from your post, the indentation is for readability. If the user specified additional keys on the commandline then "options.key" is a list too and is appended to "cut_keys". ("cut_keys += options.key" means "cut_keys = cut_keys + options.key")
If a commandline option isn't given and has no default value, it's not returned so I test to see if it's "None" (null).

Then you'll find the reset part: "os" and "os.path" are modules dealing with Operating System tasks, like renaming files. "os.path" is a module inside the "os" module Smile. "os.path.join" is a function to create a path. I don't want to deal with path separators myself (because 'home/user/dir/file' is different from 'home/user/dirfile') and "join" takes care of that. Also, when you run this script on another OS it should change the separators accordingly. I use string concatenation (string "+" string) for making filenames, "join" for making complete paths.

"open" opens a file for read or write. "readline" is a method on a file that returns 1 line at a time. If the length is 0 then the end of file is reached.
I check every line for all the filters from "cut_keys". If there's a match with a filter, no further processing takes place and "readline" provides the next line.
Chopping of the timecode can be done by removing the first 10 characters from a line, take the remainder from character 11 to the end ("in_string[11:]") or by finding the location of the first space and take the remainder starting from the first character after that first space. The first way is a comment in my script.

I've taken the original input file and written to a file with the same name + ".processed". At the end I swap it around: the original file gets appended with ".original" and the result gets the original name, 'cleaned' in your code.

------

strings are lists and objects:
Code:

s = "sidux"
s[0]         # 's'
s[2:]        # 'dux'
s[:4]        # 'sidu'
s[-2]        # 'u'
s[-2:]       # 'ux'
len(s)       # 5
s += s       # 'siduxsidux'
len(s)       # 10
s[2:7]       # 'duxsi'
s.find('d')  # 2
s.rfind('d') # 7


------

Python is nice, the following links are to the documentation where you can find information about the language, tutorials, etc.
http://docs.python.org/index.html
http://python.org/doc/

Getting into Python via a book, imho the best resource for starting with Python and still useful when an advanced level is reached:
Core Python Programming 2nd edition - Wesley J. Chun - Prentice Hall 2007


Last edited by CaesarTjalbo on Apr 09, 2008 - 08:49 PM; edited 1 time in total
 
 View user's profile Send private message  
Reply with quote Back to top
CaesarTjalbo
Post subject:   PostPosted: Apr 09, 2008 - 08:48 PM



Joined: Dec 01, 2007
Posts: 203
Location: Enschede NL
Status: Offline
Anyway, so far Python.

Perhaps it's possible to store the filter arguments in a separate file and use something like grep -fFv in stead of that long line you're using now.

If awk doesn't give the right results, I think it's also possible to replace the time code via sed: everything between & including the 1st ( and 1st space should be removed.
 
 View user's profile Send private message  
Reply with quote Back to top
Display posts from previous:     
Jump to:  
All times are GMT
Post new topic   Reply to topic
View previous topic Printable version Log in to check your private messages View next topic
Powered by PNphpBB2 © 2003-2007 The PNphpBB Group
Credits
 
Logos and trademarks are the property of their respective owners, comments are property of their posters, the rest is © 2006-2008 by sidux e.V., 10407 Berlin, Kniprodestr. 104. sidux e.V. is a Berlin, Germany based non-profit foundation. Consult Impressum and Legal Terms for details. sidux™ is Free Software released under the GNU/GPL license and other compatible licenses.
powered by Zikula & Zafenio