Jump to content
xisto Community
dserban

Bash Tips And Tricks From My Own Experience - Several Instalments

Recommended Posts

Quick overview of bash

As mentioned in this forum before, bash is the most popular shell environment in the Linux world, and a good tutorial can be found here:
http://www.tldp.org/LDP/abs/html/

The purpose of this topic is to show you a couple of quick how-to nuggets that I'm deriving from personal experience, both from using bash inside of Linux itself, as well as from using cygwin, a Win32 port of a bunch of Linux utilities, which includes bash as one of the shell options for running commands in.

All examples have been made using cygwin in a Windows environment.

Instalment 1:
- How to convert all file names in a directory from mixed (or upper) case to lower case:

Suppose we have a couple of files in our ztmptmp directory, like this:

# pwd/cygdrive/c/ztmptmp## ls -ltotal 7488-rw-r--r--    1 Administ Administ  1571302 Jun  6 14:55 02-DeFrag-xvid.avi-rw-r--r--    1 Administ Administ  3431522 Jun  6 14:57 03-ResizeNTFS-xvid.avi-rw-r--r--    1 Administ Administ  2649526 Jun  6 14:59 04-MakeLnxparts-xvid.avi-rw-r--r--    1 Administ Administ    13292 Jul  1 17:56 AVSCAN-20070701-164946-8E31AAB2.LOG#

and we throw another one in, for good measure:

# touch IN.ALL.UPPER.CASE.TXT## ls -ltotal 7488-rw-r--r--    1 Administ Administ  1571302 Jun  6 14:55 02-DeFrag-xvid.avi-rw-r--r--    1 Administ Administ  3431522 Jun  6 14:57 03-ResizeNTFS-xvid.avi-rw-r--r--    1 Administ Administ  2649526 Jun  6 14:59 04-MakeLnxparts-xvid.avi-rw-r--r--    1 Administ Administ    13292 Jul  1 17:56 AVSCAN-20070701-164946-8E31AAB2.LOG-rw-r--r--    1 Administ Administ        0 Jul  1 18:45 IN.ALL.UPPER.CASE.TXT#

The script that converts the file names to lower case looks like this:

> done
#
linenums:0'># for myfile in *> do> mv $myfile `echo $myfile | tr 'A-Z' 'a-z'`> done#
and the result is the one I expected:

# ls -ltotal 7488-rw-r--r--    1 Administ Administ  1571302 Jun  6 14:55 02-defrag-xvid.avi-rw-r--r--    1 Administ Administ  3431522 Jun  6 14:57 03-resizentfs-xvid.avi-rw-r--r--    1 Administ Administ  2649526 Jun  6 14:59 04-makelnxparts-xvid.avi-rw-r--r--    1 Administ Administ    13292 Jul  1 17:56 avscan-20070701-164946-8e31aab2.log-rw-r--r--    1 Administ Administ        0 Jul  1 18:45 in.all.upper.case.txt#

Advanced UNIX / Linux users already know what did the trick, but for completeness I am giving a detailed explanation.
When you run the script, the first thing that happens is that the shell sees the wildcard * and expands it into the full list of files, in other words, bash temporarily (and internally) changes the script to look like:

done
linenums:0'>for myfile in 02-DeFrag-xvid.avi 03-ResizeNTFS-xvid.avi 04-MakeLnxparts-xvid.avi AVSCAN-20070701-164946-8E31AAB2.LOG IN.ALL.UPPER.CASE.TXTdomv $myfile `echo $myfile | tr 'A-Z' 'a-z'`done
Each of the file names in that list is then in turn subject to the mv command.
mv is a pretty basic command that can be used to move or rename a file.
In this case, we are supplying a target file name as the second argument and not a directory, so it will know to perform a rename operation.
The two arguments of the mv command are:
- $myfile
- `echo $myfile | tr 'A-Z' 'a-z'`
Note the use of backticks with which we surround the Linux command:

echo $myfile | tr 'A-Z' 'a-z'

These backticks instruct the bash shell to run the command inside and substitute its output for the entire string (including the backticks).
To understand what the command does, let's test it using the first file name:

02-defrag-xvid.avi
#
linenums:0'># echo 02-DeFrag-xvid.avi | tr 'A-Z' 'a-z'02-defrag-xvid.avi#
Here, the output from the first command, which is simply the literal string 02-DeFrag-xvid.avi is fed to the second command as the input.
The tr command expands the character sequences A-Z and a-z into the full-blown ASCII sequences:

ABCD ... Z

and

abcd ... z

then makes a positional translation of the text it receives as its input (in our case the file name), therefore the name tr for translate.
So it will loop through each character of the file name and go:
Is this character an A? if yes, replace the A with the character in the same position from the second string, which is a lowercase a. If no, move on to the second position.
Is this character a B? ... you get the idea...
until it exhausts the input string, at which time it will output the modified string.

And the modified string will replace the backticks and become the second argument to the mv command, as explained above.

Share this post


Link to post
Share on other sites

Very interesting topic, thanks for the info. I was very impressed, I learned a lot of things, explained in a very easy-to-understant way, which everyone appreciates in a tutorial.The only thing I would say is, the topic title is not accurate.the topic title is "Bash Tips and Trick". This is false, this is not related to bash. It is also not specially Unix nor Linux.It's simply on the "tr" command usage, which is the same on each Unis shell : ksh, sh, cs, bash.So, the title should be "tr usage, useful tricks from my own experience". People expecting to learn "bash" tips would less be disappointed.RegardsYordan

Share this post


Link to post
Share on other sites

Instalment 2:

- How I use POSIX utilities to process text

 

One of the problems I have very often is that I would like to extract usable links from a very large web page the source code of which looks messy (when I look at it using "View Source").

Have you ever looked at a cool web page with nice visual effects and the first thing that comes to your mind is "How did he do that? I would also like to have those effects on my own site."

Or otherwise the page has lots of links to pretty pictures and you just want the links without the other "fat".

And you go to view the source and it looks like this:

 

<html><head><title>Fun Pics</title></head><body><a href=http://forums.xisto.com/no_longer_exists/ 1</a><br><a href=http://forums.xisto.com/no_longer_exists/ 2</a><br><a href=http://forums.xisto.com/no_longer_exists/ 3</a><br><a href=http://forums.xisto.com/no_longer_exists/ 4</a><br><a href=http://forums.xisto.com/no_longer_exists/ 5</a><br><a href=http://forums.xisto.com/no_longer_exists/ 6</a><br><a href=http://forums.xisto.com/no_longer_exists/ 7</a></body></html>

Now you want to make some sense out of that never ending HTML string.

You probably already have your favorite tool handy to do that easily, and that's fine, but my objective here is to showcase the use of POSIX utilities, that's why I will show you what might look like the hard way of doing it.

 

For simplicity, I will assume that the above HTML code is stored in file messyHTML.html.

 

My objective is to obtain a clean list of all the links pointing to jpg files, each one on its own line.

Something like this:

 

http://forums.xisto.com/no_longer_exists/='>http://forums.xisto.com/no_longer_exists/="http://forums.xisto.com/no_longer_exists/"&;

First of all, I would like to get a rough overview of the structure of the web page by having each HTML tag on its own line:

 


> <#g'

 

<html>

<head>

<title>Fun Pics

</title>

</head>

<body>

<a href=http linenums:0'># cat messyHTML.html | sed 's#<#\> <#g'<html><head><title>Fun Pics</title></head><body><a href=http://forums.xisto.com/no_longer_exists/ 1</a><br><a href=http://forums.xisto.com/no_longer_exists/ 2</a><br><a href=http://forums.xisto.com/no_longer_exists/ 3</a><br><a href=http://forums.xisto.com/no_longer_exists/ 4</a><br><a href=http://forums.xisto.com/no_longer_exists/ 5</a><br><a href=http://forums.xisto.com/no_longer_exists/ 6</a><br><a href=http://forums.xisto.com/no_longer_exists/ 7</a></body></html>#

OK, so what did I do?

Let's look at the command again, it might seem complex and cryptic at first, so let's break it down.

To eliminate some confusion related to how I configured my shell prompts (the sets of characters "# " and "> "), let's see what the command looks like without them:

 


<#g'

linenums:0'>cat messyHTML.html | sed 's#<#\<#g'

The two utilities cat and sed communicate with each other using a pipe (the character "|"). This pipe redirects the output of the cat command from where it would normally go (the terminal output) into the standard input of the sed command.

Of course I could have done this using only the sed command by pointing it to the file, but I have chosen to use two commands for several reasons, one of which is to make the sed command as easy to understand as possible, which is not an easy task. The other reason was to give this tutorial some integrity. You will see what I mean later on, when I use sed one more time, but in a slightly more complex manner.

 

You are already wondering why the command spans two lines. Well, try to use your imagination and put the two lines together again, but keep in mind that they are still separated by the binary representation of the carriage return.

The imaginary command might look like this:

 


linenums:0'>cat messyHTML.html | sed 's#<#\{binary representation of the carriage return}<#g'

The carriage return is a special character that the bash shell will interpret according to its own specification, unless we "escape" it. Escaping a character means instructing the bash shell not to give that character any special mening and to treat it "as-is". Escaping is done in the bash shell by putting a backslash in front of the special character.

 

Now we need to look at the sed command in more detail.

sed is a stream editor that many people use primarily as a tool to mass replace patterned strings of text.

A very basic example of how sed works is this:

 


abcWWWWWfghi

#

linenums:0'># echo "abcdefghi" | sed 's#de#WWWWW#'abcWWWWWfghi#

In this example I have used sed to replace the first occurence of "de" with "WWWWW" in the string "abcdefghi".

Look at the following example where I append an additional occurence of "de" to the end of our string "abcdefghi" and run the same command again:

 


abcWWWWWfghide

#

linenums:0'># echo "abcdefghide" | sed 's#de#WWWWW#'abcWWWWWfghide#

If our objective is to replace ALL occurences of "de", we simply specify the "g" switch, which does a global replace.

 


abcWWWWWfghiWWWWW

#

linenums:0'># echo "abcdefghide" | sed 's#de#WWWWW#g'abcWWWWWfghiWWWWW#

On a side note, I personally tend to use the hash mark (the "#" character) as the separator in sed, because I often find myself needing to replace strings that contain slashes, but most people I have seen use the slash as a separator, like this:

 


abcWWWWWfghiWWWWW

#

linenums:0'># echo "abcdefghide" | sed 's/de/WWWWW/g'abcWWWWWfghiWWWWW#

Excellent !!!

I'm only interested in links to pictures, so after quickly eyeballing the structure of the web page, I decide that I would only like to keep those lines that contain the string "images".

 


> <#g' | grep images

<a href=http linenums:0'># cat messyHTML.html | sed 's#<#\> <#g' | grep images<a href=http://forums.xisto.com/no_longer_exists/ 1<a href=http://forums.xisto.com/no_longer_exists/ 2<a href=http://forums.xisto.com/no_longer_exists/ 3<a href=http://forums.xisto.com/no_longer_exists/ 4<a href=http://forums.xisto.com/no_longer_exists/ 5<a href=http://forums.xisto.com/no_longer_exists/ 6<a href=http://forums.xisto.com/no_longer_exists/ 7#

grep is a very powerful and very underutilized tool - "underutilized" in the sense that it's not utilized to its full potential: 99% of the time people only use 1% of its power. I will probably spend some time in a future instalment exploring its capabilities.

 

It is that 1% that I am using here as well: basic filtering using plain (unpatterned) text.

 

At this point we are one step away from reaching our objective and, as always in UNIX, there is more than one way of performing the next step. Let me explain.

 

In this particular example, all the links that we are interested in begin at the same offset and have the same length, so one is strongly tempted to leverage this feature and use the cut command. For educational purposes, I will go ahead and show you how cut works, but in real life situations I always use an advanced feature of sed called backreferencing, which I will show you further below, so keep reading.

 

Let's look at how our output lines are structured:

 

<a href=http://forums.xisto.com/no_longer_exists/ 112345678901234567890123456789012345678901234567890123456789012345678901234567890         |         |         |         |         |         |         |         10        20        30        40        50        60

So where's the "beef"?

Well, the "beef" begins at position 9 and ends at position 62.

Let's perform a single test:

 

# echo "<a href=http://forums.xisto.com/no_longer_exists/ 1" | cut -c9-62[url="http://forums.xisto.com/no_longer_exists/"&;#

It works as expected, let's do the whole thing in one fell swoop:

 


> <#g' | grep images | cut -c9-62

[url="http://forums.xisto.com/no_longer_exists/ _linenums:0'># cat messyHTML.html | sed 's#<#\> <#g' | grep images | cut -c9-62[url="http://forums.xisto.com/no_longer_exists/"&="http://forums.xisto.com/no_longer_exists/"&;#

Excellent !!! Are we done?

Well, yes and no, this was the lazy man's way, let me show you now the right way.

 


> <#g' | grep images | sed 's#^.*\(http.*jpg\).*$#\1#'

[url="http://forums.xisto.com/no_longer_exists/ _linenums:0'># cat messyHTML.html | sed 's#<#\> <#g' | grep images | sed 's#^.*\(http.*jpg\).*$#\1#'[url="http://forums.xisto.com/no_longer_exists/"&="http://forums.xisto.com/no_longer_exists/"&;#

Does that look like command line garbage or what?

Take a deep breath, I'm getting ready to explain to you that scary thing at the end:

 

's#^.*\(http.*jpg\).*$#\1#'

But before I do, let us make a quick mental note that, although seemingly vastly more complex, this approach makes absolutely no assumptions whatsoever about where the links begin and where they end.

 

The following is a regular expression pattern that matches the whole line (each and every line in its entirety):

 

^.*\(http.*jpg\).*$

It matches the whole line because it begins with the character "^" and it ends with the character "$".

When used in a regular expression pattern, the character "^" always matches the beginning of the line and the character "$" always matches the end of the line.

 

Note that I am making use of terms which I have not already explained, terms like "regular expression" and "pattern". If you want a formal definition for these terms, please feel free to google them, then come back to this tutorial. I prefer to define these entities by showing you how they are used in practice.

 

When used in a regular expression pattern, a dot matches any character.

A dot followed by an asterisk matches any string of characters, including the empty string.

This is because, when used in a regular expression pattern, an asterisk matches zero or more occurences of the character preceding it. For instance, the pattern a* will match any one of the following strings:

- the empty string

- a

- aa

...

- aaaaaaaaa

etc. ... you get the idea

 

The regular expression pattern:

 

^.*\(http.*jpg\).*$

makes use of four anchors, two of which we have already discussed (beginning and ending of line).

 

The other two are: "(http" and "jpg)"

Here I am showing the parentheses without their respective preceding backslashes, but keep in mind that these parentheses are special characters that need to be escaped so that the bash shell leaves them alone and they therefore become available for use by sed.

 

These two anchors "(http" and "jpg)" mean that whatever text matches the pattern "http.*jpg" inside the parentheses will temporarily be stored in a special location in sed's memory called a backreference. From that moment on, the character string stored in that special location in memory can be referenced by the name of "\1".

sed provides a maximum of nine such backreferences, \1 through \9. Backreferences \2 and above become available as soon as you have more than one set of parentheses.

 

It should now be obvious that the command

 

sed 's#^.*\(http.*jpg\).*$#\1#'

entirely replaces each and every line from the standard input with the string we want.

 

One more caveat to pattern matching:

sed's regular expression engine is "greedy", meaning that, should there have been several http.*jpg links on one line, the backreferencing trick would only partially provide the expected results.

Let's give an example using a slightly modified version of the web page:

 

# cat test.html<html><head><title>Fun Pics</title></head><body><a href=http://forums.xisto.com/no_longer_exists/ 1</a><br><a href=http://forums.xisto.com/no_longer_exists/ 2</a><br><a href=http://forums.xisto.com/no_longer_exists/ 3</a><br><a href=http://forums.xisto.com/no_longer_exists/ 4</a><br><a href=http://forums.xisto.com/no_longer_exists/ 5</a><br><a href=http://forums.xisto.com/no_longer_exists/ 6</a><br><a href=http://forums.xisto.com/no_longer_exists/ 7</a></body></html>### cat test.html | grep images | sed 's#^.*\(http.*jpg\).*$#\1#'[url="http://forums.xisto.com/no_longer_exists/"&="http://forums.xisto.com/no_longer_exists/"&="http://forums.xisto.com/no_longer_exists/"&="http://forums.xisto.com/no_longer_exists/"&="http://forums.xisto.com/no_longer_exists/"&="http://forums.xisto.com/no_longer_exists/"&;#

In this case, the first and second patterns of ".*" both behave in a greedy manner, each one of them trying to "swallow" the character string:

 

http://forums.xisto.com/no_longer_exists/ 6</a><br><a href=

but precedence is given to the first.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.