Hopefully, the previous Bash crash course chapter provided more than a hint of the utility and power of Bash. On the other hand, this chapter introduces several bolt-on technologies to make Bash even more extensive when searching for items and text, or automating file explorer/file system operations.
By itself, Bash is merely a powerful scripting language, but much of Bash's flexibility comes from being able to "glue" other technologies (tools or languages) together to make the output more useful. In other words, Bash is a base platform similar to how some auto/car lovers choose a particular platform before making their modifications. Will a modified car do everything, even with enhancements? Certainly not, but it can make it more powerful or useful in specific cases, and at least provides four wheels for movement.
Not only do common scripts contain a series of commands for automation, they often include logic to modify strings such as the following:
- Removing trailing characters
- Replacing sections of words (substrings)
- Searching for strings in files
- Finding files
- Testing file types (directory, file, empty, and so on)
- Performing small calculations
- Limiting the scope of searches or data (filtering)
- Modifying the contents of variables (strings inside of string variables)
This logic that modifies, limits, and even replaces input/output data can be very powerful when you need to execute broad searches for a specific string or when you have copious amounts of data. Terminals chock; full of output or massive data files can be very daunting to explore!
However, there is one very important concept that still needs to be discussed, and that is recursive functionality. Recursive functionality can apply to script functions, logic, and even a command operation. For example, you can use grep to recursively crawl an entire directory until no more files remain, or you can recursively execute a function inside of itself until a condition is met (for example, printing a single character at a time within a string):
# e.g. File system
# / (start here)
# /home (oh we found home)
# /home/user (neat there is a directory inside it called user)
# /home/user/.. (even better, user has files - lets look in them too)
# /etc/ # We are done with /home and its "children" so lets look in /etc
# ... # Until we are done
Be careful with recursion (especially with functions), as it can sometimes be really slow depending on the complexity of the structure (for example, file system or size of files). Also if there is a logic error, you can keep executing functions recursively forever!
This chapter is all about limiting data, utilizing it, modifying it, internationalizing it, replacing it, and even searching for it in the first place.
Imagine searching for a four leaf clover in a big garden. It would be really hard (and it is still really hard for computers). Thankfully, words are not images and text on a computer is easily searchable depending on the format. The term format has to be used because if your tool cannot understand a given type of text (encoding), then you might have trouble recognizing a pattern or even detecting that there is text at all!
Typically, when you are looking at the console, text files, source code (C, C++, Bash, HTML), spreadsheets, XML, and other types, you are looking at it in ASCII or UTF. ASCII is a commonly used format in the *NIX world on the console. There is also the UTF encoding scheme, which is an improvement upon ASCII and can support a variety of extended characters that were not present in computing originally. It comes in a number of formats such as UTF-8, UTF-16, and UTF32.
When you hear the words encoding and decoding, it is similar to encryption and decryption. The purpose is not to hide something, but rather to transform some data into something appropriate for the use case. For example, transmission, usage with languages, and compression.
ASCII and UTF are not the only types your target data might be in. In various types of files, you may encounter different types of encoding of data. This is a different problem that's specific to your data and will need additional considerations.
In this recipe, we will begin the process of searching for strings and a couple of ways to search for some of your own needles in a massive haystack of data. Let's dig in.
Besides having a terminal open (and your favorite text editor, if necessary), we only need a couple of core commands such as grep, ls, mkdir, touch, traceroute, strings, wget, xargs, and find.
Assuming that your user already has the correct permissions for your usage (and authorized, of course), we will need to generate data to begin searching:
$ ~/
$ wget --recursive --no-parent https://www.packtpub.com www.packtpub.com # Takes awhile
$ traceroute packtpub.com > traceroute.txt
$ mkdir -p www.packtpub.com/filedir www.packtpub.com/emptydir
$ touch www.packtpub.com/filedir/empty.txt
$ touch www.packtpub.com/findme.xml; echo "<xml>" www.packtpub.com/findme.xml
Using the data obtained by recursively crawling the Packt Publishing website, we can see that inside of www.packtpub.com the entire website is available. Wow! We also created some test data directories and files.
- Next, open up a terminal and create the following script:
#!/bin/bash
# Let's find all the files with the string "Packt"
DIRECTORY="www.packtpub.com/"
SEARCH_TERM="Packt"
# Can we use grep?
grep "${SEARCH_TERM}" ~/* > result1.txt 2&> /dev/null
# Recursive check
grep -r "${SEARCH_TERM}" "${DIRECTORY}" > result2.txt
# What if we want to check for multiple terms?
grep -r -e "${SEARCH_TERM}" -e "Publishing" "${DIRECTORY}" > result3.txt
# What about find?
find "${DIRECTORY}" -type f -print | xargs grep "${SEARCH_TERM}" > result4.txt
# What about find and looking for the string inside of a specific type of content?
find "${DIRECTORY}" -type f -name "*.xml" ! -name "*.css" -print | xargs grep "${SEARCH_TERM}" > result5.txt
# Can this also be achieved with wildcards and subshell?
grep "${SEARCH_TERM}" $(ls -R "${DIRECTORY}"*.{html,txt}) > result6.txt
RES=$?
if [ ${RES} -eq 0 ]; then
echo "We found results!"
else
echo "It broke - it shouldn't happen (Packt is everywhere)!"
fi
# Or for bonus points - a personal favorite
history | grep "ls" # This is really handy to find commands you ran yesterday!
# Aaaannnd the lesson is:
echo "We can do a lot with grep!"
exit 0
Notice in the script the use of ~/* ?. This refers to our home directory and introduces the * wildcard, which allows us to specify anything from that point on. There will be more on the concept of wildcards and regexes later in this chapter.
- If you remain in your home directory (~/) and run the script, the output should be similar to the following:
$ bash search.sh; ls -lah result*.txt
We found results!
We can do a lot with grep!
-rw-rw-r-- 1 rbrash rbrash 0 Nov 14 14:33 result1.txt
-rw-rw-r-- 1 rbrash rbrash 1.2M Nov 14 14:33 result2.txt
-rw-rw-r-- 1 rbrash rbrash 1.2M Nov 14 14:33 result3.txt
-rw-rw-r-- 1 rbrash rbrash 1.2M Nov 14 14:33 result4.txt
-rw-rw-r-- 1 rbrash rbrash 33 Nov 14 14:33 result5.txt
-rw-rw-r-- 1 rbrash rbrash 14K Nov 14 14:33 result6.txt
This section is a bit of a doozy because we are leading up to another ...