In this tutorial, we’ll show, with examples that you can copy paste right into your shell, how you can use sed to automate mind-numbing tasks. Automating mind-numbing tasks is not only good for your emotional health, it’ll make you more accurate and productive
macOS (BSD) disclaimer
If you’re using a mac, the -i flag requires a file extension (like .bak) which creates backup files that then have to be removed. If you’re using linux, everything will work as is.
The Examples
Backup a list of files
Example usage
ls | sed "s/.*/cp & &.bak/" | bash
Reusable script
#!/usr/bin/env bash
sed "s/.*/cp & &.bak/"
Explanation
This example is interesting because it uses sed to generate bash code which is executed by bash.
Imagine the input is a single line containing the text "file.txt"
Then sed finds .* which means the entire input value "file.txt".
It then replaces it with "cp & &.bak".
& is a special value which means the entire found match, which is file.
So the answer is: "cp file.txt file.txt.bak"
If the input is:
1.txt
2.txt
3.txt
The output is
cp 1.txt 1.txt.bak
cp 2.txt 2.txt.bak
cp 3.txt 3.txt.bak
Which is a shell script that we can execute by piping into bash.
Why sed is better than bash loops
This could also be done with bash loops and the code is even a bit easier to read:
#!/usr/bin/env bash
for file in $(cat); do
cp ${file} ${file}.bak
end
So why use sed? Because sed outputs text. Notably if we remove "| bash" from our program, it outputs a very easy to read bash program.
A program that outputs text is incredibly powerful because text is easy to read/write by hand and it’s easy to read/write programs that read/write text. This sounds obvious, but is hugely important and we’ll discuss it in more depth next week.
find and replace for all files in a directory
Making a variable change in a dynamic programming language? Need to fix every file in the directory? You could use sublime and scroll through the list. Or…
Example Usage
sed -i "s/find/replace/g" *
Reusable Script
#!/usr/bin/env bash
FIND="$1"
REPLACE="$2"
sed -i "s/${FIND}/${REPLACE}/g" *
Explanation
By combining sed with shell file globbing, we can easily run sed to perform find and replace across all files in a directory. That’s a lot more fun than running this. For. Every. File.
find-and-replace (for a list of files)
The above code runs find and replace for all files in a directory. But what about other lists of files? We’ll need something more powerful.
Example Usage
ls | sed "s#.*#sed -i \"s/find/replace/g\" &#" | bash
Reusable script
#!/usr/bin/env bash
FIND="$1"
REPLACE="$2"
sed "s#.*#sed -i \"s/${FIND}/${REPLACE}/g\" &#"
Explanation
This code combines using sed -i to find and replace on a file with using sed to generate bash code as we did in the first example. If the files are (1.txt, 2.txt, 3.txt), FIND="find" and REPLACE="replace" the generated bash is:
sed "s/find/replace/g" 1.txt
sed "s/find/replace/g" 2.txt
sed "s/find/replace/g" 3.txt
What’s nice about this technique is that it works for any list of files generated on stdin. File globbing, like we used for all files in a directory, is limited in its expressiveness. But as long as somehow we generate the list of files as text, then this program can create a bash script which performs find and replace exactly on those files
Trailing space cleaner
Ever get dinged by git for having trailing spaces? And then you have to go back. Edit the file. Go to the line. Delete the whitespace. git add -u. Ugh, no thanks.
Example usage
find -type f | find-and-replace " *$" "" | bash
Reusable core
find-and-replace " *$" ""
Explanation
The key is a good choice of regular expression
" *$"
This matches all spaces at the end of a line. It’s replaced by "", the empty string. Meaning all trailing spaces are deleted. And by using find -type f to select all files in a folder, including files in nested folders, sed will operate on every single file. To top it off, you can put this as part of a lint-fix makefile directive and never have to manually clean up trailing whitespaces again.
No trailing newline fixer
Example Usage
sed -z -i "s/[^\n]$/&\n/" file
Explanation
This uses two tricks: a cute hack to make the end of sed’s line the end of the file and a good choice of regular expression.
First, by using -z on a regular file, $ matches not the end of any line but the end of the whole file.
Second, as such, the regular expression: [^\n]$ matches any non-newline character at the end of the file. If the end of the file isn’t a newline character, this script will replace it with "&\n", the last character plus a newline, effectively adding a newline.
If the end of file is a newline, it doesn’t find a match and sed does nothing.
Copy lines of a file into the clipboard
Working on a program and need to copy paste it into this week’s blog post? Yeah, me too. Using a mouse to select lines of text is tedious. So is navigating a file with a keyboard. This is what I do:
# linux
seq 100 | sed -n 20,30p | xsel --clipboard --input
# macOS
seq 100 | sed -n 20,30p | pbcopy
Explanation
xsel/pbcopy takes stdin and writes it to the clipboard. By taking the exact output we need and writing it to the clipboard, we can avoid the mind numbing work of selecting things with a mouse or navigating with a keyboard.
Facebook friend data cleaning
Data cleaning is incredibly mind numbing. However it’s also hard to automate, so the fact that sed can help with it is incredibly novel.
Example Usage
Note that this requires dsv-to-json from the json-toolkit to work properly.
echo 'John Doe
Friend
Software engineer at BigCorp21 mutual friends
Jane Joe
Friend
Miami, Florida7 mutual friends
Ploni Blum
Friend
Director at SuperstoreLives in New York, New York' |
sed \
-e '/Friend/d' \
-e 's/\([0-9]*\) mutual friends/:friends:\1/' \
-e 's/Lives in /:location:/' \
-e '/at/ s/^/job:/' \
-e '/^[^:]*$/ s/.*/name:&:/' |
xargs -L 2 |
sed s'/: /:/g' |
dsv-to-json : | jq '.'
Reusable Script
sed \
-e 's/\([0-9]*\) mutual friends/:friends:\1/' \
-e 's/Lives in /:location:/' \
-e "/at/ s/^/job:/" \
-e "/Friend/d" \
-e '/^[^:]*$/ s/.*/name:&:/' |
xargs -L 2 |
sed s'/: /:/g' |
dsv-to-json : | jq '.'
Explanation
While facebook friend page scraping might not be your specific data cleaning problem, this is one you can try at home and showcases general techniques you can use to use sed to clean your own messy data.
If we copy and paste directly from facebook, it looks like
John Doe
Friend
Software engineer at BigCorp21 mutual friends
Jane Joe
Friend
Miami, Florida7 mutual friends
Ploni Blum
Friend
Director at SuperstoreLives in New York, New York
This data is dirty. Numbers are next to letters. Words are connected. We need to clean it up into nice json data. By hand, this would be the ultimate tedious task. With sed, it’s fun. Overall our approach will be to first make the data labeled DSV and then convert it to a json array of arrays. Then we can use jq or python to process the arrays into object/dictionaries with labeled fields.
-e 's/\([0-9]*\) mutual friends/:friends:\1/'
First, if we see any numbers followed by " mutual friends" (the leading space is important), we know it’s a number of friends. So we can use sed to extract out the number and preface it with :friends: to so the data is somewhat DSV.
-e 's/Lives in /:location:/'
Next, facebook uses "Lives in" as a preface for some addresses. Useful, we just need to make it more dsv friendly
-e '/at/ s/.*/job:&:/'
Any line containing at, we see starts with a job. So let’s add the job: header.
-e "/^Friend$/d"
Some lines are just the word Friend. Useless, so let’s delete them
-e '/^[^:]*$/ s/.*/name:&:/'
If we run the script using the sed filters up to this point, we see lines not containing any : yet are names, so let’s label them as name. At this point our data looks like
name:John Doe:
job:Software engineer at BigCorp:friends:21
name:Jane Joe:
Miami, Florida:friends:7
name:Ploni Blum:
job:Director at Superstore:location:New York, New York
Okay, not bad, the data is structured, but records are split across two lines. The first line always ends with a : and the 2nd doesn’t, so we could use sed -z.
sed -z 's/:\n/:/g'
but, we won’t. There’s another nice tool for grouping lines, xargs. We’ll use it instead because it’s less hacky. "xargs -L 2"
name:John Doe: job:Software engineer at BigCorp:friends:21
name:Jane Joe: Miami, Florida:friends:7
name:Ploni Blum: job:Director at Superstore:location:New York, New York
That space after the colon after the name is looking a little weird, let’s clean it up.
sed 's/: /:/g'
Alright! We’ve got clean DSV data.
name:John Doe:job:Software engineer at BigCorp:friends:21
name:Jane Joe:Miami, Florida:friends:7
name:Ploni Blum:job:Director at Superstore:Location:New York, New York
Now that it’s clean DSV, we can use dsv-to-json to make it json
dsv-to-json : | jq '.'
Which gives us json data.
[
[
"name",
"John Doe",
"Job",
"Software engineer at BigCorp",
"friends",
"21"
],
[
"name",
"Jane Joe",
"Miami, Florida",
"friends",
"7"
],
[
"name",
"Ploni Blum",
"Job",
"Director at Superstore",
"Location",
"New York, New York"
]
]
Now to structure the data, we can use a quick python program:
#!/usr/bin/env python3
import json
import sys
KEYS=["name", "job", "location", "friends"]
def parse(row):
output = {}
for key in KEYS:
output[key] = null
i = 0
while i < len(row):
if row[i] in KEYS and i+1 < len(row):
output[row[i]] = row[i+1]
i+=2
elif row[i] in KEYS and i+1 == len(row):
raise Exception("Key found on last row: {}".format(row))
elif row[i] not in KEYS and "," in row[i]:
output["location"] = row[i]
i+=1
else:
raise Exception("IncompleteIfTree: {}".format(row))
return output
def L(data):
return [parse(r) for r in data]
print(json.dumps(L(json.load(sys.stdin))))
The above code is straightforward business logic, so we won’t go over it. Notice, however, that this is an RLW which uses a Faster If Statement. The output from this code is what we really want:
[
{
"name": "John Doe",
"job": "Software engineer at BigCorp",
"location": null,
"friends": "21"
},
{
"name": "Jane Joe",
"job": null,
"location": "Miami, Florida",
"friends": "7"
},
{
"name": "Ploni Blum",
"job": "Director at Superstore",
"location": "New York, New York",
"friends": null
}
]
Conclusion
We’ve seen how across many domains sed is a critical tool in automating tedious tasks. Whether its cleaning up files, cleaning up data, or generating code, sed’s a must have for automating these processes. Next week, we’ll dig into a deep idea in fast coding: the importance of textual data. Textual data is incredibly fast to write programs against and there’s a multitude of prewritten tools to make this even faster. If you don’t want to miss this important lesson, just click the Subscribe now button below.
You should never use ls in your scripts. Use find instead.