How to use sed to automate mind-numbing tasks (with examples)

Aug 18, 2020

In this tutorial, we’ll show, with examples that you can copy paste right into your shell, how you can use sed to automate mind-numbing tasks. Automating mind-numbing tasks is not only good for your emotional health, it’ll make you more accurate and productive

macOS (BSD) disclaimer

If you’re using a mac, the -i flag requires a file extension (like .bak) which creates backup files that then have to be removed. If you’re using linux, everything will work as is.

The Examples

Backup a list of files

Example usage

ls | sed "s/.*/cp & &.bak/" | bash

Reusable script

#!/usr/bin/env bash

sed "s/.*/cp & &.bak/"

Explanation

This example is interesting because it uses sed to generate bash code which is executed by bash.

Imagine the input is a single line containing the text "file.txt"

Then sed finds .* which means the entire input value "file.txt".

It then replaces it with "cp & &.bak".

& is a special value which means the entire found match, which is file.

So the answer is: "cp file.txt file.txt.bak"

If the input is:

1.txt
2.txt
3.txt

The output is

cp 1.txt 1.txt.bak
cp 2.txt 2.txt.bak
cp 3.txt 3.txt.bak

Which is a shell script that we can execute by piping into bash.

Why sed is better than bash loops

This could also be done with bash loops and the code is even a bit easier to read:

#!/usr/bin/env bash

for file in $(cat); do
  cp ${file} ${file}.bak
end

So why use sed? Because sed outputs text. Notably if we remove "| bash" from our program, it outputs a very easy to read bash program.

A program that outputs text is incredibly powerful because text is easy to read/write by hand and it’s easy to read/write programs that read/write text. This sounds obvious, but is hugely important and we’ll discuss it in more depth next week.

find and replace for all files in a directory

Making a variable change in a dynamic programming language? Need to fix every file in the directory? You could use sublime and scroll through the list. Or…

Example Usage

sed -i "s/find/replace/g" *

Reusable Script

#!/usr/bin/env bash 
FIND="$1" 
REPLACE="$2"

sed -i "s/${FIND}/${REPLACE}/g" *

Explanation

By combining sed with shell file globbing, we can easily run sed to perform find and replace across all files in a directory. That’s a lot more fun than running this. For. Every. File.

find-and-replace (for a list of files)

The above code runs find and replace for all files in a directory. But what about other lists of files? We’ll need something more powerful.

Example Usage

ls | sed "s#.*#sed -i \"s/find/replace/g\" &#" | bash

Reusable script

#!/usr/bin/env bash

FIND="$1"
REPLACE="$2"

sed "s#.*#sed -i \"s/${FIND}/${REPLACE}/g\" &#"

Explanation

This code combines using sed -i to find and replace on a file with using sed to generate bash code as we did in the first example. If the files are (1.txt, 2.txt, 3.txt), FIND="find" and REPLACE="replace" the generated bash is:

sed "s/find/replace/g" 1.txt
sed "s/find/replace/g" 2.txt
sed "s/find/replace/g" 3.txt

What’s nice about this technique is that it works for any list of files generated on stdin. File globbing, like we used for all files in a directory, is limited in its expressiveness. But as long as somehow we generate the list of files as text, then this program can create a bash script which performs find and replace exactly on those files

Trailing space cleaner

Ever get dinged by git for having trailing spaces? And then you have to go back. Edit the file. Go to the line. Delete the whitespace. git add -u. Ugh, no thanks.

Example usage

find -type f | find-and-replace " *$" "" | bash

Reusable core

find-and-replace " *$" ""

Explanation

The key is a good choice of regular expression

" *$"

This matches all spaces at the end of a line. It’s replaced by "", the empty string. Meaning all trailing spaces are deleted. And by using find -type f to select all files in a folder, including files in nested folders, sed will operate on every single file. To top it off, you can put this as part of a lint-fix makefile directive and never have to manually clean up trailing whitespaces again.

No trailing newline fixer

Example Usage

sed -z -i "s/[^\n]$/&\n/" file

Explanation

This uses two tricks: a cute hack to make the end of sed’s line the end of the file and a good choice of regular expression.

First, by using -z on a regular file, $ matches not the end of any line but the end of the whole file.

Second, as such, the regular expression: [^\n]$ matches any non-newline character at the end of the file. If the end of the file isn’t a newline character, this script will replace it with "&\n", the last character plus a newline, effectively adding a newline.

If the end of file is a newline, it doesn’t find a match and sed does nothing.

Copy lines of a file into the clipboard

Working on a program and need to copy paste it into this week’s blog post? Yeah, me too. Using a mouse to select lines of text is tedious. So is navigating a file with a keyboard. This is what I do:

# linux
seq 100 | sed -n 20,30p | xsel --clipboard --input
# macOS
seq 100 | sed -n 20,30p | pbcopy

Explanation

xsel/pbcopy takes stdin and writes it to the clipboard. By taking the exact output we need and writing it to the clipboard, we can avoid the mind numbing work of selecting things with a mouse or navigating with a keyboard.

Facebook friend data cleaning

Data cleaning is incredibly mind numbing. However it’s also hard to automate, so the fact that sed can help with it is incredibly novel.

Example Usage

Note that this requires dsv-to-json from the json-toolkit to work properly.

echo 'John Doe
Friend
Software engineer at BigCorp21 mutual friends
Jane Joe
Friend
Miami, Florida7 mutual friends
Ploni Blum
Friend
Director at SuperstoreLives in New York, New York' |
sed \
-e '/Friend/d' \
-e 's/\([0-9]*\) mutual friends/:friends:\1/' \
-e 's/Lives in /:location:/' \
-e '/at/ s/^/job:/' \
-e '/^[^:]*$/ s/.*/name:&:/' |
xargs -L 2 |
sed s'/: /:/g' |
dsv-to-json : | jq '.'

Reusable Script

sed \
-e 's/\([0-9]*\) mutual friends/:friends:\1/' \
-e 's/Lives in /:location:/' \
-e "/at/ s/^/job:/" \
-e "/Friend/d" \
-e '/^[^:]*$/ s/.*/name:&:/' |
xargs -L 2 |
sed s'/: /:/g' |
dsv-to-json : | jq '.'

Explanation

While facebook friend page scraping might not be your specific data cleaning problem, this is one you can try at home and showcases general techniques you can use to use sed to clean your own messy data.

If we copy and paste directly from facebook, it looks like

John Doe
Friend
Software engineer at BigCorp21 mutual friends
Jane Joe
Friend
Miami, Florida7 mutual friends
Ploni Blum
Friend
Director at SuperstoreLives in New York, New York

This data is dirty. Numbers are next to letters. Words are connected. We need to clean it up into nice json data. By hand, this would be the ultimate tedious task. With sed, it’s fun. Overall our approach will be to first make the data labeled DSV and then convert it to a json array of arrays. Then we can use jq or python to process the arrays into object/dictionaries with labeled fields.

-e 's/\([0-9]*\) mutual friends/:friends:\1/'

First, if we see any numbers followed by " mutual friends" (the leading space is important), we know it’s a number of friends. So we can use sed to extract out the number and preface it with :friends: to so the data is somewhat DSV.

-e 's/Lives in /:location:/'

Next, facebook uses "Lives in" as a preface for some addresses. Useful, we just need to make it more dsv friendly

-e '/at/ s/.*/job:&:/'

Any line containing at, we see starts with a job. So let’s add the job: header.

-e "/^Friend$/d"

Some lines are just the word Friend. Useless, so let’s delete them

-e '/^[^:]*$/ s/.*/name:&:/'

If we run the script using the sed filters up to this point, we see lines not containing any : yet are names, so let’s label them as name. At this point our data looks like

name:John Doe:
job:Software engineer at BigCorp:friends:21
name:Jane Joe:
Miami, Florida:friends:7
name:Ploni Blum:
job:Director at Superstore:location:New York, New York

Okay, not bad, the data is structured, but records are split across two lines. The first line always ends with a : and the 2nd doesn’t, so we could use sed -z.

sed -z 's/:\n/:/g'

but, we won’t. There’s another nice tool for grouping lines, xargs. We’ll use it instead because it’s less hacky. "xargs -L 2"

name:John Doe: job:Software engineer at BigCorp:friends:21
name:Jane Joe: Miami, Florida:friends:7
name:Ploni Blum: job:Director at Superstore:location:New York, New York

That space after the colon after the name is looking a little weird, let’s clean it up.

sed 's/: /:/g'

Alright! We’ve got clean DSV data.

name:John Doe:job:Software engineer at BigCorp:friends:21
name:Jane Joe:Miami, Florida:friends:7
name:Ploni Blum:job:Director at Superstore:Location:New York, New York

Now that it’s clean DSV, we can use dsv-to-json to make it json

dsv-to-json : | jq '.'

Which gives us json data.

[
  [
    "name",
    "John Doe",
    "Job",
    "Software engineer at BigCorp",
    "friends",
    "21"
  ],
  [
    "name",
    "Jane Joe",
    "Miami, Florida",
    "friends",
    "7"
  ],
  [
    "name",
    "Ploni Blum",
    "Job",
    "Director at Superstore",
    "Location",
    "New York, New York"
  ]
]

Now to structure the data, we can use a quick python program:

#!/usr/bin/env python3

import json
import sys

KEYS=["name", "job", "location", "friends"]
def parse(row):
    output = {}
    for key in KEYS:
         output[key] = null
    i = 0
    while i < len(row):
        if row[i] in KEYS and i+1 < len(row):
            output[row[i]] = row[i+1]
            i+=2
        elif row[i] in KEYS and i+1 == len(row):
            raise Exception("Key found on last row: {}".format(row))
        elif row[i] not in KEYS and "," in row[i]:
            output["location"] = row[i]
            i+=1
        else:
            raise Exception("IncompleteIfTree: {}".format(row))
    return output

def L(data):
    return [parse(r) for r in data]

print(json.dumps(L(json.load(sys.stdin))))

The above code is straightforward business logic, so we won’t go over it. Notice, however, that this is an RLW which uses a Faster If Statement. The output from this code is what we really want:

[
  {
    "name": "John Doe",
    "job": "Software engineer at BigCorp",
    "location": null,
    "friends": "21"
  },
  {
    "name": "Jane Joe",
    "job": null,
    "location": "Miami, Florida",
    "friends": "7"
  },
  {
    "name": "Ploni Blum",
    "job": "Director at Superstore",
    "location": "New York, New York",
    "friends": null
  }
]

Conclusion

We’ve seen how across many domains sed is a critical tool in automating tedious tasks. Whether its cleaning up files, cleaning up data, or generating code, sed’s a must have for automating these processes. Next week, we’ll dig into a deep idea in fast coding: the importance of textual data. Textual data is incredibly fast to write programs against and there’s a multitude of prewritten tools to make this even faster. If you don’t want to miss this important lesson, just click the Subscribe now button below.

CodeFaster

Discussion about this post