kae3g 9603: Shell Text Processing - grep, sed, awk Mastery

Phase 2: Core Systems & Tools | Week 6 | Reading Time: 18 minutes

What You'll Learn

grep: Pattern matching and searching
sed: Stream editing and text transformation
awk: Pattern scanning and processing language
Combining the power trio in pipelines
Regular expressions essentials
Real-world text processing workflows
When to use which tool

Prerequisites

9550: Command Line - Pipes, redirection
9560: Text Files - Plain text power
9601: Shell Scripting - Shell basics

The Power Trio

Unix philosophy (Essay 9510): "Do one thing well" + "Compose tools"

grep, sed, awk are the text processing champions:

grep: "Global Regular Expression Print" - search and filter
sed: "Stream Editor" - search and replace
awk: Pattern scanning language - extract and compute

Together: Process terabytes of logs, transform data, extract insights!

grep: Search and Filter

Basic Usage

# Search for pattern in file
grep "ERROR" /var/log/app.log

# Search in multiple files
grep "TODO" *.js

# Recursive search
grep -r "function" src/

# Case-insensitive
grep -i "warning" log.txt

Common Options

# -n: Show line numbers
grep -n "ERROR" log.txt
# Output: 42:ERROR: Connection failed

# -v: Invert match (lines NOT matching)
grep -v "DEBUG" log.txt  # Exclude debug lines

# -c: Count matches
grep -c "ERROR" log.txt  # Output: 15

# -l: List files with matches
grep -l "TODO" *.js  # Output: app.js utils.js

# -A, -B, -C: Context lines
grep -A 3 "ERROR" log.txt  # Show 3 lines after
grep -B 2 "ERROR" log.txt  # Show 2 lines before
grep -C 2 "ERROR" log.txt  # Show 2 lines before and after

Regular Expressions

Basic patterns:

# Literal string
grep "hello" file.txt

# Start of line (^)
grep "^ERROR" log.txt  # Lines starting with ERROR

# End of line ($)
grep "failed$" log.txt  # Lines ending with failed

# Any character (.)
grep "h.llo" file.txt  # Matches: hello, hallo, h3llo, etc.

# Zero or more (*)
grep "colou*r" file.txt  # Matches: color, colour

# One or more (\+, needs -E)
grep -E "10+" file.txt  # Matches: 10, 100, 1000, etc.

# Character class
grep "[0-9]" file.txt  # Lines with digits
grep "[A-Z]" file.txt  # Lines with uppercase
grep "[aeiou]" file.txt  # Lines with vowels

# Word boundary (\b, needs -E)
grep -E "\bcat\b" file.txt  # Matches "cat" but not "catch"

Extended Regex (-E)

# Alternation (|)
grep -E "ERROR|FATAL" log.txt  # ERROR or FATAL

# Optional (?)
grep -E "colou?r" file.txt  # color or colour

# Groups
grep -E "(ab)+" file.txt  # ab, abab, ababab, etc.

# Exactly n times {n}
grep -E "[0-9]{3}" file.txt  # Exactly 3 digits

# Range {n,m}
grep -E "[0-9]{3,5}" file.txt  # 3 to 5 digits

Real-World Examples

Extract IP addresses:

grep -Eo "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log

Find email addresses:

grep -Eo "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" users.txt

Filter log levels:

# Only ERROR and FATAL
grep -E "^(ERROR|FATAL)" log.txt

# Everything except DEBUG and INFO
grep -Ev "^(DEBUG|INFO)" log.txt

sed: Stream Editing

Basic Substitution

# Replace first occurrence per line
sed 's/old/new/' file.txt

# Replace all occurrences (g flag)
sed 's/old/new/g' file.txt

# Edit file in-place
sed -i 's/old/new/g' file.txt  # Linux
sed -i '' 's/old/new/g' file.txt  # macOS (requires empty string)

Syntax: s/pattern/replacement/flags

Advanced Substitution

# Case-insensitive (I flag)
sed 's/error/ERROR/gI' file.txt

# Replace only on lines matching pattern
sed '/^ERROR/s/foo/bar/g' file.txt

# Delete lines
sed '/DEBUG/d' log.txt  # Delete lines with DEBUG

# Print only matching lines
sed -n '/ERROR/p' log.txt  # Like grep!

# Multiple commands
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt
# Or use semicolon:
sed 's/foo/bar/g; s/baz/qux/g' file.txt

Capture Groups

# \1, \2 reference captured groups
sed 's/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\3\/\2\/\1/' dates.txt
# Transform: 2025-10-10 → 10/10/2025

# Swap words
echo "hello world" | sed 's/\(.*\) \(.*\)/\2 \1/'
# Output: world hello

Line Ranges

# Lines 10-20
sed -n '10,20p' file.txt

# From line 5 to end
sed -n '5,$p' file.txt

# Every 3rd line
sed -n '1~3p' file.txt

# Delete first line
sed '1d' file.txt

# Delete last line
sed '$d' file.txt

Real-World Examples

Remove comments:

sed 's/#.*$//' config.txt  # Remove # to end of line
sed '/^$/d' config.txt  # Remove blank lines

Add line numbers:

sed = file.txt | sed 'N;s/\n/\t/'

Extract between markers:

sed -n '/START/,/END/p' file.txt

Config file editing:

# Change port in nginx config
sed -i 's/listen 80/listen 8080/' /etc/nginx/nginx.conf

# Enable a commented setting
sed -i 's/^# *\(max_connections = \)/\1/' config.ini

awk: Pattern Scanning & Processing

Basic Syntax

# Print entire line
awk '{print}' file.txt

# Print specific field ($1 = first, $2 = second, etc.)
awk '{print $1}' file.txt

# Print multiple fields
awk '{print $1, $3}' file.txt

# Field separator (default: whitespace)
awk -F: '{print $1}' /etc/passwd  # Use : as separator

Patterns and Actions

# Pattern { action }
awk '/ERROR/ {print $0}' log.txt  # Like grep

# Only lines > 80 chars
awk 'length > 80' file.txt

# Print line number and content
awk '{print NR, $0}' file.txt

# Lines 10-20
awk 'NR>=10 && NR<=20' file.txt

Built-in Variables

NR    # Current line number
NF    # Number of fields in current line
$0    # Entire line
$1    # First field
$NF   # Last field
FS    # Field separator (input)
OFS   # Output field separator

Examples:

# Print last field
awk '{print $NF}' file.txt

# Print number of fields per line
awk '{print NF}' file.txt

# Swap first and last fields
awk '{temp=$1; $1=$NF; $NF=temp; print}' file.txt

Arithmetic

# Sum column
awk '{sum += $1} END {print sum}' numbers.txt

# Average
awk '{sum += $1; count++} END {print sum/count}' numbers.txt

# Min/Max
awk 'NR==1 {min=$1; max=$1} {if($1<min) min=$1; if($1>max) max=$1} END {print min, max}' numbers.txt

# Print with calculation
awk '{print $1, $2, $1*$2}' data.txt  # Print col1, col2, product

Conditionals

# if/else
awk '{if ($1 > 100) print "high"; else print "low"}' numbers.txt

# Ternary operator
awk '{print ($1 > 100) ? "high" : "low"}' numbers.txt

# Multiple conditions
awk '$1 > 100 && $2 < 50 {print}' data.txt

BEGIN and END

# BEGIN: Runs before processing
# END: Runs after processing

awk 'BEGIN {print "Starting..."} {print $1} END {print "Done!"}' file.txt

# CSV with header
awk 'BEGIN {FS=","; print "Name,Age"} {print $1, $2}' data.csv

# Statistics
awk 'BEGIN {count=0; sum=0} {sum+=$1; count++} END {print "Count:", count, "Average:", sum/count}' numbers.txt

Real-World Examples

Parse access logs:

# Extract IP and status code
awk '{print $1, $9}' access.log

# Count 404 errors
awk '$9 == 404 {count++} END {print count}' access.log

# Top 10 IPs
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

Process CSV:

# Extract columns 1 and 3
awk -F, '{print $1, $3}' data.csv

# Filter rows where column 2 > 100
awk -F, '$2 > 100 {print}' data.csv

# Convert CSV to TSV
awk -F, '{print $1 "\t" $2 "\t" $3}' data.csv
# Or:
awk 'BEGIN {FS=","; OFS="\t"} {print}' data.csv

System monitoring:

# Disk usage over 80%
df -h | awk '$5 > 80 {print $6, $5}'

# Process memory (top 5)
ps aux | awk 'NR>1 {print $6, $11}' | sort -rn | head -5

# Network connections by state
netstat -an | awk '/^tcp/ {print $6}' | sort | uniq -c

Combining the Power Trio

Pipeline Patterns

grep → sed → awk:

# Extract errors, clean, summarize
grep ERROR log.txt | \
    sed 's/.*ERROR: //' | \
    awk '{count[$0]++} END {for (err in count) print count[err], err}' | \
    sort -rn

Example breakdown:

grep ERROR: Filter error lines
sed 's/.*ERROR: //': Remove everything before "ERROR: "
awk: Count unique errors
sort -rn: Sort by count (reverse numeric)

Real-World Workflows

Apache log analysis:

# Top 10 requested URLs
awk '{print $7}' access.log | \
    sort | \
    uniq -c | \
    sort -rn | \
    head -10

Failed login attempts:

grep "Failed password" /var/log/auth.log | \
    awk '{print $(NF-3)}' | \
    sort | \
    uniq -c | \
    sort -rn

Extract and transform data:

# From: Name: John, Age: 30
# To:   John,30

grep "Name:" data.txt | \
    sed 's/Name: \(.*\), Age: \(.*\)/\1,\2/'

Generate report:

#!/bin/bash

echo "=== System Report ==="
echo

echo "Top 5 Processes by Memory:"
ps aux | awk 'NR>1 {print $6, $11}' | sort -rn | head -5
echo

echo "Disk Usage:"
df -h | awk 'NR>1 && $5+0 > 0 {print $6, $5}'
echo

echo "Error Count (last hour):"
grep "$(date +%Y-%m-%d\ %H)" /var/log/syslog | \
    grep -c ERROR

When to Use Which

grep

Best for:

Finding files with specific content
Filtering log entries
Quick pattern matching
Binary "yes/no" searches

Example: "Which files contain 'TODO'?"

sed

Best for:

Search and replace
Line deletion/insertion
Simple transformations
In-place file editing

Example: "Change all 'http' to 'https'"

awk

Best for:

Column extraction
Arithmetic operations
Structured data processing
Complex logic and state

Example: "What's the average of column 3?"

Combination

Use together for multi-stage pipelines:

grep: Filter relevant lines
sed: Clean/transform
awk: Extract and compute

Try This

Exercise 1: Log Analysis

Given access.log:

192.168.1.1 - - [10/Oct/2025:13:55:36 -0700] "GET /api/users HTTP/1.1" 200 1234
192.168.1.2 - - [10/Oct/2025:13:55:37 -0700] "GET /api/posts HTTP/1.1" 404 567
192.168.1.1 - - [10/Oct/2025:13:55:38 -0700] "POST /api/login HTTP/1.1" 200 890

Tasks:

Extract all IP addresses
Count requests by status code

Find all 404 errors with URLs

# 1. Extract IPs
awk '{print $1}' access.log

# 2. Count by status
awk '{print $9}' access.log | sort | uniq -c

# 3. 404 errors with URLs
awk '$9 == 404 {print $7}' access.log

Exercise 2: Data Transformation

Given users.csv:

John,Doe,30,Engineer
Jane,Smith,25,Designer
Bob,Johnson,35,Manager

Tasks:

Print only first and last names
Find users over 30

Convert to JSON format

# 1. First and last names
awk -F, '{print $1, $2}' users.csv

# 2. Users over 30
awk -F, '$3 > 30 {print}' users.csv

# 3. Convert to JSON
awk -F, 'BEGIN {print "["} {printf "  {\"first\":\"%s\",\"last\":\"%s\",\"age\":%s,\"role\":\"%s\"}", $1,$2,$3,$4; if(NR<3) print ","; else print ""} END {print "]"}' users.csv

Exercise 3: Configuration Management

Given config.txt:

# Database settings
db_host=localhost
db_port=5432
# db_user=admin
db_pass=secret123

# Server settings
server_port=8080

Tasks:

Remove all comments
Extract all uncommented settings

Change db_port to 5433

# 1. Remove comments
sed 's/#.*$//' config.txt | sed '/^$/d'

# 2. Uncommented settings
grep -v "^#" config.txt | grep "="

# 3. Change port
sed 's/db_port=5432/db_port=5433/' config.txt

Best Practices

1. Use Appropriate Tool

# BAD: awk for simple search
awk '/ERROR/ {print}' log.txt

# GOOD: grep for search
grep ERROR log.txt

# BAD: sed for arithmetic
sed ... # complex expression

# GOOD: awk for arithmetic
awk '{sum+=$1} END {print sum}' numbers.txt

2. Quote Regular Expressions

# BAD
grep $pattern file.txt  # Shell may expand

# GOOD
grep "$pattern" file.txt
grep '$1 > 100' data.txt  # Single quotes prevent shell expansion

3. Test on Sample Data

Never run destructive commands on production data first!

# Test first
sed 's/old/new/g' file.txt | head

# Then apply
sed -i 's/old/new/g' file.txt

4. Use -n with sed

# BAD: prints everything twice
sed '/ERROR/p' log.txt

# GOOD: prints only matches
sed -n '/ERROR/p' log.txt

5. Escape Special Characters

Regex special chars: . * [ ] ^ $ \ + ? { } | ( )

# Search for literal dot
grep "\." file.txt

# Search for literal dollar sign
grep "\$" file.txt

Going Deeper

Related Essays

9550: Command Line - Pipes, redirection
9601: Shell Scripting - Scripting basics
9603: Shell Functions - Reusable text processing (Coming Soon!)

External Resources

"sed & awk" - Dale Dougherty (O'Reilly, definitive guide)
Regular-Expressions.info - Comprehensive regex tutorial
awk manual - man awk or GNU awk guide
"The AWK Programming Language" - Aho, Kernighan, Weinberger

Reflection Questions

Why three tools instead of one? (Unix philosophy - simple, composable)
When is awk overkill? (Simple search/replace - use grep/sed)
Can awk replace Python? (For text processing pipelines, often yes! For apps, no.)
Why learn regex? (Universal - works in grep, sed, awk, vim, Python, JavaScript, ...)
How does this relate to functional programming? (Pipelines = function composition! grep | sed | awk = compose(awk, sed, grep))

Summary

The Power Trio:

grep: Search and filter (grep "pattern" file)
sed: Search and replace (sed 's/old/new/g' file)
awk: Extract and compute (awk '{sum+=$1} END {print sum}' file)

grep essentials:

-i: Case-insensitive
-v: Invert match
-n: Line numbers
-r: Recursive
-E: Extended regex

sed essentials:

s/pattern/replacement/g: Substitute
/pattern/d: Delete lines
-n '/pattern/p': Print matches only
-i: In-place editing

awk essentials:

{print $1}: Print first field
$1 > 100 {print}: Conditional action
{sum+=$1} END {print sum}: Accumulate
-F,: Set field separator

When to use:

grep: Filter lines
sed: Transform lines
awk: Process columns

Combine in pipelines:

grep ERROR log.txt | sed 's/.*ERROR: //' | awk '{count[$0]++} END {for(e in count) print count[e], e}' | sort -rn

In the Valley:

Text processing = data flowing through ecosystem
grep, sed, awk = different plant species (complementary niches!)
Pipelines = nutrient cycles (output of one feeds input of next)
Ecological lens: "Diverse tools create resilient workflows—monoculture (one tool) is fragile, polyculture (grep+sed+awk) is robust."

Next: Essay 9603 - Shell Functions & Modularity! We'll learn to write reusable, maintainable shell code!

Navigation:
← Previous: 9601 (Shell Scripting Fundamentals) | Phase 2 Index | Next: 9603 (Shell Functions & Modularity) (Coming Soon!)

Metadata:

Phase: 2 (Core Systems & Tools)
Week: 6 (Shell Scripting)
Prerequisites: 9550, 9560, 9601
Concepts: grep (search), sed (transform), awk (process), regex, pipelines
Next Concepts: Shell functions, modularity, reusable code
Plant Lens: Diverse tools (polyculture), nutrient cycles (pipelines), resilient workflows
Hands-On: 3 exercises (log analysis, data transformation, config management)

View Hidden Docs Index | Return to Main Index