PDF compression is a complex process involving multiple algorithms working together to reduce file size while preserving quality. Understanding these algorithms helps you make informed decisions about compression settings and achieve optimal results for your specific use cases.
How PDF Compression Works
PDF files contain various types of data: text, images, fonts, metadata, and structure information. Each type benefits from different compression strategies. Modern PDF compression applies multiple algorithms selectively to different content types.
PDF Content Types and Compression:
- Text & Fonts: Typically 5-10% of file size (highly compressible)
- Images: 80-90% of file size (biggest optimization target)
- Metadata: 1-2% of file size (minimal impact)
- Page Structure: 3-5% of file size (moderate compression)
Core Compression Algorithms
1. Flate/Deflate (ZIP) Compression
Type: Lossless compression based on LZ77 and Huffman coding
The most common compression method in PDFs, Flate is identical to the ZIP algorithm. It works by finding repeated patterns in data and replacing them with shorter references.
How It Works:
- 1. Scans data for repeated sequences (LZ77 algorithm)
- 2. Creates a dictionary of common patterns
- 3. Replaces patterns with short codes
- 4. Applies Huffman coding to further compress the codes
Best for: Text, vector graphics, page structure
Typical compression ratio: 2:1 to 4:1
Quality: Perfect (lossless)
2. JPEG Compression
Type: Lossy compression for photographic images
JPEG uses Discrete Cosine Transform (DCT) to compress photographic content. It's highly effective for photos but can introduce artifacts in text or line drawings.
Compression Process:
- 1. Converts RGB to YCbCr color space (separates brightness from color)
- 2. Downsamples color channels (human eyes are less sensitive to color detail)
- 3. Divides image into 8×8 pixel blocks
- 4. Applies DCT to each block
- 5. Quantizes coefficients (controlled quality loss)
- 6. Compresses with Huffman encoding
Best for: Photographs, complex color images
Typical compression ratio: 10:1 to 50:1
Quality: Adjustable (higher compression = more quality loss)
3. JPEG2000 Compression
Type: Advanced wavelet-based compression (lossy or lossless)
A more sophisticated successor to JPEG, using wavelet transforms instead of DCT. Provides better quality at high compression rates and supports lossless compression.
✓ Advantages:
- • Better quality than JPEG at same file size
- • No 8×8 blocking artifacts
- • Progressive quality rendering
- • Supports lossless mode
✗ Disadvantages:
- • Less universal support
- • Slower encoding/decoding
- • Higher CPU requirements
- • Not all PDF viewers support it
Best for: High-quality archival, medical images, satellite imagery
Typical compression ratio: 15:1 to 80:1 (lossy), 2:1 to 3:1 (lossless)
4. JBIG2 Compression
Type: Specialized compression for bilevel (black & white) images
Designed specifically for scanned documents, JBIG2 achieves remarkable compression by identifying similar character shapes and storing them only once. A page with repeated letters stores the letter "e" shape once and references it hundreds of times.
Pattern Matching Process:
- 1. Identifies connected components (letters, symbols)
- 2. Groups similar shapes together
- 3. Stores each unique shape once in a dictionary
- 4. References dictionary entries for repeated shapes
Best for: Scanned documents, text-heavy PDFs, faxes
Typical compression ratio: 20:1 to 100:1
Quality: Lossless or lossy (lossy mode can introduce character substitution errors)
5. LZW Compression
Type: Lossless dictionary-based compression
Lempel-Ziv-Welch (LZW) builds a dictionary of patterns as it compresses. While effective, it's less common in modern PDFs due to historical patent issues and the superior performance of Flate compression.
Best for: GIF images embedded in PDFs, TIFF images
Typical compression ratio: 2:1 to 3:1
Status: Legacy algorithm, rarely used in new PDFs
6. CCITT Compression
Type: Lossless compression for fax-like bilevel images
Developed for fax transmission (CCITT Group 3 and Group 4), these algorithms efficiently compress black and white images by encoding runs of white or black pixels.
Best for: Black & white scanned documents, faxes
Typical compression ratio: 10:1 to 30:1
Variants: Group 3 (fax transmission), Group 4 (better compression)
Advanced Optimization Techniques
Image Downsampling
Reduces image resolution to match the intended viewing or printing resolution. A 300 DPI image viewed on screen (96 DPI) wastes bandwidth and storage.
Screen Viewing:
150 DPI sufficient
Office Printing:
200-250 DPI optimal
Professional Print:
300 DPI required
Font Subsetting
Instead of embedding entire fonts (which can be 100KB-500KB each), embed only the characters actually used in the document.
Example: If your document uses only "Hello World" in Arial, embed just those characters instead of all 256+ glyphs.
Object Stream Compression
PDF 1.5+ supports object streams that group multiple PDF objects together and compress them as a unit, achieving better compression ratios than compressing objects individually.
Can reduce file size by an additional 10-20% beyond other compression methods.
Duplicate Image Elimination
If the same logo appears on every page, store it once and reference it 50 times rather than storing 50 copies.
Impact: Especially effective for documents with repeated branding, headers, or footers.
Metadata Removal
Strip out unnecessary metadata: creation software info, edit history, thumbnails, and bookmarks you don't need.
Savings: Typically 1-5% of file size, but can be significant for documents with extensive metadata.
Compression Strategies by Use Case
📧 Email Attachments (Target: <5MB)
- ✓ Aggressive JPEG compression (quality 60-70)
- ✓ Downsample images to 150 DPI
- ✓ Remove metadata and bookmarks
- ✓ Use JBIG2 for scanned pages
- Expected reduction: 50-80%
🖨️ Professional Printing
- ✓ Minimal or no JPEG compression (quality 90+)
- ✓ Keep images at 300 DPI
- ✓ Use lossless Flate for text and vectors
- ✓ Preserve color profiles
- Expected reduction: 10-30%
🌐 Web Publishing
- ✓ Medium JPEG compression (quality 75-85)
- ✓ Downsample to 120-150 DPI
- ✓ Enable "Fast Web View" (linearization)
- ✓ Object stream compression
- Expected reduction: 40-60%
📚 Archival Storage
- ✓ Lossless compression only (Flate, JBIG2 lossless)
- ✓ JPEG2000 lossless for color images
- ✓ Keep original resolution
- ✓ PDF/A compliance
- Expected reduction: 20-40%
Measuring Compression Effectiveness
Key Metrics:
- Compression Ratio
- Original size ÷ Compressed size. A 10MB file compressed to 2MB has a 5:1 ratio.
- Space Savings
- (Original - Compressed) ÷ Original × 100%. Example: 80% savings means 5× smaller.
- Quality Loss
- Measured by PSNR (Peak Signal-to-Noise Ratio) or SSIM (Structural Similarity Index).
Common Compression Mistakes
❌ Compressing Already-Compressed Images
JPEG images are already compressed. Re-compressing them adds artifacts without much size reduction. Extract, optimize once properly, then re-embed.
❌ Using Lossy Compression on Text
JPEG compression on text creates blurry, illegible characters. Always use lossless compression (Flate) for text and line art.
❌ Over-Aggressive Downsampling
Reducing a 300 DPI image to 50 DPI creates pixelated, unprintable results. Match DPI to intended use.
❌ Ignoring Color Space
CMYK images for print should stay CMYK. Converting to RGB and back causes color shifts.
Conclusion
Effective PDF compression requires understanding the algorithms available and matching them to your content and use case. By combining the right algorithms—Flate for text, JPEG for photos, JBIG2 for scans—and applying appropriate optimization techniques like downsampling and font subsetting, you can dramatically reduce file sizes while maintaining the quality your documents require. The key is balancing compression ratio against quality needs based on whether your PDF is destined for screen viewing, printing, archival, or email distribution.