Skip to content

Suffix Array

A suffix array is a sorted array of all suffixes of a given string. It provides space-efficient string matching and analysis capabilities, using significantly less memory than suffix trees while maintaining good performance for most operations. Suffix arrays are particularly useful in bioinformatics, text compression, and string algorithms where memory efficiency is crucial.

Features

  • Space-efficient representation - linear space O(n) with smaller constants than suffix trees
  • Binary search pattern matching - find patterns in O(m log n) time where m is pattern length
  • Multiple occurrence detection - efficiently find all positions where a pattern occurs
  • LCP array integration - Longest Common Prefix array for advanced string analysis
  • Repeated substring detection - identify all repeated patterns using LCP information
  • Rank operations - quick access to suffix ordering information
  • Longest common substring - find shared substrings between texts
  • Flexible queries - support various string analysis operations with good performance

Time Complexities

Operation Time Complexity Description
Construction O(n log n) naive, O(n) advanced Build array for text of length n
Pattern Search O(m log n) Search for pattern of length m
Find Occurrences O(m log n + k) m = pattern length, k = number of occurrences
LCP Array Build O(n) Build Longest Common Prefix array
Get Suffixes O(n²) List all suffixes (total length)
Repeated Substrings O(n) Find all repeated patterns using LCP
Rank Query O(1) Get rank of suffix at position
LCP Query O(1) with preprocessing LCP between any two suffixes
Text Length O(1) Get length of original text

Space Complexity: O(n) where n is the length of the input text

Basic Usage

Creating and Building

from algo.data_structs.strings.suffix_array import SuffixArray

# Create empty suffix array
array = SuffixArray()
print(array.is_empty())           # True
print(array.size())               # 0
print(len(array))                 # 0

# Build array with text
array.build("banana")
print(array.is_empty())           # False
print(array.size())               # 6 (number of suffixes)
print(len(array))                 # 6 (length of original text)
print(array.get_text())           # "banana"

# String representations
print(str(array))                 # SuffixArray(text='banana', size=6)
print(repr(array))                # SuffixArray(text_length=6, suffixes=6)

Understanding Suffix Arrays

# Build array and examine its structure
array = SuffixArray()
array.build("abc")

# Get all suffixes in sorted order
suffixes = array.get_suffixes()
print(f"Sorted suffixes: {suffixes}")         # ['abc', 'bc', 'c']

# Get suffix array (starting positions of sorted suffixes)
suffix_array = array.get_suffix_array()
print(f"Suffix array: {suffix_array}")        # [0, 1, 2]

# Verify relationship between suffix array and suffixes
original_text = array.get_text()
for i, start_pos in enumerate(suffix_array):
    suffix_from_position = original_text[start_pos:]
    suffix_from_list = suffixes[i]
    print(f"Position {start_pos}: '{suffix_from_position}' == '{suffix_from_list}'")
    assert suffix_from_position == suffix_from_list

Longest Common Prefix (LCP) Array

array = SuffixArray()
array.build("banana")

# Get LCP array
lcp_array = array.get_lcp_array()
print(f"LCP array: {lcp_array}")              # [1, 3, 0, 0, 2]

# Understand LCP values
suffixes = array.get_suffixes()
print(f"Suffixes: {suffixes}")
# ['a', 'ana', 'anana', 'banana', 'na', 'nana']

# LCP[i] = longest common prefix between suffixes[i] and suffixes[i+1]
for i, lcp_len in enumerate(lcp_array):
    suffix1 = suffixes[i]
    suffix2 = suffixes[i + 1]
    common_prefix = suffix1[:lcp_len] if lcp_len > 0 else ""
    print(f"LCP('{suffix1}', '{suffix2}') = {lcp_len} ('{common_prefix}')")

Text Construction and Manipulation

Single Text Building

array = SuffixArray()

# Build with simple text
array.build("hello")
print(f"Text: '{array.get_text()}'")          # 'hello'
print(f"Suffixes: {array.get_suffixes()}")    # ['ello', 'hello', 'llo', 'lo', 'o']

# Build with repeated patterns
array.build("abcabc")
print(f"Text: '{array.get_text()}'")          # 'abcabc'
print(f"Size: {array.size()}")                # 6

# Build with single character
array.build("x")
print(f"Suffixes: {array.get_suffixes()}")    # ['x']
print(f"Suffix array: {array.get_suffix_array()}")  # [0]
print(f"LCP array: {array.get_lcp_array()}")  # [] (empty for single suffix)

Handling Different Text Types

# Empty text
empty_array = SuffixArray()
empty_array.build("")
print(f"Empty: {empty_array.is_empty()}")     # True
print(f"Suffixes: {empty_array.get_suffixes()}")  # []

# Repeated characters
repeat_array = SuffixArray()
repeat_array.build("aaaa")
print(f"Suffixes: {repeat_array.get_suffixes()}")
# ['a', 'aa', 'aaa', 'aaaa'] (sorted lexicographically)

lcp = repeat_array.get_lcp_array()
print(f"LCP array: {lcp}")                    # [1, 2, 3] (increasing overlap)

# Complex text with patterns
complex_array = SuffixArray()
complex_array.build("abracadabra")
print(f"Text: '{complex_array.get_text()}'")  # 'abracadabra'
print(f"Number of suffixes: {complex_array.size()}")  # 11

Rebuilding Arrays

array = SuffixArray()

# Build with first text
array.build("abc")
print(f"First text: '{array.get_text()}'")    # 'abc'

# Rebuild with different text (replaces previous)
array.build("xyz")
print(f"New text: '{array.get_text()}'")      # 'xyz'
print(f"Previous structure is replaced")

# Build with longer text
array.build("programming")
print(f"Longer text: '{array.get_text()}'")   # 'programming'
print(f"Array size: {array.size()}")          # 11

Pattern Matching

array = SuffixArray()
array.build("banana")

# Search for various patterns
print(f"Search 'an': {array.search('an')}")       # True
print(f"Search 'ana': {array.search('ana')}")     # True
print(f"Search 'ban': {array.search('ban')}")     # True
print(f"Search 'xyz': {array.search('xyz')}")     # False

# Using 'in' operator
print(f"'ana' in array: {'ana' in array}")        # True
print(f"'cab' in array: {'cab' in array}")        # False

# Edge cases
print(f"Search '': {array.search('')}")           # False
print(f"Search 'banana': {array.search('banana')}")    # True (full text)
print(f"Search 'bananas': {array.search('bananas')}")  # False (longer than text)

Finding All Occurrences

array = SuffixArray()
array.build("abracadabra")

# Find all occurrences of patterns
a_positions = array.find_occurrences("a")
print(f"'a' occurs at positions: {sorted(a_positions)}")
# [0, 3, 5, 7, 10] (all positions where 'a' appears)

abr_positions = array.find_occurrences("abr")
print(f"'abr' occurs at positions: {sorted(abr_positions)}")
# [0, 7] ('abr' appears at start and in 'cadabr')

br_positions = array.find_occurrences("br")
print(f"'br' occurs at positions: {sorted(br_positions)}")
# [1, 8] ('br' in 'abr' patterns)

# Non-existent pattern
xyz_positions = array.find_occurrences("xyz")
print(f"'xyz' occurs at positions: {xyz_positions}")
# [] (empty list)

# Overlapping patterns
array.build("aaaa")
aa_positions = array.find_occurrences("aa")
print(f"'aa' in 'aaaa' at: {sorted(aa_positions)}")
# [0, 1, 2] (overlapping occurrences)

Pattern Matching Performance

# Demonstrate binary search efficiency
array = SuffixArray()
text = "abcdef" * 100  # 600 characters
array.build(text)

import time

# Time pattern search
start = time.time()
result = array.search("cdef")
search_time = time.time() - start

print(f"Pattern found: {result}")
print(f"Search time: {search_time:.6f} seconds")

# Time finding all occurrences
start = time.time()
occurrences = array.find_occurrences("abc")
find_time = time.time() - start

print(f"Found {len(occurrences)} occurrences")
print(f"Find time: {find_time:.6f} seconds")

# Verify occurrences
print(f"First few positions: {sorted(occurrences)[:5]}")

Advanced String Analysis

Rank Operations

array = SuffixArray()
array.build("banana")

# Get rank of each suffix (position in sorted order)
text = array.get_text()
for i in range(len(text)):
    suffix = text[i:]
    rank = array.get_rank(i)
    print(f"Suffix at position {i} ('{suffix}') has rank {rank}")

# Verify rank consistency with suffix array
suffix_array = array.get_suffix_array()
for rank, position in enumerate(suffix_array):
    assert array.get_rank(position) == rank
    print(f"✓ Rank {rank} -> Position {position} -> Rank {array.get_rank(position)}")

print("All rank operations consistent!")

LCP Between Arbitrary Positions

array = SuffixArray()
array.build("banana")

# Compute LCP between suffixes at different positions
test_pairs = [(0, 1), (1, 3), (2, 4), (0, 3), (1, 5)]

for i, j in test_pairs:
    suffix_i = array.get_text()[i:]
    suffix_j = array.get_text()[j:]
    lcp_length = array.lcp(i, j)

    # Extract actual common prefix
    common_prefix = ""
    for k in range(min(len(suffix_i), len(suffix_j))):
        if suffix_i[k] == suffix_j[k]:
            common_prefix += suffix_i[k]
        else:
            break

    print(f"LCP(pos {i}, pos {j}): '{suffix_i}' vs '{suffix_j}' = {lcp_length} ('{common_prefix}')")

# Special cases
print(f"LCP(same position): {array.lcp(0, 0)}")  # Full length
print(f"LCP(out of bounds): {array.lcp(-1, 100)}")  # 0

Finding Repeated Substrings

array = SuffixArray()
array.build("abracadabra")

# Find repeated substrings of different minimum lengths
repeated_2 = array.get_repeated_substrings(min_length=2)
print(f"Repeated substrings (≥2 chars): {sorted(repeated_2)}")
# Should include 'ab', 'br', 'ra', 'abr', 'bra', etc.

repeated_3 = array.get_repeated_substrings(min_length=3)
print(f"Repeated substrings (≥3 chars): {sorted(repeated_3)}")
# Should include 'abr', 'bra', etc.

# Example with highly repetitive text
array.build("ababab")
repeated_ab = array.get_repeated_substrings(min_length=2)
print(f"Repeated in 'ababab': {sorted(repeated_ab)}")
# Should include 'ab', 'ba', 'abab', 'baba', etc.

# Show how LCP array drives the detection
lcp_array = array.get_lcp_array()
print(f"LCP array for 'ababab': {lcp_array}")
print("LCP values ≥ min_length indicate repeated substrings")

Longest Common Substring

array = SuffixArray()

# Build array with first text
array.build("abcdef")

# Find longest common substring with various texts
lcs1 = array.longest_common_substring("defghi")
print(f"LCS of 'abcdef' and 'defghi': '{lcs1}'")  # 'def'

lcs2 = array.longest_common_substring("xyzabc")
print(f"LCS of 'abcdef' and 'xyzabc': '{lcs2}'")  # 'abc'

lcs3 = array.longest_common_substring("123456")
print(f"LCS of 'abcdef' and '123456': '{lcs3}'")  # '' (no common)

# More complex example
array.build("banana")
lcs4 = array.longest_common_substring("ananas")
print(f"LCS of 'banana' and 'ananas': '{lcs4}'")  # Should find good overlap

# Identical texts
lcs5 = array.longest_common_substring("banana")
print(f"LCS of 'banana' and 'banana': '{lcs5}'")  # 'banana'

Working with Different Text Types

DNA Sequence Analysis

def analyze_dna_with_suffix_array():
    """Analyze DNA sequence using suffix array."""
    array = SuffixArray()

    # Build array with DNA sequence
    dna_sequence = "ATCGATCGATCGAAATTTCGATCGATCG"
    array.build(dna_sequence)

    print(f"DNA Sequence: {dna_sequence}")
    print(f"Sequence length: {len(array)}")

    # Find repeated sequences (important in genomics)
    repeated = array.get_repeated_substrings(min_length=3)
    print(f"Repeated sequences (≥3 bp): {sorted(repeated)}")

    # Look for specific patterns (start/stop codons)
    codons = ["ATG", "TAA", "TGA", "TAG"]
    for codon in codons:
        occurrences = array.find_occurrences(codon)
        if occurrences:
            print(f"Found {codon} at positions: {sorted(occurrences)}")

    # Analyze base composition
    bases = "ATCG"
    for base in bases:
        count = len(array.find_occurrences(base))
        percentage = (count / len(dna_sequence)) * 100
        print(f"Base {base}: {count} occurrences ({percentage:.1f}%)")

    # Find palindromic sequences
    def reverse_complement(seq):
        complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
        return ''.join(complement.get(base, base) for base in reversed(seq))

    print(f"\nSearching for palindromic sequences:")
    for length in range(4, 8):  # 4-7 bp palindromes
        text = array.get_text()
        for i in range(len(text) - length + 1):
            subseq = text[i:i + length]
            rev_comp = reverse_complement(subseq)
            if array.search(rev_comp) and subseq != rev_comp:
                print(f"Palindrome: {subseq} <-> {rev_comp}")

# Run DNA analysis
analyze_dna_with_suffix_array()

Text Processing Applications

def text_frequency_analysis():
    """Demonstrate text analysis using suffix arrays."""

    text = "the quick brown fox jumps over the lazy dog the fox is quick"
    clean_text = text.replace(" ", "").lower()

    array = SuffixArray()
    array.build(clean_text)

    print(f"Analyzing text: '{text}'")
    print(f"Clean text: '{clean_text}'")
    print(f"Text length: {len(array)} characters")

    # Find most common letters
    letter_counts = {}
    for letter in "abcdefghijklmnopqrstuvwxyz":
        occurrences = array.find_occurrences(letter)
        if occurrences:
            letter_counts[letter] = len(occurrences)

    print(f"\nLetter frequencies (top 5):")
    for letter, count in sorted(letter_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        percentage = (count / len(clean_text)) * 100
        print(f"  '{letter}': {count} times ({percentage:.1f}%)")

    # Find repeated patterns
    repeated = array.get_repeated_substrings(min_length=3)
    print(f"\nRepeated patterns (≥3 chars): {sorted(repeated)[:10]}")

    # Analyze specific patterns
    patterns = ["the", "qu", "ck", "ox", "og"]
    print(f"\nPattern analysis:")
    for pattern in patterns:
        # Remove spaces from pattern for clean text search
        clean_pattern = pattern.replace(" ", "")
        if clean_pattern:
            occurrences = array.find_occurrences(clean_pattern)
            positions = sorted(occurrences)
            print(f"  '{pattern}': {len(occurrences)} occurrences at {positions}")

# Run text analysis
text_frequency_analysis()

Protein Sequence Analysis

def protein_analysis_with_suffix_array():
    """Analyze protein sequences using suffix arrays."""

    # Example protein sequence (amino acid single-letter codes)
    protein = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"

    array = SuffixArray()
    array.build(protein)

    print(f"Protein sequence: {protein}")
    print(f"Length: {len(array)} amino acids")

    # Find repeated domains (important for protein function)
    repeated_domains = array.get_repeated_substrings(min_length=4)
    print(f"Repeated domains (≥4 AA): {sorted(repeated_domains)}")

    # Look for common motifs
    motifs = ["RGD", "PEST", "KDEL", "RR", "KK"]  # Common protein motifs
    for motif in motifs:
        occurrences = array.find_occurrences(motif)
        if occurrences:
            print(f"Found motif {motif} at positions: {sorted(occurrences)}")

    # Analyze amino acid composition
    amino_acids = "ACDEFGHIKLMNPQRSTVWY"
    composition = {}
    for aa in amino_acids:
        occurrences = array.find_occurrences(aa)
        if occurrences:
            composition[aa] = len(occurrences)

    print(f"\nAmino acid composition (top 5):")
    for aa, count in sorted(composition.items(), key=lambda x: x[1], reverse=True)[:5]:
        percentage = (count / len(protein)) * 100
        print(f"  {aa}: {count} ({percentage:.1f}%)")

    # Find hydrophobic regions (simplified)
    hydrophobic = "AILMFPWYV"
    hydrophobic_patterns = []
    for length in range(4, 8):  # 4-7 AA hydrophobic stretches
        text = array.get_text()
        for i in range(len(text) - length + 1):
            segment = text[i:i + length]
            if all(aa in hydrophobic for aa in segment):
                if array.search(segment):  # Verify it exists
                    hydrophobic_patterns.append((segment, i))

    if hydrophobic_patterns:
        print(f"\nHydrophobic regions found:")
        for pattern, pos in hydrophobic_patterns[:5]:  # Show first 5
            print(f"  '{pattern}' at position {pos}")

# Run protein analysis
protein_analysis_with_suffix_array()

Error Handling and Edge Cases

Empty and Invalid Inputs

# Test with empty inputs
empty_array = SuffixArray()

# Operations on empty array
print(f"Empty array search: {empty_array.search('any')}")            # False
print(f"Empty array occurrences: {empty_array.find_occurrences('a')}")  # []
print(f"Empty array suffixes: {empty_array.get_suffixes()}")         # []
print(f"Empty array LCP: {empty_array.get_lcp_array()}")             # []
print(f"Empty array size: {empty_array.size()}")                     # 0
print(f"Empty array is empty: {empty_array.is_empty()}")             # True

# Build with empty string
empty_array.build("")
print(f"After empty build - is empty: {empty_array.is_empty()}")     # True
print(f"After empty build - text: '{empty_array.get_text()}'")       # ''

# Test edge cases
edge_array = SuffixArray()
edge_array.build("a")

print(f"Single char - search 'a': {edge_array.search('a')}")         # True
print(f"Single char - search 'b': {edge_array.search('b')}")         # False
print(f"Single char - suffixes: {edge_array.get_suffixes()}")        # ['a']
print(f"Single char - LCP array: {edge_array.get_lcp_array()}")      # []

# Test rank operations with error handling
try:
    edge_array.get_rank(-1)
    print("ERROR: Should have raised IndexError")
except IndexError:
    print("✓ Correctly handled negative index")

try:
    edge_array.get_rank(1)  # Out of bounds
    print("ERROR: Should have raised IndexError")
except IndexError:
    print("✓ Correctly handled out-of-bounds index")

Special Characters and Unicode

# Test with special characters
special_array = SuffixArray()
special_text = "hello, world! 123 @#$%"
special_array.build(special_text)

print(f"Special text: '{special_array.get_text()}'")
print(f"Contains 'hello': {special_array.search('hello')}")          # True
print(f"Contains '!': {special_array.search('!')}")                 # True
print(f"Contains '123': {special_array.search('123')}")             # True
print(f"Contains '@#$': {special_array.search('@#$')}")             # True

# Find positions of special characters
special_chars = "!@#$%"
for char in special_chars:
    positions = special_array.find_occurrences(char)
    if positions:
        print(f"'{char}' found at positions: {positions}")

# Test with Unicode characters
unicode_array = SuffixArray()
unicode_text = "café naïve résumé"
unicode_array.build(unicode_text)

print(f"\nUnicode text: '{unicode_array.get_text()}'")
print(f"Contains 'café': {unicode_array.search('café')}")           # True
print(f"Contains 'é': {unicode_array.search('é')}")                # True
print(f"Contains 'ï': {unicode_array.search('ï')}")                # True

# Get Unicode suffixes
unicode_suffixes = unicode_array.get_suffixes()
print(f"Unicode suffixes: {unicode_suffixes}")

# Test repeated Unicode patterns
unicode_array.build("αβγαβγ")
repeated_unicode = unicode_array.get_repeated_substrings(min_length=2)
print(f"Repeated Unicode patterns: {repeated_unicode}")

Large Text Handling

def test_suffix_array_performance():
    """Test suffix array with larger texts."""
    import time

    # Generate test text with patterns
    base_pattern = "abcde"
    large_text = base_pattern * 200  # 1000 characters

    print(f"Testing with text of length: {len(large_text)}")

    # Build array
    start_time = time.time()
    array = SuffixArray()
    array.build(large_text)
    build_time = time.time() - start_time

    print(f"Array built in {build_time:.4f} seconds")
    print(f"Array has {array.size()} suffixes")

    # Test pattern search
    start_time = time.time()
    result = array.search("abcde")
    search_time = time.time() - start_time

    print(f"Pattern search took {search_time:.6f} seconds")
    print(f"Pattern found: {result}")

    # Test finding occurrences
    start_time = time.time()
    occurrences = array.find_occurrences("abc")
    occurrence_time = time.time() - start_time

    print(f"Finding {len(occurrences)} occurrences took {occurrence_time:.6f} seconds")

    # Test LCP array computation
    start_time = time.time()
    lcp_array = array.get_lcp_array()
    lcp_time = time.time() - start_time

    print(f"LCP array computation took {lcp_time:.6f} seconds")
    print(f"LCP array has {len(lcp_array)} entries")

    # Test repeated substring detection
    start_time = time.time()
    repeated = array.get_repeated_substrings(min_length=3)
    repeated_time = time.time() - start_time

    print(f"Finding repeated substrings took {repeated_time:.6f} seconds")
    print(f"Found {len(repeated)} repeated patterns")

# Uncomment to run performance test
# test_suffix_array_performance()

Performance Considerations and Best Practices

Memory Usage Optimization

def analyze_suffix_array_memory_usage():
    """Analyze memory usage patterns of suffix arrays."""

    # Compare arrays with different characteristics

    # Array with no repetition
    unique_array = SuffixArray()
    unique_array.build("abcdefghijk")

    # Array with high repetition
    repeat_array = SuffixArray()
    repeat_array.build("ababababab")

    print("Memory usage comparison:")
    print(f"Unique text array: {unique_array.size()} suffixes for {len(unique_array)} chars")
    print(f"Repetitive text array: {repeat_array.size()} suffixes for {len(repeat_array)} chars")

    # Show LCP array benefits for repetitive text
    unique_lcp = unique_array.get_lcp_array()
    repeat_lcp = repeat_array.get_lcp_array()

    avg_lcp_unique = sum(unique_lcp) / len(unique_lcp) if unique_lcp else 0
    avg_lcp_repeat = sum(repeat_lcp) / len(repeat_lcp) if repeat_lcp else 0

    print(f"Average LCP lengths:")
    print(f"  Unique text: {avg_lcp_unique:.2f}")
    print(f"  Repetitive text: {avg_lcp_repeat:.2f}")
    print(f"  Higher LCP indicates more structure/repetition")

# analyze_suffix_array_memory_usage()

Usage Guidelines

def suffix_array_best_practices():
    """Demonstrate best practices for suffix array usage."""

    print("Suffix Array Best Practices:")
    print("="*50)

    # 1. When to use suffix arrays vs other structures
    print("1. Use suffix arrays when you need:")
    print("   - Memory-efficient string matching")
    print("   - Multiple pattern searches in same text")
    print("   - Good balance of time/space complexity")
    print("   - LCP-based string analysis")

    # 2. Comparison with alternatives
    print("\n2. Suffix arrays vs alternatives:")
    print("   - vs Suffix trees: Less memory, slightly slower queries")
    print("   - vs Hash tables: Better for range queries, substring analysis")
    print("   - vs Naive search: Much faster for multiple queries")

    # 3. Optimal use cases
    array = SuffixArray()
    array.build("example")

    print(f"\n3. Performance characteristics:")
    print(f"   Text length: {len(array)}")
    print(f"   Array size: {array.size()}")
    print(f"   Memory ratio: {array.size()}/{len(array)} = {array.size()/len(array):.2f}")

    # 4. Building strategy
    print(f"\n4. Building strategy:")
    print(f"   - Build once for multiple queries: Excellent")
    print(f"   - Frequent text changes: Consider alternatives")
    print(f"   - Large texts: Monitor memory usage")

    # 5. Query optimization
    print(f"\n5. Query optimization:")
    print(f"   - Use binary search advantage: O(m log n) vs O(mn) naive")
    print(f"   - Batch similar queries together")
    print(f"   - Precompute LCP array for repeated substring analysis")

    return {
        'text_length': len(array),
        'array_size': array.size(),
        'memory_efficiency': array.size() / len(array)
    }

# Example usage
practices_results = suffix_array_best_practices()
print(f"\nEfficiency metrics: {practices_results}")

Practical Applications

Document Search Engine

class DocumentSearchEngine:
    """Simple search engine using suffix arrays."""

    def __init__(self):
        self.documents = {}
        self.arrays = {}

    def add_document(self, doc_id: str, text: str):
        """Add a document to the search index."""
        # Clean text (remove extra spaces, lowercase)
        clean_text = ' '.join(text.lower().split())
        self.documents[doc_id] = clean_text

        # Build suffix array
        array = SuffixArray()
        array.build(clean_text)
        self.arrays[doc_id] = array

    def search(self, query: str, max_results: int = 10) -> list:
        """Search for query across all documents."""
        clean_query = query.lower().strip()
        results = []

        for doc_id, array in self.arrays.items():
            if array.search(clean_query):
                occurrences = array.find_occurrences(clean_query)
                score = len(occurrences)  # Simple scoring based on frequency

                # Extract snippets around matches
                snippets = []
                text = self.documents[doc_id]
                for pos in occurrences[:3]:  # Max 3 snippets per document
                    start = max(0, pos - 20)
                    end = min(len(text), pos + len(clean_query) + 20)
                    snippet = text[start:end]
                    if start > 0:
                        snippet = "..." + snippet
                    if end < len(text):
                        snippet = snippet + "..."
                    snippets.append(snippet)

                results.append({
                    'document_id': doc_id,
                    'score': score,
                    'occurrences': len(occurrences),
                    'snippets': snippets
                })

        # Sort by score (frequency) and limit results
        results.sort(key=lambda x: x['score'], reverse=True)
        return results[:max_results]

    def get_document_stats(self, doc_id: str) -> dict:
        """Get statistics for a document."""
        if doc_id not in self.arrays:
            return {"error": "Document not found"}

        array = self.arrays[doc_id]
        text = self.documents[doc_id]

        # Find repeated phrases
        repeated = array.get_repeated_substrings(min_length=4)

        # Character frequency
        char_freq = {}
        for char in text:
            if char.isalnum():
                char_freq[char] = char_freq.get(char, 0) + 1

        return {
            'document_id': doc_id,
            'text_length': len(array),
            'unique_suffixes': array.size(),
            'repeated_phrases': len(repeated),
            'top_repeated': sorted(repeated, key=len, reverse=True)[:5],
            'char_frequency': dict(sorted(char_freq.items(), key=lambda x: x[1], reverse=True)[:10])
        }

# Example usage
search_engine = DocumentSearchEngine()

# Add documents
doc1 = "The quick brown fox jumps over the lazy dog. The dog was very lazy."
doc2 = "A quick brown fox leaps gracefully. The fox is very quick and agile."
doc3 = "The lazy cat sleeps all day. Cats are generally lazy animals."

search_engine.add_document("doc1", doc1)
search_engine.add_document("doc2", doc2)  
search_engine.add_document("doc3", doc3)

# Search for queries
queries = ["quick", "lazy", "fox", "dog", "animal"]

for query in queries:
    print(f"\nSearch results for '{query}':")
    results = search_engine.search(query, max_results=3)

    if not results:
        print("  No results found")
    else:
        for i, result in enumerate(results, 1):
            print(f"  {i}. Document {result['document_id']} (score: {result['score']})")
            print(f"     Occurrences: {result['occurrences']}")
            for snippet in result['snippets']:
                print(f"     \"{snippet}\"")

# Get document statistics
print(f"\nDocument statistics:")
for doc_id in ["doc1", "doc2", "doc3"]:
    stats = search_engine.get_document_stats(doc_id)
    print(f"{doc_id}: {stats['text_length']} chars, {stats['repeated_phrases']} repeated phrases")
    print(f"  Top repeated: {stats['top_repeated']}")

Bioinformatics Sequence Comparison

class SequenceComparator:
    """Compare biological sequences using suffix arrays."""

    def __init__(self):
        self.sequences = {}
        self.arrays = {}

    def add_sequence(self, seq_id: str, sequence: str):
        """Add a biological sequence."""
        # Convert to uppercase for consistency
        clean_seq = sequence.upper().strip()
        self.sequences[seq_id] = clean_seq

        # Build suffix array
        array = SuffixArray()
        array.build(clean_seq)
        self.arrays[seq_id] = array

    def find_common_motifs(self, seq1_id: str, seq2_id: str, min_length: int = 3) -> list:
        """Find common motifs between two sequences."""
        if seq1_id not in self.arrays or seq2_id not in self.arrays:
            return []

        array1 = self.arrays[seq1_id]
        seq2 = self.sequences[seq2_id]

        common_motifs = []

        # Find all common substrings
        for i in range(len(seq2)):
            for j in range(i + min_length, len(seq2) + 1):
                motif = seq2[i:j]
                if array1.search(motif):
                    positions1 = array1.find_occurrences(motif)
                    positions2 = [i]  # Position in seq2

                    common_motifs.append({
                        "motif": motif,
                        "length": len(motif),
                        "positions_seq1": positions1,
                        "positions_seq2": positions2,
                        "frequency_seq1": len(positions1)
                    })

        # Remove duplicates and sort by length and frequency
        unique_motifs = {}
        for motif_info in common_motifs:
            motif = motif_info["motif"]
            if motif not in unique_motifs:
                unique_motifs[motif] = motif_info
            elif len(motif_info["positions_seq1"]) > len(unique_motifs[motif]["positions_seq1"]):
                unique_motifs[motif] = motif_info

        return sorted(unique_motifs.values(), 
                     key=lambda x: (x["length"], x["frequency_seq1"]), 
                     reverse=True)

    def analyze_sequence_complexity(self, seq_id: str) -> dict:
        """Analyze complexity and repetitiveness of a sequence."""
        if seq_id not in self.arrays:
            return {"error": "Sequence not found"}

        array = self.arrays[seq_id]
        sequence = self.sequences[seq_id]

        # Find repeated elements
        repeated = array.get_repeated_substrings(min_length=3)

        # Calculate complexity metrics
        total_length = len(sequence)
        unique_chars = len(set(sequence))

        # Analyze LCP array for repetitiveness
        lcp_array = array.get_lcp_array()
        avg_lcp = sum(lcp_array) / len(lcp_array) if lcp_array else 0
        max_lcp = max(lcp_array) if lcp_array else 0

        # Calculate repeat coverage
        repeat_coverage = 0
        for repeat in repeated:
            occurrences = array.find_occurrences(repeat)
            repeat_coverage += len(repeat) * len(occurrences)

        repeat_density = repeat_coverage / total_length if total_length > 0 else 0

        return {
            'sequence_id': seq_id,
            'length': total_length,
            'unique_characters': unique_chars,
            'total_repeats': len(repeated),
            'average_lcp': avg_lcp,
            'max_lcp': max_lcp,
            'repeat_density': repeat_density,
            'complexity_score': unique_chars / total_length if total_length > 0 else 0,
            'top_repeats': sorted(repeated, key=len, reverse=True)[:10]
        }

    def find_longest_common_subsequence_regions(self, seq1_id: str, seq2_id: str) -> dict:
        """Find regions of high similarity between sequences."""
        if seq1_id not in self.arrays or seq2_id not in self.arrays:
            return {"error": "One or both sequences not found"}

        array1 = self.arrays[seq1_id]
        seq2 = self.sequences[seq2_id]

        # Find longest common substring
        lcs = array1.longest_common_substring(seq2)

        # Find all substantial common regions (length >= 5)
        common_regions = []
        for length in range(5, min(len(self.sequences[seq1_id]), len(seq2)) + 1):
            for i in range(len(seq2) - length + 1):
                substring = seq2[i:i + length]
                if array1.search(substring):
                    positions1 = array1.find_occurrences(substring)
                    if positions1:  # Found in seq1
                        common_regions.append({
                            'sequence': substring,
                            'length': length,
                            'seq1_positions': positions1,
                            'seq2_position': i
                        })

        # Filter overlapping regions, keep longest
        filtered_regions = []
        common_regions.sort(key=lambda x: x['length'], reverse=True)

        used_positions = set()
        for region in common_regions:
            seq2_pos = region['seq2_position']
            length = region['length']
            seq2_range = set(range(seq2_pos, seq2_pos + length))

            if not seq2_range.intersection(used_positions):
                filtered_regions.append(region)
                used_positions.update(seq2_range)

        return {
            'sequence1_id': seq1_id,
            'sequence2_id': seq2_id,
            'longest_common_substring': lcs,
            'lcs_length': len(lcs),
            'common_regions': filtered_regions[:10],  # Top 10
            'total_common_regions': len(filtered_regions),
            'similarity_score': len(lcs) / max(len(self.sequences[seq1_id]), len(seq2))
        }

# Example usage with DNA sequences
comparator = SequenceComparator()

# Add DNA sequences
seq1 = "ATCGATCGATCGAAATTTCGATCGATCG"
seq2 = "GCGATCGATCGCCGATTTAATCGATCGC"
seq3 = "ATCGATCGATCGATCGATCGATCG"  # Highly repetitive

comparator.add_sequence("seq1", seq1)
comparator.add_sequence("seq2", seq2)
comparator.add_sequence("seq3", seq3)

# Find common motifs
print("Common motifs between seq1 and seq2:")
motifs = comparator.find_common_motifs("seq1", "seq2", min_length=4)
for motif in motifs[:5]:  # Show top 5
    print(f"  {motif['motif']} (length: {motif['length']}, freq: {motif['frequency_seq1']})")

# Analyze sequence complexity
for seq_id in ["seq1", "seq2", "seq3"]:
    analysis = comparator.analyze_sequence_complexity(seq_id)
    print(f"\nComplexity analysis for {seq_id}:")
    print(f"  Length: {analysis['length']}")
    print(f"  Repeat density: {analysis['repeat_density']:.2%}")
    print(f"  Complexity score: {analysis['complexity_score']:.3f}")
    print(f"  Average LCP: {analysis['average_lcp']:.1f}")

# Find similarity regions
similarity = comparator.find_longest_common_subsequence_regions("seq1", "seq2")
print(f"\nSimilarity between seq1 and seq2:")
print(f"  LCS: '{similarity['longest_common_substring']}' (length: {similarity['lcs_length']})")
print(f"  Similarity score: {similarity['similarity_score']:.3f}")
print(f"  Common regions found: {similarity['total_common_regions']}")

API Reference

Suffix Array Implementation

A suffix array is a sorted array of all suffixes of a given string. It provides space-efficient string matching and analysis capabilities, using significantly less memory than suffix trees while maintaining good performance for most operations.

Key Features:

  • Pattern searching in O(m log n) time where m is pattern length, n is text length
  • Finding all occurrences of a pattern
  • Longest Common Prefix (LCP) array computation
  • Space-efficient representation
  • Binary search-based operations

Time Complexities:

  • Construction: O(n log n) using comparison-based sorting, O(n) with advanced algorithms
  • Pattern search: O(m log n) where m is pattern length
  • Find occurrences: O(m log n + k) where k is number of occurrences
  • LCP computation: O(n)

Space Complexity: O(n) where n is text length

SuffixArray()

Bases: SuffixArrayInterface

A suffix array implementation using efficient construction and search algorithms.

This implementation provides string matching and analysis operations by maintaining a sorted array of all suffix starting positions.

Initialize an empty suffix array.

__contains__(pattern)

Check if a pattern is in the text using 'in' operator.

__len__()

Return the length of the text.

__repr__()

Return detailed representation for debugging.

__str__()

Return string representation of the suffix array.

build(text)

Build the suffix array for the given text.

Parameters:

Name Type Description Default
text str

Text to build suffix array for

required

find_occurrences(pattern)

Find all occurrences of a pattern in the text using binary search.

Parameters:

Name Type Description Default
pattern str

Pattern to search for

required

Returns:

Type Description
List[int]

List of starting positions where pattern occurs

get_lcp_array()

Get the Longest Common Prefix (LCP) array.

Returns:

Type Description
List[int]

LCP array where lcp[i] is the length of the longest common prefix

List[int]

between suffixes at positions i and i+1 in the sorted suffix array

get_rank(suffix_index)

Get the rank (position in sorted order) of the suffix starting at given index.

Parameters:

Name Type Description Default
suffix_index int

Starting position of suffix in original text

required

Returns:

Type Description
int

Rank of the suffix in sorted order

get_repeated_substrings(min_length=2)

Find all repeated substrings of at least the given length using LCP array.

Parameters:

Name Type Description Default
min_length int

Minimum length of substrings to find

2

Returns:

Type Description
List[str]

List of repeated substrings

get_suffix_array()

Get the suffix array.

Returns:

Type Description
List[int]

List of suffix starting positions in sorted order

get_suffixes()

Get all suffixes in sorted order.

Returns:

Type Description
List[str]

List of all suffixes sorted lexicographically

get_text()

Get the original text.

Returns:

Type Description
str

Original text used to build the suffix array

is_empty()

Check if the suffix array is empty.

Returns:

Type Description
bool

True if array is empty, False otherwise

lcp(i, j)

Compute the longest common prefix between suffixes at positions i and j.

Parameters:

Name Type Description Default
i int

Position of first suffix in original text

required
j int

Position of second suffix in original text

required

Returns:

Type Description
int

Length of longest common prefix

longest_common_substring(other_text)

Find the longest common substring between the array's text and another text.

Parameters:

Name Type Description Default
other_text str

Text to compare with

required

Returns:

Type Description
str

Longest common substring

search(pattern)

Search for a pattern in the text using binary search.

Parameters:

Name Type Description Default
pattern str

Pattern to search for

required

Returns:

Type Description
bool

True if pattern exists, False otherwise

size()

Get the size of the suffix array.

Returns:

Type Description
int

Number of suffixes in the array