Suffix Array
A suffix array is a sorted array of all suffixes of a given string. It provides space-efficient string matching and analysis capabilities, using significantly less memory than suffix trees while maintaining good performance for most operations. Suffix arrays are particularly useful in bioinformatics, text compression, and string algorithms where memory efficiency is crucial.
Features
- Space-efficient representation - linear space O(n) with smaller constants than suffix trees
- Binary search pattern matching - find patterns in O(m log n) time where m is pattern length
- Multiple occurrence detection - efficiently find all positions where a pattern occurs
- LCP array integration - Longest Common Prefix array for advanced string analysis
- Repeated substring detection - identify all repeated patterns using LCP information
- Rank operations - quick access to suffix ordering information
- Longest common substring - find shared substrings between texts
- Flexible queries - support various string analysis operations with good performance
Time Complexities
| Operation | Time Complexity | Description |
|---|---|---|
| Construction | O(n log n) naive, O(n) advanced | Build array for text of length n |
| Pattern Search | O(m log n) | Search for pattern of length m |
| Find Occurrences | O(m log n + k) | m = pattern length, k = number of occurrences |
| LCP Array Build | O(n) | Build Longest Common Prefix array |
| Get Suffixes | O(n²) | List all suffixes (total length) |
| Repeated Substrings | O(n) | Find all repeated patterns using LCP |
| Rank Query | O(1) | Get rank of suffix at position |
| LCP Query | O(1) with preprocessing | LCP between any two suffixes |
| Text Length | O(1) | Get length of original text |
Space Complexity: O(n) where n is the length of the input text
Basic Usage
Creating and Building
from algo.data_structs.strings.suffix_array import SuffixArray
# Create empty suffix array
array = SuffixArray()
print(array.is_empty()) # True
print(array.size()) # 0
print(len(array)) # 0
# Build array with text
array.build("banana")
print(array.is_empty()) # False
print(array.size()) # 6 (number of suffixes)
print(len(array)) # 6 (length of original text)
print(array.get_text()) # "banana"
# String representations
print(str(array)) # SuffixArray(text='banana', size=6)
print(repr(array)) # SuffixArray(text_length=6, suffixes=6)
Understanding Suffix Arrays
# Build array and examine its structure
array = SuffixArray()
array.build("abc")
# Get all suffixes in sorted order
suffixes = array.get_suffixes()
print(f"Sorted suffixes: {suffixes}") # ['abc', 'bc', 'c']
# Get suffix array (starting positions of sorted suffixes)
suffix_array = array.get_suffix_array()
print(f"Suffix array: {suffix_array}") # [0, 1, 2]
# Verify relationship between suffix array and suffixes
original_text = array.get_text()
for i, start_pos in enumerate(suffix_array):
suffix_from_position = original_text[start_pos:]
suffix_from_list = suffixes[i]
print(f"Position {start_pos}: '{suffix_from_position}' == '{suffix_from_list}'")
assert suffix_from_position == suffix_from_list
Longest Common Prefix (LCP) Array
array = SuffixArray()
array.build("banana")
# Get LCP array
lcp_array = array.get_lcp_array()
print(f"LCP array: {lcp_array}") # [1, 3, 0, 0, 2]
# Understand LCP values
suffixes = array.get_suffixes()
print(f"Suffixes: {suffixes}")
# ['a', 'ana', 'anana', 'banana', 'na', 'nana']
# LCP[i] = longest common prefix between suffixes[i] and suffixes[i+1]
for i, lcp_len in enumerate(lcp_array):
suffix1 = suffixes[i]
suffix2 = suffixes[i + 1]
common_prefix = suffix1[:lcp_len] if lcp_len > 0 else ""
print(f"LCP('{suffix1}', '{suffix2}') = {lcp_len} ('{common_prefix}')")
Text Construction and Manipulation
Single Text Building
array = SuffixArray()
# Build with simple text
array.build("hello")
print(f"Text: '{array.get_text()}'") # 'hello'
print(f"Suffixes: {array.get_suffixes()}") # ['ello', 'hello', 'llo', 'lo', 'o']
# Build with repeated patterns
array.build("abcabc")
print(f"Text: '{array.get_text()}'") # 'abcabc'
print(f"Size: {array.size()}") # 6
# Build with single character
array.build("x")
print(f"Suffixes: {array.get_suffixes()}") # ['x']
print(f"Suffix array: {array.get_suffix_array()}") # [0]
print(f"LCP array: {array.get_lcp_array()}") # [] (empty for single suffix)
Handling Different Text Types
# Empty text
empty_array = SuffixArray()
empty_array.build("")
print(f"Empty: {empty_array.is_empty()}") # True
print(f"Suffixes: {empty_array.get_suffixes()}") # []
# Repeated characters
repeat_array = SuffixArray()
repeat_array.build("aaaa")
print(f"Suffixes: {repeat_array.get_suffixes()}")
# ['a', 'aa', 'aaa', 'aaaa'] (sorted lexicographically)
lcp = repeat_array.get_lcp_array()
print(f"LCP array: {lcp}") # [1, 2, 3] (increasing overlap)
# Complex text with patterns
complex_array = SuffixArray()
complex_array.build("abracadabra")
print(f"Text: '{complex_array.get_text()}'") # 'abracadabra'
print(f"Number of suffixes: {complex_array.size()}") # 11
Rebuilding Arrays
array = SuffixArray()
# Build with first text
array.build("abc")
print(f"First text: '{array.get_text()}'") # 'abc'
# Rebuild with different text (replaces previous)
array.build("xyz")
print(f"New text: '{array.get_text()}'") # 'xyz'
print(f"Previous structure is replaced")
# Build with longer text
array.build("programming")
print(f"Longer text: '{array.get_text()}'") # 'programming'
print(f"Array size: {array.size()}") # 11
Pattern Matching
Basic Pattern Search
array = SuffixArray()
array.build("banana")
# Search for various patterns
print(f"Search 'an': {array.search('an')}") # True
print(f"Search 'ana': {array.search('ana')}") # True
print(f"Search 'ban': {array.search('ban')}") # True
print(f"Search 'xyz': {array.search('xyz')}") # False
# Using 'in' operator
print(f"'ana' in array: {'ana' in array}") # True
print(f"'cab' in array: {'cab' in array}") # False
# Edge cases
print(f"Search '': {array.search('')}") # False
print(f"Search 'banana': {array.search('banana')}") # True (full text)
print(f"Search 'bananas': {array.search('bananas')}") # False (longer than text)
Finding All Occurrences
array = SuffixArray()
array.build("abracadabra")
# Find all occurrences of patterns
a_positions = array.find_occurrences("a")
print(f"'a' occurs at positions: {sorted(a_positions)}")
# [0, 3, 5, 7, 10] (all positions where 'a' appears)
abr_positions = array.find_occurrences("abr")
print(f"'abr' occurs at positions: {sorted(abr_positions)}")
# [0, 7] ('abr' appears at start and in 'cadabr')
br_positions = array.find_occurrences("br")
print(f"'br' occurs at positions: {sorted(br_positions)}")
# [1, 8] ('br' in 'abr' patterns)
# Non-existent pattern
xyz_positions = array.find_occurrences("xyz")
print(f"'xyz' occurs at positions: {xyz_positions}")
# [] (empty list)
# Overlapping patterns
array.build("aaaa")
aa_positions = array.find_occurrences("aa")
print(f"'aa' in 'aaaa' at: {sorted(aa_positions)}")
# [0, 1, 2] (overlapping occurrences)
Pattern Matching Performance
# Demonstrate binary search efficiency
array = SuffixArray()
text = "abcdef" * 100 # 600 characters
array.build(text)
import time
# Time pattern search
start = time.time()
result = array.search("cdef")
search_time = time.time() - start
print(f"Pattern found: {result}")
print(f"Search time: {search_time:.6f} seconds")
# Time finding all occurrences
start = time.time()
occurrences = array.find_occurrences("abc")
find_time = time.time() - start
print(f"Found {len(occurrences)} occurrences")
print(f"Find time: {find_time:.6f} seconds")
# Verify occurrences
print(f"First few positions: {sorted(occurrences)[:5]}")
Advanced String Analysis
Rank Operations
array = SuffixArray()
array.build("banana")
# Get rank of each suffix (position in sorted order)
text = array.get_text()
for i in range(len(text)):
suffix = text[i:]
rank = array.get_rank(i)
print(f"Suffix at position {i} ('{suffix}') has rank {rank}")
# Verify rank consistency with suffix array
suffix_array = array.get_suffix_array()
for rank, position in enumerate(suffix_array):
assert array.get_rank(position) == rank
print(f"✓ Rank {rank} -> Position {position} -> Rank {array.get_rank(position)}")
print("All rank operations consistent!")
LCP Between Arbitrary Positions
array = SuffixArray()
array.build("banana")
# Compute LCP between suffixes at different positions
test_pairs = [(0, 1), (1, 3), (2, 4), (0, 3), (1, 5)]
for i, j in test_pairs:
suffix_i = array.get_text()[i:]
suffix_j = array.get_text()[j:]
lcp_length = array.lcp(i, j)
# Extract actual common prefix
common_prefix = ""
for k in range(min(len(suffix_i), len(suffix_j))):
if suffix_i[k] == suffix_j[k]:
common_prefix += suffix_i[k]
else:
break
print(f"LCP(pos {i}, pos {j}): '{suffix_i}' vs '{suffix_j}' = {lcp_length} ('{common_prefix}')")
# Special cases
print(f"LCP(same position): {array.lcp(0, 0)}") # Full length
print(f"LCP(out of bounds): {array.lcp(-1, 100)}") # 0
Finding Repeated Substrings
array = SuffixArray()
array.build("abracadabra")
# Find repeated substrings of different minimum lengths
repeated_2 = array.get_repeated_substrings(min_length=2)
print(f"Repeated substrings (≥2 chars): {sorted(repeated_2)}")
# Should include 'ab', 'br', 'ra', 'abr', 'bra', etc.
repeated_3 = array.get_repeated_substrings(min_length=3)
print(f"Repeated substrings (≥3 chars): {sorted(repeated_3)}")
# Should include 'abr', 'bra', etc.
# Example with highly repetitive text
array.build("ababab")
repeated_ab = array.get_repeated_substrings(min_length=2)
print(f"Repeated in 'ababab': {sorted(repeated_ab)}")
# Should include 'ab', 'ba', 'abab', 'baba', etc.
# Show how LCP array drives the detection
lcp_array = array.get_lcp_array()
print(f"LCP array for 'ababab': {lcp_array}")
print("LCP values ≥ min_length indicate repeated substrings")
Longest Common Substring
array = SuffixArray()
# Build array with first text
array.build("abcdef")
# Find longest common substring with various texts
lcs1 = array.longest_common_substring("defghi")
print(f"LCS of 'abcdef' and 'defghi': '{lcs1}'") # 'def'
lcs2 = array.longest_common_substring("xyzabc")
print(f"LCS of 'abcdef' and 'xyzabc': '{lcs2}'") # 'abc'
lcs3 = array.longest_common_substring("123456")
print(f"LCS of 'abcdef' and '123456': '{lcs3}'") # '' (no common)
# More complex example
array.build("banana")
lcs4 = array.longest_common_substring("ananas")
print(f"LCS of 'banana' and 'ananas': '{lcs4}'") # Should find good overlap
# Identical texts
lcs5 = array.longest_common_substring("banana")
print(f"LCS of 'banana' and 'banana': '{lcs5}'") # 'banana'
Working with Different Text Types
DNA Sequence Analysis
def analyze_dna_with_suffix_array():
"""Analyze DNA sequence using suffix array."""
array = SuffixArray()
# Build array with DNA sequence
dna_sequence = "ATCGATCGATCGAAATTTCGATCGATCG"
array.build(dna_sequence)
print(f"DNA Sequence: {dna_sequence}")
print(f"Sequence length: {len(array)}")
# Find repeated sequences (important in genomics)
repeated = array.get_repeated_substrings(min_length=3)
print(f"Repeated sequences (≥3 bp): {sorted(repeated)}")
# Look for specific patterns (start/stop codons)
codons = ["ATG", "TAA", "TGA", "TAG"]
for codon in codons:
occurrences = array.find_occurrences(codon)
if occurrences:
print(f"Found {codon} at positions: {sorted(occurrences)}")
# Analyze base composition
bases = "ATCG"
for base in bases:
count = len(array.find_occurrences(base))
percentage = (count / len(dna_sequence)) * 100
print(f"Base {base}: {count} occurrences ({percentage:.1f}%)")
# Find palindromic sequences
def reverse_complement(seq):
complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
return ''.join(complement.get(base, base) for base in reversed(seq))
print(f"\nSearching for palindromic sequences:")
for length in range(4, 8): # 4-7 bp palindromes
text = array.get_text()
for i in range(len(text) - length + 1):
subseq = text[i:i + length]
rev_comp = reverse_complement(subseq)
if array.search(rev_comp) and subseq != rev_comp:
print(f"Palindrome: {subseq} <-> {rev_comp}")
# Run DNA analysis
analyze_dna_with_suffix_array()
Text Processing Applications
def text_frequency_analysis():
"""Demonstrate text analysis using suffix arrays."""
text = "the quick brown fox jumps over the lazy dog the fox is quick"
clean_text = text.replace(" ", "").lower()
array = SuffixArray()
array.build(clean_text)
print(f"Analyzing text: '{text}'")
print(f"Clean text: '{clean_text}'")
print(f"Text length: {len(array)} characters")
# Find most common letters
letter_counts = {}
for letter in "abcdefghijklmnopqrstuvwxyz":
occurrences = array.find_occurrences(letter)
if occurrences:
letter_counts[letter] = len(occurrences)
print(f"\nLetter frequencies (top 5):")
for letter, count in sorted(letter_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
percentage = (count / len(clean_text)) * 100
print(f" '{letter}': {count} times ({percentage:.1f}%)")
# Find repeated patterns
repeated = array.get_repeated_substrings(min_length=3)
print(f"\nRepeated patterns (≥3 chars): {sorted(repeated)[:10]}")
# Analyze specific patterns
patterns = ["the", "qu", "ck", "ox", "og"]
print(f"\nPattern analysis:")
for pattern in patterns:
# Remove spaces from pattern for clean text search
clean_pattern = pattern.replace(" ", "")
if clean_pattern:
occurrences = array.find_occurrences(clean_pattern)
positions = sorted(occurrences)
print(f" '{pattern}': {len(occurrences)} occurrences at {positions}")
# Run text analysis
text_frequency_analysis()
Protein Sequence Analysis
def protein_analysis_with_suffix_array():
"""Analyze protein sequences using suffix arrays."""
# Example protein sequence (amino acid single-letter codes)
protein = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"
array = SuffixArray()
array.build(protein)
print(f"Protein sequence: {protein}")
print(f"Length: {len(array)} amino acids")
# Find repeated domains (important for protein function)
repeated_domains = array.get_repeated_substrings(min_length=4)
print(f"Repeated domains (≥4 AA): {sorted(repeated_domains)}")
# Look for common motifs
motifs = ["RGD", "PEST", "KDEL", "RR", "KK"] # Common protein motifs
for motif in motifs:
occurrences = array.find_occurrences(motif)
if occurrences:
print(f"Found motif {motif} at positions: {sorted(occurrences)}")
# Analyze amino acid composition
amino_acids = "ACDEFGHIKLMNPQRSTVWY"
composition = {}
for aa in amino_acids:
occurrences = array.find_occurrences(aa)
if occurrences:
composition[aa] = len(occurrences)
print(f"\nAmino acid composition (top 5):")
for aa, count in sorted(composition.items(), key=lambda x: x[1], reverse=True)[:5]:
percentage = (count / len(protein)) * 100
print(f" {aa}: {count} ({percentage:.1f}%)")
# Find hydrophobic regions (simplified)
hydrophobic = "AILMFPWYV"
hydrophobic_patterns = []
for length in range(4, 8): # 4-7 AA hydrophobic stretches
text = array.get_text()
for i in range(len(text) - length + 1):
segment = text[i:i + length]
if all(aa in hydrophobic for aa in segment):
if array.search(segment): # Verify it exists
hydrophobic_patterns.append((segment, i))
if hydrophobic_patterns:
print(f"\nHydrophobic regions found:")
for pattern, pos in hydrophobic_patterns[:5]: # Show first 5
print(f" '{pattern}' at position {pos}")
# Run protein analysis
protein_analysis_with_suffix_array()
Error Handling and Edge Cases
Empty and Invalid Inputs
# Test with empty inputs
empty_array = SuffixArray()
# Operations on empty array
print(f"Empty array search: {empty_array.search('any')}") # False
print(f"Empty array occurrences: {empty_array.find_occurrences('a')}") # []
print(f"Empty array suffixes: {empty_array.get_suffixes()}") # []
print(f"Empty array LCP: {empty_array.get_lcp_array()}") # []
print(f"Empty array size: {empty_array.size()}") # 0
print(f"Empty array is empty: {empty_array.is_empty()}") # True
# Build with empty string
empty_array.build("")
print(f"After empty build - is empty: {empty_array.is_empty()}") # True
print(f"After empty build - text: '{empty_array.get_text()}'") # ''
# Test edge cases
edge_array = SuffixArray()
edge_array.build("a")
print(f"Single char - search 'a': {edge_array.search('a')}") # True
print(f"Single char - search 'b': {edge_array.search('b')}") # False
print(f"Single char - suffixes: {edge_array.get_suffixes()}") # ['a']
print(f"Single char - LCP array: {edge_array.get_lcp_array()}") # []
# Test rank operations with error handling
try:
edge_array.get_rank(-1)
print("ERROR: Should have raised IndexError")
except IndexError:
print("✓ Correctly handled negative index")
try:
edge_array.get_rank(1) # Out of bounds
print("ERROR: Should have raised IndexError")
except IndexError:
print("✓ Correctly handled out-of-bounds index")
Special Characters and Unicode
# Test with special characters
special_array = SuffixArray()
special_text = "hello, world! 123 @#$%"
special_array.build(special_text)
print(f"Special text: '{special_array.get_text()}'")
print(f"Contains 'hello': {special_array.search('hello')}") # True
print(f"Contains '!': {special_array.search('!')}") # True
print(f"Contains '123': {special_array.search('123')}") # True
print(f"Contains '@#$': {special_array.search('@#$')}") # True
# Find positions of special characters
special_chars = "!@#$%"
for char in special_chars:
positions = special_array.find_occurrences(char)
if positions:
print(f"'{char}' found at positions: {positions}")
# Test with Unicode characters
unicode_array = SuffixArray()
unicode_text = "café naïve résumé"
unicode_array.build(unicode_text)
print(f"\nUnicode text: '{unicode_array.get_text()}'")
print(f"Contains 'café': {unicode_array.search('café')}") # True
print(f"Contains 'é': {unicode_array.search('é')}") # True
print(f"Contains 'ï': {unicode_array.search('ï')}") # True
# Get Unicode suffixes
unicode_suffixes = unicode_array.get_suffixes()
print(f"Unicode suffixes: {unicode_suffixes}")
# Test repeated Unicode patterns
unicode_array.build("αβγαβγ")
repeated_unicode = unicode_array.get_repeated_substrings(min_length=2)
print(f"Repeated Unicode patterns: {repeated_unicode}")
Large Text Handling
def test_suffix_array_performance():
"""Test suffix array with larger texts."""
import time
# Generate test text with patterns
base_pattern = "abcde"
large_text = base_pattern * 200 # 1000 characters
print(f"Testing with text of length: {len(large_text)}")
# Build array
start_time = time.time()
array = SuffixArray()
array.build(large_text)
build_time = time.time() - start_time
print(f"Array built in {build_time:.4f} seconds")
print(f"Array has {array.size()} suffixes")
# Test pattern search
start_time = time.time()
result = array.search("abcde")
search_time = time.time() - start_time
print(f"Pattern search took {search_time:.6f} seconds")
print(f"Pattern found: {result}")
# Test finding occurrences
start_time = time.time()
occurrences = array.find_occurrences("abc")
occurrence_time = time.time() - start_time
print(f"Finding {len(occurrences)} occurrences took {occurrence_time:.6f} seconds")
# Test LCP array computation
start_time = time.time()
lcp_array = array.get_lcp_array()
lcp_time = time.time() - start_time
print(f"LCP array computation took {lcp_time:.6f} seconds")
print(f"LCP array has {len(lcp_array)} entries")
# Test repeated substring detection
start_time = time.time()
repeated = array.get_repeated_substrings(min_length=3)
repeated_time = time.time() - start_time
print(f"Finding repeated substrings took {repeated_time:.6f} seconds")
print(f"Found {len(repeated)} repeated patterns")
# Uncomment to run performance test
# test_suffix_array_performance()
Performance Considerations and Best Practices
Memory Usage Optimization
def analyze_suffix_array_memory_usage():
"""Analyze memory usage patterns of suffix arrays."""
# Compare arrays with different characteristics
# Array with no repetition
unique_array = SuffixArray()
unique_array.build("abcdefghijk")
# Array with high repetition
repeat_array = SuffixArray()
repeat_array.build("ababababab")
print("Memory usage comparison:")
print(f"Unique text array: {unique_array.size()} suffixes for {len(unique_array)} chars")
print(f"Repetitive text array: {repeat_array.size()} suffixes for {len(repeat_array)} chars")
# Show LCP array benefits for repetitive text
unique_lcp = unique_array.get_lcp_array()
repeat_lcp = repeat_array.get_lcp_array()
avg_lcp_unique = sum(unique_lcp) / len(unique_lcp) if unique_lcp else 0
avg_lcp_repeat = sum(repeat_lcp) / len(repeat_lcp) if repeat_lcp else 0
print(f"Average LCP lengths:")
print(f" Unique text: {avg_lcp_unique:.2f}")
print(f" Repetitive text: {avg_lcp_repeat:.2f}")
print(f" Higher LCP indicates more structure/repetition")
# analyze_suffix_array_memory_usage()
Usage Guidelines
def suffix_array_best_practices():
"""Demonstrate best practices for suffix array usage."""
print("Suffix Array Best Practices:")
print("="*50)
# 1. When to use suffix arrays vs other structures
print("1. Use suffix arrays when you need:")
print(" - Memory-efficient string matching")
print(" - Multiple pattern searches in same text")
print(" - Good balance of time/space complexity")
print(" - LCP-based string analysis")
# 2. Comparison with alternatives
print("\n2. Suffix arrays vs alternatives:")
print(" - vs Suffix trees: Less memory, slightly slower queries")
print(" - vs Hash tables: Better for range queries, substring analysis")
print(" - vs Naive search: Much faster for multiple queries")
# 3. Optimal use cases
array = SuffixArray()
array.build("example")
print(f"\n3. Performance characteristics:")
print(f" Text length: {len(array)}")
print(f" Array size: {array.size()}")
print(f" Memory ratio: {array.size()}/{len(array)} = {array.size()/len(array):.2f}")
# 4. Building strategy
print(f"\n4. Building strategy:")
print(f" - Build once for multiple queries: Excellent")
print(f" - Frequent text changes: Consider alternatives")
print(f" - Large texts: Monitor memory usage")
# 5. Query optimization
print(f"\n5. Query optimization:")
print(f" - Use binary search advantage: O(m log n) vs O(mn) naive")
print(f" - Batch similar queries together")
print(f" - Precompute LCP array for repeated substring analysis")
return {
'text_length': len(array),
'array_size': array.size(),
'memory_efficiency': array.size() / len(array)
}
# Example usage
practices_results = suffix_array_best_practices()
print(f"\nEfficiency metrics: {practices_results}")
Practical Applications
Document Search Engine
class DocumentSearchEngine:
"""Simple search engine using suffix arrays."""
def __init__(self):
self.documents = {}
self.arrays = {}
def add_document(self, doc_id: str, text: str):
"""Add a document to the search index."""
# Clean text (remove extra spaces, lowercase)
clean_text = ' '.join(text.lower().split())
self.documents[doc_id] = clean_text
# Build suffix array
array = SuffixArray()
array.build(clean_text)
self.arrays[doc_id] = array
def search(self, query: str, max_results: int = 10) -> list:
"""Search for query across all documents."""
clean_query = query.lower().strip()
results = []
for doc_id, array in self.arrays.items():
if array.search(clean_query):
occurrences = array.find_occurrences(clean_query)
score = len(occurrences) # Simple scoring based on frequency
# Extract snippets around matches
snippets = []
text = self.documents[doc_id]
for pos in occurrences[:3]: # Max 3 snippets per document
start = max(0, pos - 20)
end = min(len(text), pos + len(clean_query) + 20)
snippet = text[start:end]
if start > 0:
snippet = "..." + snippet
if end < len(text):
snippet = snippet + "..."
snippets.append(snippet)
results.append({
'document_id': doc_id,
'score': score,
'occurrences': len(occurrences),
'snippets': snippets
})
# Sort by score (frequency) and limit results
results.sort(key=lambda x: x['score'], reverse=True)
return results[:max_results]
def get_document_stats(self, doc_id: str) -> dict:
"""Get statistics for a document."""
if doc_id not in self.arrays:
return {"error": "Document not found"}
array = self.arrays[doc_id]
text = self.documents[doc_id]
# Find repeated phrases
repeated = array.get_repeated_substrings(min_length=4)
# Character frequency
char_freq = {}
for char in text:
if char.isalnum():
char_freq[char] = char_freq.get(char, 0) + 1
return {
'document_id': doc_id,
'text_length': len(array),
'unique_suffixes': array.size(),
'repeated_phrases': len(repeated),
'top_repeated': sorted(repeated, key=len, reverse=True)[:5],
'char_frequency': dict(sorted(char_freq.items(), key=lambda x: x[1], reverse=True)[:10])
}
# Example usage
search_engine = DocumentSearchEngine()
# Add documents
doc1 = "The quick brown fox jumps over the lazy dog. The dog was very lazy."
doc2 = "A quick brown fox leaps gracefully. The fox is very quick and agile."
doc3 = "The lazy cat sleeps all day. Cats are generally lazy animals."
search_engine.add_document("doc1", doc1)
search_engine.add_document("doc2", doc2)
search_engine.add_document("doc3", doc3)
# Search for queries
queries = ["quick", "lazy", "fox", "dog", "animal"]
for query in queries:
print(f"\nSearch results for '{query}':")
results = search_engine.search(query, max_results=3)
if not results:
print(" No results found")
else:
for i, result in enumerate(results, 1):
print(f" {i}. Document {result['document_id']} (score: {result['score']})")
print(f" Occurrences: {result['occurrences']}")
for snippet in result['snippets']:
print(f" \"{snippet}\"")
# Get document statistics
print(f"\nDocument statistics:")
for doc_id in ["doc1", "doc2", "doc3"]:
stats = search_engine.get_document_stats(doc_id)
print(f"{doc_id}: {stats['text_length']} chars, {stats['repeated_phrases']} repeated phrases")
print(f" Top repeated: {stats['top_repeated']}")
Bioinformatics Sequence Comparison
class SequenceComparator:
"""Compare biological sequences using suffix arrays."""
def __init__(self):
self.sequences = {}
self.arrays = {}
def add_sequence(self, seq_id: str, sequence: str):
"""Add a biological sequence."""
# Convert to uppercase for consistency
clean_seq = sequence.upper().strip()
self.sequences[seq_id] = clean_seq
# Build suffix array
array = SuffixArray()
array.build(clean_seq)
self.arrays[seq_id] = array
def find_common_motifs(self, seq1_id: str, seq2_id: str, min_length: int = 3) -> list:
"""Find common motifs between two sequences."""
if seq1_id not in self.arrays or seq2_id not in self.arrays:
return []
array1 = self.arrays[seq1_id]
seq2 = self.sequences[seq2_id]
common_motifs = []
# Find all common substrings
for i in range(len(seq2)):
for j in range(i + min_length, len(seq2) + 1):
motif = seq2[i:j]
if array1.search(motif):
positions1 = array1.find_occurrences(motif)
positions2 = [i] # Position in seq2
common_motifs.append({
"motif": motif,
"length": len(motif),
"positions_seq1": positions1,
"positions_seq2": positions2,
"frequency_seq1": len(positions1)
})
# Remove duplicates and sort by length and frequency
unique_motifs = {}
for motif_info in common_motifs:
motif = motif_info["motif"]
if motif not in unique_motifs:
unique_motifs[motif] = motif_info
elif len(motif_info["positions_seq1"]) > len(unique_motifs[motif]["positions_seq1"]):
unique_motifs[motif] = motif_info
return sorted(unique_motifs.values(),
key=lambda x: (x["length"], x["frequency_seq1"]),
reverse=True)
def analyze_sequence_complexity(self, seq_id: str) -> dict:
"""Analyze complexity and repetitiveness of a sequence."""
if seq_id not in self.arrays:
return {"error": "Sequence not found"}
array = self.arrays[seq_id]
sequence = self.sequences[seq_id]
# Find repeated elements
repeated = array.get_repeated_substrings(min_length=3)
# Calculate complexity metrics
total_length = len(sequence)
unique_chars = len(set(sequence))
# Analyze LCP array for repetitiveness
lcp_array = array.get_lcp_array()
avg_lcp = sum(lcp_array) / len(lcp_array) if lcp_array else 0
max_lcp = max(lcp_array) if lcp_array else 0
# Calculate repeat coverage
repeat_coverage = 0
for repeat in repeated:
occurrences = array.find_occurrences(repeat)
repeat_coverage += len(repeat) * len(occurrences)
repeat_density = repeat_coverage / total_length if total_length > 0 else 0
return {
'sequence_id': seq_id,
'length': total_length,
'unique_characters': unique_chars,
'total_repeats': len(repeated),
'average_lcp': avg_lcp,
'max_lcp': max_lcp,
'repeat_density': repeat_density,
'complexity_score': unique_chars / total_length if total_length > 0 else 0,
'top_repeats': sorted(repeated, key=len, reverse=True)[:10]
}
def find_longest_common_subsequence_regions(self, seq1_id: str, seq2_id: str) -> dict:
"""Find regions of high similarity between sequences."""
if seq1_id not in self.arrays or seq2_id not in self.arrays:
return {"error": "One or both sequences not found"}
array1 = self.arrays[seq1_id]
seq2 = self.sequences[seq2_id]
# Find longest common substring
lcs = array1.longest_common_substring(seq2)
# Find all substantial common regions (length >= 5)
common_regions = []
for length in range(5, min(len(self.sequences[seq1_id]), len(seq2)) + 1):
for i in range(len(seq2) - length + 1):
substring = seq2[i:i + length]
if array1.search(substring):
positions1 = array1.find_occurrences(substring)
if positions1: # Found in seq1
common_regions.append({
'sequence': substring,
'length': length,
'seq1_positions': positions1,
'seq2_position': i
})
# Filter overlapping regions, keep longest
filtered_regions = []
common_regions.sort(key=lambda x: x['length'], reverse=True)
used_positions = set()
for region in common_regions:
seq2_pos = region['seq2_position']
length = region['length']
seq2_range = set(range(seq2_pos, seq2_pos + length))
if not seq2_range.intersection(used_positions):
filtered_regions.append(region)
used_positions.update(seq2_range)
return {
'sequence1_id': seq1_id,
'sequence2_id': seq2_id,
'longest_common_substring': lcs,
'lcs_length': len(lcs),
'common_regions': filtered_regions[:10], # Top 10
'total_common_regions': len(filtered_regions),
'similarity_score': len(lcs) / max(len(self.sequences[seq1_id]), len(seq2))
}
# Example usage with DNA sequences
comparator = SequenceComparator()
# Add DNA sequences
seq1 = "ATCGATCGATCGAAATTTCGATCGATCG"
seq2 = "GCGATCGATCGCCGATTTAATCGATCGC"
seq3 = "ATCGATCGATCGATCGATCGATCG" # Highly repetitive
comparator.add_sequence("seq1", seq1)
comparator.add_sequence("seq2", seq2)
comparator.add_sequence("seq3", seq3)
# Find common motifs
print("Common motifs between seq1 and seq2:")
motifs = comparator.find_common_motifs("seq1", "seq2", min_length=4)
for motif in motifs[:5]: # Show top 5
print(f" {motif['motif']} (length: {motif['length']}, freq: {motif['frequency_seq1']})")
# Analyze sequence complexity
for seq_id in ["seq1", "seq2", "seq3"]:
analysis = comparator.analyze_sequence_complexity(seq_id)
print(f"\nComplexity analysis for {seq_id}:")
print(f" Length: {analysis['length']}")
print(f" Repeat density: {analysis['repeat_density']:.2%}")
print(f" Complexity score: {analysis['complexity_score']:.3f}")
print(f" Average LCP: {analysis['average_lcp']:.1f}")
# Find similarity regions
similarity = comparator.find_longest_common_subsequence_regions("seq1", "seq2")
print(f"\nSimilarity between seq1 and seq2:")
print(f" LCS: '{similarity['longest_common_substring']}' (length: {similarity['lcs_length']})")
print(f" Similarity score: {similarity['similarity_score']:.3f}")
print(f" Common regions found: {similarity['total_common_regions']}")
API Reference
Suffix Array Implementation
A suffix array is a sorted array of all suffixes of a given string. It provides space-efficient string matching and analysis capabilities, using significantly less memory than suffix trees while maintaining good performance for most operations.
Key Features:
- Pattern searching in O(m log n) time where m is pattern length, n is text length
- Finding all occurrences of a pattern
- Longest Common Prefix (LCP) array computation
- Space-efficient representation
- Binary search-based operations
Time Complexities:
- Construction: O(n log n) using comparison-based sorting, O(n) with advanced algorithms
- Pattern search: O(m log n) where m is pattern length
- Find occurrences: O(m log n + k) where k is number of occurrences
- LCP computation: O(n)
Space Complexity: O(n) where n is text length
SuffixArray()
Bases: SuffixArrayInterface
A suffix array implementation using efficient construction and search algorithms.
This implementation provides string matching and analysis operations by maintaining a sorted array of all suffix starting positions.
Initialize an empty suffix array.
__contains__(pattern)
Check if a pattern is in the text using 'in' operator.
__len__()
Return the length of the text.
__repr__()
Return detailed representation for debugging.
__str__()
Return string representation of the suffix array.
build(text)
Build the suffix array for the given text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to build suffix array for |
required |
find_occurrences(pattern)
Find all occurrences of a pattern in the text using binary search.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
Pattern to search for |
required |
Returns:
| Type | Description |
|---|---|
List[int]
|
List of starting positions where pattern occurs |
get_lcp_array()
Get the Longest Common Prefix (LCP) array.
Returns:
| Type | Description |
|---|---|
List[int]
|
LCP array where lcp[i] is the length of the longest common prefix |
List[int]
|
between suffixes at positions i and i+1 in the sorted suffix array |
get_rank(suffix_index)
Get the rank (position in sorted order) of the suffix starting at given index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suffix_index
|
int
|
Starting position of suffix in original text |
required |
Returns:
| Type | Description |
|---|---|
int
|
Rank of the suffix in sorted order |
get_repeated_substrings(min_length=2)
Find all repeated substrings of at least the given length using LCP array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_length
|
int
|
Minimum length of substrings to find |
2
|
Returns:
| Type | Description |
|---|---|
List[str]
|
List of repeated substrings |
get_suffix_array()
Get the suffix array.
Returns:
| Type | Description |
|---|---|
List[int]
|
List of suffix starting positions in sorted order |
get_suffixes()
Get all suffixes in sorted order.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of all suffixes sorted lexicographically |
get_text()
Get the original text.
Returns:
| Type | Description |
|---|---|
str
|
Original text used to build the suffix array |
is_empty()
Check if the suffix array is empty.
Returns:
| Type | Description |
|---|---|
bool
|
True if array is empty, False otherwise |
lcp(i, j)
Compute the longest common prefix between suffixes at positions i and j.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
i
|
int
|
Position of first suffix in original text |
required |
j
|
int
|
Position of second suffix in original text |
required |
Returns:
| Type | Description |
|---|---|
int
|
Length of longest common prefix |
longest_common_substring(other_text)
Find the longest common substring between the array's text and another text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other_text
|
str
|
Text to compare with |
required |
Returns:
| Type | Description |
|---|---|
str
|
Longest common substring |
search(pattern)
Search for a pattern in the text using binary search.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
Pattern to search for |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if pattern exists, False otherwise |
size()
Get the size of the suffix array.
Returns:
| Type | Description |
|---|---|
int
|
Number of suffixes in the array |