Longest Repeated Substring (LRS)
Algorithms for finding the longest substring that appears multiple times in a text.
Overview
The module provides multiple strategies for finding the longest repeated substring:
- Suffix Tree-based LRS: Optimal for complex queries and repeated operations
- Suffix Array-based LRS: Memory efficient using LCP array
- Hybrid approach: Automatically selects best strategy
- Brute force: Simple implementation for small texts
Usage
from algo.algorithms.strings.longest_repeated_substring import *
# Basic usage
lrs = find_longest_repeated_substring("banana") # Returns "ana"
# With positions
substring, positions = find_lrs_with_positions("mississippi")
# Compare strategies
comparison = compare_lrs_strategies("text")
# Find all repeated substrings
substrings = find_all_repeated_substrings("banana", min_length=2)
Performance
- Suffix Tree: O(n) construction, O(n) LRS computation
- Suffix Array: O(n log n) construction, O(n) LRS computation
- Brute Force: O(n³) for comparison
- Memory usage: Suffix Tree > Suffix Array > Brute Force
API Reference
Longest Repeated Substring Algorithms
This module provides high-level longest repeated substring functionality using different underlying data structures and strategies. It demonstrates the practical applications of Suffix Tree and Suffix Array data structures for finding the longest substring that appears multiple times in a single string.
Strategies Implemented:
- Suffix Tree-based LRS (faster for complex queries)
- Suffix Array-based LRS (memory efficient using LCP array)
- Hybrid approach combining multiple strategies
- Brute force approach for small strings
Performance Characteristics:
- Suffix Tree: O(n) construction, O(n) LRS computation where n is text length
- Suffix Array: O(n log n) construction, O(n) LRS computation using LCP array
- Brute Force: O(n³) for comparison
- Memory usage: Suffix Tree > Suffix Array > Brute Force
Use Cases:
- Finding repetitive patterns in DNA sequences
- Code duplication detection
- Text compression algorithms
- Data deduplication
- Pattern analysis in logs
- Finding common phrases in documents
BruteForceLRS
Brute force longest repeated substring implementation.
Best for: Very small strings, educational purposes, verification. Use when: You need a simple, easy-to-understand implementation.
find_lrs(text)
Find the longest repeated substring using brute force approach.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Type | Description |
|---|---|
LRSResult
|
LRSResult containing the longest repeated substring |
HybridLRS()
Hybrid LRS that combines multiple strategies for optimal performance.
Uses different strategies based on input characteristics and performance requirements.
Initialize hybrid LRS with all underlying implementations.
compare_strategies(text)
Compare all strategies and return results from each.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, LRSResult]
|
Dictionary mapping strategy names to their results |
find_lrs(text, strategy=None)
Find the longest repeated substring using optimal strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
strategy
|
Optional[LRSStrategy]
|
Force specific strategy (optional) |
None
|
Returns:
| Type | Description |
|---|---|
LRSResult
|
LRSResult from the selected strategy |
LRSResult(substring, length, positions=list(), occurrences=0, strategy_used='unknown', computation_time=0.0, metadata=dict())
dataclass
Represents a longest repeated substring result with metadata.
__post_init__()
Update fields based on substring if not provided.
__repr__()
Return detailed representation.
__str__()
Return string representation.
LRSStrategy
Bases: Enum
Different LRS computation strategies.
BRUTE_FORCE = 'brute_force'
class-attribute
instance-attribute
Brute force LRS strategy. Best for very small strings and educational purposes.
HYBRID = 'hybrid'
class-attribute
instance-attribute
Hybrid LRS strategy that combines multiple approaches for optimal performance.
SUFFIX_ARRAY = 'suffix_array'
class-attribute
instance-attribute
Suffix Array-based LRS strategy. Best for memory efficiency and one-time computations.
SUFFIX_TREE = 'suffix_tree'
class-attribute
instance-attribute
Suffix Tree-based LRS strategy. Best for complex queries and multiple operations.
SuffixArrayLRS()
Suffix Array-based longest repeated substring implementation.
Best for: Memory-constrained environments, one-time computations. Use when: You need space efficiency and can tolerate slightly slower construction.
Initialize suffix array-based LRS.
find_all_repeated_substrings(text, min_length=2)
Find all repeated substrings using suffix array and LCP array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
min_length
|
int
|
Minimum length of substrings to consider |
2
|
Returns:
| Type | Description |
|---|---|
List[LRSResult]
|
List of LRSResult objects for all repeated substrings |
find_lrs(text)
Find the longest repeated substring using suffix array and LCP array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Type | Description |
|---|---|
LRSResult
|
LRSResult containing the longest repeated substring and metadata |
SuffixTreeLRS()
Suffix Tree-based longest repeated substring implementation.
Best for: Complex queries, multiple LRS operations, when memory is not a constraint. Use when: You need fast repeated queries on the same text.
Initialize suffix tree-based LRS.
find_all_repeated_substrings(text, min_length=2)
Find all repeated substrings of at least the given length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
min_length
|
int
|
Minimum length of substrings to consider |
2
|
Returns:
| Type | Description |
|---|---|
List[LRSResult]
|
List of LRSResult objects for all repeated substrings |
find_lrs(text)
Find the longest repeated substring using suffix tree.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Type | Description |
|---|---|
LRSResult
|
LRSResult containing the longest repeated substring and metadata |
compare_lrs_strategies(text)
Compare all LRS strategies and return performance metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict]
|
Dictionary with strategy comparisons |
count_repeated_substrings(text, min_length=2, strategy=LRSStrategy.SUFFIX_TREE)
Count the number of unique repeated substrings in a text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
min_length
|
int
|
Minimum length of substrings to consider |
2
|
strategy
|
LRSStrategy
|
Strategy to use for computation |
SUFFIX_TREE
|
Returns:
| Type | Description |
|---|---|
int
|
Number of unique repeated substrings |
demo_lrs()
Demonstrate different LRS strategies.
find_all_repeated_substrings(text, min_length=2, strategy=LRSStrategy.SUFFIX_TREE)
Find all repeated substrings in a text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
min_length
|
int
|
Minimum length of substrings to consider |
2
|
strategy
|
LRSStrategy
|
Strategy to use (SUFFIX_TREE or SUFFIX_ARRAY supported) |
SUFFIX_TREE
|
Returns:
| Type | Description |
|---|---|
List[str]
|
List of repeated substrings sorted by length (descending) |
find_longest_repeated_substring(text, strategy=LRSStrategy.SUFFIX_TREE)
Find the longest repeated substring in a text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
strategy
|
LRSStrategy
|
Strategy to use for computation |
SUFFIX_TREE
|
Returns:
| Type | Description |
|---|---|
str
|
The longest repeated substring |
find_lrs_with_positions(text, strategy=LRSStrategy.SUFFIX_TREE)
Find LRS and return positions where it occurs in the text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
strategy
|
LRSStrategy
|
Strategy to use for computation |
SUFFIX_TREE
|
Returns:
| Type | Description |
|---|---|
Tuple[str, List[int]]
|
Tuple of (substring, positions_in_text) |
get_repetition_statistics(text, min_length=2)
Get comprehensive statistics about repeated substrings in a text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to analyze |
required |
min_length
|
int
|
Minimum length of substrings to consider |
2
|
Returns:
| Type | Description |
|---|---|
Dict[str, any]
|
Dictionary with repetition statistics |