Longest Common Substring (LCS)
Algorithms for finding the longest substring shared between two strings.
Overview
The module provides multiple strategies for computing the longest common substring between two texts:
- Suffix Tree-based LCS: Fastest for complex queries and repeated operations
- Suffix Array-based LCS: Memory efficient using LCP array optimization
- Hybrid approach: Automatically selects optimal strategy based on input
- Brute force: Simple implementation for small strings
Usage
from algo.algorithms.strings.longest_common_substring import *
# Basic usage
lcs = find_longest_common_substring("abcdef", "defghi") # Returns "def"
# With positions
substring, pos1, pos2 = find_lcs_with_positions("banana", "ananas")
# Compare strategies
comparison = compare_lcs_strategies("text1", "text2")
# Find all common substrings
substrings = find_all_common_substrings("abc", "bcd", min_length=1)
Performance
- Suffix Tree: O(n + m) construction, O(n + m) LCS computation
- Suffix Array: O((n + m) log(n + m)) construction, O(n + m) LCS computation
- Brute Force: O(n * m * min(n, m))
- Memory usage: Suffix Tree > Suffix Array > Brute Force
API Reference
Longest Common Substring Algorithms
This module provides high-level longest common substring functionality using different underlying data structures and strategies. It demonstrates the practical applications of Suffix Tree and Suffix Array data structures for finding the longest substring that appears in multiple strings.
Strategies Implemented:
- Suffix Tree-based LCS (faster for complex queries)
- Suffix Array-based LCS (memory efficient)
- Hybrid approach combining multiple strategies
- Brute force approach for small strings
Performance Characteristics:
- Suffix Tree: O(n + m) construction, O(n + m) LCS computation where n, m are text lengths
- Suffix Array: O((n + m) log(n + m)) construction, O(n + m) LCS computation
- Brute Force: O(n * m * min(n, m)) for comparison
- Memory usage: Suffix Tree > Suffix Array > Brute Force
Use Cases:
- Finding similarities between documents
- DNA sequence analysis
- Code plagiarism detection
- Data deduplication
- Version control systems
- Text comparison tools
BruteForceLCS
Brute force longest common substring implementation.
Best for: Very small strings, educational purposes, verification. Use when: You need a simple, easy-to-understand implementation.
find_lcs(text1, text2)
Find the longest common substring using brute force approach.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
Returns:
| Type | Description |
|---|---|
LCSResult
|
LCSResult containing the longest common substring |
HybridLCS()
Hybrid LCS that combines multiple strategies for optimal performance.
Uses different strategies based on input characteristics and performance requirements.
Initialize hybrid LCS with all underlying implementations.
compare_strategies(text1, text2)
Compare all strategies and return results from each.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, LCSResult]
|
Dictionary mapping strategy names to their results |
find_lcs(text1, text2, strategy=None)
Find the longest common substring using optimal strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
strategy
|
Optional[LCSStrategy]
|
Force specific strategy (optional) |
None
|
Returns:
| Type | Description |
|---|---|
LCSResult
|
LCSResult from the selected strategy |
LCSResult(substring, length, positions_text1=list(), positions_text2=list(), strategy_used='unknown', computation_time=0.0, metadata=dict())
dataclass
Represents a longest common substring result with metadata.
__post_init__()
Update length based on substring if not provided.
__repr__()
Return detailed representation.
__str__()
Return string representation.
LCSStrategy
Bases: Enum
Different LCS computation strategies.
BRUTE_FORCE = 'brute_force'
class-attribute
instance-attribute
Brute force LCS strategy.
HYBRID = 'hybrid'
class-attribute
instance-attribute
Hybrid LCS strategy that combines multiple approaches for optimal performance.
SUFFIX_ARRAY = 'suffix_array'
class-attribute
instance-attribute
Suffix Array-based LCS strategy.
SUFFIX_TREE = 'suffix_tree'
class-attribute
instance-attribute
Suffix Tree-based LCS strategy.
SuffixArrayLCS()
Suffix Array-based longest common substring implementation.
Best for: Memory-constrained environments, one-time computations. Use when: You need space efficiency and can tolerate slightly slower construction.
Initialize suffix array-based LCS.
find_lcs(text1, text2)
Find the longest common substring between two texts using suffix array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
Returns:
| Type | Description |
|---|---|
LCSResult
|
LCSResult containing the longest common substring and metadata |
find_lcs_with_lcp(text1, text2)
Find LCS using generalized suffix array with LCP array optimization.
This method builds a generalized suffix array for both texts and uses the LCP array to efficiently find the longest common substring.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
Returns:
| Type | Description |
|---|---|
LCSResult
|
LCSResult with detailed LCP analysis |
SuffixTreeLCS()
Suffix Tree-based longest common substring implementation.
Best for: Complex queries, multiple LCS operations, when memory is not a constraint. Use when: You need fast repeated queries on the same text pairs.
Initialize suffix tree-based LCS.
find_all_common_substrings(text1, text2, min_length=1)
Find all common substrings between two texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
min_length
|
int
|
Minimum length of substrings to consider |
1
|
Returns:
| Type | Description |
|---|---|
List[LCSResult]
|
List of LCSResult objects for all common substrings |
find_lcs(text1, text2)
Find the longest common substring between two texts using suffix tree.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
Returns:
| Type | Description |
|---|---|
LCSResult
|
LCSResult containing the longest common substring and metadata |
compare_lcs_strategies(text1, text2)
Compare all LCS strategies and return performance metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict]
|
Dictionary with strategy comparisons |
demo_lcs()
Demonstrate different LCS strategies.
find_all_common_substrings(text1, text2, min_length=2, strategy=LCSStrategy.SUFFIX_TREE)
Find all common substrings between two texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
min_length
|
int
|
Minimum length of substrings to consider |
2
|
strategy
|
LCSStrategy
|
Strategy to use (only SUFFIX_TREE supported for now) |
SUFFIX_TREE
|
Returns:
| Type | Description |
|---|---|
List[str]
|
List of common substrings sorted by length (descending) |
find_lcs_with_positions(text1, text2, strategy=LCSStrategy.SUFFIX_TREE)
Find LCS and return positions where it occurs in both texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
strategy
|
LCSStrategy
|
Strategy to use for computation |
SUFFIX_TREE
|
Returns:
| Type | Description |
|---|---|
Tuple[str, List[int], List[int]]
|
Tuple of (substring, positions_in_text1, positions_in_text2) |
find_longest_common_substring(text1, text2, strategy=LCSStrategy.SUFFIX_TREE)
Find the longest common substring between two texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text1
|
str
|
First text |
required |
text2
|
str
|
Second text |
required |
strategy
|
LCSStrategy
|
Strategy to use for computation |
SUFFIX_TREE
|
Returns:
| Type | Description |
|---|---|
str
|
The longest common substring |