Skip to content

Longest Repeated Substring (LRS)

Algorithms for finding the longest substring that appears multiple times in a text.

Overview

The module provides multiple strategies for finding the longest repeated substring:

  • Suffix Tree-based LRS: Optimal for complex queries and repeated operations
  • Suffix Array-based LRS: Memory efficient using LCP array
  • Hybrid approach: Automatically selects best strategy
  • Brute force: Simple implementation for small texts

Usage

from algo.algorithms.strings.longest_repeated_substring import *

# Basic usage
lrs = find_longest_repeated_substring("banana")  # Returns "ana"

# With positions
substring, positions = find_lrs_with_positions("mississippi")

# Compare strategies
comparison = compare_lrs_strategies("text")

# Find all repeated substrings
substrings = find_all_repeated_substrings("banana", min_length=2)

Performance

  • Suffix Tree: O(n) construction, O(n) LRS computation
  • Suffix Array: O(n log n) construction, O(n) LRS computation
  • Brute Force: O(n³) for comparison
  • Memory usage: Suffix Tree > Suffix Array > Brute Force

API Reference

Longest Repeated Substring Algorithms

This module provides high-level longest repeated substring functionality using different underlying data structures and strategies. It demonstrates the practical applications of Suffix Tree and Suffix Array data structures for finding the longest substring that appears multiple times in a single string.

Strategies Implemented:

  1. Suffix Tree-based LRS (faster for complex queries)
  2. Suffix Array-based LRS (memory efficient using LCP array)
  3. Hybrid approach combining multiple strategies
  4. Brute force approach for small strings

Performance Characteristics:

  • Suffix Tree: O(n) construction, O(n) LRS computation where n is text length
  • Suffix Array: O(n log n) construction, O(n) LRS computation using LCP array
  • Brute Force: O(n³) for comparison
  • Memory usage: Suffix Tree > Suffix Array > Brute Force

Use Cases:

  • Finding repetitive patterns in DNA sequences
  • Code duplication detection
  • Text compression algorithms
  • Data deduplication
  • Pattern analysis in logs
  • Finding common phrases in documents

BruteForceLRS

Brute force longest repeated substring implementation.

Best for: Very small strings, educational purposes, verification. Use when: You need a simple, easy-to-understand implementation.

find_lrs(text)

Find the longest repeated substring using brute force approach.

Parameters:

Name Type Description Default
text str

Text to analyze

required

Returns:

Type Description
LRSResult

LRSResult containing the longest repeated substring

HybridLRS()

Hybrid LRS that combines multiple strategies for optimal performance.

Uses different strategies based on input characteristics and performance requirements.

Initialize hybrid LRS with all underlying implementations.

compare_strategies(text)

Compare all strategies and return results from each.

Parameters:

Name Type Description Default
text str

Text to analyze

required

Returns:

Type Description
Dict[str, LRSResult]

Dictionary mapping strategy names to their results

find_lrs(text, strategy=None)

Find the longest repeated substring using optimal strategy.

Parameters:

Name Type Description Default
text str

Text to analyze

required
strategy Optional[LRSStrategy]

Force specific strategy (optional)

None

Returns:

Type Description
LRSResult

LRSResult from the selected strategy

LRSResult(substring, length, positions=list(), occurrences=0, strategy_used='unknown', computation_time=0.0, metadata=dict()) dataclass

Represents a longest repeated substring result with metadata.

__post_init__()

Update fields based on substring if not provided.

__repr__()

Return detailed representation.

__str__()

Return string representation.

LRSStrategy

Bases: Enum

Different LRS computation strategies.

BRUTE_FORCE = 'brute_force' class-attribute instance-attribute

Brute force LRS strategy. Best for very small strings and educational purposes.

HYBRID = 'hybrid' class-attribute instance-attribute

Hybrid LRS strategy that combines multiple approaches for optimal performance.

SUFFIX_ARRAY = 'suffix_array' class-attribute instance-attribute

Suffix Array-based LRS strategy. Best for memory efficiency and one-time computations.

SUFFIX_TREE = 'suffix_tree' class-attribute instance-attribute

Suffix Tree-based LRS strategy. Best for complex queries and multiple operations.

SuffixArrayLRS()

Suffix Array-based longest repeated substring implementation.

Best for: Memory-constrained environments, one-time computations. Use when: You need space efficiency and can tolerate slightly slower construction.

Initialize suffix array-based LRS.

find_all_repeated_substrings(text, min_length=2)

Find all repeated substrings using suffix array and LCP array.

Parameters:

Name Type Description Default
text str

Text to analyze

required
min_length int

Minimum length of substrings to consider

2

Returns:

Type Description
List[LRSResult]

List of LRSResult objects for all repeated substrings

find_lrs(text)

Find the longest repeated substring using suffix array and LCP array.

Parameters:

Name Type Description Default
text str

Text to analyze

required

Returns:

Type Description
LRSResult

LRSResult containing the longest repeated substring and metadata

SuffixTreeLRS()

Suffix Tree-based longest repeated substring implementation.

Best for: Complex queries, multiple LRS operations, when memory is not a constraint. Use when: You need fast repeated queries on the same text.

Initialize suffix tree-based LRS.

find_all_repeated_substrings(text, min_length=2)

Find all repeated substrings of at least the given length.

Parameters:

Name Type Description Default
text str

Text to analyze

required
min_length int

Minimum length of substrings to consider

2

Returns:

Type Description
List[LRSResult]

List of LRSResult objects for all repeated substrings

find_lrs(text)

Find the longest repeated substring using suffix tree.

Parameters:

Name Type Description Default
text str

Text to analyze

required

Returns:

Type Description
LRSResult

LRSResult containing the longest repeated substring and metadata

compare_lrs_strategies(text)

Compare all LRS strategies and return performance metrics.

Parameters:

Name Type Description Default
text str

Text to analyze

required

Returns:

Type Description
Dict[str, Dict]

Dictionary with strategy comparisons

count_repeated_substrings(text, min_length=2, strategy=LRSStrategy.SUFFIX_TREE)

Count the number of unique repeated substrings in a text.

Parameters:

Name Type Description Default
text str

Text to analyze

required
min_length int

Minimum length of substrings to consider

2
strategy LRSStrategy

Strategy to use for computation

SUFFIX_TREE

Returns:

Type Description
int

Number of unique repeated substrings

demo_lrs()

Demonstrate different LRS strategies.

find_all_repeated_substrings(text, min_length=2, strategy=LRSStrategy.SUFFIX_TREE)

Find all repeated substrings in a text.

Parameters:

Name Type Description Default
text str

Text to analyze

required
min_length int

Minimum length of substrings to consider

2
strategy LRSStrategy

Strategy to use (SUFFIX_TREE or SUFFIX_ARRAY supported)

SUFFIX_TREE

Returns:

Type Description
List[str]

List of repeated substrings sorted by length (descending)

find_longest_repeated_substring(text, strategy=LRSStrategy.SUFFIX_TREE)

Find the longest repeated substring in a text.

Parameters:

Name Type Description Default
text str

Text to analyze

required
strategy LRSStrategy

Strategy to use for computation

SUFFIX_TREE

Returns:

Type Description
str

The longest repeated substring

find_lrs_with_positions(text, strategy=LRSStrategy.SUFFIX_TREE)

Find LRS and return positions where it occurs in the text.

Parameters:

Name Type Description Default
text str

Text to analyze

required
strategy LRSStrategy

Strategy to use for computation

SUFFIX_TREE

Returns:

Type Description
Tuple[str, List[int]]

Tuple of (substring, positions_in_text)

get_repetition_statistics(text, min_length=2)

Get comprehensive statistics about repeated substrings in a text.

Parameters:

Name Type Description Default
text str

Text to analyze

required
min_length int

Minimum length of substrings to consider

2

Returns:

Type Description
Dict[str, any]

Dictionary with repetition statistics