Skip to content

Longest Common Substring (LCS)

Algorithms for finding the longest substring shared between two strings.

Overview

The module provides multiple strategies for computing the longest common substring between two texts:

  • Suffix Tree-based LCS: Fastest for complex queries and repeated operations
  • Suffix Array-based LCS: Memory efficient using LCP array optimization
  • Hybrid approach: Automatically selects optimal strategy based on input
  • Brute force: Simple implementation for small strings

Usage

from algo.algorithms.strings.longest_common_substring import *

# Basic usage
lcs = find_longest_common_substring("abcdef", "defghi")  # Returns "def"

# With positions
substring, pos1, pos2 = find_lcs_with_positions("banana", "ananas")

# Compare strategies
comparison = compare_lcs_strategies("text1", "text2")

# Find all common substrings
substrings = find_all_common_substrings("abc", "bcd", min_length=1)

Performance

  • Suffix Tree: O(n + m) construction, O(n + m) LCS computation
  • Suffix Array: O((n + m) log(n + m)) construction, O(n + m) LCS computation
  • Brute Force: O(n * m * min(n, m))
  • Memory usage: Suffix Tree > Suffix Array > Brute Force

API Reference

Longest Common Substring Algorithms

This module provides high-level longest common substring functionality using different underlying data structures and strategies. It demonstrates the practical applications of Suffix Tree and Suffix Array data structures for finding the longest substring that appears in multiple strings.

Strategies Implemented:

  1. Suffix Tree-based LCS (faster for complex queries)
  2. Suffix Array-based LCS (memory efficient)
  3. Hybrid approach combining multiple strategies
  4. Brute force approach for small strings

Performance Characteristics:

  • Suffix Tree: O(n + m) construction, O(n + m) LCS computation where n, m are text lengths
  • Suffix Array: O((n + m) log(n + m)) construction, O(n + m) LCS computation
  • Brute Force: O(n * m * min(n, m)) for comparison
  • Memory usage: Suffix Tree > Suffix Array > Brute Force

Use Cases:

  • Finding similarities between documents
  • DNA sequence analysis
  • Code plagiarism detection
  • Data deduplication
  • Version control systems
  • Text comparison tools

BruteForceLCS

Brute force longest common substring implementation.

Best for: Very small strings, educational purposes, verification. Use when: You need a simple, easy-to-understand implementation.

find_lcs(text1, text2)

Find the longest common substring using brute force approach.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required

Returns:

Type Description
LCSResult

LCSResult containing the longest common substring

HybridLCS()

Hybrid LCS that combines multiple strategies for optimal performance.

Uses different strategies based on input characteristics and performance requirements.

Initialize hybrid LCS with all underlying implementations.

compare_strategies(text1, text2)

Compare all strategies and return results from each.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required

Returns:

Type Description
Dict[str, LCSResult]

Dictionary mapping strategy names to their results

find_lcs(text1, text2, strategy=None)

Find the longest common substring using optimal strategy.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required
strategy Optional[LCSStrategy]

Force specific strategy (optional)

None

Returns:

Type Description
LCSResult

LCSResult from the selected strategy

LCSResult(substring, length, positions_text1=list(), positions_text2=list(), strategy_used='unknown', computation_time=0.0, metadata=dict()) dataclass

Represents a longest common substring result with metadata.

__post_init__()

Update length based on substring if not provided.

__repr__()

Return detailed representation.

__str__()

Return string representation.

LCSStrategy

Bases: Enum

Different LCS computation strategies.

BRUTE_FORCE = 'brute_force' class-attribute instance-attribute

Brute force LCS strategy.

HYBRID = 'hybrid' class-attribute instance-attribute

Hybrid LCS strategy that combines multiple approaches for optimal performance.

SUFFIX_ARRAY = 'suffix_array' class-attribute instance-attribute

Suffix Array-based LCS strategy.

SUFFIX_TREE = 'suffix_tree' class-attribute instance-attribute

Suffix Tree-based LCS strategy.

SuffixArrayLCS()

Suffix Array-based longest common substring implementation.

Best for: Memory-constrained environments, one-time computations. Use when: You need space efficiency and can tolerate slightly slower construction.

Initialize suffix array-based LCS.

find_lcs(text1, text2)

Find the longest common substring between two texts using suffix array.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required

Returns:

Type Description
LCSResult

LCSResult containing the longest common substring and metadata

find_lcs_with_lcp(text1, text2)

Find LCS using generalized suffix array with LCP array optimization.

This method builds a generalized suffix array for both texts and uses the LCP array to efficiently find the longest common substring.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required

Returns:

Type Description
LCSResult

LCSResult with detailed LCP analysis

SuffixTreeLCS()

Suffix Tree-based longest common substring implementation.

Best for: Complex queries, multiple LCS operations, when memory is not a constraint. Use when: You need fast repeated queries on the same text pairs.

Initialize suffix tree-based LCS.

find_all_common_substrings(text1, text2, min_length=1)

Find all common substrings between two texts.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required
min_length int

Minimum length of substrings to consider

1

Returns:

Type Description
List[LCSResult]

List of LCSResult objects for all common substrings

find_lcs(text1, text2)

Find the longest common substring between two texts using suffix tree.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required

Returns:

Type Description
LCSResult

LCSResult containing the longest common substring and metadata

compare_lcs_strategies(text1, text2)

Compare all LCS strategies and return performance metrics.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required

Returns:

Type Description
Dict[str, Dict]

Dictionary with strategy comparisons

demo_lcs()

Demonstrate different LCS strategies.

find_all_common_substrings(text1, text2, min_length=2, strategy=LCSStrategy.SUFFIX_TREE)

Find all common substrings between two texts.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required
min_length int

Minimum length of substrings to consider

2
strategy LCSStrategy

Strategy to use (only SUFFIX_TREE supported for now)

SUFFIX_TREE

Returns:

Type Description
List[str]

List of common substrings sorted by length (descending)

find_lcs_with_positions(text1, text2, strategy=LCSStrategy.SUFFIX_TREE)

Find LCS and return positions where it occurs in both texts.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required
strategy LCSStrategy

Strategy to use for computation

SUFFIX_TREE

Returns:

Type Description
Tuple[str, List[int], List[int]]

Tuple of (substring, positions_in_text1, positions_in_text2)

find_longest_common_substring(text1, text2, strategy=LCSStrategy.SUFFIX_TREE)

Find the longest common substring between two texts.

Parameters:

Name Type Description Default
text1 str

First text

required
text2 str

Second text

required
strategy LCSStrategy

Strategy to use for computation

SUFFIX_TREE

Returns:

Type Description
str

The longest common substring