CYK Algorithm

Introduction

The Cocke-Younger-Kasami (CYK) algorithm is a dynamic programming algorithm used to determine whether a given string can be generated by a given context-free grammar in Chomsky Normal Form (CNF). The algorithm was independently developed by John Cocke, Daniel Younger, and Tadao Kasami in the 1960s.

Chomsky Normal Form

A context-free grammar is in Chomsky Normal Form if all production rules are of the form:

A → BC (where A, B, and C are non-terminals)
A → a (where A is a non-terminal and a is a terminal)
S → ε (where S is the start symbol and ε is the empty string)

Algorithm Overview

The CYK algorithm works by building a triangular table (CYK table) where:

The rows represent the length of the substring being considered (row 1 = length 1)
The columns represent the starting position of the substring (column 1 = start at first symbol)
Each cell (i,j) contains the set of non-terminals that can generate the substring starting at position j with length i (both i and j are 1-based in the mathematical description)

Note on the simulation's storage/visualization: the interactive simulation uses a triangular matrix stored as a 2D array matrix[r][c] with 0-based indices. The relation between the stored indices and the (i,j) notation shown to the user is:

start = c + 1 (start position, 1-based)
length = r - c + 1 (substring length)

So the element for substring with length i starting at position j is stored at matrix[r][c] where c = j - 1 and r = c + (i - 1) = j + i - 2.

Visual orientation used in the simulation: the simulation displays length-1 substrings at the top row and increases length downward. Therefore the full-string cell (n,1) (length n, start 1) appears at the bottom-left corner of the drawn triangular table in the interactive simulation.

Algorithm Steps

Initialization: For each position i in the input string w:
- Fill the first row with non-terminals that can generate the corresponding terminal
Filling the Table: For each cell (i,j):
- Consider all possible ways to split the substring into two parts
- For each split, check if there exists a production rule A → BC where:
  - B can generate the first part
  - C can generate the second part
- If such a rule exists, add A to the cell (i,j)
Acceptance: The string is accepted if the start symbol S is in the cell (n,1), where n is the length of the input string

Time and Space Complexity

Time Complexity: O(n³|G|), where:
- n is the length of the input string
- |G| is the size of the grammar
Space Complexity: O(n²|N|), where:
- n is the length of the input string
- |N| is the number of non-terminals

Applications

The CYK algorithm is used in:

Natural Language Processing
Compiler Design
Syntax Analysis
Grammar Validation
String Matching

Example

Consider the grammar:

S → AB | BC
          A → BA | a
          B → CC | b
          C → AB | a

And the input string "baaba"

The CYK table would be constructed as follows:

First row: Fill with terminals
Subsequent rows: Fill using the production rules
Check if S is in the bottom-left cell (cell (n,1))

The algorithm will determine whether "baaba" can be generated by the given grammar.

Contrast: CYK vs DFA / NFA / Regular Expressions

Problem class: CYK tests membership for context-free grammars (CFGs) in Chomsky Normal Form, which can express nested and hierarchical structures (e.g., matched parentheses). DFA/NFA/regular expressions test membership for regular languages only, which cannot express arbitrary nesting.
How membership is tested:
- DFA: Simulate the automaton by reading the input left-to-right, following deterministic transitions; accept if the final state is accepting. This is a single pass, linear-time procedure (O(n)).
- NFA: Either convert to an equivalent DFA (subset construction) or simulate the NFA directly by maintaining the set of current states (or using ε-closures). Direct simulation runs in O(n · |Q|) in the usual subset-based simulation, where |Q| is the number of NFA states; conversion to a DFA can incur exponential blowup in the number of states in the worst case.
- Regular expressions: Common implementations compile regexes to NFAs (Thompson construction) and then simulate or convert to DFAs. Some engines use backtracking-based algorithms which are convenient but can exhibit exponential-time behavior on crafted inputs. For pure regular-membership testing (without backtracking features), the runtime is effectively linear in the input length after compilation.
- CYK: Uses dynamic programming over all substrings. For an input of length n and grammar size |G|, CYK runs in O(n^3 · |G|) time (cubic in n) because it considers all substring lengths, start positions, and split points.
Space usage and outputs:
- DFA/NFA/regex: Require only O(1) extra space beyond the automaton and current state-set (i.e., O(|Q|) for an NFA simulation). They provide a yes/no membership answer and typically do not produce a parse tree. Regex engines with captures can return substring matches, but that is different from deriving a full syntax tree.
- CYK: Uses O(n^2 · |N|) space to store which non-terminals derive each substring. Crucially, CYK can be extended to recover parse trees (and all parses, in case of ambiguity) by recording derivations while filling the table.
When to use which:
- Use DFA/NFA/regex when the language is regular (lexical patterns, simple token matching). They are fast (linear-time) and memory-light.
- Use CYK when the language is context-free and may require matching nested or recursive constructs that regular languages cannot express; CYK is useful for grammar membership tests and for producing parse information from ambiguous grammars.
Practical notes:
- Many parsing tasks use a combination: lexical analysis uses regexes/DFAs to tokenize input; then a CFG parser (often more efficient than naive CYK, e.g., Earley or LR parsers) is used for syntactic analysis. CYK is conceptually simple and robust for arbitrary CNF grammars but is often replaced by more practical parsers for programming languages because of its cubic-time cost.

Example (high level): to test if the string w is accepted:

DFA: walk transitions for each symbol in w; check final state.
NFA: simulate state-set evolution (or convert to DFA); check if any accepting state is reachable at the end.
Regex: compile to an NFA/DFA or use the engine; check for a match.
CYK: build the CYK table over all substrings of w; check whether start symbol S appears for the whole string.