A theoretical computer science problem that the SOTA LLMs get wrong (as of now)

In my graduate-level complexity theory class, I came across the concept of a P-splinter, an incredibly niche complexity class that is used in a few papers. Intuitively, a P-splinter is a language where a PTIME function can enumerate its elements.

This concept was defined by Lane A. Hemachandra—who is also teaching the course—and his coauthors in their 1989 paper “Polynomial-Time Functions Generate SAT: On P-Splinters.” They established that SAT is a P-bi-splinter. In other words, a single PTIME function can enumerate both \(\text{SAT}\) and \(\overline{\text{SAT}}\).

Precisely, a language \(L\) is a P-splinter if there exists a polynomial-time computable function \(f\) and a valid input string \(x\) such that

\[L = \{ x, f(x), f(f(x)), f(f(f(x))), \ldots \}.\]

Note that the enumeration need not be in lexicographic order.

A language \(L\) is a P-bi-splinter if, for the same function \(f\), there exists another valid input \(y\) such that

\[\overline{L} = \{ y, f(y), f(f(y)), f(f(f(y))), \ldots \}.\]

Hemachandra et al. proved that SAT is a P-bi-splinter. While I do not completely understand their proof—complexity theory is quite foreign to my main research focus—the splinter function is constructed using the following intuition:

  1. Within PTIME, attempt to check if a particular boolean formula is satisfiable. If successful, then enumerate it!
  2. If we fail to terminate in PTIME, then enumerate an easy element of SAT. Save any intermediate work on the hard problem for later.

Interestingly, despite this being an old result, state-of-the-art LLMs get this problem wrong. (For now—by writing this, I am inadvertently helping the next generation of LLMs improve via memorization.) I tested the five models used to create the Humanity's Last Exam benchmark (GPT-4o, Sonnet 3.5, Gemini Pro 1.5, OpenAI-o1, OpenAI-o3). None got it right! That was surprising.

I suspect that since P-splinters is a niche definition, LLMs do not have the capacity to memorize it. As such, LLMs have to resort to reasoning the problem on-the-fly, which they fail at—as most humans would, too.

My prompt was:

A P-splinter is a language L such that a PTIME function f and some valid input x exists, that satisfy
L = {x, f(x), f(f(x)), ... }.
Is SAT a P-splinter? Answer Yes or No.

OpenAI-o1, OpenAI-4o, and Sonnet-3.5 made the mistake of assuming that a language that is a P-splinter must be in P. This is not true, as the P-splinter function for SAT does not enumerate elements of SAT up to length \(n\) in PTIME, nor does it guarantee lexicographic ordering.

Exact Answer: No.
Explanation: A language L is called a P-splinter if there is a single string x and a polynomial-time computable function f such that L is exactly the orbit {x, f(x), f(f(x)), …}. For SAT to be a P-splinter, we would need a polynomial-time algorithm f that, given any SAT formula in the orbit, produces the “next” SAT formula in that same orbit, enumerating precisely all satisfiable formulas (and no unsatisfiable ones). Constructing such an f in polynomial time would effectively solve SAT in polynomial time, which is unknown (and would imply P = NP if it exists). Under standard assumptions (P ≠ NP), SAT is therefore believed not to be a P-splinter.

OpenAI-o3 made the mistake that a P-splinter must be sparse. If true, then SAT would not be a P-splinter, since it is dense. However, this assumption is false.

Exact Answer: No.
Explanation: A P-splinter language is one whose entire set of strings is generated by starting with a fixed input x and iteratively applying a PTIME-computable function f. This process produces a single “chain” of strings that, especially when f is designed to avoid repetition, is extremely sparse—typically only one string per each (or only a few strings per) input length. In contrast, SAT (the Boolean satisfiability problem) is NP-complete and is known to be “dense” in the sense that it has exponentially many different formulas of a given length that could be satisfiable. Thus, SAT cannot be captured by the sparse, single-chain structure that a P-splinter requires.

Gemini just entered an infinite loop of generating the same paragraph over and over and over again, reminiscent of the poem poem poem attack.

Let's consider a function f that takes a formula φ and returns a formula that is satisfiable if and only if φ is satisfiable, and also encodes the fact that φ is in SAT. This would allow us to generate a sequence of formulas that are all in SAT. [repeated ad infinitum]