[논문리뷰] Towards Large Language Models as Copilots for Theorem Proving in Lean

0. Abstract

Theorem proving - LLM에 있어 중요한 challenge
문제 : 존재하는 LLM기반 provers는 인간의 직관 없이 완전히 자율적인 모드로 정리를 증명하려고 노력했으나 어려움 → LeanDojo
해당 논문에서는 LLM을 copilot으로서 proving theorems에 있어 인간을 돕는 것을 제안 → Lean Copilot : Lean에서 신경망 네트워크 추론을 작동하기 위한 프레임워크
효과
- 다양한 LLM 베이스 증명 자동화 도구를 구축하는 프로그래머들을 Lean 유저의 workflow로 통합
- Lean copilot을 사용하여 증명 단계를 제안하고 LLM을 사용해 중간의 증명 목표를 완성하는 도구 구축
실험 결과 - 해당 방법이 사람을 돕는 데 효과적인 것을 보여줌

1. Introduction

배경
- Proof assistants (interactive theorem provers) : software for mathematicians to write formal proofs that can be checked rigorously by computers.
- Recently, there has been increasing interest in using them with machine learning, especially large language models (LLMs), to prove theorems automatically.
- Furthermore, LLMs can make proof assistants easier to use by improving automation.
문제점
- Existing LLM-based provers aim to prove theorems fully autonomously without human intervention.
- While an autonomous AI mathematician is desirable in the long term, current LLMs often fail to prove theorems that are truly novel or challenging, especially when they come from a different domain than the training data.
- Despite significant progress and open-source projects such as LeanDojo, existing LLM-based provers cannot readily assist humans.
- They are trained and evaluated following standard practices in machine learning but do not integrate into the workflow of proof assistants.
제안
- Instead of proving theorems by itself, AI can also assist human mathematicians in theorem proving.
- Humans can use mathematical intuition and knowledge to provide guidance or critical proof steps,
- whereas LLMs can automate parts of the proof that are more straightforward and tedious for humans.
효과
- This approach is highly viable since it incrementally automates the current workflow of proof assistants.
- Furthermore, it may accelerate the research toward the long-term vision of AI mathematicians. Having LLMs as copilots makes proof assistants easier to use for humans. Therefore, it will improve the quality and coverage of formalized mathematics, which will provide more and better data for developing LLMs for theorem proving.

Lean

: a proof assistant popular among mathematicians

As shown in Fig. 1, a proof in Lean consists of a sequence of proof steps called tactics (e.g., cases n, dsimp, and suggest_tactics).
Starting from the theorem as the initial goal, tactics repeatedly transform the current goal into simpler sub-goals, until all goals are solved.
Users write tactics interactively in an IDE powered by VS Code, which displays the goals in the infoview panel on the right.
Proof automation tools in Lean can also take the form of tactics (e.g. suggest_tactics in Fig. 1).
Like regular tactics, they can manipulate the proof goals and display information in the infoview panel.
예시
- rw (tactic) - h : X=Y → rw [h] → X→Y
- add_zero(Theorems) - a proof that a + 0 = a
- rfl (tactic) - X = X
Lean - Theorem Proving Tutorial
tatics : an alternative approach to constructing proofs
- tactics are commands, or instructions, that describe how to build such a proof.
- Informally, you might begin a mathematical proof by saying "to prove the forward direction, unfold the definition, apply the previous lemma, and simplify." Just as these are instructions that tell the reader how to find the relevant proof, tactics are instructions that tell Lean how to construct a proof term. They naturally support an incremental style of writing proofs, in which you decompose a proof and work on goals one step at a time.
- "tactic-style" proofs, to contrast with the ways of writing proof terms we have seen so far, which we will call "term-style" proofs. Each style has its own advantages and disadvantages. For example, tactic-style proofs can be harder to read, because they require the reader to predict or guess the results of each instruction. But they can also be shorter and easier to write.
- Moreover, tactics offer a gateway to using Lean's automation, since automated procedures are themselves tactics. (suggest_tactics)
- 예시
  - rfl : X = X, x+37 = 37+x
- cases : If n is a number, then cases n with d will break the goal into two goals, one with n = 0 and the other with n = succ d.
- dsimp : dsimp is similar to simp, except that it only uses definitional equalities.
- suggest_tactics
- Aesop

Lean Copilot

: a framework for developing LLM-based proof automation in Lean

core technical challenge : running neural network inference in Lean
works out of the box
the model is small and efficient enough to run on most hardware

fig 1 - LeanDojo

we convert the model into a platform-independent format, ONNX (Open Neural Network Exchange)
ONNX : 딥러닝 모델을 서로 다른 프레임워크 간에 서로 옮길 수 있도록 하는 오픈 소스 프로젝트 → 다양한 플랫폼 환경(Java, JS, C, C#, C++)에서 환경에 제약 없이 구현된 ‘ML 모델’을 호출하고 수행하여 수행 결과값을 반환받는 것을 의미
we run it as a shared library through Lean’s foreign function interface (FFI) FFI : Lean은 부분적으로는 Lean 자체로, 부분적으로는 C++로 작성되었으므로 두 언어 간(또는 Lean과 C 인터페이스를 지원하는 모든 언어 간) 효율적인 상호 운용성을 제공

enables programmers to build various LLM-based tools that work seamlessly in Lean
- (1) suggest_tactics : a tactic that uses LLMs to suggest proof steps
- (2) LLM-aesop : a proof search tactic that combines LLM-generated proof steps with aesop (Lean’s existing rule-based proof search)→ LLM-aesop alone (without humans) can prove more theorems than the original aesop.
- → Furthermore, it can better assist humans than aesop, requiring fewer tactics to be entered manually by humans.
- aesop : a proof search tactic for Lean 4

⇒ we introduce a general framework, Lean Copilot, and two tools (suggest_tactics and LLM-aesop) for LLMs to assist humans in proving Lean theorems

2. Lean Copilot for Native Network Inference in Lean

Lean is also a general-purpose programming language . For LLMs to assist humans in Lean, Lean Copilot provides a general framework for running the inference of LLMs in Lean, and Lean programmers can use it to build various LLM-based applications.
Since all popular deep learning frameworks are in Python. However, model in Python suffers from the overhead of inter-process communication, and it requires users to perform additional setup steps that do not fit into Lean’s conventional workflow. To overcome these issues, Lean Copilot runs LLMs natively in Lean through its foreign function interface (FFI) FFI : a mechanism for programs in one language to call subroutines in another language. Lean is partially implemented in C++ and interoperates efficiently with C++. Programmers can declare a function in Lean but implement it in C++. The implementation is compiled into a shared library and linked to Lean dynamically.
We adopt the ReProver model from LeanDojo. It is based on an encoder-decoder Transformer, ByT5, that maps an input string to an output string. Lean Copilot makes the model runnable in Lean by wrapping it into a C++ function operating on strings, which can be called in Lean through FFI. → We do not perform tokenization since ByT5 is a tokenizer-free model that works directly on UTF-8 bytes

3. Building LLM-based Proof Automation with Lean Copilot

3.1 Generating Tactic Suggestions

When humans prove theorems in Lean, they inspect the current goal to decide the next tactic.

Tactics do not come from a predefined list; they are more like programs in a domain-specific language (DSL). They can take arbitrary Lean terms as parameters, and simpler tactics can be combined into compound tactics.
Furthermore, users can extend existing tactics by defining customized tactics. Due to these complexities, producing the right tactic can be challenging even for experienced Lean users.

suggest_tactic : a tool using LLMs to generate tactic suggestions

itself is also a tactic
When applied, it feeds the current goal into an LLM and displays the generated tactics in the infoview panel (Fig. 1 Right)
The user can choose whether to accept one of the suggestions (by clicking on it) or use them as inspirations to come up with a new tactic.
Our frontend for displaying tactics is based on an existing tactic suggestion tool, llmstep [llmstep](https://github.com/wellecks/llmstep) : a Lean 4 tactic for suggesting proof steps using a language model
If a suggestion can directly solve the current goal, it is marked by a party popper emoji
ReProver model open-sourced by LeanDojo It is relatively small (299M parameters) and can run efficiently on most hardware on which Lean runs, including laptops without GPUs

3.2 Searching for Proofs with LLM-aesop

Suggest_tactics only generates tactics for the current step, without the capability to search for multi-tactic proofs.

→ we combine it with aesop to build an LLM-based proof search tool named LLM-aesop.

[Aesop](https://github.com/leanprover-community/aesop) : implements best-first search as a Lean tactic and allows users to configure how the search tree gets expanded.
The search tree consists of goals as nodes. Initially, it has only the original goal as the root node. At each step, aesop picks the most promising unexpanded node, expands it by applying tactics, and adds the resulting nodes as its children. The proof search succeeds when aesop has found a path from the root to goals that can be solved trivially. It may fail because of timeout or when aesop has run out of tactics to try.

In aesop, tactics for expanding nodes are drawn from a set called the rule set. It is configurable by users before proof search but fixed during the search, i.e., the same rule set is used for expanding all nodes, regardless of the proof goal. Therefore, aesop’s performance depends critically on whether the user has configured an effective rule set, which is often problem-dependent. Aesop lacks the flexibility to adaptively decide what tactics to try during proof search.

LLM-aesop : LLM-aesop augments aesop’s rule set with goal-dependent tactics generated by suggest_tactics. It allows the rule set to be customized for each goal, which makes aesop substantially more flexible. Furthermore, LLM-aesop is a drop-in replacement of aesop: Users can easily switch between LLM-aesop and the original aesop by activating/deactivating the LLM-generated tactics.

4. Experiments

We empirically validate the effectiveness of LLM-aesop compared to aesop in two settings:

(1) proving theorems autonomously

(2) assisting humans in theorem proving.

In addition, we compare LLM-aesop with suggest_tactics to demonstrate the benefits of proof search.

Dataset : randomly selected 50 theorems in “Mathematics in Lean” → their proofs have 5.52 tactics on average ex)
setup Each theorem comes with a ground truth proof consisting of one or multiple tactics.
1. To mimic a human user, we enter the ground truth tactics one by one.
2. After each tactic, we try to prove the remaining goals using an automated tool: LLM-aesop, aesop, or suggest_tactics.
  1. For aesop, we use it out of the box, without manually configuring the rule set.
  2. For suggest_tactics, we say it proves a goal when one of the generated tactic suggestions can prove the goal.
  3. We record the number of tactics entered manually before the tool succeeds, and the number is zero if it can prove the original theorem fully autonomously without requiring human-entered tactics.
Results
- LLM-aesop can prove 64% (32 out of 50) theorems autonomously, which is significantly higher than aesop and suggest_tactics.
- When used to assist humans, LLM-aesop only requires an average of 1.02 manually-entered tactics, which also compares favorably to aesop (3.62) and suggest_tactics (2.72)

5. Conclusion

We have introduced Lean Copilot: a framework for running neural network inference in Lean through FFI.
Using Lean Copilot, we have built LLM-based proof automation for generating tactic suggestions (suggest_tactics) and searching for proofs (LLM-aesop).
Lean Copilot provides an extendable interface between LLMs and Lean.
→ This work has explored how it enables LLMs to assist Lean users.
⇒ In the future, we hope to see LLM-based proof automation help us formalize mathematics and ultimately enhance LLMs’ capability in mathematical reasoning.

논문 링크 : https://mathai2023.github.io/papers/4.pdf

저작자표시 비영리 변경금지 (새창열림)

'딥러닝 > Paper review' 카테고리의 다른 글

[논문리뷰] Time is Encoded in the Weights of Finetuned Language Models (0)	2024.04.30
[논문 리뷰] End-to-End Memory Networks (1)	2023.04.16

Yeazzing's

[논문리뷰] Towards Large Language Models as Copilots for Theorem Proving in Lean

0. Abstract

1. Introduction

Lean

Lean Copilot

2. Lean Copilot for Native Network Inference in Lean

3. Building LLM-based Proof Automation with Lean Copilot

3.1 Generating Tactic Suggestions

3.2 Searching for Proofs with LLM-aesop

4. Experiments

5. Conclusion

'딥러닝 > Paper review' 카테고리의 다른 글

티스토리툴바

[논문리뷰] Towards Large Language Models as Copilots for Theorem Proving in Lean

0. Abstract

1. Introduction

Lean

Lean Copilot

2. Lean Copilot for Native Network Inference in Lean

3. Building LLM-based Proof Automation with Lean Copilot

3.1 Generating Tactic Suggestions

3.2 Searching for Proofs with LLM-aesop

4. Experiments

5. Conclusion

'딥러닝 > Paper review' 카테고리의 다른 글

'딥러닝/Paper review' Related Articles

티스토리툴바