GRAT – Efficient Formally Verified SAT Solver Certification Toolchain

GRAT is a SAT solver certificate checking tool chain, which is formally verified using the Isabelle/HOL theorem prover. If a certificate is verified, you have strong formal guarantees that the solution produced by the SAT solver was actually correct. GRAT is actually faster than the unverified state-of-the-art tool drat-trim, and significantly faster than the verified LRAT tool.

The tool chain consists of two programs: The unverified certificate generator gratgen converts a DRAT certificate (as output by many modern SAT solvers) to a GRAT certificate, which is then checked against the original formula by the verified gratchk tool.

Features

Strong formal correctness guarantees.
Almost twice as fast as drat-trim thanks to novel heuristics (sepWL, RAT-run). On pure DRUP: Still faster than drat-trim.
Multi-threaded mode for even faster checking.
Small certificates thanks to backward checking and core-first unit propagation.
Memory efficient checking with split certificates.

News

Feb 6, 2023: No default display of invalid deletion clause warnings, as they could flood the logs. Added switch to get them back.
Jan 31, 2023: Ported to Isabelle 2022
Dec 25, 2018: Version 1.3.1. Bugfix in gratgen, now also works if initial clauses have conflict by unit propagation.
March 4, 2018: Version 1.3. Bugfixes and support for binary drat format.
June 28, 2017: Added OpenBSD style license.
June 28, 2017: Included pure DRUP benchmarks generated by Riss6: Still faster than drat-trim.
May 22, 2017: Version 1.2 released. New heuristics, now more than 2 times faster than drat-trim.

Getting started

Download the GRATgen and GRATchk (ML source only) tools and build them as indicated in their README files. The standard GRATchk package contains all Isabelle theories. However, to run the tools you only need the ML source only package.

To check a satisfiability certificate, run

gratchk sat formula.cnf certificate.sat

Here, the formula is a formula in standard DIMACS CNF format, and the certificate is a list of non-contradictory literals that satisfy the formula if assigned to true. The certificate can, for example, be obtained by processing the output of the SAT solver with

sed -re '/^v/!d;s/^v//'

To check an unsatisfiability certificate, run

gratgen formula.cnf drat-cert.drat -l grat-cert.gratl -o grat-cert.gratp [-b] [-j N]
gratchk unsat formula.cnf grat-cert.gratl grat-cert.gratp

Here, drat-cert is a certificate in the standard DRAT format, which is supported by many SAT solvers.(Use option -b for binary DRAT) Note that the DRAT format is downward compatible with DRUP, such that also DRUP certificates can be used. Moreover, if you specify the -j N option, gratgen will run in multi-threaded mode with N threads. For most certificates, this will result in a significant speedup. Good values for N are between 2 and 16, depending on the available hardware.

To check an unsatisfiability certificate in combined proof mode, i.e., lemmas and proofs are stored in the same file, run

gratgen formula.cnf drat-cert.drat -o grat-cert.grat [-j N]
gratchk unsat formula.cnf grat-cert.grat

Combined proof mode yields only a single certificate file, but the certificate checker is slightly slower and needs more memory.

Examples

For getting started quickly, we have compiled a collection of example formulas and certificates.

Benchmarks

We have benchmarked our checker (version 1.2) on a server board with an 22-core Intel XEON Broadwell 2.2GHz CPU and 128 GiB of RAM, and compared it to drat-trim (version: Nov 10, 2016 ,the current version as of May 2017) and incremental LRAT (obtained from the authors in May 2017).

For unsat, we used the problems from the 2016 SAT competition main track. As solvers, we used CryptoMiniSAT and silver medalist Riss6, yielding a total of 238 DRAT certificates. We used CryptoMiniSAT as it seems to be the only solver that generates a significant amount of RAT lemmas.

Using the standard timeout of 20.000 seconds, single-threaded GRAT could verify all certificates, while drat-trim and LRAT timed out on two, and segfaulted on a third one. To verify the remaining 235 certificates, GRAT required 44 hours, while drat-trim required 72 hours and LRAT required 93 hours. With eight threads, GRAT ran out of memory for one certificate. For the remaining 234 certificates, the wall-clock times sum up to only 21 hours.

For the certificates by CryptoMiniSAT, which make extensive use of RAT lemmas, the RAT run optimization shows its efficiency: Single-threaded GRAT needs 17h, while drat-trim need 42h and LRAT need 51h.

For the pure DRUP certificates produced by Riss6, only the sepWL optimization is effective, and GRAT needs 26h, while drat-trim need 30h and LRAT needs 42h.

The raw log files of the unsatisfiable benchmark runs are available as 2017-06-combined_logs.tgz.

We also evaluated the problems that gold medalist Maple solved in the 2017 SAT competition main track. Here, we used GRAT 1.3, the latest available repository version of drat-trim (December 2017), and the incremental LRAT that we obtained from the authors in May 2017 (According to the authors, there have been no updates to LRAT since then). The results are available as 2018-01-logs.tgz.

Memory consumption

The increased speed and formal correctness guarantees come at the price of higher memory consumption. Our tool chain needs roughly three times more memory than drat-trim in single-threaded mode, and memory consumption drastically increases in multi-threaded mode. However, memory consumption seems not to be a general problem. Out of the 347 benchmark problems, gratgen with 8 threads ran out of memory at 128 GiB for only a single certificate.

Satisfiability

Verifying satisfiability is almost straightforward and very efficient. For the 241 satisfiable problems in our benchmark set, the certificates have a size of 566MiB and gratchk verified them in roughly 100 seconds. We provide the raw csv data extracted from the logs: sat.csv Note that the sum of the times there is only about 50 seconds, which is due to small times being below the precision of our time measurement, and thus being rounded down to zero.

Documentation of the code

GRATgen comes with a doxygen generated documentation. For GRATchk, you may want to look at the proof document or the proof outline.

Formal Guarantees

We have formally proved the soundness of GRATchk in the Isabelle/HOL interactive theorem prover. The proof covers the actual implementation of GRATchk, and the semantics of the formula down to the integer array that represents it.

That is, if GRATchk claims that a formula is satisfiable/unsatisfiable, it is very likely that this is actually true.

Trusted Base

The trusted base of a piece of software describes what has to be trusted in order to believe that the produced result is correct. It includes, at least, the hardware, the operating system, and the compilers used to compile the software.

For a SAT solver, it includes the complete implementation of a SAT solver, usually a highly optimized C program. If one uses certificates, the sat solver is replaced by the certificate checker in the trusted base.

For formally verified software, the software itself is not included in the trusted base any more. It is replaced by the trusted base of the used theorem prover and by the correctness specification. This highly decreases the likelihood of bugs in the verified software:

Modern theorem provers like Isabelle/HOL are designed for maximum trustworthiness, reducing their own trusted base to a small and well-tested logical kernel.
It is very unlikely that a bug in the logical kernel coincides with an error in the correctness proof.

We now list, in detail, the trusted code base of our verified checker, i.e., what pieces of software you have to trust in order to believe in correctness of our checker:

The Isabelle/HOL logical inference kernel. This piece of code is considerably small and very well tested. In the last decade, we discovered less than 5 bugs, none of them leading to an inconsistent theorem being proved unnoticed!
The Isabelle/HOL code generator and its Imperative HOL extension. The code generator is merely a syntax transformation from the functional fragment of HOL to the target language, in our case Standard ML. It’s correctness has been proved on paper, and it is well-tested. Again, it is unlikely that an error in code generation will result in the verified checker accepting an invalid certificate. It is much more likely that the checker will just not compile, or fail with an exception at runtime.
The correctness theorem that we proved for the certified checker. In particular, you have to believe that the correctness theorem specifies the desired behavior of the checker. To this end, we provide a specification of a satisfying formula down to the integer array that holds null-terminated clauses of the formula, only using standard list functions. Excluding the definitions of the standard list functions, this specification is only a few lines of Isabelle/HOL source text. See the last section of the proof outline for a detailed explanation.
A small parser that parses the cnf file into an integer array. We tried to write this parser as clear as possible. Moreover, using Standard ML, it is immune to low-level bugs like numeric or buffer overflows.
The small piece of code that interprets the command line, invokes the parser and the generated code of the verifier, and prints the result.

While this list may seem huge, it is worth to list some items that are missing as opposed to an unverified checker written in C:

That there are no buffer overflows of any kind.
Correctness of the used data structures. This includes a map from clause IDs to clauses, from literals to their assignments, the management of RAT candidates, and that all code using these data structures satisfies their (often undocumented) preconditions.
Correctness of RUP and RAT checks per se, as well as unit propagation.
Correctness of the implementation of these concepts using partial assignments.
Correctness of the implementation of overall proof checking, e.g., that all lemmas are checked at the end.

License

GRAT comes under an OpenBSD style license.

Version History

Version 1.0 Initial version of GRAT
Version 1.1
- Summarized deletion items
- Split certificates
Version 1.2
- Separate watchlists (sepWL): More efficient implementation of core-first unit propagation.
- RAT-run heuristics: Reduction of expensive RAT candidate queries.
Version 1.3
- Support for binary DRAT format
- Bugfixes in gratgen
Version 1.3.1
- Bugfixes in gratgen
Version 1.3.2
- Bugfixes in gratgen

Version 1.3.3 (Current)

Bugfixes in gratgen.
- Details: while reading the formula, gratgen will assign unit clauses immediately. This can lead to a conflict clause being detected before the initial unit propagation. In this case, gratgen forgot to mark the relevant clauses as unit, thus not including them into the generated proof. Now fixed.