GRAT is a SAT solver certificate checking tool chain, which is formally verified using the Isabelle/HOL theorem prover. If a certificate is verified, you have strong formal guarantees that the solution produced by the SAT solver was actually correct. GRAT is actually faster than the unverified state-of-the-art tool drat-trim, and significantly faster than the verified LRAT tool.

The tool chain consists of two programs: The unverified certificate generator `gratgen`

converts a DRAT certificate (as output by many modern SAT solvers) to a GRAT certificate, which is then checked against the original formula by the verified `gratchk`

tool.

- Strong formal correctness guarantees.
- Almost twice as fast as drat-trim thanks to novel heuristics (sepWL, RAT-run). On pure DRUP: Still faster than drat-trim.
- Multi-threaded mode for even faster checking.
- Small certificates thanks to backward checking and core-first unit propagation.
- Memory efficient checking with split certificates.

- June 28, 2017: Added OpenBSD style license.
- June 28, 2017: Included pure DRUP benchmarks generated by Riss6: Still faster than drat-trim.
- May 22, 2017: Version 1.2 released. New heuristics, now more than 2 times faster than drat-trim.

Download the GRATgen and GRATchk (ML source only) tools and build them as indicated in their README files. The standard GRATchk package contains all Isabelle theories. However, to run the tools you only need the ML source only package.

To check a satisfiability certificate, run

`gratchk sat formula.cnf certificate.sat`

Here, the formula is a formula in standard DIMACS CNF format, and the certificate is a list of non-contradictory literals that satisfy the formula if assigned to true. The certificate can, for example, be obtained by processing the output of the SAT solver with

`sed -re '/^v/!d;s/^v//'`

To check an unsatisfiability certificate, run

```
gratgen formula.cnf drat-cert.drat -l grat-cert.gratl -o grat-cert.gratp [-j N]
gratchk unsat formula.cnf grat-cert.gratl grat-cert.gratp
```

Here, drat-cert is a certificate in the standard DRAT format, which is supported by many SAT solvers. Note that the DRAT format is downward compatible with DRUP, such that also DRUP certificates can be used. Moreover, if you specify the -j N option, gratgen will run in multi-threaded mode with N threads. For most certificates, this will result in a significant speedup. Good values for N are between 2 and 16, depending on the available hardware.

To check an unsatisfiability certificate in combined proof mode, i.e., lemmas and proofs are stored in the same file, run

```
gratgen formula.cnf drat-cert.drat -o grat-cert.grat [-j N]
gratchk unsat formula.cnf grat-cert.grat
```

Combined proof mode yields only a single certificate file, but the certificate checker is slightly slower and needs more memory.

For getting started quickly, we have compiled a collection of example formulas and certificates.

We have benchmarked our checker (version 1.2) on a server board with an 22-core Intel XEON Broadwell 2.2GHz CPU and 128 GiB of RAM, and compared it to drat-trim (version: Nov 10, 2016 ,the current version as of May 2017) and incremental LRAT (obtained from the authors in May 2017).

For unsat, we used the problems from the 2016 SAT competition main track. As solvers, we used CryptoMiniSAT and silver medalist Riss6, yielding a total of 238 DRAT certificates. We used CryptoMiniSAT as it seems to be the only solver that generates a significant amount of RAT lemmas.

Using the standard timeout of 20.000 seconds, single-threaded GRAT could verify all certificates, while drat-trim and LRAT timed out on two, and segfaulted on a third one. To verify the remaining 235 certificates, GRAT required 44 hours, while drat-trim required 72 hours and LRAT required 93 hours. With eight threads, GRAT ran out of memory for one certificate. For the remaining 234 certificates, the wall-clock times sum up to only 21 hours.

For the certificates by CryptoMiniSAT, which make extensive use of RAT lemmas, the RAT run optimization shows its efficiency: Single-threaded GRAT needs 17h, while drat-trim need 42h and LRAT need 51h.

For the pure DRUP certificates produced by Riss6, only the sepWL optimization is effective, and GRAT needs 26h, while drat-trim need 30h and LRAT needs 42h.

The raw log files of the unsatisfiable benchmark runs are available as 2017-06-combined_logs.tgz.

The increased speed and formal correctness guarantees come at the price of higher memory consumption. Our tool chain needs roughly twice the memory than drat-trim in single-threaded mode, and memory consumption drastically increases in multi-threaded mode. However, memory consumption seems not to be a general problem. Out of the 238 benchmark problems, gratgen with 8 threads ran out of memory at 128 GiB for only a single certificate.

Verifying satisfiability is almost straightforward and very efficient. For the 64 satisfiable problems that CryptoMiniSAT solved at the 2016 SAT competition, the certifiates have a size of 229 MiB and gratchk verified them in 40 seconds. We provide the raw csv data extracted from the logs: sat.csv

GRATgen comes with a doxygen generated documentation. For GRATchk, you may want to look at the proof document or the proof outline.

We have formally proved the soundness of GRATchk in the Isabelle/HOL interactive theorem prover. The proof covers the actual implementation of GRATchk, and the semantics of the formula down to the integer array that represents it.

That is, if GRATchk claims that a formula is satisfiable/unsatisfiable, it is very likely that this is actually true.

The trusted base of a piece of software describes what has to be trusted in order to believe that the produced result is correct. It includes, at least, the hardware, the operating system, and the compilers used to compile the software.

For a SAT solver, it includes the complete implementation of a SAT solver, usually a highly optimized C program. If one uses certificates, the sat solver is replaced by the certificate checker in the trusted base.

For formally verified software, the software itself is not included in the trusted base any more. It is replaced by the trusted base of the used theorem prover and by the correctness specification. This highly decreases the likelihood of bugs in the verified software:

- Modern theorem provers like Isabelle/HOL are designed for maximum trustworthiness, reducing their own trusted base to a small and well-tested logical kernel.
- It is very unlikely that a bug in the logical kernel coincides with an error in the correctness proof.

We now list, in detail, the trusted code base of our verified checker, i.e., what pieces of software you have to trust in order to believe in correctness of our checker:

The Isabelle/HOL logical inference kernel. This piece of code is considerably small and very well tested. In the last decade, we discovered less than 5 bugs,

**none of them**leading to an inconsistent theorem being proved unnoticed!The Isabelle/HOL code generator and its Imperative HOL extension. The code generator is merely a syntax transformation from the functional fragment of HOL to the target language, in our case Standard ML. It's correctness has been proved on paper, and it is well-tested. Again, it is unlikely that an error in code generation will result in the verified checker accepting an invalid certificate. It is much more likely that the checker will just not compile, or fail with an exception at runtime.

The correctness theorem that we proved for the certified checker. In particular, you have to believe that the correctness theorem specifies the desired behavior of the checker. To this end, we provide a specification of a satisfying formula down to the integer array that holds null-terminated clauses of the formula, only using standard list functions. Excluding the definitions of the standard list functions, this specification is only a few lines of Isabelle/HOL source text. See the last section of the proof outline for a detailed explanation.

A small parser that parses the cnf file into an integer array. We tried to write this parser as clear as possible. Moreover, using Standard ML, it is immune to low-level bugs like numeric or buffer overflows.

The small piece of code that interprets the command line, invokes the parser and the generated code of the verifier, and prints the result.

While this list may seem huge, it is worth to list some items that are missing as opposed to an unverified checker written in C:

- That there are no buffer overflows of any kind.
- Correctness of the used data structures. This includes a map from clause IDs to clauses, from literals to their assignments, the management of RAT candidates, and that all code using these data structures satisfies their (often undocumented) preconditions.
- Correctness of RUP and RAT checks per se, as well as unit propagation.
- Correctness of the implementation of these concepts using partial assignments.
- Correctness of the implementation of overall proof checking, e.g., that all lemmas are checked at the end.

GRAT comes under an OpenBSD style license.

- Version 1.0 Initial version of GRAT
- Version 1.1
- Summarized deletion items
- Split certificates

- Separate watchlists (sepWL): More efficient implementation of core-first unit propagation.
- RAT-run heuristics: Reduction of expensive RAT candidate queries.