Introduction

The vision of Feynman1 that a quantum computer can be used to overcome the limitations of classical computers, gained momentum with the design of quantum algorithms that outperform their classical counterparts for popular challenging problems like factorization2,3, searching an unstructured solution space4. Quantum circuit is one of the most popular way for describing and implementing quantum algorithms. These consist of a series of elementary operations or gates belonging to a universal set and dictated by the implementing technologies. Like their classical counterpart, circuit synthesis and optimization is a significant part of any quantum computer compilation process. A compiler primarily translates from a human-readable input (programming language) into instructions that can be executed directly on a hardware. An integral part of this process is quantum circuit synthesis, whose aim is to decompose a unitary operation into a sequence of gates from a universal set. Often, additional constraints are imposed on a synthesis task, like minimization of a certain resource like qubits, gates (total number), non-Clifford gates, multi-qubit gates, etc. We call them resource-optimal synthesis algorithm.

Our work primarily focuses on the “Clifford+T” gate set, a popular finite universal fault-tolerant set. Fault-tolerant quantum error correction5,6 is required to control the accumulation of errors due to noise on quantum information, faulty quantum gates, faulty quantum state preparation, faulty measurements, etc. This is especially important for long computations, else the errors will make negligible the likelihood of obtaining a reliable and useful answer. The non-Clifford T gate has known constructions in most of the error correction schemes and the cost of fault-tolerantly implementing it exceeds the cost of the Clifford group gates by as much as a factor of hundred or more7,8. The minimum number of T-gates required to implement certain unitaries is a quantifier of difficulty in many algorithms9,10 that try to classically simulate quantum computation. So, even though alternative fault-tolerance methods such as completely transversal Clifford+T scheme11 and anyonic quantum computing12 are also being explored, minimization of the number of T gates (or T-count) in quantum circuits remain an important and widely studied goal. It has been argued in refs. 13,14,15 that it is also important to reduce the maximum number of T gates in any circuit path (or T-depth).

The Solovay-Kitaev algorithm16,17 guarantees that given an n-qubit unitary W, we can generate a circuit with a “discrete finite” universal gate set like Clifford+T, such that the unitary U implemented by the circuit is at most a certain distance from W. Here we note that there exists “infinite continuous” universal gate sets like Clifford+Rz(θ), with which we can implement any unitary, without any approximation. In this paper we focus on finite universal gate sets, that are more suitable for quantum error correction and fault tolerance. In fact, in quantum computation a set of gates is said to be universal if any quantum operation can be approximated to arbitrary accuracy by a quantum circuit involving only those gates18. A unitary is called exactly implementable by a gate set if there exists a quantum circuit with these gates, that implements it (up to some global phase). Otherwise, it is approximately implementable. Accordingly, a synthesis algorithm can be (a) exact when U = eiϕW (ϕ is the global phase); or (b) approximate when d(U,W) ≤ ϵ for some ϵ > 0. d( . ) is a distance metric. For an unitary U that is exactly implementable by the Clifford+T set, its T-count (denoted by \({{{\mathcal{T}}}}(U)\)) is the minimum number of T-gates required to implement it, while its T-depth (denoted by \({{{{\mathcal{T}}}}}_{d}(U)\)) is the minimum T-depth of any circuit that implements it. These definitions can be generalized for approximately implementable unitaries and have been described in the section “Preliminaries”. In this paper we give algorithm for the following two problems.

ϵ-T-COUNT: Given an n-qubit unitary W and \(\epsilon \in {{\mathbb{R}}}_{\ge 0}\), determine the T-count of a unitary U such that \({{{\mathcal{T}}}}(U)\le {{{\mathcal{T}}}}({U}^{{\prime} })\), where \(U,{U}^{{\prime} }\) are n-qubit exactly implementable unitaries and \(d(U,W),d({U}^{{\prime} },W)\le \epsilon\).

ϵ-T-DEPTH: Given an n-qubit unitary W and \(\epsilon \in {{\mathbb{R}}}_{\ge 0}\), determine the T-depth of a unitary U such that \({{{{\mathcal{T}}}}}_{d}(U)\le {{{{\mathcal{T}}}}}_{d}({U}^{{\prime} })\), where \(U,{U}^{{\prime} }\) are n-qubit exactly implementable unitaries and \(d(U,W),d({U}^{{\prime} },W)\le \epsilon\).

\({{{\mathcal{T}}}}(U)\) and \({{{{\mathcal{T}}}}}_{d}(U)\) are called the ϵ-T-count (\({{{{\mathcal{T}}}}}_{\epsilon }(W)\)) and ϵ-T-depth (\({{{{\mathcal{T}}}}}_{d\epsilon }(W)\)) of W, respectively. The T-count and T-depth-optimal circuits of U are called ϵ-T-count-optimal and ϵ-T-depth-optimal circuit for W, respectively. In this paper, we use the global phase invariant distance (see the “Preliminaries” section) as the distance metric and not the operator norm. This is because the global phase invariant distance ignores the global phase and hence avoids unnecessarily long approximating sequences that achieve a specific global phase. This can be the reason for the fact that the bound on T-count of single-qubit Z-rotations is less in ref. 19, which works with this distance, compared to refs. 20,21, that works with operator norm. (More discussions can be found in ref. 22.) This distance is composable22 and has been used to synthesize unitaries in other models like topological quantum computation23,24. It is not hard to see that if ϵ = 0 then we get the problem of synthesizing T-count and T-depth-optimal circuits for exactly implementable unitaries. In this case, both provable25,26,27 and much more efficient heuristic26,27 algorithms have been developed (see Table 1 for a comparison). We say that an algorithm is provable if its claimed efficiency and correctness or quality of solution (in this case optimality) can be proved by rigorous arguments. An algorithm is heuristic if either one or both of these factors are conjectured to be true.

Table 1 Complexity of some state-of-the-art optimal synthesis algorithms.

Any synthesis algorithm will have complexity at least O(2n), the input size. Placing further optimality constraint makes the problem even harder, in fact impractical to synthesize on a PC after a certain value of n. So re-synthesis algorithms have been developed which takes a circuit implementing a unitary and then tries to reduce (not minimize) a certain resource (see for example, refs. 28,29,30). In literature, usually these algorithms do not account for the complexity of synthesizing the circuit and claim a running time of poly(n). A detail study about the relative merit and de-merit of synthesis and re-synthesis algorithms, is beyond the scope of this work. But we would like to point out that the importance of designing better (optimal) synthesis algorithms or studying their complexity cannot be undermined, not only for theoretical reasons but also for the various applications they can have. Apart from guaranteeing optimality, they can be used as sub-routines in compiling large unitaries22,31,32, assess the quality of re-synthesis algorithms, etc. For example, the T-depth-optimal synthesis algorithm of ref. 27 was able to generate optimal circuits for standard unitaries like Fredkin, Peres, and Quantum OR, which could not be done by the re-synthesis methods used in ref. 33. In the section “Discussion of implementation results”, we show that we get much less T-count for widely-used multi-qubit unitaries like controlled rotation, compared to the number of T-gates obtained by compiling them first into single-qubit rotations and then replacing the T-count-optimal circuit of each such single-qubit rotation.

To the best of our knowledge, before this paper there was no algorithm to determine ϵ-T-count or ϵ-T-depth of arbitrary multi-qubit (n ≥ 1) unitaries, considering any distance metric. Previous algorithms like19,20,21 synthesize ϵ-T-count-optimal circuits for single qubit Z-rotations. In fact, even if we consider other discrete, finite universal gate sets like Clifford+V or Clifford+CS, there are no algorithms that work for arbitrary multi-qubit unitaries and minimize the non-Clifford gate count/depth. Even it is not clear how to modify or generalize the methods introduced in these papers. However, our results not only work for multi-qubit unitaries but can also be applied in these alternate bases, as explained in the next section.

In this paper, we give algorithms that can be used to synthesize (provably) ϵ-T-count and ϵ-T-depth-optimal circuits. We treat arithmetic operations on the entries of a unitary at unit cost and we do not account for the bit-complexity associated with specifying or manipulating them. Suppose the input n-qubit unitary is W, having size 2n × 2n. Then the space complexity of our algorithms, described in “Methods”, is poly(2n). The time complexity has an exponential dependence on \({{{{\mathcal{T}}}}}_{\epsilon }(W)\) or \({{{{\mathcal{T}}}}}_{d\epsilon }(W)\) while synthesizing ϵ-T-count or ϵ-T-depth optimal circuit, respectively (Table 1).

For the design and analysis of our algorithm the following results (see the “An exponential time and polynomial space algorithm” section) have been crucial. Suppose E is a unitary that is close to \({\mathbb{I}}\) in the global phase invariant distance, i.e., \(d(E,{\mathbb{I}})\le \epsilon\). C0 is an n-qubit Clifford operator. EC0 behaves almost like Clifford C0, i.e., it approximately inherits some characteristics from C0. First, expanding both EC0 and C0 in the Pauli basis, we can see that the absolute value of the coefficients (at each point) are almost equal (amplitude test). Second, if we expand \((E{C}_{0})P{(E{C}_{0})}^{{\dagger} }\) in the Pauli basis then the absolute value of the coefficients is almost 1 at \({P}^{{\prime} }\) if \({C}_{0}P{C}_{0}^{{\dagger} }={P}^{{\prime} }\), and nearly zero at other points (conjugation test). These results may be of independent interest and can be used for resource-optimal synthesis in other bases as described below.

Most discrete universal gate sets consist of Clifford gates and one non-Clifford gate. Consider one such set Clifford+A, where A is a non-Clifford gate and let UA is a unitary exactly implementable by this set. Since usually the cost of fault-tolerantly implementing the non-Clifford gate is higher, so we are required to optimize the A-count or A-depth, which are defined analogous to T-count and T-depth. One of the tricks in many resource-optimal-synthesis algorithms is to find a nice generating set \({{{\mathcal{G}}}}\) such that it has finite cardinality and UA can be decomposed as follows.

$${U}_{A}={e}^{i\phi }\left(\mathop{\prod }\limits_{i=m}^{1}{G}_{i}\right){C}_{0}\qquad {G}_{i}\in {{{\mathcal{G}}}},\,{C}_{0}\in {{{{\mathcal{C}}}}}_{n},\,\phi \in \left[0,2\pi \right),\,m={{{\rm{A}}}}-{{{\rm{count}}}}/{{{\rm{A}}}}-{{{\rm{depth}}}}$$
(1)

Each \({G}_{i}\in {{{\mathcal{G}}}}\) has a specific property: A-count 1 or A-depth 1. Then we design a search algorithm to search for products of Gi such that, we get \({U}_{A}{C}_{0}^{-1}\) (up to a global phase), or rather \({U}_{A}{\left({\prod }_{i}{G}_{i}\right)}^{-1}\) is a Clifford. We know that for any discrete finite universal gate set not all unitaries are exactly implementable. Let VA be one such unitary. UA is a unitary such that d(UA, VA) ≤ ϵ and it has the minimum A-count or A-depth among all exactly implementable unitaries within ϵ distance of VA. Then we can perform amplitude test (Theorem 3.1) and conjugation test (Theorem 3.3, Corollary 3.1) and get an ϵ-A-count-optimal or ϵ-A-depth-optimal decomposition of VA. So it will be interesting and useful to find such nice generating set for other bases, as has been found for Clifford+T25 (T-count), (T-depth)27, and Clifford+CS34 (CS-count, only for 2-qubit unitaries). One simple way of constructing \({{{\mathcal{G}}}}\) for count-optimality is to write UA as follows.

$$\begin{array}{l}{U}_{A}\,\,=\,\,{e}^{i\phi }{C}_{1}{A}_{({q}_{1})}{C}_{2}{A}_{({q}_{2})}{C}_{3}\ldots {C}_{m}{A}_{({q}_{m})}{C}_{0}\\ \qquad=\,\,{e}^{i\phi }\left({C}_{1}{A}_{({q}_{1})}{C}_{1}^{{\dagger} }\right)\left({C}_{1}{C}_{2}{A}_{({q}_{2})}{C}_{2}^{{\dagger} }{C}_{1}^{{\dagger} }\right)\ldots\\ \qquad\qquad \left({C}_{1}\ldots {C}_{m}{A}_{({q}_{m})}{C}_{m}^{{\dagger} }\ldots {C}_{1}^{{\dagger} }\right){C}_{1}\ldots {C}_{m}{C}_{0}\\ \qquad=\,\,{e}^{i\phi }\left({C}_{1}{A}_{({q}_{1})}{C}_{1}^{{\dagger} }\right)\left({C}_{2}^{{\prime} }{A}_{({q}_{2})}{C}_{2}^{{\prime} {\dagger} }\right)\ldots \left({C}_{m}^{{\prime} }{A}_{({q}_{m})}{C}_{m}^{{\prime} {\dagger} }\right){C}_{0}^{{\prime} }\end{array}$$

In the above \({A}_{({q}_{i})}\) denotes the A-gate applied on qubit qi. \({C}_{0},\ldots ,{C}_{m},{C}_{0}^{{\prime} }\ldots ,{C}_{m}^{{\prime} }\in {{{{\mathcal{C}}}}}_{n}\). So \({{{\mathcal{G}}}}=\{C{A}_{({q}_{i})}{C}^{{\dagger} }:C\in {{{{\mathcal{C}}}}}_{n},i\in [n]\}\), which is obviously finite since the Clifford group is finite35,36. It is possible to get more compact sets by exploiting other algebraic properties (for example, ref. 25). A generating set for depth-optimality can be constructed by conjugating products of at least n A-gates on distinct qubits by Clifford, as has been done in ref. 27.

From Table 1 we see that for exactly implementable unitaries the provable algorithms like refs. 25,26,27 had an exponential dependence on T-count and T-depth. Significant improvements were achieved in refs. 26,27, where heuristics were designed that led to algorithms with a polynomial dependence on T-count and T-depth. The algorithm in our paper also suffers from exponential dependence on ϵ-T-count and ϵ-T-depth, which usualy have an inverse dependence on ϵ, i.e., \({{{{\mathcal{T}}}}}_{\epsilon },{{{{\mathcal{T}}}}}_{d\epsilon }\propto f(1/\epsilon )\). We conjecture that for approximately implementable unitaries it is not possible to have algorithms with a polynomial dependence on these parameters.

Conjecture 1. It is not possible to have ϵ-T-count or ϵ-T-depth-optimal synthesis algorithms with complexity \({{{\rm{poly}}}}\left({2}^{n},{{{{\mathcal{T}}}}}_{\epsilon }\right)\) or \({{{\rm{poly}}}}\left({2}^{n},{{{{\mathcal{T}}}}}_{d\epsilon }\right)\), respectively.

However, from the improvements in T-count obtained by us (see the “Discussion of implementation results” section), we feel it is important to design efficient multi-qubit resource-optimal synthesis algorithms. In many practical quantum algorithms it is not too difficult to decompose a large unitary into smaller ones. We can apply composability rules (for example, see ref. 22 for global phase invariant distance) and distribute the errors among these small unitaries, such that the overall error remains within the desired bound. The complexity of resource-optimal synthesis algorithms will determine the maximum size of the component unitaries in a decomposition. The larger the components, the better the resource estimates, as evident from the results in our paper (see the “Discussion of implementation results” section). It might be possible to decrease the exponent in the time complexity by applying techniques like meet-in-the-middle25 or nested meet-in-the-middle26,27. It is not hard to see that our algorithm can be parallelized. It will be interesting to investigate if additional tricks can be used. Another interesting question is to find more compact generating sets for other universal gate sets for multi-qubit unitaries.

Much work has been done to synthesize a circuit for any multi-qubit unitary (without provable optimality on any resource)16,17,18,20,37,38,39,40,41. Comparatively little has been done for arbitrary multi-qubit unitaries, when additional constraints are imposed, like minimizing the T-count or T-depth. To the best of our knowledge, all the previous works for approximately implementable unitaries, have been exclusively for single-qubit unitaries, in fact specifically for Rz(θ) gates. They work with either operator norm20,21 or global phase invariant distance19. However, considerable amount of work has been done to synthesize T-count and T-depth-optimal circuits for exactly implementable multi-qubit unitaries. These include algorithms with provable optimality like25,26,27 that employ meet-in-the-middle (MITM) and nested MITM techniques, as well as more efficient heuristic algorithms whose optimality depend on some conjecture26,27. A crisp summary of the complexity of some state-of-the-art optimal synthesis algorithms has been given in Table 1.

Work has also been done to approximate single-qubit unitaries in Clifford+V42,43,44,45 basis and perform a CS-count-optimal (exact) synthesis34 of two-qubit unitaries in Clifford+CS basis.

Results and discussion

Preliminaries

We write [K] = {1, 2, …, K}. We denote the n × n identity matrix by \({{\mathbb{I}}}_{n}\) or \({\mathbb{I}}\) if dimension is clear from the context. We denote the set of n-qubit unitaries by \({{{{\mathcal{U}}}}}_{n}\). The size of an n-qubit unitary is N × N where N = 2n. We have given detail description about the n-qubit Pauli operators (\({{{{\mathcal{P}}}}}_{n}\)), Clifford group (\({{{{\mathcal{C}}}}}_{n}\)) and the group (\({{{{\mathcal{J}}}}}_{n}\)) generated by the Clifford and T gates in Supplementary Note 1. In this subsection we give some additional definitions and results required for the rest of the paper. We observe the following when expanding a Clifford in the Pauli basis.

Fact 2.1 (ref. 46). If \(C\in {{{{\mathcal{C}}}}}_{n}\) then for each \(P\in {{{{\mathcal{P}}}}}_{n}\)\(\exists {r}_{P}\in {\mathbb{C}}\), such that \(C={\sum }_{P\in {{{{\mathcal{P}}}}}_{n}}{r}_{P}P\). Further, if \({r}_{P},{r}_{{P}^{{\prime} }}\,\ne\, 0\) for any pair of \(P,{P}^{{\prime} }\), then \(| {r}_{P}| =| {r}_{{P}^{{\prime} }}| =r\), for some \(r\in {\mathbb{R}}\).

Fact 2.2. Let \(Q={\sum }_{P\in {{{{\mathcal{P}}}}}_{n}}{q}_{P}P\) be the expansion of a matrix Q in the Pauli basis. Then

$${q}_{P}=Tr\left(QP\right)/N.$$

Further, if Q is a unitary then \({\sum }_{P\in {{{{\mathcal{P}}}}}_{n}}{\left\vert {q}_{P}\right\vert }^{2}=1\)

The proof has been given in Supplementary Note 1 (Fact 2) Now consider a Clifford C which has an expansion, as given in Fact 2.1. Let there be M (≤N2) such non-zero coefficients. Since C is a unitary, so we apply Fact 2.2 and get the following.

$$M{r}^{2}=1\quad \ \Rightarrow \ r=\frac{1}{\sqrt{M}}\ge \frac{1}{N}.$$
(2)

We say that a unitary U is exactly implementable if there exists a Clifford+T circuit that implements it (up to some global phase), else it is approximately implementable. Specifically, we say W is ϵ-approximately implementable if there exists an exactly implementable unitary U such that d(U,W) ≤ ϵ. The Solovay-Kitaev algorithm16,17 guarantees that any unitary is ϵ-approximately implementable, for arbitrary precision ϵ ≥ 0. We denote the set of exactly implementable unitaries by \({{{{\mathcal{J}}}}}_{n}\). In this paper, we use the following distance measure d(. , .), which has been used in previous works like refs. 19,38 (qubit-based computing), refs. 23,24 (topological quantum computing).

Definition 2.1 (Global phase invariant distance). Given two unitaries \(U,W\in {{{{\mathcal{U}}}}}_{n}\), we define the global phase invariant distance between them as follows.

$$d(U,W)=\sqrt{1-\frac{\left\vert Tr\left({U}^{{\dagger} }W\right)\right\vert }{N}}$$

Composability of this distance with respect to tensor product and multiplication of unitaries, has been derived in ref. 22. This implies that if \(U={\prod }_{i}\left({\otimes }_{j}{U}_{ij}\right)\), \(V={\prod }_{i}\left({\otimes }_{j}{V}_{ij}\right)\) and d(Uij, Vij) ≤ ϵij then we can upper bound d(U, V) as function of ϵij.

Definition 2.2. The T-count of a circuit is the number of T-gates in it. The T-count of an exactly implementable unitary \(U\in {{{{\mathcal{J}}}}}_{n}\), denoted by \({{{\mathcal{T}}}}(U)\), is the minimum number of T-gates required to implement it (up to a global phase).

In ref. 25, the authors proved the following decomposition result, by which any exactly implementable unitary over the Clifford+T set can be written as a product of T-count 1 unitaries.

Theorem 2.1 (Proposition 1 in ref. 25 (re-stated)). For any \(U\in {{{{\mathcal{J}}}}}_{n}\) there exists a phase ϕ [0, 2π), Cliffords \({C}_{i}\in {{{{\mathcal{C}}}}}_{n}\) and Paulis \({P}_{i}\in {{{{\mathcal{P}}}}}_{n}\setminus \{{\mathbb{I}}\}\) for \(i\in [{{{\mathcal{T}}}}(U)]\) such that

$$U={e}^{i\phi }\left(\mathop{\prod }\limits_{i={{{\mathcal{T}}}}(U)}^{1}R({P}_{i})\right){C}_{0}$$
(3)

where \(R({P}_{i})={C}_{i}{T}_{({q}_{i})}{C}_{i}^{{\dagger} }=\frac{1}{2}(1+{e}^{\frac{i\pi }{4}}){\mathbb{I}}+\frac{1}{2}(1-{e}^{\frac{i\pi }{4}}){C}_{i}{Z}_{({q}_{i})}{C}_{i}^{{\dagger} } =\frac{1}{2}(1+{e}^{\frac{i\pi }{4}}){\mathbb{I}}+\frac{1}{2}(1-{e}^{\frac{i\pi }{4}}){P}_{i}\).

Using Fact 1 in Supplementary Note 1, given P and \({Z}_{({q}_{i})}\) we can compute (circuit for) Ci efficiently such that \(P={C}_{i}{Z}_{({q}_{i})}{C}_{i}^{{\dagger} }\). A decomposition of U, as in Theorem 2.1, with the minimum number of T-gates is called a T-count-optimal decomposition of U.

Definition 2.3 (ϵ-T-count of approximately implementable unitaries). Let \(W\in {{{{\mathcal{U}}}}}_{n}\) be an approximately implementable unitary. The ϵ-T-count of W, denoted by \({{{{\mathcal{T}}}}}_{\epsilon }(W)\), is equal to \({{{\mathcal{T}}}}(U)\), the T-count of an exactly implementable unitary \(U\in {{{{\mathcal{J}}}}}_{n}\) such that d(U,W) ≤ ϵ and \({{{\mathcal{T}}}}(U)\le {{{\mathcal{T}}}}({U}^{{\prime} })\) for any \({U}^{{\prime} }\in {{{{\mathcal{J}}}}}_{n}\) within distance ϵ of W, i.e., \(d({U}^{{\prime} },W)\le \epsilon\). We call a T-count-optimal circuit for any such U as the ϵ-T-count-optimal circuit for W.

It is not hard to see that the above definitions are very general and can be applied to any unitary \(W\in {{{{\mathcal{U}}}}}_{n}\), exactly or approximately implementable. If a unitary is exactly implementable then ϵ = 0. In fact, nearly all the following results can be deduced for the special case of exactly implementable unitaries by applying ϵ = 0.

Discussion of implementation results

We implemented our algorithm \({{{{\mathcal{A}}}}}_{MIN}\) in standard C++17 on an Intel(R) Core(TM) i7-7700K CPU at 4.2GHz, with 8 cores and 48 GB RAM, running FreeBSD 13.1. We compiled the code using clang++ 13.0.0. We used OpenMP47 for parallelization and the Eigen 3 matrix library48 for some of the matrix operations. We applied our algorithm to return the T-count-optimal decomposition of the following unitaries: (i) 1-qubit Rz(θ) and Rk; (ii) 2-qubit controlled-Rz(θ) (cRz(θ)); (iii) 2-qubit controlled Rk (cRk); (iv) 3-qubit double controlled-Rz(θ) (ccRz(θ)); (v) 3-qubit double controlled-Rk (ccRk); (vi) 2-qubit Givens rotation (Givens(θ)); (vii) 2-qubit QFT.

$$(i)\quad {R}_{z}(\theta )=\left[\begin{array}{cc}{e}^{-i\theta /2}&0\\ 0&{e}^{i\theta /2}\end{array}\right];\qquad {R}_{k}=\left[\begin{array}{cc}1&0\\ 0&{e}^{2\pi i/{2}^{k}}\end{array}\right]={e}^{2\pi i/{2}^{k+1}}{R}_{z}\left(\frac{2\pi }{{2}^{k}}\right)$$
$$(ii)\quad c{R}_{z}(\theta )={{{\rm{diagonal}}}}\left(1,1,{e}^{-i\theta /2},{e}^{i\theta /2}\right)$$
$$(iii)\quad c{R}_{k}={{{\rm{diagonal}}}}\left(1,1,1,{e}^{2\pi i/{2}^{k}}\right)$$
$$(iv)\quad cc{R}_{z}(\theta )={{{\rm{diagonal}}}}\left(1,1,1,1,1,1,{e}^{-i\theta /2},{e}^{i\theta /2}\right)$$
$$(v)\quad cc{R}_{k}={{{\rm{diagonal}}}}\left(1,1,1,1,1,1,1,{e}^{2\pi i/{2}^{k}}\right)$$
$$(vi)\quad Givens(\theta )=\left[\begin{array}{cccc}1&0&0&0\\ 0&\cos \theta &-\sin \theta &0\\ 0&\sin \theta &\cos \theta &0\\ 0&0&0&1\end{array}\right]$$

For convenience, we have denoted some diagonal matrices by diagonal(a, b, c, …) which implies that the matrix has elements a, b, c, … along the diagonal and 0 elsewhere. We used Quantum++49 to obtain the unitary from the circuit given in ref. 18 (Supplementary Fig. 1).

Controlled-Rz(θ) gates appear in many important quantum algorithms like Quantum Fourier Transform (QFT), phase estimation, factorization, order finding, hidden subgroup problem, Grover’s search18, quantum simulations50. Givens rotation appears in a number of quantum chemistry aplications51. Before our work there was no algorithm for T-count of multi-qubit approximately implementable unitaries. In fact they returned optimal results only for single qubit Rz(θ). If possible other unitaries were decomposed into a sequence of Rz(θ) gates, for which we know the following empirical formula from ref. 19, where the T-count is averaged over θ.

$${{{{\mathcal{T}}}}}^{\epsilon }({R}_{z}(\theta ))=3.067\,{\log }_{2}\left(1/\epsilon \right)-4.322$$
(4)

Then an upper bound was given by adding the T-count of component unitaries. For example, in Fig. 1 we have shown two implementations of cRz(θ) gate, that we found in literature. In Fig. 1a, two Fredkin gates (T-count = 726), one Rz(θ), and an extra ancilla52 is used. In Fig. 1b, the implementation uses two Rz gates. So upper bound on the T-count of cRz(θ), averaged over θ is as follows.

$$\#{{{\rm{T}}}}(c{R}_{z}(\theta ))=3.067\,{\log }_{2}\left(1/\epsilon \right)-4.322+14=3.067\,{\log }_{2}\left(1/\epsilon \right)+9.678\quad [\,{{\mbox{Figure 1a}}}\,]$$
$$\#{{{\rm{T}}}}(c{R}_{z}(\theta ))=2\left(3.067\,{\log }_{2}\left(1/\epsilon \right)-4.322\right)=6.134\,{\log }_{2}\left(1/\epsilon \right)-8.644\quad [\,{{\mbox{Figure 1b}}}\,]$$

The first upper bound is better (gives less T-count) for every ϵ < 0.016, but the implementation uses an extra ancilla. In Fig. 1c and d we show an implementation of cRk and ccRz(θ), respectively. The latter circuit can be used to implement ccRk by replacing the cRz(θ) with cRk. Upper bounds on T-count can be deduced in a similar manner for the respective unitaries.

Fig. 1: Implementations of multi-controlled rotations.
figure 1

cRz(θ) a implemented by using two Fredkin gates and one Rz(θ) gate, where upper qubit is the control, middle qubit is the target, and bottom qubit is the ancilla; and b implemented by two CNOT and two Rz gates. c cRk implemented by two Toffoli gates and one Rk gate. d ccRz(θ) implemented by two Toffoli, one cRz(θ) and one ancilla set to \(\left\vert 0\right\rangle\).

We took \(\theta =\frac{2\pi }{{2}^{k}}\) and varied k from 2 to 11. We were more interested in synthesizing multi-qubit unitaries, since these were not T-count-optimally synthesized before. It took on an average 48 mins to synthesize a 2-qubit unitary with T-count at most 7; and about 5.7 h for a 3-qubit unitary with T-count at most 4. We have synthesized only one 2-qubit unitary with T-count 9. This is the 2-qubit QFT at ϵ = 10−18 and it took more than 4 days. We have not synthesized 2 and 3 qubit unitaries with higher T-count because of time constraint. We made the following observations.

  1. 1.

    The T-count of controlled rotation gates reduce, as we increase the number of controls, at least for many of the angles and precision tested by us. This has been shown in Table 2. The average running time has been stated before.

    Table 2 Comparison of ϵ-T-count of different (controlled) rotation gates for various angles and precision.
  2. 2.

    The T-count of 2-qubit QFT is equal to the T-count of R2 and has been shown in Table 3. In this table we have also shown the T-count of 3-qubit QFT for some precision. The running time for these tests has been explicitly mentioned.

    Table 3 ϵ-T-count of 2 and 3 qubit QFT.
  3. 3.

    The T-count of Givens(θ) is similar to cRz(θ), on an average. (So we have not shown it separately.)

  4. 4.

    The T-count of Rz(θ) (and hence Rk) agrees with the results given in ref. 19.

The numerical results of this subsection, together with instructions on how to reproduce them, are available online at https://github.com/vsoftco/approx-t.

Methods

An exponential time and polynomial space algorithm

In this section, we describe an algorithm for determining the ϵ-T-count of an n-qubit unitary \(W\in {{{{\mathcal{U}}}}}_{n}\). This algorithm has space and time complexity \(O\left({2}^{2n}\right)\) and \(O\left({2}^{2n{{{{\mathcal{T}}}}}_{\epsilon }(W)+4n}\right)\), respectively. First, we derive some results that will help us design our algorithm.

Let U be an exactly synthesizable unitary such that d(U,W) ≤ ϵ.

$$d(W,U)=\sqrt{1-\frac{| Tr({W}^{{\dagger} }U)| }{N}}\le \epsilon \ \Rightarrow \ | Tr({W}^{{\dagger} }U)| \ge N(1-{\epsilon }^{2})$$
(5)

Let \(U=\left(\mathop{\prod }\nolimits_{i = t}^{1}R({P}_{i})\right){C}_{0}{e}^{i\phi }\) for some \({C}_{0}\in {{{{\mathcal{C}}}}}_{n}\) and global phase ϕ. And W = UE. Then from Eq. (5) we have

$$\left\vert Tr\left({E}^{{\dagger} }\right)\right\vert =\left\vert Tr\left(E\right)\right\vert \ge N(1-{\epsilon }^{2}).$$
(6)

The above implies that \(d(E,{\mathbb{I}})\le \epsilon\). We have

$$\begin{array}{l}\left\vert Tr\left({W}^{{\dagger} }\left(\mathop{\prod }\limits_{i=t}^{1}R({P}_{i})\right)\right)\right\vert \,\,=\,\,\left\vert Tr\left({E}^{{\dagger} }{U}^{{\dagger} }\left(\mathop{\prod }\limits_{i=t}^{1}R({P}_{i})\right)\right)\right\vert \\ \qquad\qquad\qquad\qquad\qquad\quad=\,\,\left\vert Tr\left({E}^{{\dagger} }{e}^{-i\phi }{C}_{0}^{{\dagger} }\left(\mathop{\prod }\limits_{i=1}^{t}{R}^{{\dagger} }({P}_{i})\right)\left(\mathop{\prod }\limits_{i=t}^{1}R({P}_{i})\right)\right)\right\vert \\ \qquad\qquad\qquad\qquad\qquad\quad=\,\,\left\vert Tr\left({E}^{{\dagger} }{C}_{0}^{{\dagger} }\right)\right\vert =\left\vert Tr\left(E{C}_{0}\right)\right\vert \end{array}$$

and similarly

$$\left\vert Tr\left({W}^{{\dagger} }\left(\mathop{\prod }\limits_{i=t}^{1}R({P}_{i})\right){P}_{1}\right)\right\vert =\left\vert Tr\left({E}^{{\dagger} }{C}_{0}^{{\dagger} }{P}_{1}\right)\right\vert =\left\vert Tr\left(E{C}_{0}{P}_{1}\right)\right\vert \qquad [{P}_{1}\in {{{{\mathcal{P}}}}}_{n}]$$
(7)

We now study some properties of \(\left\vert Tr(E{C}_{0}{P}_{1})\right\vert\), which will help us check if we have identified a correct \(\mathop{\prod }\nolimits_{i = t}^{1}R({P}_{i})\). For this, we prove the following theorem.

Theorem 3.1. Let \(E\in {{{{\mathcal{U}}}}}_{n}\) be such that \(\left\vert Tr\left(E\right)\right\vert \ge N\left(1-{\epsilon }^{2}\right)\), for some ϵ ≥ 0. \({C}_{0}={\sum }_{P\in {{{{\mathcal{P}}}}}_{n}}{r}_{P}P\) is an n-qubit Clifford. If \(\left\vert \left\{P:{r}_{P}\,\,\ne\,\, 0\right\}\right\vert =M\) then

$$\begin{array}{l}\frac{1-{\epsilon }^{2}}{\sqrt{M}}-\sqrt{M}\sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}\le \left\vert Tr\left(E{C}_{0}{P}_{1}\right)/N\right\vert \le \frac{1}{\sqrt{M}}\\ \qquad +\sqrt{M}\sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}\quad [if\,{r}_{{P}_{1}}\ne \,0]\end{array}$$
(8)
$$and\quad 0\le \left\vert Tr\left(E{C}_{0}{P}_{1}\right)/N\right\vert \le \sqrt{M}\sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}\quad [if\,{r}_{{P}_{1}}=0]$$
(9)

Proof Since E is unitary, we can expand it in the Pauli basis as

$$E=\mathop{\sum}\limits_{P\in {{{{\mathcal{P}}}}}_{n}}{e}_{P}P$$
(10)

where \({e}_{P}=Tr\left(EP\right)/N\) and \({\sum }_{P}{\left\vert {e}_{P}\right\vert }^{2}=1\) (Fact 2.2). Thus

$$1\ge | {e}_{{\mathbb{I}}}| =\left\vert Tr\left(E\right)/N\right\vert \ge (1-{\epsilon }^{2})\qquad [\,{{\mbox{From inequality 6}}}\,]$$
(11)
$$\,{{\mbox{and}}}\,\,\mathop{\sum}\limits_{P\ne {\mathbb{I}}}| {e}_{P}{| }^{2}\le 1-{(1-{\epsilon }^{2})}^{2}=2{\epsilon }^{2}-{\epsilon }^{4}$$
(12)
$$\ \Rightarrow \ | {e}_{P}| \le \sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}\qquad [\forall P\,\,\ne \,\,{\mathbb{I}}]$$
(13)

Since \({C}_{0}={\sum }_{P\in {{{{\mathcal{P}}}}}_{n}}{r}_{P}P\), from Fact 2.1 and Eq. (2), we know that \(| {r}_{P}| =r=\frac{1}{\sqrt{M}}\) or rP = 0. Then

$$\begin{array}{l}E{C}_{0}\,\,=\,\,\mathop{\sum}\limits_{P\in {{{{\mathcal{P}}}}}_{n}}{r}_{P}EP=\mathop{\sum}\limits_{{P\in {{{{\mathcal{P}}}}}_{n}}\atop{{r}_{P}\,\ne\, 0}}{r}_{P}EP\\ \,{{\mbox{and}}}\,\,E{C}_{0}{P}_{1}={r}_{{P}_{1}}E+\mathop{\sum}\limits_{{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}\atop{{r}_{P}\,\ne\, 0}}{r}_{P}E{P}_{1}^{{\prime} }\quad \,{{\mbox{where}}}\,\,{P}_{1}\in {{{{\mathcal{P}}}}}_{n}\setminus \{{\mathbb{I}}\}\,{{\mbox{and}}}\,\,P{P}_{1}={P}_{1}^{{\prime} }\,\ne \,{\mathbb{I}}.\end{array}$$

So

$$\begin{array}{l}\left\vert Tr\left(E{C}_{0}{P}_{1}\right)\right\vert \,=\,\left\vert {r}_{{P}_{1}}Tr\left(E\right)+\mathop{\sum}\limits_{{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}\atop{{r}_{P}\ne 0}}{r}_{P}Tr\left(E{P}_{1}^{{\prime} }\right)\right\vert \\ \qquad\qquad\qquad=\left\vert \left({r}_{{P}_{1}}{e}_{{\mathbb{I}}}+\mathop{\sum}\limits_{{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}\atop{{r}_{P}\ne 0}}{r}_{P}{e}_{{P}_{1}^{{\prime} }}\right)\right\vert N\quad [\,{{\mbox{Using Fact 2.2}}}\,]\\ \end{array}$$
(14)
$$\begin{array}{l}\le \left\vert {r}_{{P}_{1}}\right\vert \left\vert {e}_{{\mathbb{I}}}\right\vert N+\mathop{\sum}\limits_{{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}\atop{{r}_{P}\ne 0}}\left\vert {r}_{P}\right\vert \left\vert {e}_{{P}_{1}^{{\prime} }}\right\vert N\quad [\because | {e}_{{\mathbb{I}}}| \le 1]\\ \le \left\vert {r}_{{P}_{1}}\right\vert N+\mathop{\sum}\limits_{{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}\atop{{r}_{P}\ne 0}}r\sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}N\qquad [\,{{\mbox{Using inequality 13}}}\,]\end{array}$$
(15)

From Eq. (14) we can also obtain the following lower bound.

$$\begin{array}{l}\left\vert Tr\left(E{C}_{0}{P}_{1}\right)\right\vert \,\,\ge \,\,\left\vert {r}_{{P}_{1}}\right\vert \left\vert {e}_{{\Bbb{I}}}\right\vert N-\mathop{\sum}\limits_{{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}\atop{{r}_{P}\ne 0}}\left\vert {r}_{P}\right\vert \left\vert {e}_{{P}_{1}^{{\prime} }}\right\vert N\\ \qquad\qquad\qquad\,\,\ge \,\,\left\vert {r}_{{P}_{1}}\right\vert (1-{\epsilon }^{2})N-\mathop{\sum}\limits_{{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}\atop{{r}_{P}\ne 0}}r\sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}N\quad [\,{{\mbox{Inequality 13 and 11}}}\,]\end{array}$$
(16)

Since \(r=\frac{1}{\sqrt{M}}\), we prove the following inequalities.

$$\begin{array}{l}\frac{1-{\epsilon }^{2}}{\sqrt{M}}-\sqrt{M}\sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}\,\,\le \,\,\left\vert Tr\left(E{C}_{0}{P}_{1}\right)/N\right\vert \le \frac{1}{\sqrt{M}}+\sqrt{M}\sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}\quad [\,{{\mbox{if}}}\,\,{r}_{{P}_{1}}\ne 0]\\ \qquad\qquad\qquad {{\mbox{and}}}\,\quad 0\,\,\le \,\,\left\vert Tr\left(E{C}_{0}{P}_{1}\right)/N\right\vert \le \sqrt{M}\sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}\quad [\,{{\mbox{if}}}\,\,{r}_{{P}_{1}}=0]\end{array}$$

Basically, this theorem says that if E is close to identity then distribution of absolute-value-coefficients of EC0 and C0 in the Pauli basis expansion, is almost similar. In fact, we can have a more general theorem that can be deduced from the calculations in Theorem 3.1.

Theorem 3.2. Let \(E\in {{{{\mathcal{U}}}}}_{n}\) be such that \(\left\vert Tr\left(E\right)\right\vert \ge N\left(1-{\epsilon }^{2}\right)\), for some ϵ ≥ 0. \(Q={\sum }_{P\in {{{{\mathcal{P}}}}}_{n}}{q}_{P}P\) is an n-qubit unitary. Then for each \({P}_{1}\in {{{{\mathcal{P}}}}}_{n}\),

$$\begin{array}{l}(1-{\epsilon }^{2})| {q}_{{P}_{1}}| -\mathop{\sum}\limits_{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}| {q}_{P}| \sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}\le \left\vert Tr\left(EQ{P}_{1}\right)/N\right\vert \le | {q}_{{P}_{1}}|\\ \qquad\qquad\quad + \mathop{\sum}\limits_{P\in {{{{\mathcal{P}}}}}_{n}\setminus \{{P}_{1}\}}| {q}_{P}| \sqrt{2{\epsilon }^{2}-{\epsilon }^{4}}.\end{array}$$

So we can define two sets \({{{{\mathcal{S}}}}}_{1}\) and \({{{{\mathcal{S}}}}}_{0}\) as follows.

$${{{{\mathcal{S}}}}}_{1}=\left\{\left\vert Tr\left(E{C}_{0}{P}_{1}\right)/N\right\vert :{r}_{{P}_{1}}\,\,\ne\,\, 0\right\}$$
(17)
$${{{{\mathcal{S}}}}}_{0}=\left\{\left\vert Tr\left(E{C}_{0}{P}_{1}\right)/N\right\vert :{r}_{{P}_{1}}=0\right\}$$
(18)

From our results so far, it follows that for small enough ϵ (which is usually the case in nearly all applications) the values in \({{{{\mathcal{S}}}}}_{1}\) are nearly equal, while those in \({{{{\mathcal{S}}}}}_{0}\) are nearly 0. Let \(\Delta =\mathop{\max }\nolimits_{{t}_{1}\in {{{{\mathcal{S}}}}}_{1},{t}_{0}\in {{{{\mathcal{S}}}}}_{0}}({t}_{1}-{t}_{0})\). Then to get a positive difference we have the following.

$$\begin{array}{l}\Delta \,\,\ge \,\,\frac{1-{\epsilon }^{2}}{\sqrt{M}}-2\sqrt{M(2{\epsilon }^{2}-{\epsilon }^{4})} >\, 0\\ \quad\, \Rightarrow \, {\epsilon }^{4}-2{\epsilon }^{2}+\frac{1}{1+4{M}^{2}}\,\ge\, 0\end{array}$$

Solving this we obtain the following conditions.

$${\epsilon }^{2}\ge 1+\sqrt{1-\frac{1}{1+4{M}^{2}}}\quad \,{{\mbox{or}}}\,\quad {\epsilon }^{2}\le 1-\sqrt{1-\frac{1}{1+4{M}^{2}}}$$
(19)

Since usually ϵ < 1, so we consider the second inequality. Expanding the term in the square root we obtain

$${\epsilon }^{2}\le \frac{1}{2}\cdot \frac{1}{1+4{M}^{2}}+\frac{1}{2!}\cdot \frac{1}{2\cdot 2}{\left(\frac{1}{1+4{M}^{2}}\right)}^{2}+\frac{1}{3!}\cdot \frac{3}{2\cdot 2\cdot 2}{\left(\frac{1}{1+4{M}^{2}}\right)}^{3}+\ldots$$

Since this function decreases with M and 1 ≤ M ≤ N2, so we can say that

$${\epsilon }^{2}\le 0.105572809\, \Rightarrow \ \epsilon \le 0.3249196962.$$
(20)

For all practical purposes, the value of ϵ is much smaller than this. So we can easily distinguish the sets \({{{{\mathcal{S}}}}}_{0}\) and \({{{{\mathcal{S}}}}}_{1}\).

Algorithm Now we are in a position to describe our exhaustive search algorithm, \({{{{\mathcal{A}}}}}_{MIN}\) (Algorithm 1), that determines the ϵ-T-count of a unitary \(W\in {{{{\mathcal{U}}}}}_{n}\). This is an iterative procedure, where in every iteration we decide if \({{{{\mathcal{T}}}}}_{\epsilon }(W)=m\) for increasing values of a variable m.

Algorithm 1: \({{{{\mathcal{A}}}}}_{MIN}\)

Algorithm 2: \({{{{\mathcal{A}}}}}_{DECIDE}\)

Algorithm 3: \({{{{\mathcal{A}}}}}_{CONJ}\)

The main idea to solve the decision version is as follows. Suppose we have to test if \({{{{\mathcal{T}}}}}_{\epsilon }(W)=m\). From the definitions given in the “Preliminaries” section, we know that if this is true then \(\exists U\in {{{{\mathcal{J}}}}}_{n}\) such that \({{{\mathcal{T}}}}(U)\,=\,m\). Let \(U=\left(\mathop{\prod }\nolimits_{i = m}^{1}{U}_{i}\right){C}_{0}\) where \({C}_{0}\in {{{{\mathcal{C}}}}}_{n}\) and \({U}_{i}\in \{R(P):P\,\ne\, {\mathbb{I}}\}\). Let \(\widetilde{U}=\mathop{\prod }\nolimits_{i = m}^{1}{U}_{i}\). To test if we have guessed the correct \(\widetilde{U}\), we can apply the results deduced in the previous section. Specifically we calculate \({W}^{{\prime} }={W}^{{\dagger} }\widetilde{U}\) and then calculate the set \({{{{\mathcal{S}}}}}_{c}=\left\{| Tr({W}^{{\prime} }P)/N| :P\in {{{{\mathcal{P}}}}}_{n}\right\}\), of coefficients. Then we check if we can distinguish two subsets \({{{{\mathcal{S}}}}}_{0}\) and \({{{{\mathcal{S}}}}}_{1}\), as shown in Eqs. (17) and (18), for some 1 ≤ M ≤ N2. Further details have been given in Algorithm 2. Let us call this the amplitude test. After passing this test we have a unitary of the form EQ where Q is a unitary. This test sort of filters out the approximate values of the coefficients of Q in the Pauli basis (Theorem 3.2). So after passing this test Q will be a unitary with equal or nearly equal amplitudes or coefficients (absolute value) at some points and zero or nearly zero at other points. To ensure Q is a Clifford, i.e., \({W}^{{\prime} }={E}^{{\dagger} }{C}_{0}\) for some \({C}_{0}\in {{{{\mathcal{C}}}}}_{n}\), we perform the conjugation test (Algorithm 3) for further verification.

Theorem 3.3. Let \(E,Q\in {{{{\mathcal{U}}}}}_{n}\) such that \(d(E,{\mathbb{I}})\le \epsilon\). \({P}^{{\prime}}\in {{{{\mathcal{P}}}}}_{n}\setminus \{{\mathbb{I}}\}\) such that \(Q{P}^{{\prime}}{Q}^{{\dagger}}={\sum}_{P}{\alpha}_{P}P\), where \({\alpha}_{P}\in {\mathbb{C}}\). Then for each \({P}^{{\prime}{\prime}}\in {{{{\mathcal{P}}}}}_{n}\),

$$\begin{array}{l}\min \left\{\right.0,| {\alpha }_{{P}^{{\prime}{\prime} }}| (1-4{\epsilon }^{2}+2{\epsilon }^{4})\\-2\epsilon \mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}| \left.\right\}\, \le \,\left\vert Tr\left(({E}^{{\dagger} }Q{P}^{{\prime} }{Q}^{{\dagger} }E){P}^{{\prime}{\prime} }\right)\right\vert /N\\ \qquad\qquad\qquad\quad\le \max \{| {\alpha }_{{P}^{{\prime}{\prime} }}| +2\epsilon \mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}| ,1\}\end{array}$$

Proof. We have the following.

$$\begin{array}{l}{E}^{{\dagger} }(Q{P}^{{\prime} }{Q}^{{\dagger} })E\,=\,\mathop{\sum}\limits_{\hat{P}}| {e}_{\hat{P}}{| }^{2}\hat{P}\left(\mathop{\sum}\limits_{P}{\alpha }_{P}P\right)\hat{P}+\mathop{\sum}\limits_{\hat{P}\ne \tilde{P}}\overline{{e}_{\hat{P}}}{e}_{\tilde{P}}\hat{P}\left(\mathop{\sum}\limits_{P}{\alpha }_{P}P\right)\tilde{P}\\ \qquad \qquad \qquad\,=\,| {e}_{{\mathbb{I}}}{| }^{2}\mathop{\sum}\limits_{P}{\alpha }_{P}P+\mathop{\sum}\limits_{\hat{P}\ne {\mathbb{I}},P}| {e}_{\hat{P}}{| }^{2}{\alpha }_{P}\hat{P}P\hat{P}+\mathop{\sum}\limits_{\hat{P}\ne \tilde{P},P}\overline{{e}_{\hat{P}}}{e}_{\tilde{P}}{\alpha }_{P}\hat{P}P\tilde{P}\end{array}$$
(21)

Multiplication by P gives us the following.

$$\begin{array}{l}({E}^{{\dagger} }Q{P}^{{\prime} }{Q}^{{\dagger} }E){P}^{{\prime}{\prime} }\,=\,| {e}_{{\mathbb{I}}}{| }^{2}{\alpha }_{{P}^{{\prime}{\prime} }}{\mathbb{I}}+| {e}_{{\mathbb{I}}}{| }^{2}\mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}{\alpha }_{P}P{P}^{{\prime}{\prime} }+\mathop{\sum}\limits_{\hat{P}\ne {\mathbb{I}}}| {e}_{\hat{P}}{| }^{2}{\alpha }_{{P}^{{\prime}{\prime} }}(\pm {\mathbb{I}})\\ \qquad\qquad\qquad\qquad +\mathop{\sum}\limits_{\hat{P}\ne {\mathbb{I}},P\ne {P}^{{\prime}{\prime} }}| {e}_{\hat{P}}{| }^{2}{\alpha }_{P}\hat{P}P\hat{P}{P}^{{\prime}{\prime} }+\mathop{\sum}\limits_{\hat{P}\ne \tilde{P}}\overline{{e}_{\hat{P}}}{e}_{\tilde{P}}{\alpha }_{{P}^{{\prime}{\prime} }}\hat{P}{P}^{{\prime}{\prime} }\tilde{P}{P}^{{\prime}{\prime} }\\ \qquad\qquad\qquad\qquad +\mathop{\sum}\limits_{\hat{P}\ne \tilde{P},P\ne {P}^{{\prime}{\prime} }}\overline{{e}_{\hat{P}}}{e}_{\tilde{P}}{\alpha }_{P}\hat{P}P\tilde{P}{P}^{{\prime}{\prime} }\end{array}$$

So,

$$\begin{array}{l}\left\vert Tr\left(({E}^{{\dagger} }Q{P}^{{\prime} }{Q}^{{\dagger} }E){P}^{{\prime}{\prime} }\right)/N\right\vert \,\le \,| {\alpha }_{{P}^{{\prime}{\prime} }}| \mathop{\sum}\limits_{\hat{P}}| {e}_{\hat{P}}{| }^{2}\\ \qquad \qquad\qquad\qquad\qquad\qquad +\mathop{\sum}\limits_{\hat{P}\ne \tilde{P},P\ne {P}^{{\prime}{\prime} }}| \overline{{e}_{\hat{P}}}{e}_{\tilde{P}}{\alpha }_{P}| \left\vert Tr(\hat{P}P\tilde{P}{P}^{{\prime}{\prime} })/N\right\vert \\ \qquad \qquad\qquad\qquad\qquad\qquad =\,| {\alpha }_{{P}^{{\prime}{\prime} }}| +\mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}| \mathop{\sum}\limits_{\hat{P}\ne \tilde{P}}| \overline{{e}_{\hat{P}}}{e}_{\tilde{P}}| \left\vert Tr(\hat{P}P\tilde{P}{P}^{{\prime}{\prime} })/N\right\vert \qquad [\,{{\mbox{Fact}}}\,\,2.2]\end{array}$$

Given \({P}^{{\prime}{\prime} },P,\hat{P}\), we can have \(\hat{P}P\tilde{P}{P}^{{\prime}{\prime} }=\pm {\mathbb{I}}\) for one particular value of \(\tilde{P}\).

$$\begin{array}{l}\left\vert Tr\left(({E}^{{\dagger} }Q{P}^{{\prime} }{Q}^{{\dagger} }E){P}^{{\prime}{\prime} }\right)/N\right\vert \le | {\alpha }_{{P}^{{\prime}{\prime} }}| +\mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}| \mathop{\sum}\limits_{\hat{P}}| \overline{{e}_{\hat{P}}}| | {e}_{\hat{{P}^{{\prime} }}}| \\ \qquad [\hat{{P}^{{\prime} }}\,\ne\, \hat{P}\,\,{{\mbox{is such that}}}\,\,\hat{P}P\hat{{P}^{{\prime} }}{P}^{{\prime}{\prime} }=\pm \,{\mathbb{I}}]\end{array}$$

Let \({\mathbb{I}}P\hat{{P}_{0}^{{\prime} }}{P}^{{\prime}{\prime} }=\pm {\mathbb{I}}\) and \(\hat{{P}_{0}}P{\mathbb{I}}{P}^{{\prime}{\prime} }=\pm {\mathbb{I}}\), for some Paulis \(\hat{{P}_{0}^{{\prime} }},\hat{{P}_{0}}\in {{{{\mathcal{P}}}}}_{n}\setminus \{{\mathbb{I}}\}\). Then we can write

$$\begin{array}{l}\left\vert Tr\left(({E}^{{\dagger} }Q{P}^{{\prime} }{Q}^{{\dagger} }E){P}^{{\prime}{\prime} }\right)/N\right\vert \,\le \,| {\alpha }_{{P}^{{\prime}{\prime} }}| +\mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}| \left(| {e}_{{\Bbb{I}}}| | {e}_{\hat{{P}_{0}^{{\prime} }}}| +| {e}_{\hat{{P}_{0}}}| | {e}_{{\mathbb{I}}}| +\mathop{\sum}\limits_{\hat{P}\ne {\mathbb{I}},\hat{{P}_{0}}}| {e}_{\hat{P}}| | {e}_{\hat{{P}^{{\prime} }}}| \right)\\ \qquad\qquad\qquad\qquad\qquad\qquad\,\le \,| {\alpha }_{{P}^{{\prime}{\prime} }}| +\mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}| \left(| {e}_{\hat{{P}_{0}^{{\prime} }}}| +| {e}_{\hat{{P}_{0}}}| +\mathop{\sum}\limits_{\hat{P}\ne {\mathbb{I}},\hat{{P}_{0}}}| {e}_{\hat{P}}| | {e}_{\hat{{P}^{{\prime} }}}| \right)\quad [\,{{\mbox{Equation 11}}}]\end{array}$$

In Supplementary Note 2 we show that \({\sum }_{\hat{P}\ne {\mathbb{I}},\hat{{P}_{0}}}| {e}_{\hat{P}}| | {e}_{\hat{{P}^{{\prime} }}}| \le (2{\epsilon }^{2}-{\epsilon }^{4})\). We observe that in the above inequality we have taken \(| {e}_{{\mathbb{I}}}| \le 1\), but if \(| {e}_{{\mathbb{I}}}| =1\) then eP = 0 for any \(P\,\ne \, {\mathbb{I}}\), since ∑PeP2 = 1. To get non-zero values for the sum within bracket \(| {e}_{{\mathbb{I}}}| < 1\). If we have to maximize \(| {e}_{\hat{{P}_{0}^{{\prime} }}}| +| {e}_{\hat{{P}_{0}}}|\) given Eq. (12), then if we consider an optimization problem with these two variables only, then it is not difficult to see that the maximum occurs if they have the same value. That is \(| {e}_{\hat{{P}_{0}}}| +| {e}_{\hat{{P}_{0}^{{\prime} }}}| \le 2\sqrt{\frac{2{\epsilon }^{2}-{\epsilon }^{4}}{2}}\approx 2\epsilon\). Ignoring higher order terms of ϵ, we can write the following.

$$\left\vert Tr\left(({E}^{{\dagger} }Q{P}^{{\prime} }{Q}^{{\dagger} }E){P}^{{\prime}{\prime} }\right)/N\right\vert \le | {\alpha }_{{P}^{{\prime}{\prime} }}| +2\epsilon \mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}|$$

We also have the following lower bound using similar reasoning as above.

$$\begin{array}{l}\left\vert Tr\left(({E}^{{\dagger} }Q{P}^{{\prime} }{Q}^{{\dagger} }E){P}^{{\prime}{\prime} }\right)/N\right\vert \,\ge \,| {\alpha }_{{P}^{{\prime}{\prime} }}| \left(| {e}_{{\mathbb{I}}}{| }^{2}-\mathop{\sum}\limits_{\hat{P}\ne {\mathbb{I}}}| {e}_{\hat{P}}{| }^{2}\right)\\\qquad\qquad\qquad\qquad\qquad\qquad-\mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}| \mathop{\sum}\limits_{\hat{P}\ne \tilde{P}}| \overline{{e}_{\hat{P}}}{e}_{\tilde{P}}| \left\vert Tr(\hat{P}P\tilde{P}{P}^{{\prime}{\prime} })/N\right\vert \\ \qquad\qquad\qquad\qquad\qquad\qquad \ge \,| {\alpha }_{{P}^{{\prime}{\prime} }}| (1-4{\epsilon }^{2}+2{\epsilon }^{4})-2\epsilon \mathop{\sum}\limits_{P\ne {P}^{{\prime}{\prime} }}| {\alpha }_{P}| \end{array}$$

And we have the following corollary.

Corollary 3.1. Let \({C}_{0}\in {{{{\mathcal{C}}}}}_{n}\) and \({P}^{{\prime} }\in {{{{\mathcal{P}}}}}_{n}\) such that \({C}_{0}{P}^{{\prime} }{C}_{0}^{{\dagger} }=\tilde{P}\in {{{{\mathcal{P}}}}}_{n}\). If \(E\in {{{{\mathcal{U}}}}}_{n}\) such that \(d(E,{\Bbb{I}})\le \epsilon\), then

$$\begin{array}{l}(1-4{\epsilon }^{2}+2{\epsilon }^{4})\,\,\le \,\,\left\vert Tr\left(({E}^{{\dagger} }{C}_{0}{P}^{{\prime} }{C}_{0}^{{\dagger} }E){P}^{{\prime}{\prime} }\right)\right\vert /N\le 1\qquad if\,{P}^{{\prime}{\prime} }=\tilde{P}\\ \qquad\qquad\qquad 0\,\,\le \,\,\left\vert Tr\left(({E}^{{\dagger} }{C}_{0}{P}^{{\prime} }{C}_{0}^{{\dagger} }E){P}^{{\prime}{\prime} }\right)\right\vert /N\le 2\epsilon \qquad else.\end{array}$$

The above theorem and corollary basically says that EC0 (or EC0) approximately inherits the conjugation property of C0. For each \({P}^{{\prime} }\in {{{{\mathcal{P}}}}}_{n}\), if we expand \({C}_{0}{P}^{{\prime} }{C}_{0}^{{\dagger} }\) in the Pauli basis then the absolute value of the coefficients has value 1 at one point, 0 in the rest. If we expand \(E{C}_{0}{P}^{{\prime} }{C}_{0}^{{\dagger} }{E}^{{\dagger} }\) in the Pauli basis then one of the coefficients (absolute value) will be almost 1 and the rest will be almost 0. From Theorem 3.3 this pattern will not show for at least one Pauli \({P}^{{\prime}{\prime}{\prime} }\in {{{{\mathcal{P}}}}}_{n}\) if we have EQ, where \(Q\,\notin \, {{{{\mathcal{C}}}}}_{n}\). If we expand \(EQ{P}^{{\prime}{\prime}{\prime} }{Q}^{{\dagger} }{E}^{{\dagger} }\) or \({E}^{{\dagger} }Q{P}^{{\prime}{\prime}{\prime} }{Q}^{{\dagger} }E\) in the Pauli basis then the spike in the amplitudes will be in at least two points. Also, we observe that 2ϵ < (1 − 4ϵ2 + 2ϵ4) for any ϵ ≤ 0.31. Thus there exists a distinguishable gap between the two cases of Corollary 3.1. For all practical purposes ϵ is much less than this value.

Synthesizing T-count-optimal circuits

So far we have been able to determine \({{{{\mathcal{T}}}}}_{\epsilon }(W)\) for any \(W\in {{{{\mathcal{U}}}}}_{n}\). We now describe how we can synthesize ϵ-T-count-optimal circuit for W, using the above algorithms. It is easy to see that \({{{{\mathcal{A}}}}}_{DECIDE}\) can return a sequence {Um, …, U1} of unitaries such that \(U=\left(\mathop{\prod }\nolimits_{i = {{{\mathcal{T}}}}(U)}^{1}{U}_{i}\right){C}_{0}{e}^{i\phi }\) (for some \({C}_{0}\in {{{{\mathcal{C}}}}}_{n}\)) and \({{{\mathcal{T}}}}(U)={{{{\mathcal{T}}}}}_{\epsilon }(W)\). We can efficiently construct circuits for each \({U}_{i}\in \{R(P):P\,\ne \, {\mathbb{I}}\}\) using Fact 1 of Supplementary Note 1. So what remains, is to determine C0. Then we can efficiently construct a circuit for it, for example, by using the results in ref. 53.

If W = UE then at step 3 of Algorithm 2 we calculate \({W}^{{\prime} }={W}^{{\dagger} }\widetilde{U}={e}^{-i\phi }{E}^{{\dagger} }{C}_{0}^{{\dagger} }\), where \(\widetilde{U}=\mathop{\prod }\nolimits_{i = m}^{1}{U}_{i}\). From Algorithm 2 we can also obtain the following information: (1) set \({{{{\mathcal{S}}}}}_{1}\), as defined in Eq. (17), (2) \(M=| {{{{\mathcal{S}}}}}_{1}|\). Thus we can calculate \(r=\frac{1}{\sqrt{M}}\) and from step 4 we can actually calculate the set \(\widetilde{{{{{\mathcal{S}}}}}_{1}}=\left\{({t}_{P},P):| {t}_{P}| =| Tr({W}^{{\prime} }P)/N| \in {{{{\mathcal{S}}}}}_{1}\right\}\). From Eq. (14) we can say that for small enough ϵ (say \(< < \frac{1}{M}\)) we have

$$\frac{Tr\left({E}^{{\dagger} }{C}_{0}^{{\dagger} }{P}_{1}\right)}{Tr\left({E}^{{\dagger} }{C}_{0}^{{\dagger} }{P}_{2}\right)}\approx \frac{\overline{{r}_{{P}_{1}}}}{\overline{{r}_{{P}_{2}}}}$$
(22)

We perform the following steps.

  1. 1.

    Calculate \({a}_{P}=\frac{{t}_{P}}{{t}_{{\mathbb{I}}}}=\frac{Tr({W}^{{\prime} }P)}{Tr({W}^{{\prime} })}\), where \(({t}_{P},P)\in \widetilde{{{{{\mathcal{S}}}}}_{1}}\) (or equivalently \(| {t}_{P}| \in {{{{\mathcal{S}}}}}_{1}\)). We must remember that \(({t}_{{\mathbb{I}}},{\mathbb{I}})\in \widetilde{{{{{\mathcal{S}}}}}_{1}}\).

    We explained that from Eq. (14), \({a}_{P}\approx \frac{\overline{{r}_{P}}}{\overline{{r}_{{\Bbb{I}}}}}\). From Fact 2.1 we know that \(\frac{| \overline{{r}_{P}}| }{| \overline{{r}_{{\Bbb{I}}}}| }=\frac{| {r}_{P}| }{| {r}_{{\Bbb{I}}}| }=1\). So we adjust the fractions such that their absolute value is 1. For small enough ϵ this adjustment is not much and so with a slight abuse we use the same notation for the adjusted values.

  2. 2.

    Select \(c,d\in {\mathbb{R}}\) such that c2 + d2 = r2. Let \(\widetilde{{r}_{{\mathbb{I}}}}=c+di\). Then we claim that the Clifford \(\widetilde{{C}_{0}}=\widetilde{{r}_{{\mathbb{I}}}}{\sum }_{P:{r}_{P}\ne 0}\overline{{a}_{P}}P\) is sufficient for our purpose.

It is not hard to see that \(\widetilde{{C}_{0}}={e}^{i\phi {\prime} }{C}_{0}\) for some \(\phi {\prime} \in [0,2\pi )\). Thus if \({U}^{{\prime} }=\left(\mathop{\prod }\nolimits_{i = m}^{1}{U}_{i}\right)\widetilde{{C}_{0}}\), then \({{{\mathcal{T}}}}({U}^{{\prime} })={{{\mathcal{T}}}}(U)\) and \(d({U}^{{\prime} },W)\le \epsilon\).

Time complexity: worst case analysis

We first analyse the time complexity of \({{{{\mathcal{A}}}}}_{CONJ}\). The outer and inner loop at steps 2 and 6, respectively, can run at most N2*N2 = N4 times, where N = 2n. At step 7 multiplication of four N × N matrices take O(N2) time and calculating trace takes N time steps. So overall complexity at step 7 is dominated by O(N2). We note that at step 13 we do not need to calculate the product and trace again. In the worst case every loop at steps 2 and 6 are implemented, incurring an overall time complexity of O(N6).

Now we analyse the time complexity of \({{{{\mathcal{A}}}}}_{DECIDE}\), when testing for a particular T-count m. The algorithm loops over all possible products of m unitaries Uj, which is R(P) in case of T-count-decision. Since there can be N2 − 1 non-identity Paulis P, so this loop happens at most N2m times. Now in each such loop we do m matrix multiplications at step 2 and 3. This has time complexity O(mN2). At step 4 we make a list of N2 real numbers. Each is obtained by multiplying two N × N matrices and then taking trace. So time complexity for making this list is O(N4). Sorting this list takes time O(nN2). The inner loop 5-13 happens N2 times. Each of the list elements is checked and so step 8 has complexity O(N2). Now let within the inner loop the conjugation test is called k1 times. So the loop 5-13 incurs a complexity O(k1N6 + (N2 − k1)N2), when k1 > 0, else it is O(N4). Let k is the number of outer loops (steps 2–14) for which conjugation test is invoked in the inner loop 5-13 and k1 is the maximum number of times this test is called within any 5-13 inner loop. Then the overall complexity of \({{{{\mathcal{A}}}}}_{DECIDE}\) is O((N2m − k)  (mN2 + N4 + nN2 + N4) + k (mN2 + N4 + nN2 + k1N6 + (N2 − k1)N2)) O((N2m − k) N4 + k (k1N6 + (N2 − k1)N2)), assuming m < N2.

The conjugation test is invoked only if a unitary passes the amplitude test. We assume that the occurrence of non-Clifford unitaries with equal amplitude is not so frequent such that kk1 < N2m − k. (We did check this in our implementations.) Thus \({{{{\mathcal{A}}}}}_{DECIDE}\) has a complexity of O(N2m+4), for one particular value of m. Hence, the overall algorithm \({{{{\mathcal{A}}}}}_{MIN}\) has a time complexity \(O({N}^{2{{{{\mathcal{T}}}}}_{\epsilon }(W)+4})\in O({2}^{2n{{{{\mathcal{T}}}}}_{\epsilon }(W)+4n})\), with the given assumption.

Time complexity: practical considerations

  1. 1.

    It is not hard to see that if [P1, P2] = P1P2 − P2P1 = 0 then \(R({P}_{1})R({P}_{2})=(\alpha {\mathbb{I}}+\beta {P}_{1})(\alpha {\mathbb{I}}+\beta {P}_{2})=R({P}_{2})R({P}_{1})\), where \(\alpha =\frac{1}{2}\left(1+{e}^{i\pi /4}\right)\) and \(\beta =\frac{1}{2}\left(1-{e}^{i\pi /4}\right)\). Thus

    $$\left(\mathop{\prod}\limits_{i}R({P}_{i})\right)R({P}_{1})R({P}_{2})\left(\mathop{\prod}\limits_{j}R({P}_{j})\right)=\left(\mathop{\prod}\limits_{i}R({P}_{i})\right)R({P}_{2})R({P}_{1})\left(\mathop{\prod}\limits_{j}R({P}_{j})\right).$$

    So in step 2 of Algorithm 2 we need not loop over all m-length products of R(P). It is easy to check if two n-qubit Paulis commute. There are even number of places where the respective 1-qubit Paulis are non-identity and different. We need not go into actual matrix multiplications. This can speed-up the actual running time by orders of magnitude. For example, for the unitaries considered by us, we obtained a speed-up of 5–10 times. In fact, it may be possible to show that the asymptotic complexity also decreases. One can work out more such symmetries in order to prune the search space.

  2. 2.

    We already made an assumption that the number of non-Cliffords that pass the amplitude test is much less. Even if such a unitary is tested in \({{{{\mathcal{A}}}}}_{CONJ}\), we need not loop over N4 times. As soon as there are 2 spikes for any outer loop Pauli Pout, the program exits (step 10). If a non-spike is “far enough” from 0 then also the program exits (step 14). So in most cases testing a non-Clifford with equal amplitudes take less time. If there is Clifford then all the N4 loops have to run, but then it implies that \({{{{\mathcal{A}}}}}_{DECIDE}\) has obtained T-count.

  3. 3.

    Most of the matrix multiplications, especially by Paulis are sparse, so here the run-time complexity is less. In step 2 of Algorithm 2 one has to repeatedly multiply a unitary U by \(R(P)=\alpha {\mathbb{I}}+\beta P\). Since P is sparse, we can first multiply U by P, then multiply each non-zero off-diagonal element by β and finally add α to the diagonal. This can reduce some practical running time.

Space complexity

The input to our algorithm is a N × N unitary, with space complexity O(N2). In step 1 of \({{{{\mathcal{A}}}}}_{DECIDE}\), we can store the single qubit Paulis and calculate R(P) whenever required. We require O(N2) space to perform matrix multiplication of two N × N matrices. In \({{{{\mathcal{A}}}}}_{DECIDE}\), we either store a N × N matrix or a list of N2 real numbers (step 4). Even in \({{{{\mathcal{A}}}}}_{CONJ}\) we store one N × N matrix. Hence the overall space complexity is O(N2) O(22n), without storing R(P). This increases running time because we have to calculate the n-qubit Paulis and R(P) repeatedly. But the asymptotic time complexity remains unchanged.

If we store the n-qubit Paulis or R(P), then we require O(N4) space. This factor dominates and overall space complexity is O(N4) O(24n). In this approach, the actual running time reduces but the asymptotic time complexity remains same.

T-depth-optimal synthesis

The algorithms 1 (\({{{{\mathcal{A}}}}}_{MIN}\)) and 2 (\({{{{\mathcal{A}}}}}_{DECIDE}\)) can be used for T-depth-optimal-synthesis of any multi-qubit unitary, since we know there is a finite generating set27 such that any exactly implementable unitary can be written as a product of elements from this set and a Clifford. We first give some definitions.

Definition 3.1 (T-depth of circuits). Suppose the unitary U implemented by a circuit is written as a product U = UmUm−1U1 such that each Ui can be implemented by a circuit in which all the gates can act in parallel or simultaneously. We say Ui has depth 1 and m is the depth of the circuit. The T-depth of a circuit is the number of unitaries Ui where the T/T gate is the only non-Clifford gate and all the T/T gates can act in parallel. (The remaining Clifford gates within each Ui may not act in parallel.)

Definition 3.2 (T-depth of exactly implementable unitaries). The T-depth or min-T-depth of an exactly synthesizable unitary \(U\in {{{{\mathcal{J}}}}}_{n}\), denoted by \({{{{\mathcal{T}}}}}_{d}(U)\), is the minimum T-depth of a Clifford+T circuit that implements it (up to a global phase).

In ref. 27 a subset, \({{\mathbb{V}}}_{n}\subset \{{\prod }_{i\in [n]}C{\overline{T}}_{(i)}{C}^{{\dagger} },C\in {{{{\mathcal{C}}}}}_{n},\overline{T}\in \{{{{\rm{T}}}},{{{{\rm{T}}}}}^{{\dagger} },{\mathbb{I}}\}\}\), of T-depth-1 unitaries has been defined. It has been shown that \(| {{\mathbb{V}}}_{n}| \le n\cdot {2}^{5.6n}\) and any T-depth-1 unitary \({U}_{1}\in {{{{\mathcal{J}}}}}_{n}\) can be written as

$${U}_{1}={e}^{i\phi }\left(\mathop{\prod}\limits_{i\ge 1}{V}_{i}\right){C}_{0}\qquad \,{{\mbox{where}}}\,\,{V}_{i}\in {{\mathbb{V}}}_{n}\,\,{{\mbox{and}}}\,\,{C}_{0}\in {{{{\mathcal{C}}}}}_{n}$$
(23)

We call each Vi, with T-depth 1, as a (parallel) block and it can be written as product of R(P) or R(P), where \(P\in \pm\, {{{{\mathcal{P}}}}}_{n}\). It is possible to multiply consecutive T-depth-1 unitaries to get another T-depth-1 unitary (conditions given in ref. 27). Thus \({{\mathbb{V}}}_{n}\) can be regarded as a generating set (modulo Clifford) for the set of T-depth-1 unitaries, and hence for the complete group \({{{{\mathcal{J}}}}}_{n}\). A decomposition which has the minimum number of T-depth-1 unitaries is called a T-depth-optimal decomposition. A circuit implementing \(U\in {{{{\mathcal{J}}}}}_{n}\) with the minimum T-depth is called a T-depth-optimal circuit.

Definition 3.3 (ϵ-T-depth of approximately implementable unitaries). The ϵ-T-depth of an approximately implementable unitary \(W\in {{{{\mathcal{U}}}}}_{n}\), denoted by \({{{{\mathcal{T}}}}}_{d\epsilon }(W)\), is equal to \({{{{\mathcal{T}}}}}_{d}(U)\), the T-depth of an exactly implementable unitary \(U\in {{{{\mathcal{J}}}}}_{n}\) such that d(U,W) ≤ ϵ and \({{{{\mathcal{T}}}}}_{d}(U)\le {{{{\mathcal{T}}}}}_{d}({U}^{{\prime} })\) for any \({U}^{{\prime} }\in {{{{\mathcal{J}}}}}_{n}\) and \(d({U}^{{\prime} },W)\le \epsilon\).

We call a T-count-optimal (or T-depth-optimal) circuit for any such U as the ϵ-T-count-optimal (or ϵ-T-depth-optimal, respectively) circuit for W.

Modification of \({{{{\mathcal{A}}}}}_{MIN}\)

Since the set \({{\mathbb{V}}}_{n}\) is finite, so it is not hard to see that algorithms \({{{{\mathcal{A}}}}}_{MIN}\) and \({{{{\mathcal{A}}}}}_{DECIDE}\) can be applied to find T-depth-optimal decomposition of any unitary \(W\in {{{{\mathcal{U}}}}}_{n}\). Replace step 1 of Algorithm 2 by \({{{\mathcal{S}}}}\leftarrow {{\mathbb{V}}}_{n}\). Suppose W = UE for some exactly implementable unitary U such that \({{{{\mathcal{T}}}}}_{d\epsilon }(W)={{{{\mathcal{T}}}}}_{d}(U)\) and d(W,U) ≤ ϵ. Then we can decompose U as in Eq. (23). If we have guessed the correct V1, …, Vd then after multiplying ∏iVi with W we are left with EC0. Now the amplitude test and conjugation test can be applied to check if we have the correct guess. We have said before that it is possible to multiply consecutive Vi such that the product has T-depth 1. In that case the ϵ-T-depth is less than d. So to find the minimum possible T-depth we might have to iterate more than \({{{{\mathcal{T}}}}}_{d\epsilon }(W)\) times.

Time complexity

The time complexity of conjugation test \({{{{\mathcal{A}}}}}_{CONJ}\) is same as before. The analysis of the complexity of \({{{{\mathcal{A}}}}}_{DECIDE}\) is also similar, but here \(| {{{\mathcal{S}}}}| =| {{\Bbb{V}}}_{n}|\), so if we take all possible \(m^{\prime}\)-length product at step 2, then the number of iterations for the outer loop 2-14 is at most \(n{2}^{5.6nm^{\prime} }\). The complexity of the remaining steps are same, so the overall complexity of \({{{{\mathcal{A}}}}}_{DECIDE}\) is \(O(n{2}^{5.6nm^{\prime} +4n})\). We explained before that it is possible to combine more than one consecutive unitaries \({V}_{i}\in {{\mathbb{V}}}_{n}\) such that we get one T-depth 1 unitary. Thus this procedure gives a T-depth \(m\le m^{\prime}\). We do not know how much is the difference \(m^{\prime} -m\).

Alternatively, we can do a pessimistic analysis of \({{{{\mathcal{A}}}}}_{DECIDE}\). This algorithm is basically an exhaustive search procedure to test for a certain T-depth m. Let in step 2 we make sure that we have a T-depth-m unitary \(\tilde{U}\), i.e., it is not possible to combine any further. Basically, this means \(\tilde{U}\) is from the set of T-depth-1 unitaries modulo Clifford. Now there can be at most \(O({4}^{{n}^{2}})\) of these. This is because there can be at most \(O({4}^{{n}^{2}})\)n-length product of R(P). This is a naive bound and more explanations can be found in ref. 27. So this time the outer loop can occur at most \(O({4}^{{n}^{2}m})\) times. Arguing in the same way, the complexity of \({{{{\mathcal{A}}}}}_{DECIDE}\) is \(O({2}^{2{n}^{2}m+4n})\), and hence complexity of \({{{{\mathcal{A}}}}}_{MIN}\) is \(O({2}^{2{n}^{2}{{{{\mathcal{T}}}}}_{d\epsilon }(W)+4n})\).

Space complexity

In step 1 of \({{{{\mathcal{A}}}}}_{DECIDE}\) we can store \(| {{\mathbb{V}}}_{n}|\) in a symbolic way, for example, for each \({V}_{i}={\prod }_{j}R({P}_{j})\in {{\mathbb{V}}}_{n}\), simply store the Paulis in the product. Then we can calculate the necessary matrices whenever necessary by taking products of the corresponding R(P)s. In all other steps we store N × N matrix, taking at most O(N2) O(22n) space. Thus space complexity is O(n25.6n). As explained before this approach leads to more running time, without affecting the asymptotic time complexity.

In this paper, we do not implement our algorithm to determine ϵ-T-depth. For small enough ϵ, we can use the procedure chalked out in the “An exponential time and polynomial space algorithm” section to synthesize a T-depth-optimal circuit.