Data storage in DNA with fewer synthesis cycles using composite DNA letters

Anavy, Leon; Vaknin, Inbal; Atar, Orna; Amit, Roee; Yakhini, Zohar

doi:10.1038/s41587-019-0240-x

Article
Published: 09 September 2019

Data storage in DNA with fewer synthesis cycles using composite DNA letters

Nature Biotechnology volume 37, pages 1229–1236 (2019)Cite this article

99 Citations
166 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 16 September 2019

This article has been updated

Abstract

The density and long-term stability of DNA make it an appealing storage medium, particularly for long-term data archiving. Existing DNA storage technologies involve the synthesis and sequencing of multiple nominally identical molecules in parallel, resulting in information redundancy. We report the development of encoding and decoding methods that exploit this redundancy using composite DNA letters. A composite DNA letter is a representation of a position in a sequence that consists of a mixture of all four DNA nucleotides in a predetermined ratio. Our methods encode data using fewer synthesis cycles. We encode 6.4 MB into composite DNA, with distinguishable composition medians, using 20% fewer synthesis cycles per unit of data, as compared to previous reports. We also simulate encoding with larger composite alphabets, with distinguishable composition deciles, to show that 75% fewer synthesis cycles are potentially sufficient. We describe applicable error-correcting codes and inference methods, and investigate error patterns in the context of composite DNA letters.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Encoding a binary message using standard and composite DNA.**

**Fig. 2: Encoding pipeline of a large-scale composite DNA-based data storage.**

**Fig. 3: Performance of a large-scale composite DNA-based storage system.**

**Fig. 4: Analysis of higher-resolution composite alphabets using large-scale experiments.**

**Fig. 5: Data storage systems based on large composite alphabets.**

Efficient DNA-based data storage using shortmer combinatorial encoding

Article Open access 02 April 2024

DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage

Article Open access 06 February 2023

A Characterization of the DNA Data Storage Channel

Article Open access 04 July 2019

Data availability

All raw sequencing data are available from the European Nucleotide Archive (ENA) under accession PRJEB32427. This includes sequencing of the large-scale experiment described in Figs. 2–4, sequencing of the experiment with large alphabets described in Fig. 5 and sequencing of the error analysis experiment described in Fig. 5. All other data are available within the article or its supplementary information.

Code availability

All original software code included in this study is available online. Alteration of the previously published DNA fountain code to support composite DNA is available from https://github.com/leon-anavy/dna-fountain. Code used for Reed–Solomon error correction (altered from previously published code) is available from https://github.com/leon-anavy/Reed-Solomon. Custom code used for the analyses presented in this study is available from https://github.com/leon-anavy/composite-DNA.

Change history

16 September 2019
An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

Cox, J. P. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).
Article CAS PubMed Google Scholar
Zhirnov, V., Zadegan, R. M., Sandhu, G. S., Church, G. M. & Hughes, W. L. Nucleic acid memory. Nat. Mater. 15, 366–370 (2016).
Article CAS PubMed PubMed Central Google Scholar
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012).
Article CAS PubMed Google Scholar
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).
Article CAS PubMed PubMed Central Google Scholar
Bornholt, J. et al. Toward a DNA-based archival storage system. IEEE Micro 37, 98–104 (2017).
Article Google Scholar
Tabatabaei Yazdi, S. M. H. et al. A rewritable, random-access DNA-based storage system. Sci. Rep. 5, 14138 (2015).
Article CAS PubMed Central Google Scholar
Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
Article CAS PubMed Google Scholar
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
Article CAS PubMed Google Scholar
Gabrys, R., Kiah, H. M. & Milenkovic, O. Asymmetric lee distance codes for DNA-based storage. In Proc. 2015 IEEE International Symposium on Information Theory (ISIT) 909–913 (IEEE, 2015)..
Levy, M. & Yaakobi, E. Mutually uncorrelated codes for DNA storage. In Proc. 2017 IEEE International Symposium on Information Theory (ISIT) 3115–3119 (IEEE, 2017).
Lee, H. H., Kalhor, R., Goela, N., Bolot, J. & Church, G. M. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nat. Commun. 10, 2383 (2019).
Article PubMed PubMed Central Google Scholar
Palluk, S. et al. De novo DNA synthesis using polymerase–nucleotide conjugates. Nat. Biotechnol. 36, 645–650 (2018).
Article CAS PubMed Google Scholar
Roquet, N., Park, H. & Bhatia, S. P. Nucleic acid-based data storage. US patent 20180137418 (2017).
LeProust, E. M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522–2540 (2010).
Article CAS PubMed PubMed Central Google Scholar
Barrett, M. T. et al. Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA. Proc. Natl Acad. Sci. USA 101, 17765–17770 (2004).
Article CAS PubMed PubMed Central Google Scholar
Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
Article CAS PubMed PubMed Central Google Scholar
Choi, Y. et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases. Sci. Rep. 9, 6582 (2019).
Article PubMed PubMed Central Google Scholar
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. Engl. 54, 2552–2555 (2015).
Article CAS PubMed Google Scholar
Reed, I. S. & Solomon, G. Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8, 300–304 (1960).
Article Google Scholar
MacKay, D. J. C. Fountain codes. IEE Proc. Comm. 152, 1062 (2005).
Article Google Scholar
Jiménez-Sánchez, A. DNA computer code based on expanded genetic alphabet. Eur. J. Comput. Sci. Inf. Technol. 2, 8–20 (2014).
Google Scholar
Tabatabaei Yazdi, S. M. H. et al. DNA-based storage: trends and methods. IEEE Trans. Mol. Biol. Multiscale Commun. 1, 230–248 (2015).
Article Google Scholar
Raviv, N., Schwartz, M. & Yaakobi, E. Rank modulation codes for DNA storage. In Proc. 2017 IEEE International Symposium on Information Theory (ISIT) 3125–3129 (IEEE, 2017).
Yazdi, S. M. H. T., Kiah, H. M., Gabrys, R. & Milenkovic, O. Mutually uncorrelated primers for DNA-based data storage. Preprint at https://arxiv.org/abs/1709.05214 (2017).
Takahashi, C. N., Nguyen, B. H., Strauss, K. & Ceze, L. Demonstration of end-to-end automation of DNA data storage. Sci. Rep. 9, 4998 (2019).
Article PubMed PubMed Central Google Scholar
Hoshika, S. et al. Hachimoji DNA and RNA: a genetic system with eight building blocks. Science 363, 884–887 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bains, W. Hybridization methods for DNA sequencing. Genomics 11, 94–301 (1991).
Article Google Scholar
Pevzner, P. A. Rearrangements of DNA sequences and SBH. Comput. Chem. 18, 221–223 (1994).
Article CAS PubMed Google Scholar
Preparata, F. P. & Oliver, J. S. DNA sequencing by hybridization using semi-degenerate bases. J. Comput. Biol. 11, 753–765 (2004).
Article CAS PubMed Google Scholar
Snir, S., Yeger-Lotem, E., Chor, B., and Yakhini, Z. Using restriction enzymes to improve sequencing by hybridization. Technical report CS-2002-14 (Technion, 2002).
Chen, Z. et al. Highly accurate fluorogenic DNA sequencing with information theory-based error correction. Nat. Biotechnol. 35, 1170–1178 (2017).
Article CAS PubMed Google Scholar
Davidson, E. H. The Regulatory Genome: Gene Regulatory Networks in Development and Evolution (Academic, 2006).
Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94 (2004).
Article CAS PubMed PubMed Central Google Scholar
Levy, L. et al. A synthetic oligo library and sequencing approach reveals an insulation mechanism encoded within bacterial σ54 promoters. Cell Rep. 21, 845–858 (2017).
Article CAS PubMed Google Scholar
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gilbert, L. A. et al. CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes. Cell 154, 442–451 (2013).
CAS PubMed PubMed Central Google Scholar
Mikutis, G. et al. Silica-encapsulated DNA-based tracers for aquifer characterization. Environ. Sci. Technol. 52, 12142–12152 (2018).
Article CAS PubMed Google Scholar
Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina paired-end read merger. Bioinformatics 30, 614–620 (2014).
Article CAS PubMed Google Scholar
Shakespeare, W. The Complete Works of William Shakespeare http://www.gutenberg.org/ebooks/100 (1994)
Huffman, D. A. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 1098–1101 (1952).
Article Google Scholar

Download references

Acknowledgements

We thank T. Katz-Ezov and T. Hashimshony from the Technion Genome Center for advice and assistance with oligonucleotide design and sequencing experiments. We also thank P. Weiss from Twist Bioscience for technical support and assistance with DNA synthesis. Finally, we thank the Yakhini and Amit research groups for valuable comments and discussions. L. Anavy is supported by the Adams Fellowships Program of the Israel Academy of Sciences and Humanities. This project received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement 664918 (MRG-Grammar).

Author information

Authors and Affiliations

Computer Science Department, Technion – Israel Institute of Technology, Haifa, Israel
Leon Anavy & Zohar Yakhini
Faculty of Biotechnology and Food Engineering, Technion – Israel Institute of Technology, Haifa, Israel
Inbal Vaknin, Orna Atar & Roee Amit
School of Computer Science, Herzliya Interdisciplinary Center, Herzliya, Israel
Zohar Yakhini

Authors

Leon Anavy
View author publications
You can also search for this author in PubMed Google Scholar
Inbal Vaknin
View author publications
You can also search for this author in PubMed Google Scholar
Orna Atar
View author publications
You can also search for this author in PubMed Google Scholar
Roee Amit
View author publications
You can also search for this author in PubMed Google Scholar
Zohar Yakhini
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.A. and Z.Y. initiated and designed the coding and algorithmic approach. L.A. developed the software and performed data analysis. I.V. and O.A. performed the experiments. L.A., R.A. and Z.Y. wrote the manuscript. R.A. and Z.Y. supervised the study.

Corresponding authors

Correspondence to Leon Anavy or Zohar Yakhini.

Ethics declarations

Competing interests

L.A, Z.Y and R.A are the inventors of a patent application for the method described in this article. The initial filing was assigned United States Provisional Patent Application No. 62/674,114. The remaining authors declare no competing financial interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–17 and Supplementary Note

Reporting Summary

Supplementary Table 1

Physical density calculations of composite DNA storage. This includes the large-scale experiment and the dilution experiment.

Supplementary Table 2

Logical density calculations of composite DNA storage. This includes all the experiments, theoretical encodings and simulation experiments.

Supplementary Table 3

Oligonucleotide design for large-alphabet experiments and error analysis.

Supplementary Table 4

Oligonucleotide design for the large-scale composite DNA storage.

Supplementary Table 5

Oligonucleotide design for the simulations of large composite alphabet DNA storage.

Supplementary Table 6

Simulation results of large composite alphabet DNA storage

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anavy, L., Vaknin, I., Atar, O. et al. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat Biotechnol 37, 1229–1236 (2019). https://doi.org/10.1038/s41587-019-0240-x

Download citation

Received: 26 September 2018
Accepted: 25 July 2019
Published: 09 September 2019
Issue Date: October 2019
DOI: https://doi.org/10.1038/s41587-019-0240-x