Information and Coding Theory Aspects of DNA-based Data Storage | �鶹��ýAV Information Theory Society

2025

Information and Coding Theory Aspects of DNA-based Data Storage

Guest editors

Rawad Bitar

David Landsman

Olgica Milenkovic

Moshe Schwartz

Antonia Wachter-Zeh

Eitan Yaakobi

Giuseppe Caire

DNA-based data storage is a cutting-edge technology that offers exceptional information density and longevity. However, the processes of writing, storing, and reading data in DNA formats are prone to noise and errors at various stages. Achieving reliable storage at a reasonable cost necessitates advanced error-correction methods. Traditional error-correction, which primarily addresses substitution and erasure errors, falls short due to the unique characteristics of the DNA storage medium. This has led to the emergence of new coding challenges and the need for enhancements to existing techniques. These challenges include developing codes and fundamental bounds for handling insertions, deletions, and substitutions, reconstructing sequences, addressing duplication errors, designing constrained codes, and much more. Moreover, ensuring data privacy—a critical requirement for any storage technology—has not been significantly explored in the context of DNA-based data storage.

This Special Issue encourages the research community working on topics related to DNA-based data storage to advance the mathematical foundations of error-correction in DNA storage systems.

Krishna Gopal Benerjee Adrish Banerjee

This paper presents constructions of DNA codes that satisfy biological and combinatorial constraints for DNA-based data storage systems. We introduce an algorithm that generates DNA blocks containing sequences that meet the required constraints for DNA codes. The constructed DNA sequences satisfy biological constraints: balanced GC-content, avoidance of secondary structures, and prevention of homopolymer runs. These sequences simultaneously satisfy combinatorial constraints that ensure differences among DNA sequences and their reverse and reverse-complement sequences. The DNA codes incorporate error correction through minimum Hamming distance requirements. We establish a bijective mapping between algebraic structures and DNA sequences, providing construction of DNA codes with specified characteristics. Using this framework, we construct DNA codes based on error-correcting codes, including Simplex and Reed-Muller codes. These constructions ensure DNA sequences avoid secondary structures and homopolymer runs exceeding length three, which cause errors in DNA storage systems. Concatenated sequences maintain these properties. The codes achieve non-vanishing code rates and minimum Hamming distances for large sequence lengths, demonstrating viability for DNA-based data storage systems.

Jin Sima Chao Pan S. Kasra Tabatabaei Alvaro G. Hernandez Charles M. Schroeder Olgica Milenkovic

DNA-based data storage systems face practical challenges due to the high cost of DNA synthesis. A strategy to address the problem entails encoding data via topological modifications of the DNA sugar-phosphate backbone. The DNA Punchcards system, which introduces nicks (cuts) in the DNA backbone, encodes only one bit per nicking site, limiting density. We propose DNA Tails, a storage paradigm that encodes nonbinary symbols at nicking sites by growing enzymatically synthesized single-stranded DNA of varied lengths. The average tail lengths encode multiple information bits and are controlled via a staggered nicking-tail extension process. We demonstrate the feasibility of this encoding approach experimentally and identify common sources of errors, such as calibration errors and stumped tail growth errors. To mitigate calibration errors, we use rank modulation proposed for flash memory. To correct stumped tail growth errors, we introduce a new family of rank modulation codes that can correct “stuck-at” errors. Our analytical results include constructions for order-optimal-redundancy permutation codes and accompanying encoding and decoding algorithms.

Zitan Chen

The number of zeros and the number of ones in a binary string are referred to as the composition of the string, and the prefix-suffix compositions of a string are a multiset formed by the compositions of the prefixes and suffixes of all possible lengths of the string. In this work, we present binary codes of length n in which every codeword can be efficiently reconstructed from its erroneous prefix-suffix compositions with at most t composition errors. All our constructions have decoding complexity polynomial in n and the best of our constructions has constant rate and can correct t = Θ(n) errors. As a comparison, no prior constructions can afford to efficiently correct t = Θ(n) arbitrary composition errors. Additionally, we propose a method of encoding h arbitrary strings of the same length so that they can be reconstructed from the multiset union of their error-free prefix-suffix compositions, at the expense of h-fold coding overhead. In contrast, existing methods can only recover h distinct strings, albeit with code rate asymptotically equal to 1/h. Building on the top of the proposed method, we also present a coding scheme that enables efficient recovery of h strings from their erroneous prefix-suffix compositions with t = Θ(n) errors.

Tomer Cohen Eitan Yaakobi

This paper studies two problems that are motivated by the novel recent approach of composite DNA that takes advantage of the DNA synthesis property which generates a huge number of copies for every synthesized strand. Under this paradigm, every composite symbols does not store a single nucleotide but a mixture of the four DNA nucleotides. The first problem studies the expected number of strand reads in order to decode a composite strand or a group of composite strands. In the second problem, our goal is study how to carefully choose a fixed number of mixtures of the DNA nucleotides such that the decoding probability by the maximum likelihood decoder is maximized.

Ziv Aharoni Henry D. Pfister

Synchronization errors, arising from both synthesis and sequencing noise, present a fundamental challenge in DNA-based data storage systems. These errors are often modeled as insertion-deletion-substitution (IDS) channels, for which maximum-likelihood decoding is quite computationally expensive. In this work, we propose a data-driven approach based on neural polar decoders (NPDs) to design decoders with reduced complexity for channels with synchronization errors. The proposed architecture enables decoding over IDS channels with reduced complexity O(AN logN), where A is a tunable parameter independent of the channel. NPDs require only sample access to the channel and can be trained without an explicit channel model. Additionally, NPDs provide mutual information (MI) estimates that can be used to optimize input distributions and code design. We demonstrate the effectiveness of NPDs on both synthetic deletion and IDS channels. For deletion channels, we show that NPDs achieve near-optimal decoding performance and accurate MI estimation, with significantly lower complexity than trellis-based decoders. We also provide numerical estimates of the channel capacity for the deletion channel. We extend our evaluation to realistic DNA storage settings, including channels with multiple noisy reads and real-world Nanopore sequencing data. Our results show that NPDs match or surpass the performance of existing methods while using significantly fewer parameters than the state-of-the-art. These findings highlight the promise of NPDs for robust and efficient decoding in DNA data storage systems.

Yan Hao Ling Nir Weinberger Jonathan Scarlett

In this paper, we study error exponents for an index-based concatenated coding based class of DNA storage codes in which the number of reads performed can be variable. That is, the decoder can sequentially perform reads and choose whether to output the final decision or take more reads, and we are interested in minimizing the average number of reads performed rather than a fixed pre-specified value. We show that this flexibility leads to a considerable reduction in the error probability compared to a fixed number of reads, not only in terms of constants in the error exponent but also in the scaling laws. This is shown via an achievability result for a suitably-designed protocol, and in certain parameter regimes we additionally establish a matching converse that holds for all protocols within a broader index-based concatenated coding based class.

Wenkai Zhang Zhiying Wang

Emerging DNA storage technologies use composite DNA letters, where information is represented by a probability vector, leading to higher information density and lower synthesis costs. However, it faces the problem of information leakage in sharing the DNA vessels among untrusted vendors. This paper introduces an asymptotic ramp secret sharing scheme (ARSSS) for secret information storage using composite DNA letters. This innovative scheme, inspired by secret sharing methods over finite fields and enhanced with a modified matrix-vector multiplication operation for probability vectors, achieves asymptotic information-theoretic data security for a large alphabet size. Moreover, this scheme reduces the number of reading operations for DNA samples compared to traditional schemes, and therefore lowers the complexity and the cost of DNA-based secret sharing. We further explore the construction of the scheme, starting with a proof of the existence of a suitable generator, followed by practical examples. Finally, we demonstrate efficient constructions to support large information sizes, which utilize multiple vessels for each secret share rather than a single vessel.

Wentu Song Kui Cai Tony Q. S. Quek

The central problem in sequence reconstruction is to find the minimum number of distinct channel outputs required to uniquely reconstruct the transmitted sequence. According to Levenshtein’s work in 2001, this number is determined by the size of the maximum intersection between the error balls of any two distinct input sequences of the channel. In this work, we study the sequence reconstruction problem for the q-ary single-deletion single-substitution channel for any fixed integer q≥2. First, we prove that if two q-ary sequences of length n have a Hamming distance d≥2, then the intersection size of their error balls is upper bounded by 2qn-3q-2-δq,2, where δi,j is the Kronecker delta, and this bound is achievable. Next, we prove that if two q-ary sequences have a Hamming distance d≥3 and a Levenshtein distance dL≥2, then the intersection size of their error balls is upper bounded by 3q+11, and we show that the gap between this bound and the tight bound is at most 2.

Adir Kobovich Nir Weinberger

Recent advancements in DNA storage show that composite DNA letters can significantly enhance storage capacity. We model this process as a multinomial channel and propose an optimization algorithm to determine its capacity-achieving input distribution (CAID) for an arbitrary number of output reads. Our empirical results match a scaling law that determines that the support size grows exponentially with capacity. In addition, we introduce a limited-support optimization algorithm that optimizes the input distribution under a restricted support size, making it more feasible for real-world DNA storage systems. We also extend our model to account for noise and study its effect on capacity and input design.

Brendon McBain Emanuele Viterbo

This paper studies achievable rates of nanopore-based DNA storage when nanopore signals are decoded using a tractable channel model that does not rely on a basecalling algorithm. Specifically, the noisy nanopore channel (NNC) with the Scrappie pore model generates average output levels via i.i.d. geometric sample duplications corrupted by i.i.d. Gaussian noise (NNC-Scrappie). Simplified message passing algorithms are derived for efficient soft decoding of nanopore signals using NNC-Scrappie. Previously, evaluation of this channel model was limited by the lack of DNA storage datasets with nanopore signals included. This is solved by deriving an achievable rate based on the dynamic time-warping (DTW) algorithm that can be applied to genomic sequencing datasets subject to constraints that make the resulting rate applicable to DNA storage. Using a publicly-available dataset from Oxford Nanopore Technologies (ONT), it is demonstrated that coding over multiple DNA strands of 100 bases in length and decoding with the NNC-Scrappie decoder can achieve rates of at least 0.64-1.18 bits per base, depending on the channel quality of the nanopore that is chosen in the sequencing device per channel-use, and 0.96 bits per base on average assuming uniformly chosen nanopores. These rates are pessimistic since they only apply to single reads and do not include calibration of the pore model to specific nanopores.

V. Arvind Rameshwar Nir Weinberger

In this paper, we consider a recent channel model of a nanopore sequencer proposed by McBain, Viterbo, and Saunderson (2024), termed the noisy nanopore channel (NNC). In essence, an NNC is a duplication channel with structured, Markov inputs, that is corrupted by memoryless noise. We first discuss a (tight) lower bound on the capacity of the NNC in the absence of random noise. Next, we present lower and upper bounds on the channel capacity of general noisy nanopore channels. We then consider two interesting regimes of operation of an NNC: first, where the memory of the input process is large and the random noise introduces erasures, and second, where the rate of measurements of the electric current (also called the sampling rate) is high. For these regimes, we show that it is possible to achieve information rates close to the noise-free capacity, using low-complexity encoding and decoding schemes. In particular, our decoder for the regime of high sampling rates makes use of a change-point detection procedure – a subroutine of immediate relevance for practitioners.

Ran Tamir Nir Weinberger

We consider a molecular channel, in which messages are encoded to the frequency of objects in a pool, and whose output during reading time is a noisy version of the input frequencies, as obtained by sampling with replacement from the pool. Motivated by recent DNA storage techniques, we focus on the regime in which the input resolution is unlimited. We propose two error probability bounds for this channel; the first bound is based on random coding analysis of the error probability of the maximum likelihood decoder and the second bound is derived by code expurgation techniques. We deduce an achievable bound on the capacity of this channel, and compare it to both the achievable bounds under limited input resolution, as well as to a converse bound.

Yaoyu Yang

In DNA sequencing, we often need to infer an unknown sequence from a collection of its corrupted copies. Each copy cannot faithfully tell the truth due to DNA fragmentation, point mutations, and measurement errors. The theoretical guarantee of unique reconstruction is thus of concern. This motivated the study of sequence reconstruction problems three decades ago. Recently, synthetic DNA has been regarded as an ultra-dense data storage medium. Sequence reconstruction is a crucial step in achieving reliable and efficient data readout. In this survey, we summarize mainly two types of problems, reconstruction from subsequences or substrings, in both combinatorial and probabilistic settings. Meanwhile, we discuss codes and algorithms that may assist with the future development of DNA-based data storage systems.

Hsin-Po Wang Venkatesan Guruswami

As a potential implementation of data storage using DNA molecules, multiple strands of DNA are stored unordered in a liquid container. When the data are needed, an array of DNA readers will sample the strands with replacement, producing a Poisson-distributed number of noisy reads for each strand. The primary challenge here is to design an algorithm that reconstructs data from these unsorted, repetitive, and noisy reads. In this paper, we lay down a capacity-achieving rateless code along each strand to encode its index; we then lay down a capacity-achieving block code at the same position across all strands to protect the data. These codes weave a low-complexity storage scheme that saturates the fundamental upper limit of DNA. This improves upon the previous work of Weinberger and Merhav, which proves said bound and uses high-complexity random codes to saturate the limit. Our scheme also differs from other concatenation-based implementations of DNA data storage in the sense that, instead of decoding the inner codes first and passing the results to the outer code, our decoder alternates between the rateless codes and the block codes.

�鶹��ýAV

Table of Contents