ProjectsProject Details

DNA Sequencing Data Compression

Project ID: 6425-2-22
Year: 2023
Student/s: Raz Rippa, Dan Pardo
Supervisor/s: Yotam Gershon

Genomic sequencing refers to the process of determining the order of nucleotides (A, C, G, and T) in a DNA molecule. This information is used to understand the genetic makeup of an organism and can have a wide range of applications, including medical diagnosis and treatment, genetic research, and the development of new agricultural products.
DNA reads are a short sequence of DNA that are generated by DNA sequencing technologies. These reads are typically around 100-150 base pairs in length and are generated by breaking the DNA molecule into smaller fragments that can be individually sequenced. The resulting reads are then assembled to reconstruct the original DNA sequence.
DNA reads can be generated using a variety of sequencing technologies such as shotgun sequencing, when each reads location within the sequence is generally unknown.
The quality and accuracy of DNA reads is an important factor in the overall quality of the final genome assembly. High-quality reads are characterized by having a low error rate and high coverage, which means that they have been sequenced multiple times to ensure accuracy. Therefore, the sequence assembly from the reads require high computational effort. The computational load can be reduced by compression of pre-assembly reads data. Read-compression methods can be divided into two main categories: reference-free and reference-based tools, differing by whether using an almost identical reference genome that is shared by the encoder and decoder to compress the data. These methods are typically more efficient than reference-free methods, as the reference genome can be used to identify and eliminate redundant information in the data. In certain scenarios, where genome reads are generated at a remote location such as a physician's office and then sent for further processing to a central location, like a cloud database, the expenses associated with using reference-based methods may be too high for the edge node to bear, due to the storage and alignment requirements of the long reference sequences. Alternatively, recent work proposed a new compression scheme, in which the encoder needs to neither store nor process any reference, while benefiting from a reference available at the central node decoding the reads. In this project we aim to efficiently implement this compression scheme in C++, and develop an easy-to-use simulation environment for examining the scheme.

Poster for DNA Sequencing Data Compression