BWA-MEME: BWA-MEM emulated with a machine learning approach

Paper: BWA-MEME: BWA-MEM emulated with a machine learning approach

Summary

BWA-MEM, a widely used alignment tool, typically encounters bottlenecks in the seeding phase of its seed-and-extend alignment paradigm. The paper propose BWA-MEME that emulates BWA MEM algorithm with a novel contraction-based approach in seeding. This approach uses learned-index to enhance the longest-common-prefix (LCP) search efficiency between the query sequence and the suffix array, which represents the reference genome.

BWA-MEME uses Contraction-based approach.

Existing works FM-index and ERT index uses a elongation-based approach for exact match search. Those methods elongates the exact match between query and reference genome, which requires random memory access during index lookup. BWA-MEME takes a contraction-based approach:

Finds LCP between query and suffix array using learned-index. Predicting the position of LCP is done using learned-index. Binary search is performed from the predicted position to find LCP. The region for binary search is determined by the max error is provided from the learned-index model (which is pre-calculated at the training stage).
Finds seeds nearby the LCP. (the positions of seeds in suffix array always includes LCP)

Seed is SMEM which is defined by:

exact match between query and reference genome that cannot be extended in either direction without mismatch
the number of exact matches to the reference genome should be larger than the given input threshold N.

FM-index or ERT keeps track of the number of exact match while elogating the exact match. However, BWA-MEME first finds the longest exact match and finds a shorter exact match that adhere to the definition of seeds. BWA-MEME performs seeding in zig-zag which is similar to the proposed approach in ERT, however, BWA-MEME shows the left-extension points (LEP) used in ERT are not required [ref].

Better memory efficiency with contraction-based approach

The LCP’s position is efficiently predicted using a learned-index, typically requiring only one or two instances of random memory access.
The memory access pattern becomes more predictable, with a binary search initiated from the predicted position and seeds searched near the LCP, either through linear or exponential methods.

Putting together, contraction-based approach opens up various ways of improving memory-efficiency of exact match search in BWA-MEME (including data-locality in index and memory prefetching)

Components of BWA-MEME for further improvment in memory-efficiency

Optimization of the RMI (Recursive Model Index) structure tailored for the reference genome.
Improved data locality, achieved by integrating suffix characters directly next to suffix array position values.
Efficient reuse of exact match search results, reducing redundant computations.

Runtime index building

When disk I/O is limited, loading large index into memory can take significant time. To address this issue, BWA-MEME implemented an runtime index building that only loads suffix array and builds the other indices at start-up. The speed are equivalent to disk read of speed 3-5 GB per second.

Paper: BWA-MEME: BWA-MEM emulated with a machine learning approach#

Summary#

BWA-MEME uses Contraction-based approach.#

Better memory efficiency with contraction-based approach#

Components of BWA-MEME for further improvment in memory-efficiency#

Runtime index building#