Paper: Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Summary

Software that maps long reads to the reference genome.

Minimap adopts the idea of sketch like MHAP but takes minimizers as a reduced representation instead; it stores k-mers in a hash table like BLAT and MHAP but also uses sorting extensively like DALIGNER. In addition, minimap is designed not only as a read overlapper but also as a read-to-genome and genome-to-genome mapper.

Algorithms explained (Seeding and Indexing)

A-1. MinimizerSketch(s, w, k): How minimizer is obtained

Algorithm 1

minimizer is the smallest K-mer hash value within window size w. Single K-mer (minimizer) is a representative of w+k-1 length string, also a seed.

For each character in input string s

  1. Calculates the hash value of both K-mer and reverse-complement K-mer. Choose the one with the smaller hash value.
    • Strand number is stored along with the K-mer hash value, therefore, if strand is ambiguous (ambiguous when hash values of s and RC of s are identical), the K-mer hash value is discarded.
    • Minimizer is compared with the minimum K-mer hash value between the one from the original strand and the one from the RC strand. By using two K-mer hash values, reads are aligned to the reference regardless of the strand the reads align to.
  2. All calculated K-mer hash value from 1. is appended to circular queue.
    • Minimap2 uses a circular queue (variable buf in minimap2/sketch.c) to track the minimum K-mer hash value.
    • (position in input string s, reference_id, strand number (0 or 1)) are also stored in the queue along with the hash value.
  3. The minimum k-mer hash value is updated and appended to output minimizer in 2 cases.
    1. New K-mer hash value from 1 is smaller than current minimum K-mer hash value
    2. Minimum K-mer hash value is not within window size w anymore.

A-2. InvertibleHash(x, p): Hash function used for K-mer

Reference: https://naml.us/post/inverse-of-a-hash-function/

Thomas Wang’s integer hash functions

Algorithm 2

  • To avoid oversampling poly-A (K-mer hash of poly-A as minimizer), minimap uses an invertible Hash function. As a result, poly-A which might have been 0 hash value with naive 2-bit encoding, results in non-zero hash value. Minimizer is the minimum K-mer hash value within window size w, hence, poly-A are less likely to be selected as minimizer which diversifies the minimizers and avoid oversampling poly-A as minimizer.
  • Key properties include avalanche (changing any input bit changes about half of the output bits) and invertibility.

A-3. Index(T, w, k): Indexing the reference genome

Given a multiple strings T.

  1. For each reference string t in T, collect the minimizers (A-1).
  2. Sort the minimizer array.
    • Sorted with the (1: minimizer, 2:position) (K-mer hash value of minimizers).
    • Values are the reference index, position in the reference, strand, number of minimizers.
    • Using a bucket reduces the memory size and increases the cache hit ratio.
  3. Minimizers’ positions (pos in mininimizer array) are stored in buckets (hash table index).
    • The bucket uses b bit for hash value instead of 2K bit minimizer.
    • The bucket stores the start position and number of minimizers.
  4. The actual position can be obtained by lookup in 1) bucket hash table and 2) minimizer-position array.

A-4. Map(H, q, w, k, e): Mapping query to the reference genome.

Given a query q. Follow the seed-and-extend paradigm.

  1. Get minimizers of query q. (minimizers are seeds)
  2. Sort minimizers
  3. Do Chaining (find seeds that are colinear), Seed extension (Needleman–Wunsch algorithm, Global alignment).

Chaining and Seed extension will be covered in other Paper Reading post.