Paper: A multi-task convolutional deep neural network for variant calling in single molecule sequencing

Summary

Clairvoyante is a CNN based DNN which provides 4 kinds of predictions (1. Variant type 2. zygosity 3. alternative allele 4. indel length ).

Contributions.

  1. Better accuracy and less computation in germline variant calling (in single molecule sequencing).
  2. Novel input encoding method

Notes.

  • Due to class imbalance, it is hard to detect long indel. No support for indel bp>4
  • Uses a fixed-window size of 33

Algorithms explained

A-1. Constructing Input for the model

def GenerateTensor(args, ctgName, alns, center, refSeq):
    '''
    flankingBaseNum = 16 , matrixNum = 4
    '''
    # Initialize 33x4x4 array
    alnCode = [0] * ( (2*param.flankingBaseNum+1) * 4 * param.matrixNum )
    depth = [0] * ((2 * param.flankingBaseNum + 1))
    for aln in alns:
        for refPos, queryAdv, refBase, queryBase in aln:
            if str(refBase) not in "ACGT-":
                continue
            if str(queryBase) not in "ACGT-":
                continue
            if refPos - center >= -(param.flankingBaseNum+1) and refPos - center < param.flankingBaseNum:
                offset = refPos - center + (param.flankingBaseNum+1)
                if queryBase != "-":
                    if refBase != "-": # querybase matches to reference
                        depth[offset] = depth[offset] + 1
                        # stripe2: 16, stripe1: 4
                        alnCode[stripe2*offset + stripe1*base2num[refBase] + 0] += 1.0
                        # second tensor encodes the inserted sequences
                        alnCode[stripe2*offset + stripe1*base2num[queryBase] + 1] += 1.0
                        # third tensor encodes the deleted sequences
                        alnCode[stripe2*offset + stripe1*base2num[refBase] + 2] += 1.0
                        # fourth tensor encodes alternative alleles
                        alnCode[stripe2*offset + stripe1*base2num[queryBase] + 3] += 1.0
                    elif refBase == "-": # ins in querybase
                        idx = min(offset+queryAdv, 2*param.flankingBaseNum+1-1)
                        alnCode[stripe2*idx + stripe1*base2num[queryBase] + 1] += 1.0
                    else:
                      print >> sys.stderr, "Should not reach here: %s, %s" % (refBase, queryBase)
                elif queryBase == "-":
                    if refBase != "-": # deletion in querybase
                        alnCode[stripe2*offset + stripe1*base2num[refBase] + 2] += 1.0
                    else: # not possible
                        print >> sys.stderr, "Should not reach here: %s, %s" % (refBase, queryBase)
                else: # not possible
                    print >> sys.stderr, "Should not reach here: %s, %s" % (refBase, queryBase)
    # 
    newRefPos = center - (0 if args.refStart == None else (args.refStart - 1))
    if (newRefPos - (param.flankingBaseNum+1) >= 0) and depth[param.flankingBaseNum] >= args.minCoverage:
        outputLine = "%s %d %s %s" % (ctgName, center, refSeq[newRefPos-(param.flankingBaseNum+1):newRefPos+param.flankingBaseNum], " ".join("%0.1f" % x for x in alnCode))
        return outputLine
    else:
        return None    

The result of each encoded inputs (SNP, Insertion, Deletion, Reference) looks like below image

Encoded Input