Paper: A multi-task convolutional deep neural network for variant calling in single molecule sequencing
Summary
Clairvoyante is a CNN based DNN which provides 4 kinds of predictions (1. Variant type 2. zygosity 3. alternative allele 4. indel length ).
Contributions.
- Better accuracy and less computation in germline variant calling (in single molecule sequencing).
- Novel input encoding method
Notes.
- Due to class imbalance, it is hard to detect long indel. No support for indel bp>4
- Uses a fixed-window size of 33
Algorithms explained
A-1. Constructing Input for the model
def GenerateTensor(args, ctgName, alns, center, refSeq):
'''
flankingBaseNum = 16 , matrixNum = 4
'''
# Initialize 33x4x4 array
alnCode = [0] * ( (2*param.flankingBaseNum+1) * 4 * param.matrixNum )
depth = [0] * ((2 * param.flankingBaseNum + 1))
for aln in alns:
for refPos, queryAdv, refBase, queryBase in aln:
if str(refBase) not in "ACGT-":
continue
if str(queryBase) not in "ACGT-":
continue
if refPos - center >= -(param.flankingBaseNum+1) and refPos - center < param.flankingBaseNum:
offset = refPos - center + (param.flankingBaseNum+1)
if queryBase != "-":
if refBase != "-": # querybase matches to reference
depth[offset] = depth[offset] + 1
# stripe2: 16, stripe1: 4
alnCode[stripe2*offset + stripe1*base2num[refBase] + 0] += 1.0
# second tensor encodes the inserted sequences
alnCode[stripe2*offset + stripe1*base2num[queryBase] + 1] += 1.0
# third tensor encodes the deleted sequences
alnCode[stripe2*offset + stripe1*base2num[refBase] + 2] += 1.0
# fourth tensor encodes alternative alleles
alnCode[stripe2*offset + stripe1*base2num[queryBase] + 3] += 1.0
elif refBase == "-": # ins in querybase
idx = min(offset+queryAdv, 2*param.flankingBaseNum+1-1)
alnCode[stripe2*idx + stripe1*base2num[queryBase] + 1] += 1.0
else:
print >> sys.stderr, "Should not reach here: %s, %s" % (refBase, queryBase)
elif queryBase == "-":
if refBase != "-": # deletion in querybase
alnCode[stripe2*offset + stripe1*base2num[refBase] + 2] += 1.0
else: # not possible
print >> sys.stderr, "Should not reach here: %s, %s" % (refBase, queryBase)
else: # not possible
print >> sys.stderr, "Should not reach here: %s, %s" % (refBase, queryBase)
#
newRefPos = center - (0 if args.refStart == None else (args.refStart - 1))
if (newRefPos - (param.flankingBaseNum+1) >= 0) and depth[param.flankingBaseNum] >= args.minCoverage:
outputLine = "%s %d %s %s" % (ctgName, center, refSeq[newRefPos-(param.flankingBaseNum+1):newRefPos+param.flankingBaseNum], " ".join("%0.1f" % x for x in alnCode))
return outputLine
else:
return None