Xiaoqiu (See-ow-chew) Huang, Professor
Department of Computer Science, Iowa State University
226 Atanasoff Hall, Ames, IA 50011
EM: xqhuang AT iastate DOT edu Phone: 515-294-2432 Fax: 515-294-0258
Google Scholar Citations:
Introduction to Data Structures: Com S 228
Computational Techniques for Genome Assembly and Analysis:
Xiaoqiu Huang, with his collaborators, has developed algorithms and software in four areas of
sequence alignment, sequence assembly, gene finding and tree construction.
He is also interested in understanding the prevalence and implications of
gene flow in the evolution of asexual filamentous fungal pathogens (see blow).
In sequence alignment, a linear-space algorithm was developed for computing
k best local alignments between two sequences (Huang and Miller 1991),
significantly reducing the quadratic space complexity of the
Michael Waterman, the Smith-Waterman local alignment algorithm was developed to
address the weakness of the
Needleman-Wunsch global alignment algorithm in comparing homologous
genomic segments with conserved regions (such as exons) separated by divergent regions (such as introns).
Although the Huang-Miller algorithm could be used to find many pairs of conserved
regions (in separate, unordered alignments) between two genomic segments,
a linear-space global alignment algorithm was designed to compute
one alignment of genomic segments with all pairs of conserved
regions in similarity blocks and all pairs of divergent regions in difference blocks
(in the natural order), precisely showing the boundary between
similarity and difference blocks (Huang and Chao 2003).
An appropriate substitution matrix (e.g. at the maximum-likelihood evolutionary distance)
was selected in alignment of protein sequences to suit their conservation level (Huang 2008).
Multiple sets of alignment parameters with different levels of stringency
were dynamically selected to suit various kinds of sequence regions
with different levels of conservation (Huang and Brutlag 2007).
In sequence assembly, Huang was encouraged by
Ross Overbeek in 1990 to develop
a contig assembly program (CAP).
As one of the first sequence assembly programs based on the overlap-layout-alignment approach,
CAP was special in using a maximum similarity algorithm,
instead of a minimum distance algorithm, for computing overlaps between sequences,
in generating contigs in a decreasing order of overlap quality without the need
to determine the orientation of sequences before the layout stage, and in using
a special alignment algorithm for generating a multiple alignment of sequences in a contig (Huang 1992).
Paired-end reads and base quality values were used in the well-known CAP3 program (Huang and Madan 1999).
A whole-genome assembly program named PCAP was developed to handle million of
Sanger sequences in assembly of large animal genomes (Huang et al. 2003).
PCAP was used in several large whole-genome assembly projects
such as chimpanzee (Nature 437: 69-87), chicken (Nature 432: 695-716),
and platypus (Nature 453: 175-183).
The de novo feature of the PCAP chimpanzee genome assembly
was useful in a study
(BMC Genomics 7: 15).
A unique feature of PCAP was the introduction of a superword array in finding
overlaps between sequences (Huang et al. 2006).
Unlike suffix arrays,
superword arrays allow base mismatches (Huang 2017).
In gene finding, a linear-space DNA-protein alignment algorithm
was developed to find exon-intron structures in a genomic sequence (Huang and Zhang 1996).
This algorithm, along with a linear-space DNA-cDNA alignment algorithm,
was combined with fast sequence comparison methods into a sequence
annotation package named AAT (Huang et al. 1997).
AAT was used to find gene structures in the first plant genome project (Nature 402: 761-768).
Genes and other functional elements are often located in C+G rich regions of
the genome, which can be found with a local content algorithm (Huang 1994).
A useful feature of this algorithm is that the lengths of regions found
are solely determined by their contents, not by any user-provided window size.
In tree construction,
a new formulation of phylogenetic reconstruction based on the maximum sequence similarity
was described (Huang and Vingron 2009).
A hierarchical phylogeny construction algorithm for managing a large number of
taxa was proposed (Das and Huang 2019).
Horizontal theory of molecular evolution
After examining highly variable genomic regions in several asexual
fungal pathogens such as Verticillium dahliae (Huang 2014),
Fusarium virguliforme (Huang et al. 2016), and
Fusarium oxysporum (Huang 2019), Huang (in 2019) proposed
the horizontal theory of molecular evolution, suggesting that
in a major eukaryotic lineage such as fungi,
most of the variation within and between recently-emerged asexual
pathogenic species are due to horizontal transfer of genetic variants
that are beneficial under certain conditions,
maintain genetic structure, or generate genetic variation (structural variants
as well as nucleotide substitutions).
Horizontal transfer refers to the movement of genetic information between genomes
other than by the vertical transmission of DNA from parent to offspring.
This occurs on a regular and frequent basis (ongoing) when pathogens carry supernumerary
chromosomes flanked by complementary subtelomeres that are host-specific
and contain helicase-like genes. Supernumerary chromosomes are enriched in
transposons and pathogenicity genes
associated with heterokaryotic incompatibility (het) genes.
Core chromosomes are flanked by the same type of complementary subtelomeres
as supernumerary chromosomes, allowing the exchange of genes between
core and supernumerary chromosomes by homologous recombination over
their highy similar subtelomeres.
Pathogenicity genes in core chromosomes are maintained through
their association with het genes, because a loss of
a region containing pathogenicity and het genes
can result in incompatible het interactions triggering
local and surrounding cell death.
Horizontal theory concerning structural variants complements
the neutral theory of molecular evolution concerning nucleotide substitutions
by providing alternative solutions to
Huang X, Miller W. (1991)
A Time-Efficient, Linear-Space Local Similarity Algorithm.
Advances in Applied Mathematics 12:337-357.
Huang X. (1992)
A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps.
Genomics 14: 18-25.
Huang X. (1994)
An Algorithm for Identifying Regions of a DNA Sequence that Satisfy a Content Requirement.
Bioinformatics 10: 219-225.
Huang X, Zhang J. (1996)
Methods for Comparing a DNA Sequence with a Protein Sequence.
Bioinformatics 12: 497-506.
Huang X, Adams MD, Zhou H, Kerlavage AR. (1997)
A Tool for Analyzing and Annotating Genomic Sequences.
Genomics 46: 37-45.
Huang X, Madan A. (1999)
CAP3: A DNA Sequence Assembly Program.
Genome Research 9: 868-877.
Huang X, Chao K-M. (2003)
A Generalized Global Alignment Algorithm.
Bioinformatics 19: 228-233.
Huang X, Wang J, Aluru S, Yang S-P, Hillier L. (2003)
PCAP: A Whole-Genome Assembly Program.
Genome Research 13: 2164-2170.
Ye L, Huang X. (2005)
MAP2: Multiple Alignment of Syntenic Genomic Sequences.
Nucleic Acids Research 33: 162-170.
Huang X, Yang S-P, Chinwalla A, Hillier L,
Minx P, Mardis E, Wilson R. (2006)
Application of a Superword Array in Genome Assembly.
Nucleic Acids Research 34: 201-205.
Huang X, Brutlag DL. (2007)
Dynamic Use of Multiple Parameter Sets in Sequence Alignment.
Nucleic Acids Research 35: 678-686.
Huang X. (2008)
Sequence Alignment with an Appropriate Substitution Matrix.
Journal of Computational Biology 15: 129-138.
Huang X, Vingron M. (2009)
Maximum Similarity: A New Formulation of Phylogenetic Reconstruction
Journal of Computational Biology 16: 887-896.
Huang X. (2014)
Horizontal Transfer Generates Genetic Variation in an Asexual Pathogen.
PeerJ 2: e650.
Huang X, Das A, Sahu BB, Srivastava SK, Leandro LF,
O'Donnell K, Bhattacharyya MK. (2016)
Identification of Highly Variable Supernumerary Chromosome Segments in an Asexual Pathogen.
PLoS ONE 11: e0158183.
Huang X. (2017)
Sequence Assembly. In Keith JM. (ed.)
Bioinformatics - Volume 1: Data, Sequence Analysis, and Evolution.
Humana Press, Second Edition, 35-45.
Das A, Huang X. (2019)
HPC: Hierarchical Phylogeny Construction.
PLoS ONE 14: e0221357.
Huang X. (2019)
Host-specific subtelomere: Genomic architecture of pathogen emergence in asexual filamentous fungi.
We have developed a number of
for analysis of DNA and protein sequences.
The programs below are used by scientists around the world in their research.
SIM: A local sequence alignment program,
MAP: A multiple sequence alignment program,
CAP3 and PCAP: Sequence and genome assembly programs
AAT: A set of programs for finding genes in DNA sequences