(569a) Mutual Information to Inform Protein Library Design | AIChE

(569a) Mutual Information to Inform Protein Library Design

Authors 

Markou, G. C. - Presenter, University of Minnesota
Sarkar, C. A., University of Minnesota

Mutual Information to Inform
Protein Library Design

 

Introduction:

Directed evolution has been widely used
to develop new or improved proteins for research, diagnostic, and therapeutic
applications. “You get what you select for” is the defining principle for
success; however, this expression ignores the importance of the initial input library,
which constrains the potential biomolecules that can be selected. Designed
libraries must balance variability in positions that modify functionality
against potentially disruptive mutations. Moreover, increasing variability at a
single position is met with diminishing returns,1 and combinatorial
library diversity can easily dwarf practical limitations in sampling
capabilities. We explore the use of information theory to inform library design
and better optimize this balance. Specifically, we use mutual information (MI)
as a tool to infer interdependence between sequence positions based on their
entropies,2 thus providing a means of objectively defining positions
that provide specific functionality. In this study, we seek to determine if MI
can provide insights for library design that a traditional design approach (e.g.,
one based on sequence alignment) would neglect, and we experimentally compare
the results of these design approaches.

Materials and Methods:

For this study, libraries were
designed based on the leucine-rich repeat (LRR) protein motif. These proteins mediate
numerous protein-protein interactions in organisms throughout all orders of
life, and thus are attractive as a class of therapeutics or nutraceuticals to
improve human health. Additionally, their short, repetitive sequence provides a
simple proof-of-principle model to test the capabilities of MI in library
design. MI is calculated as

,

where  represents residue  at position  and  represents residue  at position . MI benefits from large sample size, so sequence data
of 4,930 LRR repeats were obtained from the SMART database.3 Multiple
sequence alignments were submitted to MISTIC,4 a web tool for
analyzing protein MI. These results were used to cluster amino acid positions
based on their roles in defining protein structure or target binding.

Based on these results, three distinct LRR libraries were subsequently designed
for selections with ribosome display,5 a powerful cell-free method
for directed evolution that can accommodate experimental libraries of up to 1013
sequences. Version HVP was fully sampled and it contained high amino acid
variability in the binding paratope only. Version MI was also fully sampled,
but it had moderate amino acid variability in the binding paratope and light
variability distributed in non-binding residues which were selected using the
MI results. Finally, Version HVP+MI was a superset of Versions HVP and MI, with
the same highly variable binding residues from HVP and the light, distributed
variability from MI. This final library had increased theoretical diversity,
but suffered from the practical limitation of not being fully sampled
experimentally, underscoring the balance that must be struck in library design.
The libraries were synthesized and validated for expected variability via
Sanger sequencing of randomly selected clones.

The libraries were selected against a model target protein, chosen for its ease
of handling and existing natural LRR binders, using ribosome display. Two
successive rounds of selection against the target were performed. The output of
each round is an RT-PCR product, so enrichment of binders against the target
protein (relative to a negative control) is easily visualized via DNA gel
electrophoresis. In this context, a low negative signal is expected when an
enriched library is panned against an unrelated target.

Following the two selection rounds,
the libraries were recovered using gel extraction and subsequently cloned into
a sequencing vector. A set of clones from each enriched library was randomly chosen
and sequenced. The sequences were analyzed using ClustalX2 and Microsoft Excel
to study library enrichment. Interactive Tree of Life was used to create a
phylogeny tree of the recovered sequences.6 Selected individual clones
were then subjected to one final round of ribosome display to check clone
specificity. Selection stringency was increased in this analytical round, and competitive
selection with unbound target was used as the background to demonstrate binder
functionality.

Results and Discussion:

Figure 1 depicts a network visualization of MI among
amino acids in the studied LRRs. As expected, highly conserved positions cluster
into the core LRR motif, specifying the protein structure. One variable cluster
represents the binding domain, since some of these positions are generally
recognized as contributing to target recognition. A second variable cluster,
termed the back-end, is likewise revealed. Also of note is residue 13, which
retains high variability but does not share much MI with other residues,
suggesting it evolves independently and should not be diversified in any
library design. These results were used to design the three described 2-repeat
LRR libraries. Naïve library sequencing confirmed that the desired variability
in the three libraries was attained, so each was subjected to two rounds of
ribosome display selection against the same target protein. The DNA gel in
Figure 2 depicts RT-PCR products after second-round selection results for binding
either the desired target protein or a negative control. The high
signal-to-background ratio is suggestive of successful enrichment for binders
in all three libraries.

Sequences of randomly selected
clones from the enriched libraries were analyzed to understand the molecular
nature of binder enrichment. Positions 8, 10, and 11 in the paratope showed
heavy bias, indicating their importance in binding and consistent with the
initial MI results. All three libraries showed bias towards G10/R10 and
G11/R11, suggesting a similar pathway towards binder evolution. G10 and G11
enrichment are particularly interesting, as this residue appears to coevolve
with a bulky amino acid at nearby positions to accommodate steric effects. The
data set is too small to conduct further MI analysis on the enriched libraries,
but this result highlights the key insights that coevolution analysis provides
both before and after selections. Additionally, the dispersed variable positions
in libraries MI and HVP+MI focused away from the consensus in library HVP.
This enrichment would not be captured in a traditional library design that restricts
diversity to only the paratope.

A phylogenetic tree of the selected
sequences is shown in Figure 3. Only the Version MI library exhibited strong convergence,
with two distinct sequences in particular appearing several times (sequence
redundancy is quantified by the width of the horizontal line at the end of a
branch). Since these two sequences were overrepresented, they were individually
tested in one final round of single-clone ribosome display. With competitive
selection as the background in this stringent round, the results in Figure 4
confirm that both sequences appear to be functional binders. Only the Version MI
library, which was MI-informed and fully sampled, achieved this convergence
under the selection conditions of this study.

Conclusions:

The use of MI in library design
reinforces decisions from a traditional sequence information design approach,
but it also provides new insights into residue positions that may or may not contribute
to library function. Since initial library design is critical to successful
biomolecule evolution but necessarily limited in its complexity, MI can serve
as a useful quantitative tool to inform optimal design given these practical
constraints. For a model LRR library, these results led to an alternative
library design. Our study showed that a fully sampled library with MI-informed
variability (Version MI) performed better than traditional approaches that
either restricted variability to the paratope (Version HVP) or increased variability
but prevented full sampling (Version HVP+MI). The fully sampled, MI-informed
library converged most rapidly to functional binders in our experimental study.

Acknowledgements:

This work was supported by a 3M
Fellowship, University of Minnesota Interdisciplinary Doctoral Fellowship, NIFA
Predoctoral Fellowship (2017-67011-26204), and National Science Foundation grant
(CBET1055231).

References:

1Shohei Koide and Sachdev S. Sidhu. ACS
Chem Biol
. 2009. 4(5) 325-334; 2Louise C. Martin. Bioinformatics.
2005. 21 (22) 4116-4124; 3Franco L. Simonetti et al. Nucleic
Acids Res
. 2013. 41 W8-W14; 4Ivica Letunic et al. Nucleic
Acids Res
. 2012. 40 (D1): D302-D305; 5Jozef Hanes and Adreas
Plückthun. Proc. Natl. Acad. Sci. USA. 1997. 94 4937-4942; 6Ivica
Letunic and Peer Bork. Nucleic Acids Res. 2016. 44 (W1): W242-245

Figure 1. MI network of LRRs. Nodes denote amino
acid position in each repeat, and coloring represents position conservation.
Edges depict high MI between positions. These data are used to cluster
positions into the core LRR motif, binding residues, and back end positions. A
cartoon representation of a single LRR repeat is provided for context.

Figure 2. Second-round ribosome display results
(RT-PCR outputs from the round) for each library version binding either the
target protein (+) or a dummy target (-).

Figure 3. Phylogenetic tree of sequences from all
three enriched libraries. A convergent sequence cluster is highlighted, and two
of these sequences are denoted for their dramatic over-representation in the
enriched Version MI library.

Figure 4. Final ribosome display results for the
two isolated Version MI sequences binding the target protein either with (+) or
without (-) competitive selection in a stringent selection round.