Protein sequencing presents different challenges than nucleic acid sequencing, meaning that proteomics has yet to benefit as much as genomics from the next-generation sequencing revolution. However, the ability to sequence proteins at nucleic acid levels would be tremendously beneficial. Accordingly, scientists look to high-throughput nucleic acid sequencing for inspiration on how to improve existing protein sequencing techniques or develop new ones.1
Jeff Nivala uses nanopores to read proteins with single-amino acid sensitivity.
Jeff Hagen
Jeff Nivala, a molecular engineer at the University of Washington, sees nanopore technology as the way forward toward single-molecule protein sequencing and beyond. In this interview, Nivala describes his new technique, where he uses the enzyme ClpX to unfold and ratchet long protein strands through nanopores, allowing them to be read with single-amino acid sensitivity. 2
What sparked your interest in using nanopore technology for protein sequencing?
The big breakthrough for nucleic acid sequencing using nanopores was the discovery of a motor protein that can ratchet the strand nucleotide-by-nucleotide through the pore. At the start of my graduate studies, I tried to find a similar motor for proteins. Fortunately, around the same time, a study came out in Cell that characterized how the unfoldase ClpX worked at the single-molecule level.3 The detail presented in this study let me imagine how I could transfer this motor protein over and apply it for nanopore protein sequencing, and I was able to put two and two together.4
What is the biggest difference between using nanopore technology for protein sequencing versus nucleic acid sequencing?
Protein sequencing is a lot more challenging. Nucleic acids have uniform negatively charged backbones, which means that electrophoretic force is enough to move nucleic acids through a nanopore. Proteins are heterogeneously charged, so they do not behave as nicely and the signals become noisier. There is also a lot more complexity when working with 20 amino acids compared to four nucleotides, not to mention tertiary structures, folded domains, and so on.
How difficult is it to ratchet a protein through a nanopore?
ClpX can handle synthetic proteins fairly easily because they are not folded. Natural proteins with fully folded domains can be harder to resolve because they have to be unfolded before passing through the nanopore. Proteins can also refold on the trans side of the pore after passing through. In another study using this technology, we actually have to unfold a protein twice: once before it can go through the nanopore via electrophoretic force and once before it can come back up. There is still a lot to be discovered about how well a motor works with a given protein.
How difficult is it to distinguish between different amino acids?
The signal that we observe comes from sensing a sliding window of around 20 amino acids at a time as they pass through the pore. This makes it much more difficult to detect single amino acid differences. However, the longer the sequence, the higher the odds that it will generate distinct signal elements. Putting these individual elements together creates a collective unique signature, which lets us adopt a fingerprint-based approach and use a signature to identify a protein.
These sequence differences can be very subtle, making it hard to do traditional statistics or other analysis methods. This is where machine learning comes in. We are training machine learning programs to extract the differences in signal between different proteins, learn what features are associated with what amino acids, and map how the surrounding amino acids contribute to the observed signal at a given position.
What are your short- and long-term goals for this technique?
We are increasing the size of our data sets by looking at more complex amino acid sequences so that we can train better models. Most of our data right now comes from synthetic proteins, so we are building our dataset by adding more natural molecules. Ultimately, the goal is to have a model that can recognize any given arbitrary protein in the human proteome.
As we go down that path, we anticipate that we will see new challenges appear. For example, will we need an improved motor protein for certain types of protein sequences? Will a different pore size make our method more sensitive because our current nanopore was optimized for DNA sequences? There are going to be so many improvements that make this technique exponentially better in the future.
This interview has been condensed and edited for clarity.
Discussion about this post