Skip to content Skip to navigation

Digging into the genetic recipe book

ChEM-H faculty fellow Polly Fordyce

Jun 5 2018

Polly Fordyce is an assistant professor of genetics and bioengineering at Stanford University and a faculty fellow of Stanford ChEM-H. The Fordyce Lab develops novel tools to study molecular interactions. In a recent paper published in the Proceedings of the National Academy of Sciences, Fordyce unravels part of the mystery of how our body expresses different genes at different times in order to develop properly and adapt to environmental changes.



Q: What is the major puzzle you are trying to solve? 

PF: I think of our genome as a recipe book. Every cell has the information to make every recipe. Yet, somehow, your eyes know to just make the “eye recipe” and your liver knows just to make the “liver recipe.” And the instructions for the cell about which recipes to make—which genes to turn on and off—are encoded in sequences within the genome called regulatory sequences. We are trying to decipher these regulatory sequences. 

If you give me the coding sequence for any protein in the genome, I can tell you exactly what protein will be made. But, for regulatory sequences, I can’t tell you when that set of instructions will turn a gene on or off. This is the central puzzle we’re trying to solve.

Q: What was the main goal this study was trying to address?

PF: Transcription factors are proteins that will bind particular regulatory sequences to turn genes on and off. Our main goal was to determine how a transcription factor figures out where, within the approximately 3 billion base pairs of the human genome, it’s supposed to bind. Currently, we cannot predict the binding of transcription factors with high accuracy. 

Q: What did we know already about the binding behavior of transcription factors?

PF: We’re actually accumulating a lot of information about transcription factor binding. Researchers have done many experiments where they expose a transcription factor to millions and millions of possible DNA sequences to determine which ones the transcription factor likes to bind. They then have looked for common patterns among these sequences—called “consensus sites”—and have catalogued them. We use consensus sites to predict where transcription factors will bind, but they aren’t perfectly accurate for predicting binding.

Q: What were the specific research questions of this study?  

PF: Since the consensus site doesn’t fully predict transcription factor behavior, other variables must be at play. A very specific question we were trying to ask was whether the nucleotides on either side of the consensus site might be providing additional information to help transcription factors determine where to bind. 

Q: What did you discover from this study?

PF: We measured binding of transcription factors to one million sequences that all had exactly the same consensus site, but had different sequences 5 base pairs before and after.  We were able to count molecules with really high accuracy, which showed us that changing these sequences on either side had surprisingly large effects on the overall affinity – sometimes as much as making changes to the consensus sequence! 

The other main innovation was that we compared the ability of a neural network model, which is a complex model with many, many free parameters, with two much simpler models to see how well each explained the data.

Q: Did the simple models come close to what you saw with the neural network’s model?

PF: Neural networks are really good at picking out patterns from a ton of data, but you can’t ask them, “How do you get there? What are you learning?” It’s kind of like this black box that just figures things out. But we found that the simple model could explain all of the measured energies just as well as the neural network could. This gives us really great evidence that the simple model is a tool that is both interpretable and accurate.

Q: What is the significance of this study? 

PF: One reason is precision medicine. When someone sequences your genome, they will usually only sequence the coding regions, the genes that actually make protein, because they want to know if you have a mutation that alters the function of an essential protein. But most mutations—the differences between your genome and my genome—are found in the regulatory sequences of DNA, not in the coding regions or even in the consensus sites. As of now, wecan’t predict whether mutations in these regions are likely to have functional effects.It might make no difference, or it might make you very different from me. This work is trying to figure out a way to assign significance to these differences so that, in the future, we can predict which mutations in regulatory sequences we need to pay attention to.

Q: Where do we go from here?

PF: I think this is a novel technique of studying transcription factor binding that we’d like to apply to other transcription factors to learn how similar transcription factors know to bind distinct regulatory sites. We’re also interested in adding more complexity to account for other variables that affect transcription, such as whether the DNA is accessible to the transcription factor or not.

Q: Why were you interested in studying this in the first place?

PF: In college, I majored in physics and biology. I was really interested in finding some place where physics and biology intersect, and for me, that place is macromolecular interactions. Thinking about a protein that binds DNA, or a protein that binds another protein, is such an interesting problem because it’s where the laws of physics determine whether or not two things come together. And disruption of that process can lead to catastrophic consequences on the level of cells or entire people. A transcription factor is supposed to find one DNA sequence from the entire genome. It’s got millions and millions of wrong spots, and we want to understand how it finds the right one.

Jaynelle Gao is a Stanford graduate student in Epidemiology and Clinical Research.