BCB 444/544X

Lab 9  - Revised 10/23/05

Gene Prediction

 

Objectives

 

  1. Use genome browsers to examine types of evidence for genes
  2. Learn about the resources available for gene prediction
  3. Practice using gene prediction and promoter prediction software

 

Introduction

 

Drena will provide this in lab!

 

Exercises

 

Required questions are in red.  

 

Turn in the answers to the questions by emailing them to terrible@iastate.edu by noon on Monday.

 

 

Exercise 1   Genome Browsers

 

We will first use some genome browsers to see what kinds of information are available to help us in predicting genes.  Go to the UCSC Genome Browser and take a look at what is available.  The input boxes at the top of the page allow you to choose which region of which genome you want to look at.  For our purposes, we can accept the values that are already there (which should be looking at a region of human chromosome 7).  Click on the submit button to get started. 

 

Take a look at everything this server allows you to do.  The top of the page has navigation controls that let you move upstream and downstream, as well as zoom in or out, or jump directly to a position.  Below that, there is a picture of the entire chromosome with a red line showing the region you are currently looking at. The main box contains a graphical representation of a huge amount of information including sequence markers, known genes, ESTs, conservation, SNPs, etc.  The rest of the page lists available tracks. 

 

Since we are working on gene prediction, we should see what gene prediction tracks are available and display them.  Click on the hide all button to clear all selected tracks.  Then scroll down the page and select some of the gene prediction tracks.  Play around with the different display options (dense, squish, pack, full) to see which options you like the best.  Notice the differences in gene predictions by the different methods. 

 

Next go to PlantGDB and browse the Arabidopsis genome.  This site was created and is maintained by Volker Brendel and his group here at Iowa State.  The browser at this site shows less information, but it is displayed very nicely.  Everything is color coded so that you can tell at a glance what evidence there is for a gene in the region.  The color key on the left side of the page shows what the different color boxes mean.  The labels are all self-explanatory except for the UCA.  UCA stands for User Contributed Annotation.  One of the features of PlantGDB is that users can look at the EST, mRNA, cDNA, and outside evidence and contribute annotations to the genome project. 

 

 

 

Exercise 2   Gene Prediction

 

The human uroporhphyrinogen decarboxylase  (URO-D, U30787) is used in this exercise.  An SP1 transcription factor binding site, a TATA box, and 10 exons in the forward strand have been annotated in the sequence of 4514 bp. 

 

1. Go to EBI database and download the URO-D sequence, both in FASTA format (for use in exercises below) and in default format. 

 

http://www.ebi.ac.uk/embl

           

http://www.ebi.ac.uk/cgi-bin/emblfetch

(you can download sequence in several formats from here)

 

2. Use GeneID   http://www1.imim.es/geneid.html

 

to predict splice sites and START and STOP codons in the sequence.

Identify the real sites among the predictions.

Do they tend to show higher scores?

 

3.  Now, use GeneID to predict all possible exons.

Compare the exon predictions with the real exons.

Why is the initial exon not included in the final gene assembly?

 

4.  The initial exon is not detected by ab initio methods or homology searches.

(What does ab initio mean, in the context of gene prediction methods?)

Explain this observation.

 

5.  Use

 

GENESCAN   http://genes.mit.edu/GENSCAN.html

& 

FGENESH  http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=gfind

 

with parameters from other species (try a plant, a non-vertebrate animal, and a yeast) to predict genes in the URO-D sequence. 

 

Discuss the results.

 

Now, do repeat these predictions using the appropriate parameters (i.e., those for human). 

 

How much improvement do you observe?

 

5.  Locate the region in the Drosophila genome that encodes the URO-D gene and use GeneID, GENSCAN and FGENESH with human parameters to make the predictions. 

 

Compare with the predictions using the Drosophila parameters. 

What differences can be noted?

 

 


Exercise 3   Promoter Prediction

 

The promoter region of the human obese gene (leptin, U43589) includes 3 regulatory elements that have been annotated:  an SP1 site, a cEBP box, and a TATA box. The sequence can be downloaded from EBI database.

 

1. Go to TRANSFAC database http://www.generegulation.com/pub/databases.html#transfac

 

and obtain the matrix representing the TATA box. You may be presented with several potential TATA motifs. 

 

Find the one motif that is bound by TATA binding factor (TBP or TBF) and save the header information for this motif. 

Carefully read the comments of the record. 

 

How many sites were used to build this matrix?

 

2. Repeat the above process for SP1 and cEBP. 

How many sites were aligned to build their matrices?

 

Is there any relationship between the quality of the predictions and the number of collected binding sites?

 

3. Access the program MATCH which can be used to scan sequences for potential transcription factor binding sites.

 http://www.gene-regulation.com/pub/programs.html#match

 

If you haven't before, you will need to register for access to this, but registration is free.

http://www.gene-regulation.com/register

 

Scan the promoter sequence using the full collection of vertebrate matrices. 

Identify the real binding sites in the output.

 

To do this, you need the coordinates for the real binding sites. They are provided below in GFF format, relative to the transcription start site (TSS) at position 1000.

 

Here are coordinates of 3 annotated elements (in GFF format)

 

 U43589   SP1       904  909       SP1     # GGGCGG

 U43589   CEBP      947  956       cEBP    # GTTGCGCAAG

 U43589   TATA      972  977       TATA    # TATAAG

 

 

4.  Repeat # 3 above, using the program MATINSPECTOR

http://www.genomatix.de/cgi-bin/matinspector/matinspector.pl

 

The link provided for MatInspector in the original version of this lab didn't work - you probably Googled it and discovered that you must also register for free access to this software. It may take a while to get the password back (via email). If you do not get a response form MatInspector (or don't feel like waiting for one), just choose another promoter prediction program from your textbook (or from that excellent optional review Drena recommended in lecture - see PPTs from Friday)  and try it out. Answer Question 5  based on the program you chose instead of MatInspector.

 

 

5.  Which program do you like better and why?

 

6.   Use BLAST2SEQ at NCBI to align the human and mouse promoters (U43589 & U36238) and obtain a graphical output.  Set a very restrictive mismatch penalty (-5) and a neutral gap extension penalty (0) to recover short very conserved stretches of genomic sequence. 

 

Compare the alignment blocks with the annotations. 

Are the real binding sites conserved in these promoters?

 

If you actually try to set both the gap initiation and gap extension penalties to 0, you should get an error message.  Play around with the mismatch and gap penalty settings and examine the results.

Do any of these settings allow you to detect the "real" TF binding sites noted in 3.3 for the human promoter?

 

 

7.  Now, repeat #6 using the promoter region of URO-D homologs from as many species as possible.

 

Search for conserved elements, using MEME and MAST. http://meme.sdsc.edu/meme/intro.html

 

Try this with CLUSTAL if you like.

 

Can conserved elements be identified across all the promoter elements?

Do they correspond to the known binding motifs?