California State University Monterey Bay - ESSP 241L "Bio II Lab"
   

Introductory Bioinformatics Lab

 for Introduction to Cell and Molecular Biology
 Henrik Kibak - Fall 2004

 

ESSP 241 L Kibak
 
Bioinformatics is emerging as a hugely important field affecting all areas of biology.  While bioinformatics is formally the application of computer technologies to biological sciences - ranging from automated analysis of microarrays containing thousands of individual experiments to the development of browser tools for looking at whole genomes - students in all areas of biology need to be familiar with software tools developed by bioinformaticians to accomplish routine tasks in biology.
 
Skills developed in this lab:
  • Use of National Center for Biotechnology Information (NCBI) databases
  • Retrieval of sequences from NCBI
  • Alignment of homologous protein sequences using ClustalX
  • Identification and visualization of evolutionarily conserved structures in proteins
  • Using ClustalX output to prepare phylogenetic networks (trees)
  • Testing evolutionary hypothes
 
 

First we will look at the taxonomic position of Euglena using Cytochrome C as a demonstration exercise. 

You will then have the tools to answer the question: "Are whales and dolphins a sister group to Ariodactyls (ungulates)?  Or should they be placed within the Ariodactyls as a sister group to Hippopotami? You will answer that question during next week's lab (Reading HERE).


Photo courtesy of The Euglenoid Project at Rutgers
University. More on the Protist
genus Euglena including
fantastic images!

Algae are Protists with chloroplasts. However, Euglena is a protist genus where some species have chloroplasts and others don't. So, are they more closely related to animals or plants?!!

To answer that question we will use the resources of the National Center for Biotechnology Information (NCBI)

 

 



It is impossible to provide a reasonable guide to even a small section of this tremendous resource... you will have to explore it yourself... Most of the instructions will be given in lab.  If you miss lab, you will have to work with a classmate to capture some of the steps.

For an example, look up "Mirounga"

As you can see, there is a vast amount of information cataloged even for this monachine phocid...

try clicking on "PubMed Central: free, full text journal articles."

Here, for example, you will find an important article that should be read by all ESSP majors.

"Sequential megafaunal collapse in the North Pacific Ocean: An ongoing legacy of industrial whaling?"


To see what is available for Euglena let's enter that instead of Mirounga. Go ahead and refine the search a bit by clicking "Protein" and adding the search modifier for "organism"  like this:

Euglena [orgn]

That should reduce the number of hits a bit. Adding "cytochrome c" with quotes like this should help a lot:

Euglena [orgn] "cytochrome c"

Finally, if you add the search modifier for "protein" like this:

Euglena [orgn] "cytochrome c" [prot]

 ...it should knock it down to about three hits that include the Cytochrome C sequences for Euglena viridis and Euglena gracilis, that were obtained many years ago by direct protein sequencing, and a more recent one with no information on how it was obtained.

Create a folder called "Seqs" somewhere on your hard drive where you can find it again (Perhaps write down the pathname).

Save the first Euglena viridis sequence to that folder as a web page called "Cyt_c_Eug_vir.html"

Now erase your previous search terms and try typing in "Cytochrome C" in quotes... what results do you get when you search?

Click on "Protein" if you aren't already in the Protein database.

You should see "Page 1" of at least "1,567 pages" of results!!!  A bit more than Mirounga...

To refine the search try adding [prot] after the "Cytochrome C" - that should get it down to only 25 pages of results (!).

Finally try adding "mammalia" to the search terms as in the example below:

 

 

What do you see?  You should see that the results have been narrowed to 56 items (2004) on 3 pages. Click on the first one if it is P68096.

The sequences are available in a variety of formats which are selected via the "Display" button. The sequences can also be sent to "text" for printing or saved in a file. Copying and pasting into Notepad also works. There is also information associated with structure, taxonomy, other genes and publications, etc.

In order to save time I have downloaded five sequences for us to use in this exercise.  Follow the steps below the sequences.

 

The Cytochrome C sequences we will use (in FASTA format):

>Arabidopsis gi|4539007 Cytochrome c [Arabidopsis thaliana]
MASFDEAPPGNPKAGEKIFRTKCAQCHTVEKGAGHKQGPNLNGLFGRQSGTTPGYSYSAA
NKSMAVNWEEKTLYDYLLNPKKYIPGTKMVFPGLKKPQDRADLIAYLKEGTA

>Euglena GI|117985:1-102 Cytochrome c [Euglena viridis]
GDAERGKKLFESRAGQCHSSQKGVNSTGPALYGVYGRTSGTVPGYAYSNANKNAAIVWED
ESLNKFLENPKKYVPGTKMAFAGIKAKKDRLDIIAYMKTLKD

>Hippo gi|65451 Cytochrome c [Hippopotamus amphibius]
GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQSPGFSYTDANKNKGITWG
EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKQATNE

>Mosquito gi|31202411|ref|XP_310154.1| [Anopheles gambiae]
MGVPAGDVEKGKKLFVQRCAQCHTVEAGGKHKVGPNLHGLFGRKTGQAAGFSYTDANKAK
GITWNEDTLFEYLENPKKYIPGTKMVFAGLKKPQERGDLIAYLKSATK


>Rice gi|218249 Cytochrome C [Oryza sativa (japonica cultivar-group)]
MASFSEAPPGNPKAGEKIFKTKCAQCHTVDKGAGHKQGPNLNGLFGRQSGTTPGYSYSTA
NKNMAVIWEENTLYDYLLNPKKYIPGTKMVFPGLKKPQERADLISYLKEATS

Preparing sequences for comparison by aligning them using ClustalX

  1. If you haven't already done so, create a folder called "Seqs" somewhere on your hard drive where you can find it again (Perhaps write down the pathname).

  2. Copy & Paste all five sequences above into Notepad.  It will help a lot later if you insert a reasonable name in the space behind the ">" as illustrated below.

            
  3. Save the file as "all_five.txt" or something similar, maybe "AraEugHipAnoOry.txt" IN THE FOLDER YOU HAVE JUST CREATED CALLED "Seqs"

  4. Now click on this link: Clustal and save this executable file to your "Seqs" file so you have it if you need it in the future.

  5. Click on Clustal again, this time OPEN it.

  6. Go to "File" ---> "Load Sequences" and open your "all_five.txt"  Unfortunately this usually involves quite a bit of navigation on our lab computers... that's why I wrote "where you can find it again" in step 1 above.

  7. Notice the order of the names of the seqences on the left.  Use the slider to inspect the entire length.  Choose "Colors" ---> "Black & White" for a less confused image.  Then select "Alignment" ---> "Do Complete Alignment"
  8. Note that although there are some machine errors... the program does a fairly good job of aligning the sequences. The program has also automatically generated a file that can be opened with Notepad and edited.  Check for "all_five.aln" in your "seqs" folder.



    Preparing phylogenetic trees based on the sequence comparison

  9. We want the program also to compare the aligned sequences for us and see how different they are from each other.  The more "different" they are, the less related they should be, and the more "distant" they should appear on a phylogenetic tree.  The program first finds the two most related sequences then adds the next most related "neighbor" sequence.  It calculates a difference score and outputs a little file of brackets and numbers that show the relationships and degree of relationship in the form of "branch lengths."

  10. Choose "Trees" ---> "Draw N-J Tree"
                     

  11. The output file will be called "all_five.ph" and look something like this:

              (
              (
              (
              Arabidopsis:0.04909,
              Rice:0.04912)
              :0.14420,
              Euglena:0.31160)
              :0.09117,
              Hippo:0.06395,
              Mosquito:0.10109);
              

  12. We will use a second program to generate a graphic representing the "tree" or network of relationships between the sequences. 
      • First save the program DrawTree to your "Seqs" folder.
      • Then save "fontfile" there as well by right-clicking on this fontfile link and "save target as".
      • Finally rename the output file "all_five.ph" --->  "intree"

  13. Go to your "Seqs" folder and RUN DrawTree FROM INSIDE THE "Seqs" FOLDER... and answer "Y" to run with the default settings.


So Euglena is slightly more closely related to animals than plants...  Are you convinced?


"Are whales and dolphins a sister group to Ariodactyls (ungulates)?  Or should they be placed within the Ariodactyls as a sister group to Hippopotami? You will answer that question during next week's lab.

 

Pancreatic Ribonuclease sequences for this your project:


 You may, upon consultation with me, choose a different project for your lab... some of you may choose to work with fish, insects or plants... Or, perhaps the most challenging and interesting of all, comparing whales, seals, bears, weasels... However, be aware that it will take you extra time since you will have to find your own sequences to compare and confirm that they are what you think they are... not always easy for beginners.
 

 Your write-up should consist of:

  • A hypothesis, for example, "Whales and dolphins form a sister group with Hippopotami within the Ariodactyls."
  • A background paragraph explaining the controversy or question you are attempting to resolve.
  • The protein (or RNA) you chose to use for the analysis and why. For example, what is Pancreatic Ribonuclease?  Has it been used before for phylogenetic analysis? Why choose it for this taxonomic group? Hint: why wouldn't you use Pancreatic Ribonuclease to answer the Euglena question above? If possible, include a 3D image of your protein (see below).
  • The species you chose and why. Please prepare a figure showing the FASTA format collection of sequences using 8 pt. Courier font. Also include a photo of your organism.
  • A figure showing the alignment you produced. The caption should describe how the alignment was prepared. In the discussion, you should point out the highly conserved regions... can you speculate on why they are conserved?
  • A figure showing the resulting tree.  The caption should describe how the figure was prepared.
  • In the discussion conclude whether your hypothesis was supported or not, and provide suggestions for additional comparisons along with your reasoning.
  • Literature cited. 

Due December 17, 2004.

 

 Another interesting dataset to try... Remember, you can pick and choose your own... no need to run them all.

>Rhinocerus (white) ATP7A [Ceratotherium simum]
IVYQPHLITVQEIKKQIEAAGFPAFIKKQPKFLKLGAIDIERLKNTPVKSSERPQQRSPSYTSDSTVTFI
VDGMHCKSCVSNIESALSTLQYISSIVVSLENRSAIVKYNASLVTPETLRKAIEAVSPGQYRVNITSEVE
STSNSPSSSSLQKIPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLSNGNGTVEYD
PLLTSPETLRKAIED

>Horse ATP7A [Equus caballus]
IVYQPHLITVEEIKKQIEAAGFPAFIKKQPKFLKLGAIDIERLKNTPVKSSERPQQRSPSCTNDSAVTFI
VDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSAIVKYNASLVTPETLRKAIEAISPGQYRVSFPSEVE
STSNSPSGSSLHKIPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANGNGTVEYD
PLLTSPETLRKAIED

>Hippopotamus ATP7A [Hippopotamus amphibius]
IVYQPHLITAEEIKKQIEAVGFPAFIRKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVVFI
IDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSAVVKYNASLVTPETLRKAIETMSPGQYKVSSTSEIE
STSNSPSSSSLQKSPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANSKGTVEYD
PLLTSPETLREAIED

>Elephant (African) ATP7A [Loxodonta africana]
IIYQPHLITAEEIKKQIEAVGFSAFIKKQPKYLTLGAIDVERLKNTPVRYSEGSEQRSPSYTNDSTATFI
INGMHCKSCVSNIESALSTLQYVSSIAISLENRSATVKYNASLVTPETLRKAIEAVSPGQYSVSITSDVE
STPSSPFSSYHQQIPLNIVSQPLTQETVINIGGMTCNSCVQSIEGVISEKAGVKSIRVSLANSSGVIEYD
PLLNSPETLREAIEN

>Whale (Humpback) ATP7A [Megaptera novaeangliae]
VVYQPHLITAEEIKKQIEAVGFPAFIKKQPKYLRLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVIFI
IDGMHCKSCVSNIESALSTLQYVSSVVVSLENRSATVKYNASLVTPETLRKAIEAISPGQYRVSSTSEIE
STSNSPSSSSLQKSPLNIVSQPLTQETVINIDGMTCNSCVQSIEGVISKKAGVKSIRVSLANGKGTVEYD
PLLTSPETLREAIED

>Okapia (Giraffe family) ATP7A [Okapia johnstoni] Photo
VVYQPHLITAEEIKKQIEAVGFTAFIKKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSSTSNSTVIFT
IDGMHCKSCVSNIESALSTFQHISSVVVSLENKSAIVKYNANLVTPEALRKAIEAISQGQYRVSTASDVG
STSNSPSSSSLQKSPLNVVSQPLTQETVINIDGMTCNSCVQSIEGVLSKKAGVKSVQVSLANGKGTVEYD
PLLTSPETLREAIED

>Pig (note "X" at 2nd to last residue) ATP7A [Sus scrofa]
YQPHLITVEEIKKQIEAVGFPVFIKKQPKYLKLGAIDIERLKNTPVKSLEGSPQRSTSYTNNSTVIFIID
GMHCKSCVSNIESALSTLQYVSSIVVSLENRTAIVKYNASLVTPETLRKAIEDISPGQYRVTSTSDIECT
SNSPSSSSLQKSPLNIVSQPLTQEAVINIDGMTCNSCVQSIEGVISKKPGVKYIRISLANGKGTVEYDPL
LTSPETLREXI

>Manatee (Caribbean) ATP7A [Trichechus manatus]
IVYQPHLITVEEIKKQIEAVGFSVFIKKQPKYLTLGAIDIERLKNTPVRSSEGSEQRSPSYTNDSTATFI
INGMHCKSCVSNIESALSTLQYVSSIAISLENRSANVKYNASLVTPETLRKTIEAISPGQYSVSITSDAE
STPSSPSSSYHQKIPLNIVSQPLTQETVINIGGMTCNSCVQSIEGVISKKAGVKSIQVSLVNSSGIIEYD
PLLNSPETLREAIEN

>Dolphin (Bottle-nosed) ATP7A [Tursiops truncatus]
VVYQPHLITAEEIKKQIEAVGFPAFIKKQPKYLKLGAIDIERLKNTPVKSSEGSQQRSPSYTNNSTVIFI
IDGMHCKSCVSNIENALSTLQYVSSVVVSLENRTATVXXKASLVTPETLRKAIEAISPGQYRVSSTNEIE
STSNSPSSSSLQKSPLSIVSQPLTQETGINIDGMTCNSCVQSIEGVILKKAGVKSIRVSLANGKGIVEYD
PLLTCPETLREAIED



 

If you are desperate to produce a nicer image of your tree than a screen shot will provide, follow the steps below:

  1. When you have your tree visible in the "DrawTree" program, select "File" ---> "Plot"

  2. The program will seem like it crashed and closed, but you will find that it created a postscript file called "plotfile" that can be read by "Adobe Acrobat Distiller."

  3. Right-click on "plotfile" and choose open with Acrobat Distiller.

  4. Acrobat Distiller will work for a few seconds and automatically generate a file called "plotfile.PDF"

 

 

  1. Close Acrobat Distiller and double-click on "plotfile.PDF"... it should open in Adobe Acrobat.

  2. If you like what you see, choose "File" ---> "Save As" and select "PNG Files (*.png)"


  3. When you open Microsoft WORD, you will be able to "Insert" ---> "Picture" ---> "From File" and move your tree around in your write-up document.

 

 

Viewing three dimensional structures of proteins and their sequences.

Some proteins have had their structures determined by X-ray crystallography or Nuclear Magnetic Resonance.  This is an arduous but rewarding endeavor and especially important for understanding enzyme mechanisms or for drug discovery.

 

Cytochrome c - fully oxidized Cytochrome c - fully reduced
 
  1. Return to the NCBI and this time select the "Structure" database with "Cytochrome C" Equus as your query.
 

 

  1. Scroll down until you see 1HRC. It may be on the second or third page.  If you don't find it, click here.  It should open as a Cn3D rotatable image.  If for some reason it doesn't, you may need to install the Cn3D browser plugin. This is a link to version 4.1 for the Windows operating system.  As updated versions are developed, they will be available at the NCBI Cn3D website. Macintosh and Unix versions are available there also.

  2. Under "Style" ---> "Options" ---> "Settings" the display can be radically altered.
 



© Henrik Kibak 2004