Wednesday 10 July 2019

SCIN082 Quiz No. 1 (10 July 2019)


GACCTACACCTGTCAACATAATTGGAAGAAATCTGTTGACTCAGATTGGTTGCACTTTAAATTTTCCCATTAGCCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAAGTTAAACAATGGCCATTGACAGAAGAAAAAATAAAAGCATTAGTAGAAATTTGTACAGAGATGGAAAAGGAAGGGAAAATTTCAAAAATTGGGCCTGAAAATCCATACAATACTCCAGTATTTGCCATAAAGAAAAAAGACAGTACTAAATGGAGAAAATTAGTAGATTTCAGAGAACTTAATAAGAGAACTCAAGACTTCTGGGAAGTTCAATTAGGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGATGTGGGTGATGCATATTTTTCAGTTCCCTTAGATGAAGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCACAGGGATGGAAAGGATCACCAGCAATATTCCAAAGTAGCATGACAAAAATCTTAGAGCCTTTTAGAAAACAAAATCCAGACATAGTTATCTATCAATACATGGATGATTTGTATGTAGGATCTGACTTAGAAATAGGGCAGCATAGAACAAAAATAGAGGAGCTGAGACAACATCTGTTGAGGTGGGGACTTACCACACCAGACAAAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTATAGTGCTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGA


1. What is the name of the protein that contain this DNA sequence?
2. In which organism does was this sequence derived from?
3. When was the sequence loaded in the database?
4. What is the Accession number of the sequence?
5. What is the percentage identity of the sequence to the query seuence?
6. What is the Expectation(E)-value of the results?


Tuesday 18 July 2017

WHAT IS BIOINFORMATICS?
(Molecular) bio informatics: bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry) and applying "informatics techniques" (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications.

Broadly speaking, Bioiformatics or computational biology is the application of computer science, statistics, and mathematics to problems in biology. Computational biology spans a wide range of fields within biology, including genomics/genetics, biophysics, cell biology, biochemistry, and evolution. Likewise, it makes use of tools and techniques from many different quantitative fields, including algorithm design, machine learning, Bayesian and frequentist statistics, and statistical physics.

What kinds of problems do computational biologists work on?

Much of computational biology is concerned with the analysis of molecular data, such as biosequences (DNA, RNA, or protein sequences), three-dimensional protein structures, gene expression data, or molecular biological networks (metabolic pathways, protein-protein interaction networks, or gene regulatory networks). A wide variety of problems can be addressed using these data, such as the identification of disease-causing genes, the reconstruction of the evolutionary histories of species, and the unlocking of the complex regulatory codes that turn genes on and off. Computational biology can also be concerned with non-molecular data, such as clinical or ecological data.

What are the differences between computational biology and bioinformatics?

The terms computational biology and bioinformatics are often used interchangeably. However, computational biology sometimes connotes the development of algorithms, mathematical models, and methods for statistical inference, while bioinformatics is more associated with the development of software tools, databases, and visualization methods.

For your Classwork No. 1

GACCTACACCTGTCAACATAATTGGAAGAAATCTGTTGACTCAGATTGGTTGCACTTTAAATTTTCCCATTAGCCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAAGTTAAACAATGGCCATTGACAGAAGAAAAAATAAAAGCATTAGTAGAAATTTGTACAGAGATGGAAAAGGAAGGGAAAATTTCAAAAATTGGGCCTGAAAATCCATACAATACTCCAGTATTTGCCATAAAGAAAAAAGACAGTACTAAATGGAGAAAATTAGTAGATTTCAGAGAACTTAATAAGAGAACTCAAGACTTCTGGGAAGTTCAATTAGGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGATGTGGGTGATGCATATTTTTCAGTTCCCTTAGATGAAGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCACAGGGATGGAAAGGATCACCAGCAATATTCCAAAGTAGCATGACAAAAATCTTAGAGCCTTTTAGAAAACAAAATCCAGACATAGTTATCTATCAATACATGGATGATTTGTATGTAGGATCTGACTTAGAAATAGGGCAGCATAGAACAAAAATAGAGGAGCTGAGACAACATCTGTTGAGGTGGGGACTTACCACACCAGACAAAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTATAGTGCTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGA


1. What is the name of the protein that contain this DNA sequence?
2. In which organism does was this sequence derived from?
3. When was the sequence loaded in the database?
4. What is the Accession number of the sequence?
5. What is the percentage identity of the sequence to the query seuence?
6. What is the Expectation(E)-value of the results?

Monday 17 July 2017

Sequence Databases

The NCBI Sequence Database

All published genome sequences are available over the internet, as it is a requirement of every scientific journal that any published DNA or RNA or protein sequence must be deposited in a public database. The main resources for storing and distributing sequence data are three large databases: the NCBI database (www.ncbi.nlm.nih.gov/), the European Molecular Biology Laboratory (EMBL) database (www.ebi.ac.uk/embl/, and the DNA Database of Japan (DDBJ) database (www.ddbj.nig.ac.jp/). These databases collect all publicly available DNA, RNA and protein sequence data and make it available for free. They exchange data nightly, so contain essentially the same data.

In this chapter we will discuss the NCBI database. Note however that it contains essentially the same data as in the EMBL/DDBJ databases.
Sequences in the NCBI Sequence Database (or EMBL/DDBJ) are identified by an accession number. This is a unique number that is only associated with one sequence. For example, the accession number NC_001477 is for the DEN-1 Dengue virus genome sequence. The accession number is what identifies the sequence. It is reported in scientific papers describing that sequence.
As well as the sequence itself, for each sequence the NCBI database (or EMBL/DDBJ databases) also stores some additional annotation data, such as the name of the species it comes from, references to publications describing that sequence, etc. Some of this annotation data was added by the person who sequenced a sequence and submitted it to the NCBI database, while some may have been added later by a human curator working for NCBI.
The NCBI database contains several sub-databases, the most important of which are:
  • the NCBI Nucleotide database: contains DNA and RNA sequences
  • the NCBI Protein database: contains protein sequences
  • EST: contains ESTs (expressed sequence tags), which are short sequences derived from mRNAs
  • the NCBI Genome database: contains DNA sequences for whole genomes
  • PubMed: contains data on scientific publications
Classwork 2

Q1. What information about the rabies virus sequence (NCBI accession NC_001542) can you obtain from its annotations in the NCBI Sequence Database?
What does it say in the DEFINITION and ORGANISM fields of its NCBI record? Note: rabies virus is the virus responsible for rabies, which is classified by the WHO as a neglected tropical disease.
Q2. How many nucleotide sequences are there from the bacterium Chlamydia trachomatis in the NCBI Sequence Database?
Note: the bacterium Chlamydia trachomatis is responsible for causing trachoma, which is classified by the WHO as a neglected tropical disease.

Tuesday 13 September 2016

Open Reading Frames

In bioinformatics we look for gene-coding sequences or what we call open reading frames(ORF), entrez has a tool called ORF finder (now you know why I like entrez :-)
http://www.ncbi.nlm.nih.gov/projects/gorf/ Let's use this tool to find out what regions will code for a gene in this sequence.

For your Classwork
ACTTTGCAGGCAGCGGCGGCCGGGGCGGAGCGGGATCGAGCCCTCGCCGCGGCCTGCCAGTCATGGGCCCGCGCCGCCGCCGCCGCCTGCCTCCCGGGCCACGCGGGCCGTGAGCGCCATGGCCGTAGCCCCCGCGGGCGGCCAGCACGCGCCAGCGCTGGAGGCCCTGCTCGGGGCGGGCGCGTTGCGGCTGCTCGACTCCTCGCAGATCGTCATCATCTCCACCGCGCCCGATGTCGGCGCCCCGCAGCTCCCCGCCGCGCCGCCCACTGGCCCTCGCGATTCTGACGTGCTGCTCTTCGCCACGCCGCAGGCGCCCCGACCCGCGCCTAGTGCACCGCGCCCGGCTCTCGGCCGCCCGCCGGTGAAACGGAGGCTGGATCTGGAGACTGACCATCAGTACCTCGCTGGTAGCAGTGGGCCATTCCGGGGCAGAGGCCGCCACCCAGGGAAAGGTGTGAAATCTCCGGGGGAGAAGTCACGCTATGAAACCTCACTAAATCTGACCACCAAACGCTTCTTGGAGCTGCTGAGCCGCTCAGCTGACGGTGTCGTTGACCTGAACTGGGCAGCTGAGGTGCTGAAGGTGCAGAAACGGCGCATCTATGACATCACCAATGTCCTGGAGGGCATCCAGCTCATTGCCAAGAAGTCCAAGAATCATATCCAGTGGCTAGGCAGCCACACCATGGTGGGGATTGGTAAGCGGCTTGAAGGCCTGACCCAGGACCTGCAGCAACTGCAGGAGAGTGAGCAGCAGCTGGATCACCTGATGCACATCTGTACCACACAGCTGCAACTGCTTTCGGAGGACTCCGACACCCAGCGCCTGGCCTATGTGACCTGCCAGGACCTTCGCAGCATTGCAGACCCTGCAGAACAGATGGTCATAGTGATCAAGGCCCCTCCTGAGACCCAACTACAAGCTGTGGATTCTTCAGAGACATTTCAGATCTCCCTTAAGAGCAAACAAGGCCCCATTGATGTTTTCCTGTGCCCGGAGGAGAGTGCAGACGGGATTAGCCCTGGGAAGACCTCATGCCAGGAGACATCCTCTGGGGAGGACCGGACTGCAGACTCTGGCCCAGCAGGGCCTCCACCATCACCTCCCTCCACATCCCCAGCCTTGGATCCCAGTCAATCCCTGTTGGGCCTGGAGCAAGAAGCAGTATTGCCACGGATGGGCCACCTGAGGGTCCCTATGGAAGAGGACCAACTGTCACCACTGGTGGCTGCTGACTCACTCCTGGAGCATGTTAAAGAAGACTTCTCTGGGCTCCTCCCTGGGGAGTTCATCAGCCTCTCCCCACCCCACGAGGCCCTTGACTATCACTTTGGTCTCGAGGAGGGTGAGGGCATTAGAGATCTCTTTGACTGTGACTTTGGGGACCTGACCCCTCTGGATTTCTGACAGAAGCCTAGGGATTCAGGGTGTCTGGAGATGCCCACCTGTCTGCAGCTTTGGAGCCTCCTGCCCTGGGCCATCCTTCCTGCCTCATTGGAATAGCACGATCCATACCCTCTGTCCCAATAGCTTCTAGCTCTGGGGTTTGGTTGCTGCCACATTGAGCAGACCAAAATGGGAAGGATGTTGTACAGTGTGTGTGCATGCACCCCACACTGCGCACTGTGTGCCTGGGGTGTGTGTCTGAGTGTGTGTGTGTGTGTGTGTGTGTGAGTGTGTGTGTGTGTGTGTGTGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATGTATGTGTATGTGCACGTGTGCCCGGGAATGAAGGTGAACACATCTGTATGTGTGCTGCAGACACATCCTGGTGTGTCCACATGTGTGCATGGATCCATGTGTGCGCATTGGGGTGGGGGTGGGCTCTAACTGCACTTTTGGTGTCCTTGCTGCAGGGGCCCTGTGAGGCCCAGGGTGGCTGCCTGCTTTCAGAATCCTGTGTGTCAGCCAGGCCGGGTGGTACAGCTTGCCTGGCTGGGTTTGCAGGGCAGCAAGAGCACTGCTTAAAAGTTTTCCGATCGAAGCTTTAATGGAGCGTTTATTTATTTATCGAGGCCTCTGGCAAGCCTGGGGGGATAAGCAAAGGGTGGGGGGCATGGGTGATACCTTAAGTCCCTGTTCTCTGAAGCAAGGGCAGGATCCCTACCCAAGAGTTGCTGAGGCCCAAGCAGTTTATTTATTGGGAAAGGGAGAGGGAGACAGACTGACAGCCATGGATGGGCTGGAGAAACAGTCCCTTTGTACCAGTACTCCAGCCGCATGTATCCAGGGGATCTGAGATGGGGAGGGTACGTGAGGGCCTTGGCTGACTGCGGCCAGGAGGGGTGGGTATGCGTCCTTCCTATGGCTGGAGTGCTCCTCTGCTGTCCTCCCCACCCTCCAGTCTGCACTTTGATTTGTTTCCTAACAGTTCTGTTCCCTCCTGCTTTGATTTTAATAAATGTTTTGATG

1. Find the ORF regions
2. Which region is most probably the gene coding region if this sequence contains only a single gene?
3. What is the length of this most probable gene?
4. The gene will encode a protein molecule. How long will this protein molecule be?
5. How many Methionines are encoded in the gene-containing region?
6. In which position is the stop codon found?
7. What is the name of this most probable gene?
8. How did you determine the name of the best-matched gene name: i.e. what was your e-value, total score, number of gaps, %identity.

EXERCISE 1 [ ORF VERSUS GENES]

1.     In our original sequenced DNA, the following ORFs were predicted:

(a)  What is an ORF and how different is it from a gene coding region?                        (2)


(b)  Which ORF is most probably the gene-coding region? Circle it.                   (1)
(c)  How long is this open reading frame?_________________                          (1)
(d)  In which frame was this most probable ORF found?______________        (1)
(e)  Predict the length of the protein that would be coded by this ORF_________(1)

2.     The 3D structure of the resulting gene product looks like this. Describe the protein.
 (7)

Saturday 10 September 2016

Protein Sturctures Databases

The PDB archive contains information about experimentally-determined structures of proteins, nucleic acids, and complex assemblies. As a member of the wwPDB, the RCSB PDB curates and annotates PDB data according to agreed upon standards.
Use the RCSB PDB to perform simple and advanced searches based on annotations relating to sequence, structure and function, and to visualize, download, and analyze molecules.

Choose a protein of your choice according to your current research focus and view the structural components on RCSB http://www.rcsb.org/pdb/home/home.do
Have fun
For your class-work: fully describe the protein structure

Friday 9 September 2016

Functional Analysis of proteins

Today, let's open expassy and use prosite to look at functional characterization of this protein sequence:

MVQRWLYSTNAKDIAVLYFMLAIFSGMAGTAMSLIIRLELAAPGSQYLHGNSQLFNVLVVGHAVLMIFCAPFRLIYHCIEVLIDKHISVYSINENFTVSFWFWLLVVTYMVFRYVNHMAYPVGANSTGTMACHKSAGVKQPAQGKNCPMARLTNSCKECLGFSLTPSHLGIVIHAYVLEEEVHELTKNESLALSKSWHLEGCTSSNGKLRNTGLSERGNPGDNGVFMVPKFNLNKVRYFSTLSKLNARKEDSLAYLTKINTTDFSELNKLMENNHNKTETINTRILKLMSDIRMLLIAYNKIKSKKGNMSKGSNNITLDGINISYLNKLSKDINTNMFKFSPVRRVEIPKTSGGFRPLSVGNPREKIVQESMRMMLEIIYNNSFSYYSHGFRPNLSCLTAIIQCKNYMQYCNWFIKVDLNKCFDTIPHNMLINVLNERIKDKGFMDLLYKLLRAGYVDKNNNYHNTTLGIPQGSVVSPILCNIFLDKLDKYLENKFENEFNTGNMSNRGRNPIYNSLSSKIYRCKLLSEKLKLIRLRDHYQRNMGSDKSFKRAYFVRYADDIIIGVMGSHNDCKNILNDINNFLKENLGMSINMDKSVIKHSKEGVSFLGYDVKVTPWEKRPYRMIKKGDNFIRVRHHTSLVVNAPIRSIVMKLNKHGYCSHGILGKPRGVGRLIHEEMKTILMHYLAVGRGIMNYYRLATNFTTLRGRITYILFYSCCLTLARKFKLNTVKKVILKFGKVLVDPHSKVSFSIDDFKIRHKMNMTDSNYTPDEILDRYKYMLPRSLSLFSGICQICGSKHDLEVHHVRTLNNAANKIKDDYLLGRMIKMNRKQITICKTCHFKVHQGKYNGPGL

Click on: http://www.expasy.ch/
and open PROSITE
Look at the following:
0. Domain structure of the protein
1. Clustal format(1st 3 sequences)• Retrieve the sequence LOGO from the alignment (for 15 aas)
2. Taxonomic tree view of all Swiss-Prot/TrEMBL entries matching our protein
3. Retrieve a list of all Swiss-Prot/TrEMBL entries matching our protein
4. Scan Swiss-Prot/TrEMBL entries against our protein
5. view ligand binding statistics on our protein
6. Click on sequence ID and retrieve sequence Logo from alignment

For your classwork, here is your sequence
MLDQQTINIIKATVPVLKEHGVTITTTFYKNLFAKHPEVRPLFDMGRQESLEQPKALAMT
VLAAAQNIENLPAILPAVKKIAVKHCQAGVAAAHYPIVGQELLGAIKEVLGDAATDDILD
AWGKAYGVIADVFIQVEADLYAQAVE