Introduction to Bioinformatics

Sequence Alignment

Protein Motifs and Domain Prediction

1. What is the length of a motif, in terms of amino acids residue? a) 30- 60

b) 10- 20

c) 70- 90

d) 1- 10

View Answer

Answer: b

Explanation: A typical motif is 10-20 amino acids long. For e.g. Zn-finger motif. Hence it is also referred to as super secondary structure. This motif is seen in transcription factors.

2. On average, what is the length of a typical domain?

a) About 100 residues

b) About 300 residues

c) About 500 residues

d) About 900 residues View Answer

Answer: a

Explanation: The predicted optimal number of residues, which corresponds to the maximum free energy of unfolding, is 100. This is in agreement with a statistical analysis derived from their experimental structures of motifs. For too short chain, change in enthalpy of internal interactions is not favorable enough for folding because of the limited number of inter-residue contacts. And a long chain is also unfavorable for a single domain.

3. Which of the following is false about the ‘loop’ structure in proteins?

a) They connect helices and sheets

b) They are more tolerant of mutations

c) They are more flexible and can adopt multiple conformations

d) They are never the components of active sites View Answer

Answer: d

Explanation: Loops are frequently components of active sites as they are flexible in nature and as they are situated on the surface of the structure. Besides, they vary in length and 3-D configurations which give even more chances to be component of active sites.

4. Which of the common structural motifs are described wrongly?

a) β-hairpin – adjacent antiparallel strands

b) Greek key – 4 adjacent antiparallel strand

c) β-α-β – 2 parallel strands connected by helix

d) β-α-β – 2 antiparallel strands connected by helix View Answer

Answer: d

Explanation: In motif, two adjacent β parallel strands are connected by an α helix from the C- terminus of strand 1 to the N-terminus of strand. Most protein structures that contain parallel beta-sheets are built up from combinations of such β-α-β motifs.

5. Which of the following least describes Long Loop β-hairpins?

a) They are Often referred to as a ‘random coil’ conformation

b) Generally they are referred to as the β-meander super secondary structure

c) Loop looks similar to the Greek Letter Ω

d) Wide-range of conformations with very specific sequence preferences View Answer

Answer: d

Explanation: They are wide-range of conformations with no particular sequence preferences. As the name suggests ‘meander’ the conformation they possess is also quite unspecified. Addition to that Long loop β-hairpins are special case of Ω loops, that explains a lot about their structural preferences.

6. Motifs that can form α/β horseshoes conformation are rich with which protein residue?

a) Proline

b) Arginine

c) Valine

d) Leucine View Answer

Answer: d

Explanation: Specific pattern of Leucine residues, strands form a curved sheet with helices on outside. Leucine-rich repeats (LRRs) are 20-29-residue sequence motifs present in a number of proteins with diverse functions. The primary function of these motifs appears to be to provide a versatile structural framework for the formation of protein-protein interactions.

7. Which of the following wrongly describes protein domains?

a) They are made up of one secondary structure

b) Defined as independently foldable units

c) They are stable structures as compared to motifs

d) They are separated by linker regions View Answer

Answer: a

Explanation: Protein domains are made up of two or more motifs i.e. the secondary structure to form stable and folded 3-D structures. They are conserved part of the protein sequence and can evolve, function, and exist independently of the rest of the protein chain.

8. The protein structural motif domain- helix loop helix are contained by all of the following except

a) Scleraxis

b) Neurogenins

c) Transcription Factor 4

d) Leucine zipper View Answer

Answer: d

Explanation: Leucine zipper is associated with gene regulation and contains alpha helix with leucine at every 7th amino acid. While rest of them are under one of the largest families of dimerizing transcription factors.

9. Which of the following is not the function of Short Linear Motifs?

a) Irreversible cleavage of the peptide at the SLiM

b) Reversible cleavage of the peptide at the SLiM

c) Moiety addition at targeted sites on SLiM

d) Structural modifications of the peptide backbone View Answer

Answer: a

Explanation: Short Linear Motifs are short stretches of protein sequence that mediate protein- protein interaction. SLiMs can act as recognition sites of endo-peptidases resulting in the irreversible cleavage of the peptide at the SLiM.

10. In the zinc finger, which residues in this sequence motif form ligands to a zinc ion?

a) Cysteine and histidine

b) Cysteine and arginine

c) Histidine and proline

d) Histidine and arginine View Answer

Answer: a

Explanation: In the zinc finger, which is found in a widely varying family of DNA-binding proteins, cysteine and histidine residues in this sequence motif form ligands to a zinc ion whose coordination is essential to stabilize the tertiary structure

Motif and Domain Databases Using Regular Expressions

This set of Bioinformatics Interview Questions and Answers focuses on “Motif and Domain Databases Using Regular Expressions”.

1. While scanning for similarities in motifs, how regular expressions’ techniques work?

a) It represents a sequence family by a string of characters and further compares them

b) An algorithm similar to dynamic programming is used

c) Dot matrix analysis is used in this type of sequence analysis

d) Matrix analysis methods are used in this type View Answer

Answer: a

Explanation: In regular expressions’ techniques Pattern matching is defined as true or false in answer or outcome. In other words, if the pattern described in regex is found in a string of letters, the answer is true.

2. Which of the following best defines regular expressions?

a) They are made up of terms, operators and modifiers

b) They describe string or set of strings to find matching patterns

c) They are strictly restricted to alignment and corresponding score

d) They consist of set of rules for the connotations of various amino acid residues View Answer

Answer: b

Explanation: Regular expressions are powerful notable algebra that describe string or set of strings to find matching patterns. Pattern matching is defined as true or false in answer or outcome. And it is true that they are made up of terms, operators and modifiers but they are terminologies further used in matching process.

3. In regular expressions, which of the following pair of pattern is wrongly matched with its significance?

a) [ ] – Or

b) { } – Not

c) ( ) – Repeats

d) Z – Any View Answer

Answer: d

Explanation: Regular Expression Symbols have their own significances in regular expressions system means [GA] .g.e rFo ‘G or A’, {V,P} means not P or V, x(4) means (xxxx). Likewise, X denotes any character.

4. In terminologies related to regular expressions which of the following is false about terms and operators?

a) Terms are strings or substrings

b) Operators combine terms and expressions

c) Operators do not have precedence

d) Operators have precedence like arithmetic operators View Answer

Answer: c

Explanation: For harmonious, efficient and error-free functioning of the matching preocess, operators have precedence in order to set the priority of the operations to be carried out during the alignment.

5. In regular expressions, which of the following pair of pattern is wrongly matched with its significance?

a) ‘-’ – separator

b) < – N-terminal

c) > – C-terminal

d) ‘>>’ – end View Answer

Answer: d

Explanation: Regular Expression Symbols have their own significances in regular expressions’ system. For e.g. x(2,3) means x-x or x-x-x. Similarly, ‘.’ means end.

6. Emotif uses which databases for alignment of sequences?

a) BLOCKS and PRINTS databases

b) PROSITE

c) BLOCKS

d) PRINTS View Answer

Answer: a

Explanation: Emotif is a motif database that uses multiple sequence alignments from both the

BLOCKS and PRINTS databases with an alignment collection much larger than PROSITE. It identifies patterns by allowing fuzzy matching of regular expressions. Therefore, it produces fewer false negatives than PROSITE.

7. While analysing motif sequences, what is the major disadvantageous feature of PROSITE?

a) The database constructs profiles to complement some of the sequence patterns

b) The functional information of these patterns is primarily based on published literature

c) Some of the sequence patterns are too short to be specific

d) Lack of specificity about probability and variation and relation between them View Answer

Answer: c

Explanation: The major pitfall with the PROSITE patterns is that some of the sequence patterns are too short to be specific. Rest of the options are advantages. The problem with these short sequence patterns is that the resulting match is very likely to be a result of random events.

Overall, PROSITE has a greater than 20% error rate. Thus, either a match or non-match in PROSITE should be treated with caution.

8. Which of the following is not a characteristic of Fuzzy or approximate matches in regular expression?

a) This method is able to include more variant forms of a motif with a conserved function

b) the rule of matching is based on observations, not actual assumptions

c) with the more relaxed matching, there is increase of the noise level and false positives

d) the rule of matching is based on assumptions not actual observations View Answer

Answer: b

Explanation: The rule of matching is based on assumptions not actual observations in Fuzzy or approximate matches in regular expression. This provides more permissive matching by allowing more flexible matching of residues of similar biochemical properties. For example, if an original alignment only contains phenylalanine at a particular position, fuzzy matching allows other aromatic residues (including unobserved tyrosine and tryptophan) in a sequence to match with the expression.

9. Which of the following is not a characteristic of exact matches in regular expression?

a) There must be a strict match of sequence patterns

b) Any variations in the query sequence from the predefined patterns are not allowed

c) Provide more permissive matching by allowing more flexible matching of residues of similar biochemical properties

d) Searching a motif database using this approach results in either a match or non-

match

View Answer

Answer: c

Explanation: In this type of matching, there has to be a strict match of sequence patterns. This way of searching has a good chance of missing truly relevant motifs that have slight variations, thus generating false-negative results. As new sequences of motif are being accumulated, the rigid regular expression tends to become obsolete if not updated regularly to reflect the changes.

10. What does this representation mean- R.L.[EQD] ?

a) An arginine- Amino acid- Leucine- Amino acid- Either Apartic acid, glutamic acid or glutamine

b) An arginine- Leucine- Either Apartic acid, glutamic acid or glutamine

c) An arginine- Leucine- Amino acid- Either Apartic acid, glutamic acid or glutamine

d) An arginine- Leucine- Apartic acid and glutamic acid and glutamine View Answer

Answer: a

Explanation: This is an example of pexel motif. Here, the ‘.’ represents the ‘end’ i.e. the amino acid as mentioned in the answer and the [ ] means ‘or’ i.e. either of the mentioned residue is present in the given postion..

Motif and Domain Databases Using Statistical Models

This set of Bioinformatics Questions and Answers for Fresher’s focuses on “Motif and Domain Databases Using Statistical Models”.

1. Which of the following is not an advantage of Statistical models’ methods in analysing protein motifs?

a) Sequence information is preserved from a multiple sequence alignment and expresses it with probabilistic models

b) Statistical models allow partial matches and compensate for unobserved sequence patterns using pseudo-counts

c) Statistical models have stronger predictive power than the regular expression based approach, even when they are derived from a limited set of sequences

d) The comparative flexibility is less in case of these methods when compared to regular expressions methods

View Answer

Answer: d

Explanation: The major limitation of regular expressions is that this method does not take into

account sequence probability information about the multiple alignment from which it is modeled making them less flexible. If a regular expression is derived from an incomplete sequence set, it has less predictive power because many more sequences with the same type of motifs are not represented. Unlike regular expressions, position-specific scoring matrices (PSSMs), profiles, and HMMs preserve the sequence information from a multiple sequence alignment and express it with probabilistic models.

2. For motif scanning which of the following programs or databases is for regulated sites curated from scientific literature?

a) ENSEMBL

b) ORegAnno

c) MAST

d) Clover View Answer

Answer: b

Explanation: Clover identifies overrepresented motifs in protein sequences whereas; MAST allows users to scan different databases for matches to motifs. ENSEMBL is another online genomic sequence repository which also includes online tools for data mining as well as BLAST searches.

3. Which of the following is not an advantageous feature or algorithm of the database PRINTS?

a) This program breaks down a motif into even smaller non-overlapping units called ‘fingerprints’, which are represented by unweighted PSSMs

b) To define a motif, at least a majority of fingerprints are required to match with a query sequence

c) A query that has simultaneous high-scoring matches to a majority of fingerprints belonging to a motif is a good indication of containing the functional motif

d) The difficulty to recognize short motifs when they reach the size of single fingerprints View Answer

Answer: d

Explanation: PRINTS is a protein fingerprint database containing ungapped, manually curated alignments corresponding to the most conserved regions among related sequences. The drawbacks of PRINTS are–the difficulty to recognize short motifs when they reach the size of single fingerprints and a relatively small database, which restricts detection of many motifs.

4. In which of the following multipurpose packages Gibbs sampling algorithm is used?

a) Consensus

b) BEST

c) AlignACE

d) PhyloCon View Answer

Answer: c

Explanation: The Gibbs sampling algorithm can identify multiple motifs in a sequence in a sequence set using iterative masking procedure. It is used in AlignACE whereas BEST is a suite of four motif discovery tools integrated in a graphical user interface. Also, Consensus program finds motifs in a set of unaligned sequences and PhyloCon builds on this framework by modeling conservation across orthologous genes from multiple species.

5. Which of the following is untrue in case of the database BLOCKS?

a) The alignments are automatically generated using the same data sets used for deriving the BLOSUM matrices

b) The derived ungapped alignments are called ‘blocks’, which are usually longer than motifs, are subsequently converted to PSSMs

c) A weighting scheme and pseudo counts are subsequently applied to the PSSMs to account for underrepresented and unobserved residues in alignments

d) The functional annotation of blocks is not consistent with that for the motifs View Answer

Answer: d

Explanation: BLOCKS is a database that uses multiple alignments derived from the most conserved, ungapped regions of homologous protein sequences. Because blocks often encompass motifs, the functional annotation of blocks is thus consistent with that for the motifs. A query sequence can be used to align with pre-computed profiles in the database to select the highest scored matches. Because of the use of the weighting scheme, the signal-to-noise ratio is improved relative to PRINTS.

6. Which of the following is false in case of the database Pfam and its algorithm?

a) Each motif or domain is represented by an HMM profile generated from the seed alignment of a number of conserved homologous proteins

b) Since the probability scoring mechanism is more complex in HMM than in a profile- based approach the use of HMM yields further increases in sensitivity of the database matches

c) Pfam-B only contains sequence families not covered in Pfam

d) The functional annotation of motifs in Pfam-A is often related to that in UNIPROT View Answer

Answer: d

Explanation: Pfam is a database with protein domain alignments derived from sequences in SWISSPROT and TrEMBL. The Pfam database is composed of two parts, Pfam-A and Pfam-B. Pfam-A involves manual alignments and Pfam-B, automatic alignment in a way similar to ProDom. The functional annotation of motifs in Pfam-A is often related to that in PROSITE. Because of the automatic nature, Pfam-B has a much larger coverage but is also more error prone because some HMMs are generated from unrelated sequences.

7. Which of the following is false in case of the database SMART and its algorithm?

a) Contains HMM profiles constructed from manually refined protein domain alignments

b) Alignments in the database are built based on tertiary structures whenever available or based on PSI-BLAST profiles

c) Alignments are further checked but not refined by human annotators before HMM profile construction

d) SMART stands for Simple Modular Architecture Research Tool View Answer

Answer: c

Explanation: Alignments are further checked and refined by human annotators before HMM profile construction. Protein functions are also manually curated. Thus, the database may be of better quality than Pfam with more extensive functional annotations. Compared to Pfam,

The SMART database contains an independent collection of HMMs, with emphasis on signaling, extracellular, and chromatin-associated motifs and domains. Sequence searching in this database produces a graphical output of domains with well-annotated information with respect to cellular localization, functional sites, super-family, and tertiary structure.

8. Which of the following is false in case of the database InterPro and its algorithm?

a) InterPro is an integrated pattern database designed to unify multiple databases for protein domains and functional sites

b) This database integrates information from PROSITE, Pfam, PRINTS, ProDom, and SMART databases

c) Only overlapping motifs and domains in a protein sequence derived by all five databases are included

d) All the motifs and domains in a protein sequence derived by all five databases are included

View Answer

Answer: d

Explanation: The only overlapping motifs and domains in a protein sequence derived by all five databases are included in the database. The InterPro entries use a combination of regular

expressions, fingerprints, profiles, and HMMs in pattern matching. However, an InterPro search does not obviate the need to search other databases because of its unique criteria of motif inclusion and thus may have lower sensitivity than exhaustive searches in individual databases. A popular feature of this database is a graphical output that summarizes motif matches and has links to more detailed information.

9. Which of the following is false in case of the CDART and its algorithm?

a) CDART is a domain search program that combines the results from RPS-BLAST,

SMART, and Pfam

b) The program is now an integral part of the regular BLAST search function

c) CDART is substitute for individual database searches

d) It stands for Conserved Domain Architecture View Answer

Answer: c

Explanation: CDART is a domain search program that combines the results from various database searches. As with InterPro, CDART is not a substitute for individual database searches because it often misses certain features that can be found in SMART and Pfam.

10. Point out the wrong or irrelevant mathematical method in motif analysis.

a) Enumeration

b) Probabilistic Optimization

c) Deterministic Optimization

d) Literature mining View Answer

Answer: d

Explanation: All the rest of the options are indeed valid and proven mathematical methods that contain efficient algorithms in finding motifs in protein sequences. Literature mining is not a mathematical algorithm or tool as such to be used in identifying motifs. But it is definitely a part of research when it comes to find a function of various protein sequences.

Protein Family Databases

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Protein Family Databases”.

1. Which of the following statements about COG is incorrect regarding its features?

a) Currently, there are 4,873 clusters in the COG databases derived from unicellular organisms

b) It is constructed by comparing protein sequences encoded in forty-three completely

sequenced genomes, which are mainly from prokaryotes, representing thirty major phylogenetic lineages

c) The interface for sequence searching in the COG database is the COGnitor program, which is based on gapped BLAST

d) It is a protein family database based on structural classification View Answer

Answer: d

Explanation: COG which stands for Cluster of Orthologous Groups, is a protein family database based on phylogenetic classification. Because orthologous proteins shared by three or more lineages are considered to have descended through a vertical evolutionary scenario, if the function of one of the members is known, functionality of other members can be assigned.

2. Which of the following statements about InterPro is incorrect regarding its features?

a) Protein relatedness is defined by the P-values from the BLAST alignments

b) The most closely related sequences are grouped into the lowest level clusters

c) More distant protein groups are merged into higher levels of clusters

d) The outcome of this cluster merging is a tree-like structure of functional categories View Answer

Answer: a

Explanation: InterPro is a database of clusters of homologous proteins similar to COG. Protein relatedness is defined by the E-values from the BLAST alignments. The database further provides gene ontology information for protein cluster at each level as well as keywords from InterPro domains for functional prediction.

3. Pfam is available at four locations around the world. Which of the following is not one of them?

a) UK

b) Sweden

c) US

d) Japan View Answer

Answer: d

Explanation: Pfam is available at four locations around the world each providing a core set of functionality for accessing each family. They are US, UK, Sweden and France. Documentation on the content and use of Pfam is available via the web.

4. Which of the following is not a member database of InterPro?

a) SCOP

b) HAMAP

c) PANTHER

d) Pfam View Answer

Answer: a

Explanation: The signatures from InterPro come from 11 member databases viz. CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, TIGRFAMs.

5. Which of the following statements about SCOP is incorrect regarding its features?

a) Proteins with the same shapes but having little sequence or functional similarity are placed in different super families, and are assumed to have only a very distant common ancestor

b) Proteins having the same shape and some similarity of sequence and/or function are placed in ‘families’, and are assumed to have a closer common ancestor

c) SCOP was created in 1994 in the Centre of Protein Engineering and the University College London

d) It aims to determine the evolutionary relationship between proteins View Answer

Answer: c

Explanation: SCOP, Structural Classification of Proteins, was created in 1994 in the Centre of Protein Engineering and the Laboratory of Molecular Biology. It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in Cambridge, England.

6. What is the source of protein structures in SCOP and CATH?

a) Uniprot

b) Protein Data Bank

c) Ensemble

d) InterPro View Answer

Answer: b

Explanation: The source of protein structures in SCOP is PDB (Protein Data Bank). PDB is a secondary database which means it has protein structures derived from primary databases which have the protein sequences. UNIPROT is a primary database.

7. Which of the following statements about SUPERFAMILY database is incorrect regarding its features?

a) Sequences can be submitted raw or FASTA format

b) Sequences must be submitted in FASTA format only

c) It searches the database using a superfamily, family, or species name plus a sequence, SCOP, PDB or HMM ID’s

d) It has generated GO annotations for evolutionarily closed domains and distant domains

View Answer

Answer: b

Explanation: SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP super families. Sequences can be amino acids, a fixed frame nucleotide sequence, or all frames of a submitted nucleotide sequence. Up to 1000 sequences can be run at a time.

8. Which of the following statements about PRINTS and ProDom databases is incorrect regarding its features?

a) PRINTS is a compendium of protein fingerprints

b) Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space

c) Current versions of ProDom are built using a novel procedure based on recursive BLAST searches

d) ProDom domain database consists of an automatic compilation of homologous domains

View Answer

Answer: c

Explanation: Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches and not just BLAST searches. And PRINTS is indeed a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of UniProt.

9. Which of the following statements about CATH-Gene3D and HAMAP databases is incorrect regarding its features?

a) CATH-Gene3D describes protein families and domain architectures in complete genomes

b) In CATH-Gene3D the functional annotation is provided to proteins from single resource

c) HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or

subfamilies.

d) HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes

View Answer

Answer: b

Explanation: In CATH-Gene3D Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. Functional annotation is provided to proteins from multiple resources. Functional prediction and analysis of domain architectures is available at the website.

10. Which of the following statements about PANTHER and TIGRFAMs databases is incorrect regarding its features?

a) TIGRFAMs provides a tool for identifying functionally related proteins based on sequence homology

b) TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation

c) Hidden Markov models (HMMs) are not used in PANTHER

d) PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise

View Answer

Answer: c

Explanation: In PANTHER the subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (human-curated molecular function and biological process classifications and pathway diagrams), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences

Global Sequence Alignment

1. When did Needleman-Wunsch first describe the algorithm for global alignment? a) 1899

b) 1970

c) 1930

d) 1950

View Answer

Answer: b

Explanation: Needleman and Wunsch were among the first to describe dynamic programming algorithm for global sequence. In global sequence alignment, an attempt to align the entirety of two different sequences is made, up to and including the ends of sequences.

2. Which of the following does not describe dynamic programming?

a) The approach compares every pair of characters in the two sequences and generates an alignment, which is the best or optimal

b) Global alignment algorithm is based on this method

c) Local alignment algorithm is based on this method

d) The method can be useful in aligning protein sequences to protein sequences only View Answer

Answer: d

Explanation: The method can be useful in aligning nucleotide to protein sequences as well. These programs first perform pair-wise alignment on each pair of sequences. Then, they perform local re-arrangements on these results, in order to optimize overlaps between multiple sequences.

3. Which of the following is not an advantage of Needleman-Wunsch algorithm?

a) New algorithmic improvements as well as increasing computer capacity make possible to align a query sequence against a large DB in a few minutes

b) Similar sequence region is of same order and orientation

c) This does not help in determining evolutionary relationship

d) If you have 2 genes that are already understood as closely related, then this type of algorithm can be used to understand them in further details

View Answer

Answer: c

Explanation: Needleman-Wunsch algorithm is used when 2 genes that are already understood as closely related and can be used to understand them in further details. This is quite helpful in finding orthologs, paralogs and homologs in evolutionary studies.

4. Which of the following is not a disadvantage of Needleman-Wunsch algorithm?

a) This method is comparatively slow

b) There is a need of intensive memory

c) This cannot be applied on genome sized sequences

d) This method can be applied to even large sized sequences View Answer

Answer: d

Explanation: This method cannot be applied on genome sized sequences. But this is indeed useful in determining similarities and evolutionary relationships.

5. Which of the following does not describe global alignment algorithm?

a) In initialization step, the first row and first column are subject to gap penalty

b) Score can be negative

c) In trace back step, beginning is with the cell at the lower right of the matrix and it ends at top left cell

d) First row and first column are set to zero View Answer

Answer: d

Explanation: Initialization and scoring system of the Smith–Waterman algorithm and Needleman- Wunsch algorithm is quite different. In global alignment first row and first column are subject to gap penalty and are not set to 0.

6. Which of the following does not describe PAM matrices?

a) These matrices are used in optimal alignment scoring

b) It stands for Point Altered Mutations

c) It stands for Point Accepted Mutations

d) It was first developed by Margaret Dayhoff View Answer

Answer: b

Explanation: PAM stands for Point Accepted Mutations. PAM matrices are calculated by observing the differences in closely related proteins. One PAM unit (PAM1) specifies one accepted point mutation per 100 amino acid residues, i.e. 1% change and 99% remains as such.

7. Which of the following is untrue regarding the scoring system used in dynamic programming?

a) If the residues are same in both the sequences the match score is assumed as +5 which is added to the diagonally positioned cell of the current cell

b) If the residues are not same, the mismatch score is assumed as -3

c) If the residues are not same, the mismatch score is assumed as 3

d) The score should be added to the diagonally positioned cell of the current cell View Answer

Answer: c

Explanation: If the residues are not same, the mismatch score is assumed as -3 and it has to be

negative. However, these scores are not unique, they can be user defined also, but the mismatch and gap penalty should be the negative values.

8. Which of the following does not describe global alignment algorithm?

a) Score can be negative in this method

b) It is based on dynamic programming technique

c) For two sequences of length m and n, the matrix to be defined should be of dimensions m+1 and n+1

d) For two sequences of length m and n, the matrix to be defined should be of dimensions m and n

View Answer

Answer: d

Explanation: For two sequences of length m and n, the matrix to be defined should be of dimensions m+1 and n+1so that there is margin for addition of the score along the diagonal. Also, corresponding score is further calculated at the end cumulatively.

9. Which of the following does not describe global alignment algorithm?

a) It attempts to align every residue in every sequence

b) It is most useful when the aligning sequences are similar and of roughly the same size

c) It is useful when the aligning sequences are dissimilar

d) It can use Needleman-Wunsch algorithm View Answer

Answer: c

Explanation: Performing global alignment is most useful when the aligning sequences are similar and of roughly the same size. This is most useful to find the similarities among the organisms that are roughly connected on the timeline.

10. Which of the following is wrong in case of substitution matrices?

a) They determine likelihood of homology between two sequences

b) They use system where substitutions that are more likely should get a higher score

c) They use system where substitutions that are less likely should get a lower score

d) BLOSUM-X type uses logarithmic identity to find similarity View Answer

Answer: d

Explanation: BLOSUM-X type identifies sequences that are X% similar to the query sequence i.

e. score 54 corresponds to 54% similarity hence reducing the complexity of the output and giving the similarity in percentage. Also, these matrices are popular in bioinformatics due to their speed and accuracy.

Local Sequence Alignment

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Local Sequence Alignment”.

1. When did Smith–Waterman first describe the algorithm for local alignment? a) 1950

b) 1970

c) 1981

d) 1925

View Answer

Answer: c

Explanation: The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences.

2. Which of the following does not describe local alignment?

a) A local alignment aligns a substring of the query sequence to a substring of the target sequence

b) A local alignment is defined by maximizing the alignment score, so that deleting a column from either end would reduce the score, and adding further columns at either end would also reduce the score

c) Local alignments have terminal gaps

d) The substrings to be examined may be all of one or both sequences; if all of both are included then the local alignment is also global

View Answer

Answer: c

Explanation: Local alignments never have terminal gaps, because a higher score could be obtained by deleting the gaps (which always have negative scores, i.e. penalties). In case of global alignment there are terminal gaps while analyzing.

3. Which of the following does not describe local alignment algorithm?

a) Score can be negative

b) Negative score is set to 0

c) First row and first column are set to 0 in initialization step

d) In traceback step, beginning is with the highest score, it ends when 0 is encountered View Answer

Answer: a

Explanation: Score can be negative. When any element has a score lower than zero, it means that the sequences up to this position have no similarities; this element will then be set to zero to eliminate influence from previous alignment. In this way, calculation can continue to find alignment in any position afterwards.

4. Local alignments are more used when

a) There are totally similar and equal length sequences

b) Dissimilar sequences are suspected to contain regions of similarity

c) Similar sequence motif with larger sequence context

d) Partially similar, different length and conserved region containing sequences View Answer

Answer: a

Explanation: The given description is suitable for global alignment. It attempts to align maximum of the entire sequence unlike local alignment where the partially similar sequences are analyzed.

5. Which of the following does not describe BLOSUM matrices?

a) It stands for BLOcks SUbstitution Matrix

b) It was developed by Henikoff and Henikoff

c) The year it was developed was 1992

d) These matrices are logarithmic identity values View Answer

Answer: d

Explanation: These matrices are actual percentage identity values. Or simply, they depend on similarity. Blosum 62 means there is 62 % similarity.

6. Which of the following is untrue regarding the gap penalty used in dynamic programming?

a) Gap penalty is subtracted for each gap that has been introduced

b) Gap penalty is added for each gap that has been introduced

c) The gap score defines a penalty given to alignment when we have insertion or deletion

d) Gap open and gap extension has been introduced when there are continuous gaps (five or more)

View Answer

Answer: b

Explanation: Dynamic programming algorithms use gap penalties to maximize the biological meaning. T he open penalty is always applied at the start of the gap, and then the other gaps

following it is given with a gap extension penalty which will be less compared to the open penalty. Typical values are –12 for gap opening, and –4 for gap extension.

7. Among the following which one is not the approach to the local alignment?

a) Smith–Waterman algorithm

b) K-tuple method

c) Words method

d) Needleman-Wunsch algorithm View Answer

Answer: d

Explanation: Local alignment can be distinguished on two broad approaches– Smith–Waterman algorithm and word methods, also known as k-tuple methods and they are implemented in the well-known families of programs FASTA and BLAST.

8. Which of the following does not describe k-tuple methods?

a) k-tuple methods are best known for their implementation in the database search tools FASTA and the BLAST family

b) They are also known as words methods

c) They are basically heuristic methods to find local alignment

d) They are useful in small scale databases View Answer

Answer: d

Explanation: k-tuple or word methods are especially useful in large-scale database searches where a large proportion of stored sequences will have essentially no significant match with the query sequence. They are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than Smith-Waterman algorithm.

9. Which of the following does not describe BLAST?

a) It stands for Basic Local Alignment Search Tool

b) It uses word matching like FASTA

c) It is one of the tools of the NCBI

d) Even if no words are similar, there is an alignment to be considered View Answer

Answer: d

Explanation: If no words are similar, there is no alignment i. e. it will not find matches for very short sequences. But it is considerably accurate as compared to other tools and hence is quite popular.

10. Which of the following is untrue regarding BLAST and FASTA?

a) FASTA is faster than BLAST

b) FASTA is the most accurate

c) BLAST has limited choices of databases

d) FASTA is more sensitive for DNA-DNA comparisons View Answer

Answer: a

Explanation: BLAST is faster than FASTA and most other tools. The speed and relatively good accuracy of BLAST is the key why the tool is the most popular bioinformatics search tool.

Motif Discovery in Unaligned Sequences

This set of Bioinformatics Interview Questions and Answers for freshers focuses on “Motif Discovery in Unaligned Sequences”.

1. For what type of sequences Gibbs sampling is used?

a) Closely related sequences

b) Distinctly related sequences

c) Distinctly related sequences that share common motifs

d) Closely related sequences that share common motifs View Answer

Answer: c

Explanation: Often, distantly related sequences that share common motifs cannot be readily aligned. For example, the sequences for the helix-turn-helix motif in transcription factors can be subtly different enough that traditional multiple sequence alignment approaches fail to generate a satisfactory answer. For detecting such subtle motifs, more sophisticated algorithms such as expectation maximization (EM) and Gibbs sampling are used.

2. Which of the following is untrue about Expectation Maximization (EM) method?

a) It is used to find hidden motifs

b) The method works by first making a random or guessed alignment of the sequences to generate a trial PSSM

c) The trial PSSM is used to compare with each sequence individually

d) The log odds scores of the PSSM are modified at the end of the process View Answer

Answer: d

Explanation: The log odds scores of the PSSM are modified in each iteration to maximize the

alignment of the matrix to each sequence. During the iterations, the sequence pattern for the conserved motifs is gradually “recruited” to the PSSM.

3. Which of the following is true about Expectation Maximization (EM) method?

a) The log odds scores of the PSSM are modified at the end of the process

b) The procedure stops prematurely if the scores reach convergence

c) The final result is not sensitive to the initial alignment

d) Local optimum is an advantage of EM method View Answer

Answer: b

Explanation: The final result is sensitive to the initial alignment. The Local optimum is actually a drawback of EM method. It is same as the fact that the procedure stops prematurely if the scores reach convergence.

4. MEME stands for

a) Multiple Expectation Maximization for Motif Elicitation

b) Multiple Expectation Maximization for Motif Extraction

c) Mega Expectation Maximization for Motif Elicitation

d) Micro Expectation Maximization for Motif Extraction View Answer

Answer: a

Explanation: Multiple Expectation Maximization for Motif Elicitation is a web-based program that uses the EM algorithm to find motifs either for DNA or protein sequences. It uses amodified EM algorithm to avoid the local minimum problem.

5. In the web-based program MEME, the computation is a step procedure.

a) one

b) two

c) three

d) four

View Answer

Answer: b

Explanation: In constructing a probability matrix, it allows multiple starting alignments and does not assume that there are motifs in every sequence. Also, the computation is a two-step procedure which includes generation of sequence motif and finding highest score.

6. Gibbs is a web-based program that uses the Gibbs sampling approach to look for

gap-free segments for either DNA or protein sequences.

a) short, partially conserved

b) long, partially conserved

c) long, conserved

d) short, not conserved View Answer

Answer: a

Explanation: Gibbs sampling approach to look for short, partially conserved gap-free segments for either DNA or protein sequences. To ensure accuracy, more than twenty sequences of the exact same length should be used.

7. A multiple sequence alignment or a motif is often represented by a graphic representation

called a

a) logo

b) motto

c) algorithm

d) algo View Answer

Answer: a

Explanation: In a logo, each position consists of stacked letters representing the residues appearing in a particular column of a multiple alignment. This graphic representation called a logo.

8. The overall height of a logo position reflects how conserved the position is, and the

of each letter in a position reflects the of the residue in the alignment.

a) height, relative frequency

b) width, relative frequency

c) height, amplitude

d) width, amplitude View Answer

Answer: a

Explanation: The height expresses the data about the extent of the conservation of the position and each letter shows the frequency of that particular residue. The amplitude, here in this case, is irrelevant option.

9. Conserved positions have residues and bigger symbols.

a) fewer

b) more

c) maximum

d) minimum View Answer

Answer: a

Explanation: The options maximum and minimum are comparatively obsolete as there involves the studies of alignment. Conserved positions have fewer residues and bigger symbols; whereas less conserved positions have a more heterogeneous mixture of smaller symbols stacked together. In general, a sequence logo provides a clearer description of a consensus sequence.

10. is an interactive program for generating sequence logos.

a) EMBOSS

b) WebLogo

c) LOGOLY

d) BLAST View Answer

Answer: b

Explanation: In WebLogo, a user needs to enter the sequence alignment in FASTA format to allow the program to compute the logos. A graphic file is returned to the user as a result.

Dot Matrix Sequence Comparison

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Dot Matrix Sequence Comparison”.

1. Which of the following is not a software for dot plot analysis?

a) SIMMI

b) DOTLET

c) DOTMATCHER

d) LALIGN View Answer

Answer: a

Explanation: For the purpose of dot plot interpretation there are various software’s currently present. Among these SIM is used for these kinds of alignments through dot-plot method that is wrongly abbreviated.

2. The software’s for dot plot analysis perform several tasks. Which one of them is not performed by them?

a) Gap open penalty

b) Gap extend penalty

c) Expectation threshold

d) Change or mutate residues View Answer

Answer: d

Explanation: The gap penalties mentioned above are for the determination of score of the aligning sequences. The change in residue barely takes place as there are number of other software’s for that purpose and also the main objective is to find the score of the alignment.

3. For palindromic sequences, what is the structure of the dot plot?

a) 2 intersecting diagonal lines at the midpoint

b) one diagonal

c) Two parallel diagonals

d) No diagonal View Answer

Answer: a

Explanation: For perfectly aligned sequences there is a diagonal formation of dot plot. For palindromic sequences i. e. for sequences that are symmetrical from the midpoint of the sequence, there exist 2 intersecting diagonals on the plot.

4. For significantly aligning sequences what is the resulting structure on the plot?

a) Intercrossing lines

b) Crosses everywhere

c) Vertical lines

d) A diagonal and lines parallel to diagonal View Answer

Answer: d

Explanation: If there is alignment of sequences there is a significantly bold diagonal visible on the plot. And if the is a bit imperfect, the diagonal is shattered too to an extent and forms small parallel lines to it.

5. When was this method, first described? a) 1959

b) 1966

c) 1970

d) 1982

View Answer

Answer: c

Explanation: This method was first described in 1970. Briefly, this method involves constructing a

matrix with one of the sequences to be compared running horizontally across the bottom, and the other running vertically along the left-hand side.

6. Who were the inventors of this method?

a) Smith-Waterman

b) Margaret Preston

c) Gibbs and McIntyre

d) Needleman-Wunsch View Answer

Answer: c

Explanation: The first computer aided sequence comparison is called “dot-matrix analysis” or simply dot-plot. The first published account of this method is by Gibbs and McIntyre (1970 the diagram, a method for comparing sequences. Eur. J. Biochem 16: 1-11).

7. Which of the following is true for EMBOSS Dottup?

a) Allows you to specify threshold

b) Doesn’t allow you to specify threshold

c) Doesn’t allow you to specify window size

d) If all cells in the window are identity, it colours in some specific cells in the window View Answer

Answer: b

Explanation: The EMBOSS Dottup doesn’t allow you to specify threshold but allows you to specify window size. Also, if all cells in the window are identity, it colors in all the cells in the window.

8. Isolated dots that are not on the diagonal represent exact matches.

a) True

b) False View Answer

Answer: b

Explanation: Those isolated dots represent random matches. The dots on the diagonal represent the perfect alignment and the dots with vertical and horizontal shifts show insertions and deletions.

9. Vertical frame shifts show while the horizontal ones show

a) insertion, insertion

b) insertion, deletion

c) deletion, deletion

d) deletion, insertion View Answer

Answer: b

Explanation: Deletion and insertion of nucleotides is quite common in alignment process. The dot plot easily represents them with vertical and horizontal shifts. And the mutations are totally out of the diagonal zone.

10. Dot plot of repeating elements would be small crosses on plot.

a) True

b) False View Answer

Answer: False

Explanation: The repeating elements would be represented in parallel lines in repetitive manner. Better is the repetition; better is the nature of parallel lines. Also, the intersections show the palindromic sequences.

Dynamic Programming Algorithm for Sequence Alignment

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Dynamic Programming Algorithm for Sequence Alignment”.

1. Use of the dynamic programming method requires a scoring system for the comparison of symbol pairs, and a scheme for GAP penalties.

a) True

b) False View Answer

Answer: a

Explanation: Once those parameters have been set, the resulting alignment for two sequences should always be the same. Hence, the use of the dynamic programming method requires a scoring system for the comparison of symbol pairs (nucleotides for DNA sequences and amino acids for protein sequences), and a scheme for insertion/deletion (GAP) penalties.

2. After the derivation, the outputs of the dynamic programming are the ratios are called even scores.

a) True

b) False View Answer

Answer: b

Explanation: After the derivation, the outputs of the dynamic programming are the ratios are called odd scores. The ratios are transformed to logarithms of odds scores, called log odds scores, so that scores of sequential pairs may be added to reflect the overall odds of a real to chance alignment of an alignment. This happens in Dayhoff PAM250 and BLOSUM62.

3. The matrices PAM250 and BLOSUM62 contain

a) positive and negative values

b) positive values only

c) negative values only

d) neither positive nor negative values, just the percentage View Answer

Answer: a

Explanation: These matrices contain positive and negative values, reflecting the likelihood of each amino acid substitution in related proteins. Using these tables, an alignment of a sequential set of amino acid pairs with no gaps receives an overall score that is the sum of the positive and negative log odds scores for each individual amino acid pair in the alignment.

4. The higher is the score in the alignment,

a) the more significant is the alignment

b) or the less it resembles alignments in related proteins

c) the less significant is the alignment

d) the less it aligns with the related protein sequence View Answer

Answer: a

Explanation: In the scoring system, the higher this score, the more significant is the alignment, or the more it resembles alignments in related proteins. Also, the score given for gaps in aligned sequences is negative, because such misaligned regions should be uncommon in sequences of related proteins. Such a score will reduce the score obtained from an adjacent, matching region upstream in the sequences.

5. Gaps are added to the alignment because it

a) increases the matching of identical amino acids at subsequent portions in the alignment

b) increases the matching of or dissimilar amino acids at subsequent portions in the alignment

c) reduces the overall score

d) enhances the area of the sequences View Answer

Answer: a

Explanation: In alignment process, gaps are added to the alignment in a manner that increases the matching of identical or similar amino acids at subsequent portions in the alignment. Ideally, when two similar protein sequences are aligned, the alignment should have long regions of identical or related amino acid pairs and very few gaps. As the sequences become more distant, more mismatched amino acid pairs and gaps should appear.

6. Which of the following is not a description of dynamic programming algorithm?

a) A method of sequence alignment

b) A method that can take gaps into account

c) A method that requires a manageable number of comparisons

d) This method doesn’t provide an optimal (highest scoring) alignment View Answer

Answer: d

Explanation: The method of sequence alignment by dynamic programming provides an optimal (highest scoring) alignment as an output. The quality of the alignment between two sequences is calculated using a scoring system that favors the matching of related or identical amino acids and penalizes for poorly matched amino acids and gaps.

7. Which of the following is not a site on internet for alignment of sequence pairs?

a) BLASTX

b) BLASTN

c) SIM

d) BCM Search Launcher View Answer

Answer: a

Explanation: BLASTP is used under BLAST 2 sequence alignment. Also, The BLAST algorithm normally used for database similarity searches can also be used to align two sequences. SIM is known as Local similarity program for finding alternative alignments.

8. Dayhoff PAM matrices, are based on an evolutionary model of protein change, whereas, BLOSUM matrices, are designed to identify members of the same family.

a) True

b) False View Answer

Answer: a

Explanation: There are a very large number of amino acid scoring matrices in use, some much more popular than others, and these scoring matrices are designed for different purposes. Some, such as the Dayhoff PAM matrices, are based on an evolutionary model of protein change,

whereas others, such as the BLOSUM matrices, are designed to identify members of the same family. Alignments between DNA sequences require similar kinds of considerations.

9. A feature of the dynamic programming algorithm is that the alignments obtained depend on the choice of a scoring system for comparing character pairs and penalty scores for gaps.

a) True

b) False View Answer

Answer: a

Explanation: For an algorithm, the output depends on the choice of a scoring system. For protein sequences, the simplest system of comparison is one based on identity. A match in an alignment is only scored if the two aligned amino acids are identical. However, one can also examine related protein sequences that can be aligned easily and find which amino acids are commonly substituted for each other.

10. Which of the following is untrue regarding dynamic programming algorithm?

a) The method compares every pair of characters in the two sequences and generates an alignment

b) The output alignment will include matched and mismatched characters and gaps in the two sequences that are positioned so that the number of matches between identical or related characters is the maximum possible

c) The dynamic programming algorithm provides a reliable computational method for aligning DNA and protein sequences

d) This doesn’t allow making evolutionary predictions on the basis of sequence alignments

View Answer

Answer: d

Explanation: Optimal alignments provide useful information to biologists concerning sequence relationships by giving the best possible information as to which characters in a sequence should be in the same column in an alignment, and which are insertions in one of the sequences (or deletions on the other). This information is important for making functional, structural, and evolutionary predictions on the basis of sequence alignments.

Use of Scoring Matrices and Gap Penalties in Sequence Alignments

This set of Bioinformatics Questions and Answers for Experienced people focuses on “Use of Scoring Matrices and Gap Penalties in Sequence Alignments”.

1. In scoring matrices, for convenience, odds scores are converted to log odds scores.

a) True

b) False View Answer

Answer: a

Explanation: The odds scores are converted to log odds scores so that the values for amino acid pairs in an alignment may be summed to obtain the log odds score of the alignment. In this case, the logarithms are calculated to the base 2 and multiplied by 2 to give values designated as half- bits (a bit is the unit of an odds score that has been converted to a logarithm to the base 2). The value of 4 indicates that the 4 amino acid alignment is 2(4/2) = 4 four-fold more likely than expected by chance.

2. Which of the following doesn’t describe PAM matrices?

a) This family of matrices lists the likelihood of change from one amino acid to another in homologous protein sequences during evolution

b) There is presently no other type of scoring matrix that is based on such sound evolutionary principles as are these matrices

c) Even though they were originally based on a relatively small data set, the PAM matrices remain a useful tool for sequence alignment

d) It stands for Percent Altered Mutation View Answer

Answer: d

Explanation: PAM stands for Percent Accepted Mutation. In this, each matrix gives the changes expected for a given period of evolutionary time, evidenced by decreased sequence similarity as genes encoding the same protein diverge with increased evolutionary time.

3. The assumption in this evolutionary model is that the amino acid substitutions observed over short periods of evolutionary history can be extrapolated to longer distances.

a) True

b) False View Answer

Answer: a

Explanation: The BLOSUM matrices are based on scoring substitutions found over a range of evolutionary periods and reveal that substitutions are not always as predicted by the PAM model. The purpose of assumption in this evolutionary model is to make predictions.

4. Which of the following is untrue about the modification of PAM matrices?

a) At one time, the PAM250 scoring matrix was modified in an attempt to improve the alignment obtained

b) All scores for matching a particular amino acid were normalized to the same mean and standard deviation, and all amino acid identities were given the same score to provide an equal contribution for each amino acid in a sequence alignment

c) This took place in 1976

d) These modifications were included as the default matrices for the GCG sequence alignment programs in versions 8 and earlier and are optional in later versions View Answer

Answer: c

Explanation: This event took place in 1986 by Gribskov and Burgess. However, they are not recommended because they will not give an optimal alignment that is in accordance with the evolutionary model.

5. The Dayhoff model of protein evolution is not a Markov process.

a) True

b) False View Answer

Answer: b

Explanation: The Dayhoff Model of Protein Evolution as Used in PAM Matrices is a Markov process. In Analysis of the Dayhoff Model, each amino acid site in a protein can change at any time to any of the other 20 amino acids with probabilities given by the PAM table, and the changes that occur at each site are independent of the amino acids found at other sites in the protein and depend only on the current amino acid at the site.

6. Which of the following is true regarding the assumptions in the method of constructing the

Dayhoff scoring matrix?

a) it is assumed that each amino acid position is equally mutable

b) it is assumed that each amino acid position is not equally mutable

c) it is assumed that each amino acid position is not mutable at all

d) sites do not vary in their degree of mutability View Answer

Answer: a

Explanation: In this process, first, it is assumed that each amino acid position is equally mutable, whereas, in fact, sites vary considerably in their degree of mutability. Mutagenesis hot spots are

well known in molecular genetics, and variations in mutability of different amino acid sites in proteins are well known.

7. The more conserved amino acids in similar proteins from different species are ones that play an essential role in structure and function and the less conserved are in sites that can vary without having a significant effect on function.

a) True

b) False View Answer

Answer: a

Explanation: there are many factors that influence both the location and types of amino acid changes that occur in proteins. Wilbur (1985) has tested the Markov model of evolution and has shown that it can be valid if certain changes are made in the way that the PAM matrices are calculated.

8. A gap opening penalty for any gap (g) and a gap extension penalty for each element in the gap (r) are most often used, to give a total gap score wx, according to the equation

a) wx – rx = -g

b) wx = g – rx

c) wx = g + rx

d) wx + g + rx = 0 View Answer

Answer: c

Explanation: wx = g + rx is the equation where x is the length of the gap. in some formulations of the gap penalty, the equation wx = g + r (x – 1) is used. Thus, the gap extension penalty is not added to the gap opening penalty until the gap size is 2.

9. In the GCG and FASTA program suites, the scoring matrix itself is formatted in a way that includes default

a) gap additions

b) alignment scores

c) score penalties

d) gap penalties View Answer

Answer: d

Explanation: These program suites include default gap penalties. When deciding gap penalties

for local alignment programs, a consideration is that the penalties should be large enough to provide a local alignment of the sequences.

10. In case of the varying alignment, penalizing gaps heavily might occur. Then the best scoring local alignment between the sequences will be one that optimizes the score between matches and mismatches, without any gaps.

a) True

b) False View Answer

Answer: a

Explanation: If both mismatches and gaps are heavily penalized, the resulting alignment will also be a local alignment that contains the longest region of exact matches. In the above two cases, the alignment score of the highest-scoring local alignment will increase as the logarithm of the length of the sequences. Under these same conditions, the score of the corresponding global alignment between the sequences will be negative.

Assessing the Significance of Sequence Alignments

This set of Bioinformatics Interview Questions and Answers for Experienced people focuses on “Assessing the Significance of Sequence Alignments”.

1. On analysis of the alignment scores of random sequences will reveal that the scores follow a different distribution than the normal distribution called the

a) Gumbel equal value distribution

b) Gumbel extreme value distribution

c) Gumbel end value distribution

d) Gumbel distribution View Answer

Answer: b

Explanation: Originally, the significance of sequence alignment scores was evaluated on the basis of the assumption that alignment scores followed a normal statistical distribution. If sequences are randomly generated in a computer by a Monte Carlo or sequence shuffling method, as in generating a sequence by picking marbles representing four bases or 20 amino acids out of a bag, the distribution may look normal at first glance. But on further analysis the above result was obtained.

2. The statistical analysis of alignment scores is much better understood for than for

a) global alignments, local alignments

b) local alignments, global alignments

c) global alignments, any other alignment method

d) Needleman-Wunsch alignment, Smith-Waterman alignment View Answer

Answer: b

Explanation: Smith-Waterman alignment algorithm and the scoring system used to produce a local alignment are designed to reveal regions of closely matching sequence with a positive alignment score. In random or unrelated sequence alignments, these regions are rarely found. Hence, their presence in real sequence alignments is significant, and the probability of their occurring by chance alignment of unrelated sequences can be readily calculated.

3. When random or unrelated sequences are compared using a global alignment method, they can have reflecting the tendency of the global algorithm to match as many characters as possible.

a) very low scores

b) very high scores

c) moderate scores

d) low scores View Answer

Answer: b

Explanation: The significance of the scores of global alignments, is more difficult to determine. Using the Needleman-Wunsch algorithm and a suitable scoring system, there are many ways to produce a global alignment between any pair of sequences, and the scores of many different alignments may be quite similar hence the scores obtained might be unusually high.

4. Which of the following are not related to Needleman-Wunsch alignment algorithm?

a) Global alignment programs use this algorithm

b) The output is a positive number

c) Small changes in the scoring system can produce a different alignment

d) Changes in the scoring system can produce the same alignment View Answer

Answer: d

Explanation: In general, global alignment programs use the Needleman-Wunsch alignment algorithm and a scoring system that scores the average match of an aligned nucleotide or amino acid pair as a positive number. Hence, the score of the alignment of random or unrelated sequences grows proportionally to the length of the sequences. In addition, there are many

possible different global alignments depending on the scoring system chosen, and small changes in the scoring system can produce a different alignment.

5. Waterman, in1989, provided a set of means and standard deviations of global alignment scores between random DNA sequences, using mismatch and gap penalties that produce a linear increase in score with a distinguishing feature of global alignments.

a) alignment score

b) sequence score

c) sequence length

d) scoring system View Answer

Answer: c

Explanation: In the algorithm provided by Waterman, the score of the alignment of random or unrelated sequences grows proportionally to the length of the sequences. However, these values are of limited use because they are based on a simple gap scoring system.

6. Who suggested that the global alignment scores between unrelated protein sequences followed the extreme value distribution, similar to local alignment scores? And when?

a) Abagyan and Batalov, in 1981

b) Chvátal and Lipman, in 1984

c) Abagyan and Batalov, in 1997

d) Chvátal and Sankoff, in 1995 View Answer

Answer: c

Explanation: Abagyan and Batalov, in 1997, suggested the above observation. However, since the scoring system that they used favored local alignments, these alignments they produced may not be global but local. Unfortunately, there is no equivalent theory on which to base an analysis of global alignment scores as there is for local alignment scores.

7. analyzed the distribution of scores among 100 vertebrate nucleic acid sequences and compared these scores with randomized sequences prepared in different ways.

a) Lipman, in 1984

b) Batalov, in 1964

c) Waterman, in 1987

d) Lipman, in 1967 View Answer

Answer: a

Explanation: When the randomized sequences were prepared by shuffling the sequence to conserve base composition, as was done by Dayhoff and others, the standard deviation was approximately one-third less than the distribution of scores of the natural sequences. Thus, natural sequences are more variable than randomized ones, and using such randomized sequences for a significance test may lead to an overestimation of the significance.

8. If the random sequences were prepared in a way that maintained the local base composition by producing them from overlapping fragments of sequence, the distribution of scores has a standard deviation that is closer to the distribution of the natural sequences.

a) lowest

b) higher

c) lower

d) moderate View Answer

Answer: c

Explanation: The conclusion from the above is that the presence of conserved local patterns can influence the score in statistical tests such that an alignment can appear to be more significant than it actually is. Although this study was done using the Smith-Waterman algorithm with nucleic acids, the same cautionary note applies for other types of alignments.

9. The GCG alignment programs have a RANDOMIZATION option, which shuffles the second sequence and calculates similarity scores between the unshuffled sequence and each of the shuffled copies.

a) True

b) False View Answer

Answer: a

Explanation: If the new similarity scores are significantly smaller than the real alignment score, the alignment is considered significant. This analysis is only useful for providing a rough approximation of the significance of an alignment score and can easily be misleading.

10. Dayhoff, 1978- 1983, devised a second method for testing the relatedness of two protein sequences that can accommodate some local variation. Where this method is useful?

a) For finding repeated regions within a sequence

b) For finding similar regions that are in a different order in two sequences

c) For finding small conserved region such as an active site

d) For finding huge regions within sequences View Answer

Answer: d

Explanation: As used in a computer program called RELATE (Dayhoff 1978), all possible segments of a given length of one sequence are compared with all segments of the same length from another. An alignment score using a scoring matrix is obtained for each comparison to give a score distribution among all of the segments. A segment comparison score in standard deviation units is calculated as the difference between the values for real sequences minus the average value for random sequences divided by the standard deviation of the scores from the random sequences.

Sequence Homology versus Sequence Similarity and Identity

1. Which of the following is incorrect regarding pair wise sequence alignment?

a) The most fundamental process in this type of comparison is sequence alignment

b) It is an important first step toward structural and functional analysis of newly determined sequences.

c) This is the process by which sequences are compared by searching for common character patterns and establishing residue–residue correspondence among related sequences

d) it is the process of aligning multiple sequences. View Answer

Answer: d

Explanation: Pair wise sequence alignment is the process of aligning two sequences and is the basis of database similarity searching and multiple sequence alignment. As new biological sequences are being generated at exponential rates, sequence comparison is becoming increasingly important to draw functional and evolutionary inference of a new protein with proteins already existing in the database.

2. Which of the following is incorrect about evolution?

a) The macromolecules can be considered molecular fossils that encode the history of

millions of years of evolution

b) The building blocks of these biological macromolecules, nucleotide bases, and amino acids form linear sequences that determine the primary structure of the molecules

c) DNA and proteins are products of evolution

d) The molecular sequences barely undergo changes View Answer

Answer: d

Explanation: During this time period, the molecular sequences undergo random changes, some of which are selected during the process of evolution. As the selected sequences gradually accumulate mutations and diverge over time, traces of evolution may still remain in certain portions of the sequences to allow identification of the common ancestry.

3. The presence of evolutionary traces is because some of the residues that perform key functional and structural roles tend to be preserved by natural selection; other residues that may be less crucial for structure and function tend to mutate more frequently.

a) True

b) False View Answer

Answer: a

Explanation: the residues that perform key functional and structural roles tend to be preserved by natural selection. For example, active site residues of an enzyme family tend to be conserved because they are responsible for catalytic functions. Therefore, by comparing sequences through alignment, patterns of conservation and variation can be identified.

4. The degree of sequence variation in the alignment reveals evolutionary relatedness of different sequences, whereas the conservation between sequences reflects the changes that have occurred during evolution in the form of substitutions, insertions, and deletions.

a) True

b) False View Answer

Answer: b

Explanation: The degree of sequence conservation in the alignment reveals evolutionary relatedness of different sequences, whereas the variation between sequences reflects the changes that have occurred during evolution in the form of substitutions, insertions, and deletions. Identifying the evolutionary relationships between sequences helps to characterize the function of unknown sequences. When a sequence alignment reveals significant similarity among a group of sequences, they can be considered as belonging to the same family.

5. If the two sequences share significant similarity, it is extremely that the extensive similarity between the two sequences has been acquired randomly, meaning that the two sequences must have derived from a common evolutionary origin.

a) unlikely

b) possible

c) likely

d) relevant View Answer

Answer: a

Explanation: Sequence alignment provides inference for the relatedness of two sequences under study. Regions that are aligned but not identical represent residue substitutions; regions where residues from one sequence correspond to nothing in the other represent insertions or deletions that have taken place on one of the sequences during evolution.

6. Sometimes, it is also possible that two sequences have derived from a common ancestor, but may have diverged to such an extent that the common ancestral relationships are not recognizable at the sequence level.

a) True

b) False View Answer

Answer: a

Explanation: There are examples of such paralogous genes that have distinct functions but similar origin. In that case, the distant evolutionary relationships have to be detected using other methods.

7. Which of the following is incorrect regarding sequence homology?

a) Two sequences can homologous relationship even if have do not have common origin

b) It is an important concept in sequence analysis

c) When two sequences are descended from a common evolutionary origin, they are said to have a homologous relationship

d) When two sequences are descended from a common evolutionary origin, they are said to share homology

View Answer

Answer: a

Explanation: homologous relationships are more certain when the sequences have common evolutionary origin. A related but different term is sequence similarity, which is the percentage of

aligned residues that are similar in physiochemical properties such as size, charge, and hydrophobicity.

8. Sequence similarity can be quantified using homology is a statement.

a) percentages, quantitative

b) percentages, qualitative

c) ratios, qualitative

d) ratios, quantitative View Answer

Answer: b

Explanation: similarity is a direct result of observation from the sequence Alignment. For example, one may say that two sequences share 40% similarity. It is incorrect to say that the two sequences share 40% homology. They are either homologous or nonhomologous.

9. Shorter sequences require higher cutoffs for inferring homologous relationships than longer sequences.

a) True

b) False View Answer

Answer: a

Explanation: For determining a homology relationship of two protein sequences, for example, if both sequences are aligned at full length, which is 100 residues long, an identity of 30% or higher can be safely regarded as having close homology. If their identity level falls between 20% and 30%, determination of homologous relationships in this range becomes less certain.

10. Sequence similarity and sequence identity are synonymous for nucleotide sequences and protein sequences as well.

a) True

b) False View Answer

Answer: b

Explanation: Sequence similarity and sequence identity are synonymous for nucleotide sequences. For protein sequences, however, the two concepts are very different. In a protein sequence alignment, sequence identity refers to the percentage of matches of the same amino acid residues between two aligned sequences. Similarity refers to the percentage of aligned residues that have similar physicochemical characteristics and can be more readily substituted for each other.

Methods

1. The overall goal of pair wise sequence alignment is to find the best pairing of two sequences, such that there is maximum correspondence among residues.

a) True

b) False View Answer

Answer: a

Explanation: The goal of pair wise sequence alignment is to find the best pairing and to achieve this goal; one sequence needs to be shifted relative to the other to find the position where maximum matches are found. There are two different alignment strategies that are often used: global alignment and local alignment.

2. In local alignment, the two sequences to be aligned cannot be of unequal lengths.

a) True

b) False View Answer

Answer: b

Explanation: The two sequences to be aligned can be of different lengths. This approach is more appropriate for aligning divergent biological sequences containing only modules that are similar, which are referred to as domains or motifs. This approach can be used for aligning more divergent sequences with the goal of searching for conserved patterns in DNA or protein sequences.

3. Alignment algorithms, both global and local, are fundamentally similar and only differ in the optimization strategy used in aligning similar residues.

a) True

b) False View Answer

Answer: a

Explanation: Both types of algorithms can be based on one of the three methods: the dot matrix method, the dynamic programming method, and the word method. The word method is used in fast database similarity searching.

4. In a dot matrix, two sequences to be compared are written in the of the matrix.

a) horizontal and vertical axes

b) 2 parallel horizontal axes

c) 2 parallel vertical axes

d) horizontal axis (one preceding another) View Answer

Answer: a

Explanation: The comparison is done by scanning each residue of one sequence for similarity with all residues in the other sequence. If a residue match is found, a dot is placed within the graph. Otherwise, the matrix positions are left blank.

5. When the two sequences have substantial regions of similarity, many dots line up to form contiguous lines.

a) crossings on

b) horizontal

c) diagonal

d) vertical View Answer

Answer: c

Explanation: The dots line up to form contiguous diagonal lines, which reveal the sequence alignment. If there are interruptions in the middle of a diagonal line, they indicate insertions or deletions. Parallel diagonal lines within the matrix represent repetitive regions of the sequences.

6. A problem exists when comparing sequences using the dot matrix method, namely, the

a) small, amplification

b) large, amplification

c) small, high noise level

d) large, high noise level View Answer

Answer: d

Explanation: In most dot plots, dots are plotted all over the graph obscuring identification of the true alignment. For DNA sequences, the problem is particularly acute because there are only four possible characters in DNA and each residue therefore has a one-in-four chance of matching a residue in another sequence.

7. If the selected window size is too long, sensitivity of the alignment is lost.

a) True

b) False View Answer

Answer: a

Explanation: Dots are only placed when a stretch of residues equal to the window size from one

sequence matches completely with a stretch of another sequence. This method has been shown to be effective in reducing the noise level. The window is also called a tuple, the size of which can be manipulated so that a clear pattern of sequence match can be plotted. However, if the selected window size is too long, sensitivity of the alignment is lost.

8. A sequence can be aligned with itself to identify internal repeat elements.

a) True

b) False View Answer

Answer: a

Explanation: In the self comparison, there is a main diagonal for perfect matching of each residue. If repeats are present, short parallel lines are observed above and below the main diagonal.

9. Self complementarity of DNA sequences cannot be identified using a dot plot.

a) True

b) False View Answer

Answer: b

Explanation: Self complementarity of DNA sequences, also called inverted repeats can be identified using a dot plot. For example, those that forms the stems of a hairpin structure. In this case, a DNA sequence is compared with its reverse-complemented sequence.

Parallel diagonals represent the inverted repeats.

10. Which of the following is untrue about dot plot method and its applications?

a) This method gives a direct visual statement of the relationship between two sequences

b) One of its advantages is identification of sequence repeat regions based on the presence of parallel diagonals of the same size vertically or horizontally in the matrix

c) It is not useful in identifying chromosomal repeats

d) The method can be used in identifying nucleic acid secondary structures through detecting self-complementarity of a sequence

View Answer

Answer: c

Explanation: It is useful in identifying chromosomal repeats and in comparing gene order conservation between two closely related genomes. The dot matrix method gives a direct visual statement of the relationship between two sequences and helps easy identification of the regions of greatest similarities. The method thus has some applications in genomics.

Statistical Significance of Sequence Alignment

1. The truly statistically significant sequence alignment will be able to provide evidence of homology between the sequences involved.

a) True

b) False View Answer

Answer: a

Explanation: When given a sequence alignment showing a certain degree of similarity, it is often important to determine whether the observed sequence alignment can occur by random chance or the alignment is indeed statistically sound. When a statistically significant sequence alignment is under consideration, it will be able to provide evidence of homology between the sequences involved.

2. By calculating alignment scores of a large number of sequence pairs, a distribution model of the sequence scores can be derived.

a) related, randomized

b) unrelated, randomized

c) unrelated, unrandomized

d) related, unrandomized View Answer

Answer: b

Explanation: Solving the statistical significance problem requires a statistical test of the alignment scores of two unrelated sequences of the same length. From the distribution, a statistical test can be performed based on the number of standard deviations from the average score.

3. Many studies have demonstrated that the distribution of similarity scores assumes a peculiar shape that resembles a highly skewed normal distribution with a long tail on one side. The distribution matches the

a) Gumble elective value distribution

b) Gumble extreme void distribution

c) Gumble end value distribution

d) Gumble extreme value distribution View Answer

Answer: d

Explanation: The mentioned Distribution pattern matches the Gumble extreme value distribution for which a mathematical expression is available. This means that, given a sequence similarity

value, by using the mathematical formula for the extreme distribution, the statistical significance can be accurately estimated.

4. Which of the following is a part of the statistical test of sequences?

a) An optimal alignment between two chosen sequences is obtained at the end

b) Unrelated sequences of the same length are then generated through a randomization process

c) Unrelated sequences of the different length are then generated through a randomization process

d) Related sequences of the same length are then generated through a randomization process

View Answer

Answer: b

Explanation: Unrelated sequences of the same length are then generated through a randomization process in which one of the two sequences is randomly shuffled. And the next step is that a new alignment score is computed for the shuffled sequence pair.

5. In the statistical test, randomization process in which one of the two given sequences is randomly shuffled.

a) True

b) False View Answer

Answer: a

Explanation: After the mentioned step, computation for the alignment score for the shuffled sequence pair is done. Further, More such scores are similarly obtained through repeated shuffling.

6. What is used to generate parameters for the extreme distribution?

a) The pool of alignment scores from the shuffled sequences

b) A single score of a shuffled sequence

c) The pool of alignment scores from the unshuffled sequences

d) The basic optimal score computed at the beginning of the test View Answer

Answer: a

Explanation: Maximum scores are obtained through repeated shuffling. Then the pool of alignment scores from the shuffled sequences is used to generate parameters for the extreme distribution. The original alignment score is then compared against the distribution of random alignments to determine whether the score is beyond random chance.

7. If the score is located in the extreme margin of the distribution, that means that the alignment between the two sequences is due to random chance and is thus considered

a) unlikely, significant

b) unlikely, insignificant

c) unlikely, insignificant

d) very likely, significant View Answer

Answer: a

Explanation: The extreme margin of the distribution denotes the likeliness and thus significance. A P-value is given to indicate the probability that the original alignment is due to random chance.

8. It is not known whether the Gumble distribution applies equally well to gapped alignments.

a) True

b) False View Answer

Answer: a

Explanation: The statistics in the test were derived from ungapped local sequence alignments. Hence, it is not known whether the Gumble distribution applies equally well to gapped alignments. However, for all practical purposes, it is reasonable to assume that scores for gapped alignments essentially fit the same distribution. A frequently used software program for assessing statistical significance of a pairwise alignment is the PRSS program.

9. Which of the following is untrue about the PRSS program?

a) It stands for Probability of Random Shuffles

b) It is a web-based program that can be used to evaluate the statistical significance of DNA or protein sequence alignment

c) It first aligns two sequences using the Needleman-Wunsch algorithm and calculates the score

d) It holds one sequence in its original form and randomizes the order of residues in the other sequence.

View Answer

Answer: c

Explanation: It first aligns two sequences using the Smith–Waterman algorithm and calculates the score. The shuffled sequence is realigned with the unshuffled sequence. The resulting

alignment score is recorded. This process is iterated many (normally 1,000) times to help generate data for fitting the Gumble distribution.

10. The major disadvantage of the PRSS program is that it doesn’t allow partial shuffling.

a) True

b) False View Answer

Answer: b

Explanation: The major feature of the program is that it allows partial shuffling. For example, shuffling can be restricted to residues within a local window of 25–40, whereas the residues outside the window remain unchanged.

3. Questions & Answers on Multiple Sequence Alignment

Exhaustive Algorithms

1. Related sequences are identified through the database similarity searching and as the process generates multiple matching sequence pairs, it is often necessary to convert the numerous pair wise alignments into a single alignment.

a) True

b) False View Answer

Answer: a

Explanation: A natural extension of pair wise alignment is multiple sequence alignment, which is to align multiple related sequences to achieve optimal matching of the sequences. Related sequences are identified through the database similarity searching. As the process generates multiple matching sequence pairs, it is often necessary to convert the numerous pair wise alignments into a single alignment, which arranges sequences in such a way that evolutionarily equivalent positions across all sequences are matched.

2. There is a unique advantage of multiple sequence alignment because it reveals more biological information than many pair wise alignments can.

a) True

b) False View Answer

Answer: a

Explanation: It is truly an advantage of multiple sequence alignment. For example, it allows the

identification of conserved sequence patterns and motifs in the whole sequence family, which are not obvious to detect by comparing only two sequences.

3. Which of the following cannot be related to multiple sequence alignment?

a) Many conserved and functionally critical amino acid residues can be identified in a protein multiple alignment

b) Multiple sequence alignment is also an essential prerequisite to carrying out phylogenetic analysis of sequence families and prediction of protein secondary and tertiary structures

c) Multiple sequence alignment also has applications in designing degenerate polymerase chain reaction (PCR) primers based on multiple related sequences

d) This method does not contribute much to degenerate polymerase chain reaction (PCR) primers creation

View Answer

Answer: d

Explanation: In practice, heuristic approaches are most often used. Multiple sequence alignment has applications in designing degenerate (PCR) primers based on multiple related sequences.

4. The scoring function for multiple sequence alignment is based on the concept of sum of pairs (SP).

a) True

b) False View Answer

Answer: a

Explanation: Multiple sequence alignment is to arrange sequences in such a way that a maximum number of residues from each sequence are matched up according to a particular scoring function and is based on the concept of sum of pairs (SP). As the name suggests, it is the sum of the scores of all possible pairs of sequences in a multiple alignment based on a particular scoring matrix.

5. Which of the following scores are not considered while calculating the SP scores?

a) All possible pair wise matches

b) All possible mismatches

c) All possible gap costs

d) Number of gap penalties View Answer

Answer: d

Explanation: In calculating the SP scores, each column is scored by summing the scores for all

possible pair wise matches, mismatches and gap costs. The score of the entire alignment is the sum of all of the column scores. The score of the entire alignment is the sum of all of the column scores. In that case, option d becomes irrelevant choice here.

6. Given a multiple alignment of three sequences, the sum of scores is calculated as the sum of the dissimilarity scores of every pair of sequences at each position.

a) True

b) False View Answer

Answer: b

Explanation: Given a multiple alignment of three sequences, the sum of scores is calculated as the sum of the similarity scores of every pair of sequences at each position. The scoring is based on the BLOSUM62 matrix. If the total score for the alignment is 5, which means that the alignment is 25 = 32 times more likely to occur among homologous sequences than by random chance.

7. There are two approaches viz. exhaustive and heuristic approaches used in multiple sequence alignment.

a) True

b) False View Answer

Answer: a

Explanation: The exhaustive alignment method involves examining all possible aligned positions simultaneously. Similar to dynamic programming in pair wise alignment, which involves the use of a two-dimensional matrix to search for an optimal alignment, to use dynamic programming for multiple sequence alignment, extra dimensions are needed to take all possible ways of sequence matching into consideration.

8. In a multidimensional search matrix, for aligning N sequences, an (N+2)-dimensional matrix is needed to be filled with alignment scores.

a) True

b) False View Answer

Answer: b

Explanation: In a multidimensional search matrix, for aligning N sequences, an N-dimensional matrix is needed to be filled with alignment scores. For instance, for three sequences, a three- dimensional matrix is required to account for all possible alignment scores. Back-tracking is applied through the three-dimensional matrix to find the highest scored path that represents the optimal alignment.

9. As the amount of computational time and memory space required increases exponentially with the number of sequences, it makes the multidimensional search matrix method computationally prohibitive to use for a large data set.

a) True

b) False View Answer

Answer: a

Explanation: This is indeed the drawback of that method. For this reason, full dynamic programming is limited to small datasets of less than ten short sequences. For the same reason, few multiple alignment programs employing this “brute force” approach are publicly available.

10. Which of the following is untrue about DCA?

a) It stands for Divide-and-Conquer Alignment

b) It works by breaking each of the sequences into two smaller sections

c) The breaking points during the process are determined based on regional similarity of the sequences

d) If the sections are not short enough, further divisions are restricted as well View Answer

Answer: d

Explanation: Is a web-based program that is in fact semi exhaustive because certain steps of computation are reduced to heuristics. If the sections are not short enough, further divisions are carried out. When the lengths of the sequences reach a predefined threshold, dynamic programming is applied for aligning each set of subsequences. The resulting short alignments are joined together head to tail to yield a multiple alignment of the entire length of all sequences.

Heuristic Algorithms

1. Which of the following is untrue regarding Progressive Alignment Method?

a) Progressive alignment depends on the stepwise assembly of multiple alignment and is heuristic in nature

b) It speeds up the alignment of multiple sequences through a multistep process

c) It first conducts pair wise alignments for each possible pair of sequences using the Needleman–Wunsch global alignment method and records these similarity scores from the pair wise comparisons

d) Its drawback is it slows down the alignment of multiple sequences through a single step process

View Answer

Answer: d

Explanation: Progressive alignment speeds up the alignment of multiple sequences through a multistep process further, the scores can either be percent identity or similarity scores based on a particular substitution matrix. Both scores correlate with the evolutionary distances between sequences.

2. Clustal is a progressive multiple alignment program available either as a stand-alone or on-line program.

a) True

b) False View Answer

Answer: a

Explanation: Probably the most well-known progressive alignment program is Clustal. The stand- alone program, which runs on UNIX and Macintosh, has two variants, Clustal W and Clustal X. The W version provides a simple text-based interface and the X version provides a more user- friendly graphical interface.

3. Which of the following is untrue regarding the progressive alignment method?

a) The program also applies a weighting scheme to increase the reliability of aligning divergent sequences (sequences with less than 25% identity)

b) The progress is done by down weighting redundant and closely related groups of sequences in the alignment by a certain factor

c) This scheme is useful in enhancing similar sequences from dominating the alignment

d) This scheme is useful in enhancing similar sequences from dominating the alignment View Answer

Answer: c

Explanation: This scheme is useful in enhancing similar sequences from dominating the alignment. Further, the weight factor for each sequence is determined by its branch length on the guide tree. The branch lengths are normalized by how many times sequences share a basal branch from the root of the tree.

4. Which of the following is not a drawback of the progressive alignment method?

a) The progressive alignment method is not suitable for comparing sequences of different lengths because it is a global alignment–based method

b) In this method the use of affine gap penalties, long gaps are not allowed, and, in some cases, this may limit the accuracy of the method

c) In this method the use of affine gap penalties, long gaps is allowed, and, in some cases, this may limit the accuracy of the method

d) The final alignment result is also influenced by the order of sequence addition View Answer

Answer: c

Explanation: Another major limitation is the “greedy” nature of the algorithm: it depends on initial pair wise alignment. Once gaps introduced in the early steps of alignment, they are fixed. The final alignment could be far from optimal. The problem can be more glaring when dealing with divergent sequences.

5. Which of the following is untrue regarding T-Coffee?

a) It stands for Tree-based Consistency Objective Function for alignment Evaluation

b) It performs progressive sequence alignments as in Clustal.

c) The global pair wise alignment is not performed using the Clustal program.

d) The local pair wise alignment is generated by the Lalign program, from which the top ten scored alignments are selected

View Answer

: c

Explanation: The global pair wise alignment is performed using the Clustal program. The main difference is that, in processing a query, T-Coffee performs both global and local pair wise alignment for all possible pairs involved. The collection of local and global sequence alignments is pooled to form a library. The consistency of the alignments is evaluated.

6. Which of the following is untrue about iterative approach?

a) The iterative approach is based on the idea that an optimal solution can be found by repeatedly modifying existing suboptimal solutions

b) Because the order of the sequences used for alignment is different in each iteration

c) This method is also heuristic in nature and does not have guarantees for finding the optimal alignment

d) This method is not based on heuristic methods View Answer

Answer: d

Explanation: This method is based on heuristic methods. The procedure starts by producing a low-quality alignment and gradually improves it by iterative realignment through well-defined procedures until no more improvements in the alignment scores can be achieved.

7. Which of the following is untrue about PRRN?

a) PRRN is a web-based program that uses a double nested iterative strategy for multiple alignment

b) It performs multiple alignments through two sets of iterations: inner iteration and outer

iteration

c) In the outer iteration, an initial random alignment is generated that is used to derive a UPGMA tree

d) In the inner iteration, the sequences are randomly divided into multiple groups View Answer

Answer: d

Explanation: In the inner iteration, the sequences are randomly divided into two groups. Randomized alignment is used for each group in the initial cycle, after which the alignment positions in each group are fixed. The two groups, each treated as a single sequence, are then aligned to each other using global dynamic programming. The process is repeated through many cycles until the total SP score no longer increases. At this point, the resulting alignment is used to construct a new UPGMA tree.

8. The major drawback of the progressive and iterative alignment strategies is that they are largely global alignment based and may therefore fail to recognize conserved domains and motifs among highly divergent sequences of varying lengths.

a) True

b) False View Answer

Answer: a

Explanation: For such divergent sequences that share only regional similarities, a local alignment based approach has to be used. The strategy identifies a block of ungapped alignment shared by all the sequences, hence, the block-based local alignment strategy.

9. Which of the following is untrue about DIALIGN2?

a) It is a web based program designed to detect local similarities

b) It is designed to detect global similarities

c) It does not apply gap penalties and thus is not sensitive to long gaps

d) The method breaks each of the sequences down to smaller segments and performs all possible pair wise alignments between the segments

View Answer

Answer: b

Explanation: High-scoring segments, called blocks, among different sequences are then compiled in a progressive manner to assemble a full multiple alignment. It places emphasis on block-to-block comparison rather than residue-to-residue comparison. The sequence regions between the blocks are left unaligned. The program has been shown to be especially suitable for aligning divergent sequences with only local similarity.

10. Match-Box compares segments of some of the nine residues of possible Pair wise alignments.

a) True

b) False View Answer

Answer: b

Explanation: Match-Box compares segments of every nine residues of all possible pair wise alignments. It is a web-based server that also aims to identify conserved blocks (or boxes) among sequences. The program compares segments of every nine residues of all possible pair wise alignments. If the similarity of particular segments is above a certain threshold across all sequences, they are used as an anchor to assemble multiple alignments; residues between blocks are unaligned.

Needleman–Wunsch Algorithm

1. Which of the following is not the objective to perform sequence comparison?

a) To observe patterns of conservation

b) to find the common motifs present in both sequences

c) To study the physical properties of molecules

d) to study evolutionary relationships View Answer

Answer: c

Explanation: To assess whether it is likely that two sequences evolved from the same sequence comparison is required. Also, to find out which sequences from the database are similar to the sequence at hand, sequence comparison is carried out.

2. A dotplot is visual and qualitative technique whereas the sequence alignment is exact and quantitative measure of similarity of alignments.

a) True

b) False View Answer

Answer: a

Explanation: The sequence alignment is exact and quantitative measure of similarity of alignments. It involves– Construction of the best alignment between the sequences and assessment of the similarity from the alignment.

3. The global sequence alignment is suitable when the two sequences are of dissimilar length, with a negligible degree of similarity throughout.

a) True

b) False View Answer

Answer: b

Explanation: The global sequence alignment is suitable when the two sequences are of similar length, with a significant degree of similarity throughout. It gives the best alignment over the entire length of two sequences.

4. The alignment score is the sum of substitution scores and gap penalties in this type of algorithm.

a) True

b) False View Answer

5. The substitution matrices are rarely used in this type of matching.

a) True

b) False View Answer

Answer: b

Explanation: The substitution matrices are quite commonly used in this type of matching. A concise way to express the residue substitution costs can be achieved with a N x N matrix where, N is 4 for DNA and 20 for proteins as 4 nucleotides in DNA and 20 amino acid residues in proteins are in picture respectively.

6. Which of the following is untrue about Protein substitution matrices?

a) They are significantly more complex than DNA scoring matrices

b) They have the N x N matrices of the amino acids

c) Protein substitution matrices have quite important role in evolutionary studies

d) They are significantly quite less complex than DNA scoring matrices View Answer

Answer: d

Explanation: Protein substitution matrices are significantly more complex than DNA scoring matrices. Proteins are composed of twenty amino acids, and physico-chemical properties of individual amino acids vary considerably. A protein substitution matrix can be based on any property of amino acids: size, polarity, charge, hydrophobicity.

7. In Needleman-Wunsch algorithm, the gaps are scored -2.

a) True

b) False View Answer

Answer: b

Explanation: In Needleman-Wunsch algorithm, the gaps are ignored. Amount of gap penalty is zero here. A gap corresponds to an insertion or a deletion of a Residue.

8. The number of possible global alignments between two sequences of length N is

View Answer Answer: b

Explanation: By the total number of permutations and combinations option b gives the accurate number of possible global alignments between two sequences of length N. For two sequences of 250 residues this is 10149.

9. Which of the following is untrue about Needleman-Wunsch algorithm?

a) It is an example of dynamic programming

b) Basic idea here is to build up the best alignment by using optimal alignments of larger sub sequences

c) It was first used by Saul Needleman and Christian Wunsch

d) It was first used in 1970 View Answer

Answer: b

Explanation: In case of Needleman-Wunsch algorithm, the basic idea here is to build up the best alignment by using optimal alignments of smaller sub sequences. It is based on dynamic programming, a discipline invented by Richard Bellman in 1953.

10. There are two types matrices involved in the study- score matrices and trace matrices.

a) True

b) False View Answer

Answer: a

Explanation: The Needleman-Wunsch algorithm consists of three steps where these matrices play their role as follows:

1. Initialization of the score matrix

2. Calculation of scores and filling the traceback matrix

3. Deducing the alignment from the traceback matrix

Progressive Methods of Multiple Sequence Alignment

1. Progressive alignment methods use the dynamic programming method to build a MSA starting with the most related sequences and then progressively adding less related sequences or groups of sequences to the initial alignment

a) True

b) False View Answer

Answer: a

Explanation: The progressive alignment methods use the dynamic programming method. Relationships among the sequences are modeled by an evolutionary tree in which the outer branches or leaves are the sequences. The tree is based on pair-wise comparisons of the sequences using one of the phylogenetic methods.

2. Progenitor sequences represented by the branches of the tree are derived by alignment of the sequences.

a) outer, outermost

b) inner, outermost

c) inner, innermost

d) outer, innermost View Answer

Answer: b

Explanation: Progenitor sequences represented by the inner branches of the tree are derived by

alignment of the outermost sequences. These inner branches will have uncertainties where positions in the outermost sequences are dissimilar.

3. CLUSTALW is a more recent version of CLUSTAL with the W standing for

a) weakening

b) winding

c) weighting

d) wiping View Answer

Answer: c

Explanation: The W in CLUSTALW stands for ‘weighting’ to represent the ability of the program to provide weights to the sequence and program parameters. CLUSTAL has been around for more than 10 years and lots of improvements in the program have been made.

4. The CLUSTALX provides a graphic interface.

a) True

b) False View Answer

Answer: a

Explanation: Two examples of programs that use progressive methods are CLUSTALW and the Genetics Computer Group program PILEUP. CLUSTALX provides a graphic interface.

These changes provide more realistic alignments that should reflect the evolutionary changes in the aligned sequences and the more appropriate distribution of gaps between conserved domains.

5. Which of the following is untrue about CLUSTAL program?

a) CLUSTAL performs a global-multiple sequence alignment by a different method than MSA (Multiple Sequence Alignment)

b) The initial heuristic alignment obtained by MSA is calculated in a different way

c) The initial step includes performing pair-wise alignments of all of the sequences

d) The intermediate step includes use the alignment scores to produce a phylogenetic tree

View Answer

Answer: b

Explanation: The initial heuristic alignment obtained by MSA is calculated the same way, although it performs a global-multiple sequence alignment by a different method than MSA (Multiple Sequence Alignment). As the mentioned options are first two steps, the last is aligning the sequences sequentially, guided by the phylogenetic relationships indicated by the tree.

6. The initial alignments used to produce the guide tree may be obtained by various methods. Which of the following is not one of them?

a) Fast k-tuple

b) pattern-finding approach similar

c) FASTA

d) Faster, full dynamic programming method View Answer

Answer: d

Explanation: The methods used, might be fast k-tuple or pattern-finding approach similar to FASTA that is useful for many sequences and the full dynamic programming method as well. But the option d becomes incorrect as full dynamic programming method is slower as compared to rest of the methods in options.

7. The scoring of gaps in a MSA (Multiple Sequence Alignment) has to be performed in a different manner from scoring gaps in a pair-wise alignment

a) True

b) False View Answer

Answer: a

Explanation: As more sequences are added to a profile of an existing MSA, gaps accumulate and influence the alignment of further sequences. CLUSTALW calculates gaps in a novel way designed to place them between conserved domains.

8. Like other alignment programs, CLUSTAL uses a null score for opening a gap in a sequence alignment and a penalty for extending the gap by one residue.

a) True

b) False View Answer

Answer: b

Explanation: CLUSTAL uses a penalty for opening a gap in a sequence alignment and an additional penalty for extending the gap by one residue. These penalties are user-defined. Gaps found in the initial alignments remain fixed. New gaps introduced as more sequences are added also receive this same gap penalty, even when they occur within an existing gap, but the gap penalties for an alignment are then modified according to the average match value in the substitution matrix, the percent identity between the sequences, and the sequence lengths.

9. Which of the following is untrue about PILEUP program?

a) It is the MSA program that is a part of the Genetics Computer Group package of

sequence analysis programs

b) It is owned since 1997 by Oxford Communications, and is widely used due to the popularity and availability of this package

c) It uses a method for MSA that is polar opposite to CLUSTALW

d) The sequences are aligned pair-wise using the Needleman- Wunsch dynamic programming algorithm

View Answer

Answer: c

Explanation: PILEUP uses a method for MSA that is very similar to CLUSTALW. The sequences are aligned pair-wise using the Needleman- Wunsch dynamic programming algorithm, and the scores are used to produce a tree by the unweighted pair-group method using arithmetic averages. The resulting tree is then used to guide the alignment of the most closely related sequences and groups of sequences. The resulting alignment is a global alignment produced by the Needleman-Wunsch algorithm.

10. The resulting tree is then used to guide the alignment of the most closely related sequences and groups of sequences. The resulting alignment is a global alignment produced by the Needleman-Wunsch algorithm.

a) True

b) False View Answer

Answer: a

Explanation: The very first sequences to be aligned are the most closely related on the sequence tree. If these sequences align very well, there will be few errors in the initial alignments.

However, the more distantly related these sequences, the more errors will be made, and these errors will be propagated to the MSA. There is no simple way to circumvent this problem. A second problem with the progressive alignment method is the choice of suitable scoring matrices and gap penalties that apply to the set of sequences.

Iterative Methods of Multiple Sequence Alignment

1. Iterative methods include repeatedly realigning subgroups of the sequences and then by aligning these subgroups into a local alignment of all of the sequences.

a) True

b) False View Answer

Answer: b

Explanation: Subgroups are aligned into a global alignment of all of the sequences. The objective is to improve the overall alignment score, such as a sum of pairs score. Selection of these groups may be based on the ordering of the sequences on a phylogenetic tree predicted in a manner similar to that of progressive alignment, separation of one or two of the sequences from the rest, or a random selection of the groups.

2. Which of the following is incorrect regarding PRRP?

a) The program PRRP uses iterative methods to produce an alignment

b) An initial pair-wise alignment is made to predict a tree

c) Only one cycle is performed

d) The whole process is repeated until there is no further increase in the alignment score View Answer

Answer: c

Explanation: As mentioned, an initial pair-wise alignment is made to predict a tree, the tree is used to produce weights for making alignments in the same manner as

MSA except that the sequences are analyzed for the presence of aligned regions that include gaps rather than being globally aligned, and these regions are iteratively recalculated to improve the alignment score. The best scoring alignment is then used in a new cycle of calculations to predict a new tree, new weights, and new alignments.

3. In the program DIALIGN, pairs of sequences are aligned to locate aligned regions that do not include gaps, much like continuous diagonals in a dot matrix plot.

a) True

b) False View Answer

Answer: a

Explanation: The program DIALIGN finds an alignment by a different iterative method. Pairs of sequences are aligned to locate aligned regions that do not include gaps, much like continuous diagonals in a dot matrix plot. Diagonals of various lengths are identified.

A consistent collection of weighted diagonals that provides an alignment which is a maximum sum of weights is then found.

4. The Genetic Algorithm method has been recently adapted for MSA(Multiple Sequence Alignment) by Corpet (1998)

a) True

b) False View Answer

Answer: b

Explanation: The genetic algorithm is a general type of machine-learning algorithm that has no direct relationship to biology and that was invented by computer scientists. The method has been recently adapted for MSA (Multiple Sequence Alignment) by Notredame and Higgins (1996) in a computer program package called SAGA (Sequence Alignment by Genetic Algorithm).

5. An approach for obtaining a higher-scoring MSA (Multiple Sequence Alignment) by rearranging an existing alignment uses a probability approach called simulated annealing.

a) True

b) False View Answer

Answer: a

Explanation: The program MSASA (Multiple Sequence Alignment by Simulated Annealing) starts with a heuristic MSA (Multiple Sequence Alignment). Further, it changes the alignment by following an algorithm designed to identify changes that increase the alignment score.

6. The first step in Genetic Algorithm is arranging the sequences to be aligned in rows

a) True

b) False View Answer

Answer: a

Explanation: The sequences to be aligned are written in rows, as on a page, except that they are made to overlap by a random amount of sequence, up to 50 residues long for sequences about 200 in length. The ends are then padded with gaps. A typical population of 100 of these MSAs is made, although other numbers may be set.

7. The second step in the Genetic Algorithm comprises of scoring of the 100 initial MSAs by the sum of pairs method.

a) True

b) False View Answer

Answer: a

Explanation: The 100 initial MSAs are scored by the sum of pairs method, except that both natural and quasi-natural gap-scoring schemes are used. Recall that the best SSP score for a MSA is the minimum one and the one that is closest to the sum of the pair-wise sequence alignment. Standard amino acid scoring matrices and gap opening and extension penalties are used.

8. In Genetic Algorithm, in the mutation process

a) sequence is changed

b) gaps are not inserted

c) sequence is not changed

d) gaps are not rearranged View Answer

Answer: c

Explanation: In the mutation process, the sequence is not changed (else it would no longer be an alignment), but gaps are inserted and rearranged in an attempt to create a better-scoring MSA. In the gap insertion process, the sequences in a given MSA are divided into two groups based on an estimated phylogenetic tree, and gaps of random length are inserted into random positions in the alignment.

9. The HMM is a statistical model that considers few combinations of matches and gaps to generate an alignment of a set of sequences.

a) True

b) False View Answer

Answer: b

Explanation: The HMM is a statistical model that considers all possible combinations of matches, mismatches, and gaps to generate an alignment of a set of sequences. A localized region of similarity, including insertions and deletions, may also be modeled by an HMM. Analysis of sequences by an HMM is discussed on page 185 along with other statistical methods.

10. Which of the following is not true about iterative methods?

a) Genetic Algorithm is method used for under this

b) Hidden Markov Models are used for Multiple Sequence Alignment

c) The objective is to improve the overall alignment score

d) MultAlin recalculates global scores View Answer

Answer: d

Explanation: MultAlin (Corpet 1988) recalculates pair-wise scores during the production of a progressive Alignment. In addition, it uses these scores to recalculate the tree, which is then used to refine the alignment in an effort to improve the score.

Localized Alignments in Sequences

1. Which of the following is not among the methods for finding localized sequence similarity?

a) Profile Analysis

b) Block Analysis

c) Extraction of Blocks from a Global or Local MSA

d) Pattern blocking View Answer

Answer: d

Explanation: Pattern Searching is the correct name of the method for finding localized sequence similarity. This type of analysis was performed on groups of related proteins, and the amino acid patterns that were located may be found in the Prosite catalog.

2. Profiles are found by performing the MSA of a group of sequences and then removing the regions in the alignment into a smaller MSA.

a) local, more highly conserved

b) global, low conserved

c) global, more highly conserved

d) local, low conserved View Answer

Answer: c

Explanation: Profiles are found by performing the global MSA of a group of sequences and then removing the more highly conserved regions in the alignment into a smaller MSA. A scoring matrix for the MSA, called a profile, is then made. The profile is composed of columns much like a mini-MSA and may include matches, mismatches, insertions, and deletions.

3. The program Profilemake can be used to produce a profile from a MSA

a) True

b) False View Answer

Answer: a

Explanation: A version of the Profilesearch program, which performs a database search for matches to a profile, is available at the University of Pittsburgh Supercomputer Center. A special grant application may be needed to use this facility. Profile-generating programs are available by FTP and are included in the Genetics Computer Group suite of programs.

4. Which of the following is untrue regarding the block analysis method?

a) Blocks represent a conserved region in the MSA

b) Blocks differ from profiles in lacking insert and delete positions in the sequences

c) Every column includes only matches and mismatches

d) Blocks may be made by searching for a section of an MSA alignment that is low conserved

View Answer

Answer: d

Explanation: Like profiles, blocks may be made by searching for a section of an MSA alignment that is highly conserved. However, aligned regions may also be found by searching each sequence in turn for similar patterns of the same length. These patterns may include a region with one or a few matching characters followed by a short spacer region of unmatched characters and then by another set of a few matching characters, and so on, until the sequences start to be different.

5. Block analysis methods use substitution matrices such as the PAM and BLOSUM matrices to score matches.

a) True

b) False View Answer

Answer: b

Explanation: these methods do not use substitution matrices such as the PAM and BLOSUM matrices to score matches. Rather, they are based on finding exact matches that have the same spacing in at least some of the input sequences, and that may be repeated in a given sequence.

6. In the method of extraction of blocks from a global or local MSA, a global MSA of related protein sequences usually includes regions that have been aligned without gaps in any of the sequences.

a) True

b) False View Answer

Answer: a

Explanation: These ungapped patterns may be extracted from these aligned regions and used to produce blocks. Blocks found in this manner are only as good as the MSA from which they are derived. A global MSA of related protein sequences usually includes regions that have been aligned without gaps in any of the sequences.

7. Which of the following is not true regarding the BLOCKS?

a) Blocks of width 10–55 are extracted from a protein MSA

b) The protein MSA is up to 400 sequences

c) The program doesn’t accept manually reformatted MSAs

d) The program accepts FASTA format View Answer

Answer: c

Explanation: The program accepts FASTA, CLUSTAL, or MSF formats, or manually reformatted MSAs. Several types of analyses may be performed with such extracted blocks. The BLOCKS server primarily generates blocks from unaligned sequences. The eMOTIFs server similarly extracts motifs from MSAs in several MSA formats and provides a formatter for additional MSA formats.

8. The pattern searching method type of analysis was performed on groups of related proteins, and the amino acid patterns that were located may be found in the Prosite catalog.

a) True

b) False View Answer

Answer: a

Explanation: This Prosite catalog groups proteins that have similar biochemical functions on the basis of amino acid patterns such as those in the active site. Subsequently, these families were searched for amino acid patterns by the MOTIF program (Smith et al. 1990), which finds patterns of the type aa1 d1 aa2 d2 aa3, where aa1 and aa2 are conserved amino acids and d1 and d2 are stretches of intervening sequence up to 24 amino acids long.

9. Although MOTIF program is used successfully for making the BLOCKS database, it is limited in the pattern sizes that can be found.

a) True

b) False View Answer

Answer: a

Explanation: The MOTIF program distinguishes true motifs from random background patterns by requiring that motifs occur in a number of the input sequences and tend not to be internally repeated in any one sequence. As the length of the motif increases, there are many possible combinations of patterns of a given length where only a few characters match.

10. Which of the following is not true regarding the BLOCKS?

a) The BLOCKS server can extract a conserved, ungapped region from a MSA to produce a sequence block

b) The server can also find blocks in a set of unaligned, input sequences and maintains a large database of blocks based on an analysis of proteins in the Prosite catalog

c) Blocks are found by the Protomat program

d) The program MOTIF doesn’t locate spaced patterns View Answer

Answer: d

Explanation: Blocks are found in two steps: First, the program MOTIF described on the previous page is used to locate spaced patterns. The second step takes the best and most consistent patterns found in step 1 and uses the program MOTOMAT to merge overlapping triplets and extend them, orders the resulting blocks, and chooses those that are in the largest subset of sequences.

Statistical Methods for Aiding Alignment

1. The Expectation Maximization algorithm has been used to identify conserved domains in unaligned proteins only.

a) True

b) False View Answer

Answer: b

Explanation: This algorithm has been used to identify both conserved domains in unaligned proteins and protein-binding sites in unaligned DNA sequences (Lawrence and Reilly 1990), including sites that may include gaps (Cardon and Stormo 1992). Given are a set of sequences that are expected to have a common sequence pattern and may not be easily recognizable by eye.

2. Which of the following is untrue regarding Expectation Maximization algorithm?

a) An initial guess is made as to the location and size of the site of interest in each of the sequences, and these parts of the sequence are aligned

b) The alignment provides an estimate of the base or amino acid composition of each column in the site

c) The column-by-column composition of the site already available is used to estimate the probability of finding the site at any position in each of the sequences

d) The row-by-column composition of the site already available is used to estimate the probability

View Answer

Answer: d

Explanation: The EM algorithm then consists of two steps, which are repeated consecutively. In step 1, the expectation step, the column-by-column composition of the site already available is used to estimate the probability of finding the site at any position in each of the sequences.

These probabilities are used in turn to provide new information as to the expected base or amino acid distribution for each column in the site.

3. Out of the two repeated steps in EM algorithm, the step 2 is

a) the maximization step

b) the minimization step

c) the optimization step

d) the normalization step View Answer

Answer: a

Explanation: In step 2, the maximization step, the new counts of bases or amino acids for each position in the site found in step 1 are substituted for the previous set. Step 1 is then repeated using these new counts. The cycle is repeated until the algorithm converges on a solution and does not change with further cycles. At that time, the best location of the site in each sequence and the best estimate of the residue composition of each column in the site will be available.

4. In EM algorithm, as an example, suppose that there are 10 DNA sequences having very little similarity with each other, each about 100 nucleotides long and thought to contain a binding site near the middle 20 residues, based on biochemical and genetic evidence. the following steps would be used by the EM algorithm to find the most probable location of the binding sites in each of the sequences.

a) 30

b) 10

c) 25

d) 20

View Answer

Answer: b

Explanation: When examining the EM program MEME, the size and number of binding sites, the location in each sequence, and whether or not the site is present in each sequence do not necessarily have to be known. For the present example, the following steps would be used by the EM algorithm to find the most probable location of the binding sites in each of the 10 sequences.

5. In the initial step of EM algorithm, the 20-residue-long binding motif patterns in each sequence are aligned as an initial guess of the motif.

a) True

b) False View Answer

Answer: a

Explanation: The base composition of each column in the aligned patterns is then determined. The composition of the flanking sequence on each side of the site provides the surrounding base or amino acid composition for comparison. Each sequence is assumed to be the same length and to be aligned by the ends.

6. In the intermediate steps of EM algorithm, the number of each base in each column is determined and then converted to fractions.

a) True

b) False View Answer

Answer: a

Explanation: For example, that there are four Gs in the first column of the 10 sequences, then the frequency of G in the first column of the site, fSG = 4/10 = 0.4. This procedure is repeated for each base and each column.

7. For the 10-residue DNA sequence example, there are possible starting sites for a 20-residue-long site.

a) 30

b) 21

c) 81

d) 60

View Answer

Answer: c

Explanation: For the 10-residue DNA sequence example, there are 100 – 20 +1 possible starting sites for a 20-residue-long site. Where the first one is at position 1 in the sequence ending one at 20 and the last beginning at position 81 and ending at 100 (there is not enough sequence for a 20-residue-long site beyond position 81).

8. An alternative method is to produce an odds scoring matrix calculated by dividing each base frequency by the background frequency of that base.

a) True

b) False View Answer

Answer: a

Explanation: In this method, the probability of each location is then found by multiplying the odds scores from each column. An even simpler method is to use log odds scores in the matrix. The column scores are then simply added. In this case, the log odds scores must be converted to odds scores before position probabilities are calculated.

9. Which of the following about MEME is untrue?

a) It is a Web resource for performing local MSAs (Multiple Sequence Alignment) by the above expectation maximization method is the program MEME

b) It stands for Multiple EM for Motif Elicitation

c) It was developed at developed at the University of California at San Diego Supercomputing Center

d) The Web page has multiple versions for searching blocks by an EM algorithm View Answer

Answer: d

Explanation: The Web page for two versions of MEME, ParaMEME, a Web program that searches for blocks by an EM algorithm (Described below), and a similar program MetaMEME (which searches for profiles using HMMs, described below).The Motif Alignment and Search Tool (MAST) for searching through databases for matches to motifs.

10. Which of the following about the Gibbs sampler is untrue?

a) It is a statistical method for finding motifs in sequences

b) It is dissimilar to the principle of the EM method

c) It searches for the statistically most probable motifs

d) It can find the optimal width and number of given motifs in each sequence View Answer

Answer: b

Explanation: It is another statistical method for finding motifs in sequences is the Gibbs sampler. The method is similar in principle to the EM method described above, but the algorithm is different. A combinatorial approach of the Gibbs sampler and MOTIF may be used to make blocks at the BLOCKS Web site.

Position – Specific Scoring Matrices

1. Analysis of s for conserved blocks of sequence leads to production of the position- specific scoring matrix.

a) True

b) False View Answer

Answer: a

Explanation: The analysis of MSAs (Multiple Sequence Alignment) for conserved blocks of sequence leads to production of the position-specific scoring matrix, or PSSM. The PSSM may be used to search a sequence to obtain the most probable location or locations of the motif represented by the PSSM. Alternatively, the PSSM may be used to search an entire database to identify additional sequences that also have the same motif.

2. The quality and quantity of information provided by the PSSM also varies for

in the motif.

a) each row

b) each column

c) rows and columns

d) neither the rows nor the columns View Answer

Answer: b

Explanation: The quality and quantity of information provided by the PSSM also varies for each column in the motif, and this variation profoundly influences the matches found with sequences. This situation can be accurately described by information theory, and the results can be displayed by a colored graph called a sequence logo.

3. Two considerations arise in trying to tune the PSSM so that it adequately represents the training sequences. Which of the following is not their description?

a) If a given column in 20 sequences has only isoleucine, it is not very likely that a different amino acid will be found in other sequences with that motif because the residue is probably important for function

b) If a given column in 20 sequences has only isoleucine, it is very likely that a different amino acid will be found in other sequences with that motif because the residue is probably important for function

c) If the number of sequences with the found motif is large and reasonably diverse, the sequences represent a good statistical sampling of all sequences that are ever likely to be found with that same motif

d) Another column in the motif from the 20 sequences may have several amino acids, and some amino acids may not be represented at all

View Answer

Answer: b

Explanation: The PSSM is constructed by a simple logarithmic transformation of a matrix giving the frequency of each amino acid in the motif. Even more variation may be expected at that position in other sequences, although the more abundant amino acids already found in that column would probably be favored.

4. If a good sampling of sequences is the number of sequences is and the motif structure is it should, in principle, be possible to obtain frequencies highly representative of the same motif in other sequences also.

a) available, sufficiently large, not too complex

b) unavailable, sufficiently large, not too complex

c) unavailable, sufficiently small, not too complex

d) available, sufficiently large, too complex View Answer

Answer: a

Explanation: the more abundant amino acids already found in that column would probably be favored. Thus, if a good sampling of sequences is available, the number of sequences is sufficiently large, and the motif structure is not too complex, it should, in principle, be possible to obtain frequencies highly representative of the same motif in other sequences also (Henikoff and Henikoff 1996).

5. If the data set is , then unless the motif has amino acids in each column, the column frequencies in the motif may not be highly representative of all other occurrences of the motif.

a) small, distinct

b) small, almost identical

c) large, almost identical

d) large, distinct View Answer

Answer: b

Explanation: the number of sequences for producing the motif may be small, highly diverse, or complex, giving rise to a second level of consideration. If the data set is small, then unless the motif has almost identical amino acids in each column, the column frequencies in the motif may not be highly representative of all other occurrences of the motif. In such cases, it is desirable to improve the estimates of the amino acid frequencies by adding extra amino acid counts, called pseudocounts, to obtain a more reasonable distribution of amino acid frequencies in the column.

6. Even if many pseudocounts are added in comparison to real sequence counts, the amino acid frequencies will not have any effect or influence.

a) True

b) False View Answer

Answer: b

Explanation: Knowing how many counts to add is a difficult but fortunately solvable problem. On the one hand, if too many pseudocounts are added in comparison to real sequence counts, the pseudocounts will become the dominant influence in the amino acid frequencies and searches using the motif will not work. On the other hand, if there are relatively few real counts, many amino acid variations may not be present because of the small sample of sequences.

7. Which of the following is not a feature of editors and formatters?

a) provision for displaying the sequence on a color monitor with residue colors to aid in a clear visual representation of the alignment

b) recognition of the multiple sequence format that was output by the MSA (Multiple Sequence Alignment) program

c) maintenance of the alignment in a suitable format when the editing is completed

d) disallowing shading conserved residues in the alignment View Answer

Answer: d

Explanation: In addition to this, provision of a suitable windows interface, allowing use of the mouse to add, delete, or move sequence followed by an updated display of the alignment, is a feature. In addition, there are other types of editing that are commonly performed on MSAs (Multiple Sequence Alignment) program such as, for example, shading conserved residues in the alignment.

8. GDE (Genetic Data Environment) provides a general interface on UNIX machines for sequence analysis, sequence alignment editing, and display.

a) True

b) False View Answer

Answer: a

Explanation: It is available from several anonymous FTP sites. This interface requires communication with a host UNIX machine running the Genetics Computer Group software. Interface with MS-DOS or Macintosh is possible if the computer is equipped with the appropriate X-Windows client software.

9. MACAW is a local multiple sequence alignment program only.

a) True

b) False View Answer

Answer: b

Explanation: MACAW is both a local multiple sequence alignment program and a sequence editing tool. Given a set of sequences, the program finds ungapped blocks in the sequences and gives their statistical significance. Later versions of the program find blocks by one of three user- chosen methods.

10. Two commonly encountered examples are the Genetics Computer Group’s MSF format and the CLUSTALW ALN format.

a) True

b) False View Answer

Answer: a

Explanation: This is because these formats follow a precise outline, one may be readily converted to another by computer programs. READSEQ by D.G.Gilbert at Indiana University at Bloomington is one such program.

4. Questions on Database Similarity Searching

Heuristic Database Searching

1. A main application of pairwise alignment is retrieving biological sequences in databases based on similarity.

a) True

b) False View Answer

Answer: a

Explanation: This process involves submission of a query sequenceand performing a pairwise comparison of the query sequence with all individualsequences in a database. Thus, database similarity searching is pairwise alignmenton a large scale. This type of searching is one of the most effective ways to assign putativefunctions to newly determined sequences.

2. Dynamic programming method is the fastest and most practical method.

a) True

b) False View Answer

Answer: b

Explanation: Dynamic programming method is slow and impractical to use in most cases. Specialsearch methods are needed to speed up the computational process of sequence comparison.

3. Which of the following is not one of the requirements for implementing algorithms for sequence databasesearching?

a) Size of the dataset

b) Sensitivity

c) Specificity

d) Speed View Answer

Answer: a

Explanation: There are unique requirements for implementing algorithms for sequence databaseSearching out of which, the later three play an important role. However, speed can vary with the size of database. achieving all three at a time is nearly impossible.

4. Sensitivity refers to the ability to find as manycorrect hits as possible.

a) True

b) False View Answer

Answer: a

Explanation: Among the unique requirements for implementing algorithms for sequence database Searching, the first criterion is sensitivity, which refers to the ability to find as manycorrect hits as possible. It is measured by the extent of inclusion of correctly identified sequence members of the same family. These correct hits are considered ‘true positives’ in the database searching exercise.

5. The specificity refers to the ability to include incorrect hits.

a) True

b) False View Answer

Answer: b

Explanation: In heuristic database searching methods, The second requirement criterion is 1 also calledspecificity, which refers to the ability to exclude incorrect hits. These incorrect hits areunrelated sequences mistakenly identified in database searching and are considered ‘false positives.’

6. In heuristic methods, speed doesn’t vary with the size of database.

a) True

b) False View Answer

Answer: b

Explanation: The speed is the time it takes to get results from database searches. Depending on the size of the database, speed sometimes canbe a primary concern in the search methods.

7. An increase in sensitivity is associated with _ in selectivity.

a) no specific change

b) increase

c) decrease

d) exponential increase View Answer

Answer: c

Explanation: Ideally, one wants to have the greatest sensitivity, selectivity, and speed in database searches. However, satisfying all three requirements is difficult in reality. What generally happens is that an increase in sensitivity is associated with decrease in selectivity. A very inclusive search tends to include many false positives. Similarly, an improvementin speed often comes at the cost of lowered sensitivity and selectivity. A compromise between the three criteria often has to be made.

8. Which of the following is incorrect?

a) Smith–Waterman algorithm is the fastest

b) Smith–Waterman algorithm is comparatively slower method

c) To speedup up comparison, heuristic methods are used

d) Heuristic algorithms perform faster searches View Answer

Answer: a

Explanation: Searching a large database using the dynamic programming methods, such as

theSmith–Watermanalgorithm, although accurate and reliable, is too slow and impracticalwhencomputationalresources are limited. To speed up the comparison,heuristic methods have to be used. The heuristic algorithms perform faster searchesbecause they examine only a fraction of the possible alignments examined in regulardynamic programming.

9. Currently, there are two major heuristic algorithms for performing databasesearches: BLAST and FASTA.

a) True

b) False View Answer

Answer: a

Explanation: These methods are not guaranteed to find the optimal alignment or true homologs, but are 50–100 times faster than dynamic programming.The increased computational speed comes at a moderate expense of sensitivity andspecificity of the search, which is easily tolerated by working molecular biologists. Both programs can provide a reasonably good indication of sequence similarity by identifying similar sequence segments.

10. Which of the following is incorrect the ‘word’ method?

a) Both BLAST and FASTA use a heuristic word method

b) Word method is usedfor fast pairwise sequencealignment in BLAST and FASTA

c) The basic assumption is that two relatedsequences must have at least one word in common

d) Two related sequences must have at zero word in common while assuming View Answer

Answer: d

Explanation: This is the third method of pairwise sequence alignment. It works by findingshort stretches of identical or nearly identical letters in two sequences. These short strings of characters are called words, which are similar to the windows used in the dot matrix method. The basic assumption is that two related sequences must have at least one word in common. By first identifying word matches, a longer alignment can be obtained by extending similarity regions from the words. Once regions of high sequence similarity are found, adjacent high-scoring regions canbe joined into a full alignment.

Basic Local Alignment Search Tool (BLAST)

1. The BLAST program was developed in a) 1992

b) 1995

c) 1990

d) 1991

View Answer

Answer: c

Explanation: The BLAST program was developed by Stephen Altschul of NCBI in 1990 and hassince become one of the most popular programs for sequence analysis. BLAST uses heuristics to align a query sequence with all sequences in a database.

2. In sequence alignment by BLAST, each word from query sequence is typically

residues for protein sequences and residues for DNA sequences.

a) ten, eleven

b) three, three

c) three, eleven

d) three, ten View Answer

Answer: c

Explanation: The first step is to create a list of words from the query sequence. Each word is typically three residuesfor protein sequences and eleven residues for DNA sequences. The list includes every possible word extracted from the query sequence. This step is also called seeding.

3. In sequence alignment by BLAST, the second step is to search a sequence database for the occurrence of these words.

a) True

b) False View Answer

Answer: a

Explanation: This step is to identify database sequences containing the matching words. The matchingof the words is scored by a given substitution matrix. A word is considered a match if it is above a threshold.

4. The final step involves pairwise alignment by extending from the words in both directions while counting the alignment score using the same substitution matrix.

a) True

b) False View Answer

Answer: a

Explanation: The extension continues until the score of the alignment drops below a threshold

due to mismatches (the drop threshold is twenty-two for proteinsand twenty for DNA). The resulting contiguous aligned segment pair without gaps is called high-scoring segment pair. In the originalversion of BLAST, the highest scored HSPs are presented as the final report. They arealso called maximum scoring pairs.

5. A recent improvement in the implementation of BLAST is the ability to provide gapped alignment.

a) True

b) False View Answer

Answer: a

Explanation: In gapped BLAST, the highest scored segment is chosen to be extended in both directions using dynamic programming where gaps may be introduced.The extension continues if the alignment score is above a certain threshold; otherwise it is terminated. However, the overall score is allowed to drop below the threshold only if it is temporary and rises again to attain above threshold values. Final trimming of terminal regions is needed before producing a report of the final alignment.

6. Which of the following is not a variant of BLAST?

a) BLASTN

b) BLASTP

c) BLASTX

d) TBLASTNX View Answer

Answer: d

Explanation: BLAST is a family of programs that includes BLASTN, BLASTP, BLASTX TBLASTN, and TBLASTX. BLASTN queries nucleotide sequences with a nucleotide sequence database. The alignment scoring is based on the BLOSUM62 matrix.

7. BLASTX uses protein sequences as queries to search against a protein sequence database.

a) True

b) False View Answer

Answer: b

Explanation: BLASTP, and not BLASTX, uses protein sequences as queries to search against a protein sequence database. BLASTX uses nucleotide sequences as queries and translates them in all six reading frames to produce translated protein sequences, which are used to query a protein sequence database.

8. TBLASTX queries protein sequences to anucleotide sequence database with the sequences translated in all six reading frames.

a) True

b) False View Answer

Answer: b

Explanation: TBLASTN queries protein sequences to anucleotide sequence database with the sequences translated in all six reading frames. TBLASTX uses nucleotide sequences, which are translated in all six frames, to search against a nucleotide sequence database that has all the sequences translated in sixframes. In addition, there is also a bl2seq program that performs local alignment oftwo user-provided input sequences. The graphical output includes horizontal barsand a diagonal in a two-dimensional diagram showing the overall extent of matching between the two sequences.

9. Which of the following is not a correct about BLAST?

a) The BLAST web server has been designed in suchaway as to simplify the task of program selection.

b) The programs are organized based onthe type of query sequences

c) The programs are organized based onthe type of nucleotide sequences, or nucleotidesequence to be translated

d) BLAST is not based on heuristic searching methods View Answer

Answer: d

Explanation:BLAST and FASTA are based on heuristic searching methods. In addition, programs for special purposes are grouped separately; for example, bl2seq, immunoglobulin BLAST, and VecScreen, a program for removing contaminating vector sequences.

10. If one is looking for protein homologs encoded in newly sequenced genomes, one may use TBLASTN, which translates nucleotide database sequences in all six open reading frames.

a) True

b) False View Answer

Answer: a

Explanation: This may help to identify protein coding genes that have not yet been annotated. If a DNA sequence is to be used as the query, a protein-level comparison can be done with

TBLASTX. However, both programs are very computationally intensive and the search process can be very slow.

Comparison of FASTA and BLAST

1. In FASTA, For a Z-score > 15, the match can be considered extremely with

of a homologous relationship.

a) insignificant, uncertainty

b) significant, uncertainty

c) significant, certainty

d) insignificant, certainty View Answer

Answer: c

Explanation: If Z is in the range of 5 to15, the sequence pair can be described as highly probable homologs. If Z<5, their relationship is described as less certain.

2. BLAST uses a matching words using the

to find matching words, whereas FASTA identifies identical

a) substitution matrix, hashing procedure

b) substitution matrix, blocks

c) hashing procedure, substitution matrix

d) ktups, substitution matrix View Answer

Answer: a

Explanation: BLAST and FASTA have been shown to perform almost equally well in regular database searching; However, there are some notable differences between the two approaches.The major difference is in the seeding step– BLAST uses a substitution matrix to find matching words, whereas FASTA identifies identical matching words using the hashing procedure.

3. Which of the following is not a benefit or a factual of FASTA over BLAST?

a) FASTA scans smaller window sizes

b) It gives more sensitive results

c) It gives less sensitive results

d) It gives results with a better coverage rate for homologs View Answer

Answer: c

Explanation: By default, FASTA scans smaller window sizes. Thus, it gives more sensitive results than BLAST, with a better coverage rate for homologs. However, it is usually slower than BLAST.

4. The use of low-complexity masking in the BLAST procedure means that it may have higher specificity than FASTA because potential false positives are reduced.

a) True

b) False View Answer

Answer: a

Explanation: In addition to the given statement, BLAST sometimes gives multiple best-scoring alignments from the same sequence. FASTA returns only one final alignment.

5. Which of the following is not a benefit of BLAST?

a) Handling of gaps

b) Speed

c) More sensitive

d) Statistical rigor View Answer

Answer: a

Explanation: In addition to this, user friendly UI of BLAST is also one of its benefits. However, it does not handle gaps well. In that case gapped BLAST is better.

6. BLAST might not find matches for very short sequences.

a) True

b) False View Answer

Answer: a

Explanation: In BLAST, similarity matching of words is involved. If no words are found similar, then no alignment is detected and hence it might not find matches for very short sequences.

7. BLAST often produces several short HSPs rather than a single aligned region.

a) True

b) False View Answer

Answer: a

Explanation: The results of the word matching and attempts to extend the alignment are segments. They are called as HSPs (High-Scoring Segment Pairs). BLAST often produces several short HSPs rather than a single aligned region.

8. FASTA is derived from logic of the dot plot.

a) True

b) False View Answer

Answer: a

Explanation: Because of this, it computes best diagonals from all frames of alignment. The method looks for exact matches between words in query and test sequence.

9. The gapped portion in the diagonals represents matches in FASTA.

a) True

b) False View Answer

Answer: b

Explanation: The diagonal’s nature indicates the matching of the sequences. After all diagonals are found, it tries to join diagonals by adding gaps. Further, it Computes alignments in regions of best diagonals.

10. The initiation of FASTA format has symbol.

a) >

b) <

c) /

d) *

View Answer

Answer: a

Explanation: Its format is simple as used by almost all programs. Header line has > at the beginning. Also no specific requirements are there for line length, characters, etc.

Database Searching with the Smith – Waterman Method

1. The rigorous dynamic programming method is normally not used for database searching, because it is slow and computationally expensive.

a) True

b) False View Answer

Answer: a

Explanation: Heuristics suchas BLAST and FASTA are developed for faster speed. However, the

heuristic methods are limited in sensitivity and are not guaranteed to find the optimal alignment. Theyoften fail to find alignment for distantly related sequences.

2. FASTA and BLAST are but for larger datasets.

a) faster, more sensitive

b) faster, less sensitive

c) slower, less sensitive

d) slower, more sensitive View Answer

Answer: b

Explanation: Empirical tests have indeed shown that the exhaustive method produces superior results overthe heuristic methods like BLAST and FASTA. But heuristic methods are better and practical when it comes to assess larger datasets with comparatively low sensitivity.

3. Scan PS is a web-based program that implements a modified version of the Needleman-Wunsch algorithm.

a) True

b) False View Answer

Answer: b

Explanation: ScanPS (Scan Protein Sequence) is a web-based program that implements a modified version of the Smith–Waterman algorithm optimized for parallel processing. The major feature is that the program allows iterative searching similar to PSI-BLAST, which builds profiles from one round of search results and uses them for the second round of database searching. Full dynamic programmingis used in each cycle for added sensitivity.

4. Par Align is a web-based server that uses parallel processors to perform exhaustive sequence comparisons using either a parallelized version ofthe Smith–Waterman algorithm or a heuristic program for further speed gains.

a) True

b) False View Answer

Answer: a

Explanation: The heuristic subprogram first finds exact ungapped alignments and uses them as anchors for extension into gapped alignments by combining the scores of several diagonals in the alignment matrix. The search speed of ParAlign approaches to that of BLAST, but with higher sensitivity.

5. In Smith–Waterman algorithm, in initialization Step, the row and

column are subject to gap penalty.

a) first, first

b) first, second

c) second, First

d) first, last View Answer

Answer: a

Explanation: In Smith–Waterman algorithm, first row and first column are set to 0. In the Needleman Wunsch algorithm, First row and first column are subject to gap penalty.

6. Local sequence alignments are necessary for many cases out of which one is repeats.

a) True

b) False View Answer

Answer: a

Explanation: It can also be used for modular organization of genes and proteins (exons, domains, etc.) Also it can be used in cases of sequences diverged so that similarity was retained, or can be detected, just in some sub-regions.

7. In SW algorithm, to align two sequences of lengths of m and n, time is required.

a) O(mn)

b) O(m2n)

c) O(m2n3)

d) O(mn2) View Answer

Answer: b

Explanation: The Smith–Waterman algorithm is quite demanding of time. Hence if two sequences of lengths of m and n have to be aligned, the required time is O(m2n). It requires O(mn) calculation steps.

8. One of the challenges in SWA is obtaining correct alignments in regions of low similarity between distantly related biological sequences.

a) True

b) False View Answer

Answer: a

Explanation: It is because mutations have added too much ‘noise’ over evolutionary time to allow for a meaningful comparison of those regions. Local alignment avoids such regions altogether and focuses on those with a positive score, i.e. those with an evolutionarily conserved signal of similarity.

9. Score can be negative in Smith–Waterman algorithm.

a) True

b) False View Answer

Answer: b

Explanation: Negative score is set to 0. In Needleman–Wunsch algorithm, the Score can be negative. Also, in Smith–Waterman algorithm, in tracing back step, it begins with the highest score, ends when 0 is encountered.

10. The function of the scoring matrix is to conduct one-to-one comparisons between all components in two sequences and record the optimal alignment results.

a) True

b) False View Answer

Answer: a

Explanation: The scoring process reflects the concept of dynamic programming. The final optimal alignment is found by iteratively expanding the growing optimal alignment.

Questions & Answers on Structural Bioinformatics

Protein Structure Basics

1. The building blocks of proteins are naturally occurring amino acids, small molecules that contain a free amino group (NH2) and a free carboxyl group (COOH).

a) ten

b) twenty

c) nine

d) nineteen View Answer

Answer: b

Explanation: Both of these groups are linked to a central carbon (Cα), which is attached to hydrogen and a side chain group (R). Amino acids differ only by the side chain R group. The chemical reactivities of the R groups determine the specific properties of the amino acids. Amino acids can be grouped into several categories based on the chemical and physical properties of the side chains, such as size and affinity for water.

2. Within the hydrophobic set of amino acids, they can be further divided into aliphatic and aromatic.

a) True

b) False View Answer

Answer: a

Explanation: Aliphatic side chains are linear hydrocarbon chains and aromatic side chains are cyclic rings. Within the hydrophilic set, amino acids can be subdivided into polar and charged. Charged amino acids can be either positively charged (basic) or negatively charged (acidic).

3. the smallest amino acid, has a hydrogen atom as the R group.

a) valine

b) proline

c) Glycine

d) threonine View Answer

Answer: c

Explanation: Of particular interest within the twenty amino acids are glycine and proline. It can therefore adopt more flexible conformations that are not possible for other amino acids. Proline is on the other extreme of flexibility. Its side chain forms a bond with its own backbone amino group, causing it to be cyclic. The cyclic conformation makes it very rigid, unable to occupy many of the main chain conformations adopted by other amino acids.

4. The peptide formation involves two amino acids covalently joined together between the carboxyl group of one amino acid and the amino group of another.

a) True

b) False View Answer

Answer: a

Explanation: this reaction is a condensation reaction involving removal of elements of water from the two molecules. The resulting product is called a dipeptide. The newly formed covalent bond connecting the two amino acids is called a peptide bond. Once an amino acid is incorporated into a peptide, it becomes an amino acid residue. Multiple amino acids can be joined together to form a longer chain of amino acid polymer.

5. A linear polymer of more than fifty amino acid residues is referred to as a

a) dipeptide

b) oligopeptide

c) peptide

d) polypeptide View Answer

Answer: d

Explanation: A polypeptide, also called a protein, has a well-defined three-dimensional arrangement. On the other hand, a polymer with fewer than fifty residues is usually called a peptide without a well-defined three-dimensional structure. The residues a peptide or polypeptide are numbered beginning with the residue containing the amino group, referred to as the N- terminus, and ending with the residue containing the carboxyl group, known as the C-terminus.

6. Which of the following is not correct?

a) The rigid double bond structure forces atoms associated with the peptide bond to lie in the same plane, called the dipeptide plane

b) A peptide bond is actually a partial double bond owing to shared electrons between O=C–N atoms

c) Because of the planar nature of the peptide bond and the size of the R groups, there are considerable restrictions on the rotational freedom by the two bonded pairs of atoms around the peptide bond

d) The angle of rotation about the bond is referred to as the dihedral angle (also called the tortional angle)

View Answer

Answer: a

Explanation: The rigid double bond structure forces atoms associated with the peptide bond to lie in the same plane, called the peptide plane. For a peptide unit, the atoms linked to the peptide bond can be moved to a certain extent by the rotation of two bonds flanking the peptide bond.

7. Which of the following is not correct about the stabilizing Forces?

a) Protein structures from secondary to quaternary are maintained by noncovalent forces

b) They include electrostatic interactions but not van der Waals forces, and hydrogen

bonding

c) Electrostatic interactions are a significant stabilizing force in a protein structure

d) Electrostatic interactions occur when excess negative charges in one region are neutralized by positive charges in another region

View Answer

Answer: b

Explanation: include electrostatic interactions, van der Waals forces, and hydrogen bonding. The result is the formation of salt bridges between oppositely charged residues. The electrostatic interactions can function within a relatively long range (15 Å). Hydrogen bonds are a particular type of electrostatic interactions similar to dipole–dipole interactions involving hydrogen from one residue and oxygen from another. Hydrogen bonds can occur between main chain atoms as well as side chain atoms.

8. Which of the following is not correct about the α-Helices?

a) An α-helix has a main chain backbone conformation that resembles a corkscrew

b) Nearly all known α-helices are right handed, exhibiting a leftward spiral form

c) Nearly all known α-helices are right handed, exhibiting a rightward spiral form

d) In right handed helix, there are 3.6 amino acids per helical turn View Answer

Answer: b

Explanation: The structure is stabilized by hydrogen bonds formed between the main chain atoms of residues i and i + 4. The hydrogen bonds are nearly parallel with the helical axis. The average φ and ψ angles are 60◦ and 45◦, respectively, and are distributed in a narrowly defined region in the lower left region of a Ramachandran plot.

9. Which of the following is not correct about the β-sheet?

a) A β-sheet is a fully extended configuration built up from several spatially adjacent regions of a polypeptide chain

b) Each region involved in forming the β-sheet is a β-strand

c) The β-strand conformation is pleated with main chain backbone zigzagging and side chains positioned on same sides of the sheet

d) β-Strands are stabilized by hydrogen bonds between residues of adjacent strands View Answer

Answer: c

Explanation: The β-strand conformation is pleated with main chain backbone zigzagging and side chains positioned alternately on opposite sides of the sheet. β-strands near the surface of the protein tend to show an alternating pattern of hydrophobic and hydrophilic regions, whereas

strands buried at the core of a protein are nearly all hydrophobic. The β-strands can run in the same direction to form a parallel sheet or can run every other chain in reverse orientation to form an antiparallel sheet, or a mixture of both.

10. Which of the following is not correct about the Coils and Loops?

a) They are regular structures

b) They are irregular structures

c) The loops are often characterized by sharp turns or hairpin-like structures

d) If the connecting regions are completely irregular, they belong to random coils View Answer

Answer: a

Explanation: Residues in the loop or coil regions tend to be charged and polar and located on the surface of the protein structure. They are often the evolutionarily variable regions where mutations, deletions, and insertions frequently occur. They can be functionally significant because these locations are often the active sites of proteins.

11. Globular proteins are usually insoluble.

a) True

b) False View Answer

Answer: b

Explanation: Globular proteins are usually soluble and surrounded by water molecules. They tend to have an overall compact structure of spherical shape with polar or hydrophilic residues on the surface and hydrophobic residues in the core. Such an arrangement is energetically favorable because it minimizes contacts with water by hydrophobic residues in the core and maximizes interactions with water by surface polar and charged residues. Common examples of globular proteins are enzymes, myoglobins, cytokines, and protein hormones.

12. Which of the following is not correct about the Integral Membrane Proteins?

a) Membrane proteins exist in lipid bilayers of cell membranes

b) The exterior of the proteins spanning the membrane must be very hydrophobic to be stable

c) The exterior of the proteins spanning the membrane must be very hydrophilic to be stable

d) Most typical transmembrane segments are α-helices View Answer

Answer: c

Explanation: Because they are surrounded by lipids, the exterior of the proteins spanning the

membrane must be very hydrophobic to be stable. Occasionally, for some bacterial periplasmic membrane proteins, they are composed of β-strands. The loops connecting these segments sometimes lie in the aqueous phase, in which they can be entirely hydrophilic.

13. Which of the following is not correct about the X-ray Crystallography?

a) In x-ray protein crystallography, proteins need to be grown into large crystals in which their positions are fixed in a repeated, ordered fashion

b) The protein crystals are illuminated with an intense x-ray beam

c) The x-rays are deflected by the electron clouds surrounding the atoms in the crystal producing a regular pattern of diffraction

d) The protein crystals are illuminated with an intense infrared beam View Answer

Answer: d

Explanation: The diffraction pattern is composed of thousands of tiny spots recorded on a x-ray film.

The diffraction pattern can be converted into an electron density map using a mathematical procedure known as Fourier transform. To interpret a three-dimensional structure from two- dimensional electron density maps requires solving the phases in the diffraction data.

14. Which of the following is not correct about the NMR?

a) It stands for Nuclear Magnetic Resonance

b) NMR spectroscopy detects spinning patterns of atomic nuclei in a electric field

c) NMR spectroscopy detects spinning patterns of atomic nuclei in a magnetic field

d) Protein samples are labeled with radioisotopes such as 13C and 15N View Answer

Answer: b

Explanation: radiofrequency radiation is used to induce transitions between nuclear spin states in a magnetic field.

Interactions between spinning isotope pairs produce radio signal peaks that correlate with the distances between them. By interpreting the signals observed using NMR, proximity between atoms can be determined.

15. One can search a structure in PDB using the four-letter code or keywords related to its annotation.

a) True

b) False View Answer

Answer: a

Explanation: Each entry is given a unique code, PDB id, consisting of four characters of either letters A to Z or digits 0 to 9 such as 1LYZ and 4RCR. The identified structure can be viewed directly online or downloaded to a local computer for analysis.

Protein Structural Visualization

1. The main feature of computer visualization programs is interactivity, which allows users to visually manipulate the structural images through a graphical user interface.

a) True

b) False View Answer

Answer: a

Explanation: At the touch of a mouse button, a user can move, rotate, and zoom an atomic model on a computer screen in real time, or examine any portion of the structure in great detail, as well as draw it in various forms in different colors. Further manipulations can include changing the conformation of a structure by protein modeling or matching a ligand to an enzyme active site through docking exercises.

2. A Protein Data Bank (PDB) data file for a protein structure contains only x, and z coordinates of atoms.

a) True

b) False View Answer

Answer: b

Explanation: Because a Protein Data Bank (PDB) data file for a protein structure contains only x, y, and z coordinates of atoms, the most basic requirement for a visualization program is to build connectivity between atoms to make a view of a molecule. The visualization program should also be able to produce molecular structures in different styles, which include wire frames, balls and sticks, space-filling spheres, and ribbons.

3. A wire-frame diagram is a line drawing representing bonds between atoms.

a) True

b) False View Answer

Answer: a

Explanation: The wire frame is the simplest form of model representation. It is useful for

localizing positions of specific residues in a protein structure, or for displaying a skeletal form of a structure when Cα atoms of each residue are connected.

4. Balls and sticks are solid spheres and rods, representing atoms and bonds, respectively.

a) True

b) False View Answer

Answer: a

Explanation: These diagrams can also be used to represent the backbone of a structure. In a space-filling representation, each atom is described using large solid spheres with radii corresponding to the van der Waals radii of the atoms.

5. Ribbon diagrams use cylinders or spiral ribbons to represent α-helices and broad, flat arrows to represent β-strands.

a) True

b) False View Answer

Answer: a

Explanation: This type of representation is very attractive in that it allows easy identification of secondary structure elements and gives a clear view of the overall topology of the structure. The resulting images are also visually appealing.

6. Different representation styles can be used in combination to highlight a certain feature of a structure while deemphasizing the structures surrounding it.

a) True

b) False View Answer

Answer: a

Explanation: For example, a cofactor of an enzyme can be shown as space-filling spheres while the rest of the protein structure is shown as wire frames or ribbons. Some widely used and freely available software programs are there for molecular graphics.

7. Which of the following is wrong about Swiss-PDB Viewer?

a) It is a structure viewer for multiple platforms

b) It is a structure viewer for single platforms

c) It is essentially a Swiss-Army knife for structure visualization and modeling

d) It is capable of structure visualization, analysis, and homology modeling View Answer

Answer: b

Explanation: It is essentially a Swiss-Army knife for structure visualization and modeling because it incorporates so many functions in a small shareware program. It allows display of multiple structures at the same time in different styles, by charge distribution, or by surface accessibility. It can measure distances, angles, and even mutate residues. In addition, it can calculate molecular surface, electrostatic potential, Ramachandran plot, and so on. The homology modeling part includes energy minimization and loop modeling.

8. Which of the following is an incorrect statement?

a) WebMol is a web-based program built based on a modified RasMol code and thus shares many similarities with RasMol

b) WebMol is a web-based program that is totally different from RasMol

c) Chime is a plug-in for web browsers

d) Chime is not a standalone program and has to be invoked in a web browser View Answer

Answer: b

Explanation: Chime is also derived from RasMol and allows interactive display of graphics of protein structures inside a web browser. RasMol runs directly on a browser of any type as an applet and is able to display simple line drawing models of protein structures. It also has a feature of interactively displaying Ramachandran plots for structure model evaluation.

9. Which of the following is an incorrect statement?

a) Molscript is a UNIX program capable of generating wire-frame

b) Molscript is capable of generating space-filling

c) Molscript is not capable of generating ball-and-stick styles

d) In particular, secondary structure elements can be drawn with solid spirals and arrows representing α-helices and β-strands, respectively

View Answer

Answer: c

Explanation: Visually appealing images can be generated that are of publication quality. The drawback is that the program is command-line–based and not very user friendly. A modified UNIX program called Bobscript is available with enhanced features.

10. Ribbons another UNIX program similar to Molscript, generates ribbon diagrams depicting protein secondary structures

a) True

b) False View Answer

Answer: a

Explanation: Aesthetically appealing images can be produced that are of publication quality. However, the program, which is also command-line-based, is extremely difficult to use

Protein Structure Comparison

1. Which of the following is incorrect about protein structure comparison?

a) The comparative approach is important in finding remote protein homologs

b) Protein structures have a much higher degree of conservation than the sequences

c) Protein structures have a much lesser degree of conservation than the sequences

d) Proteins can share common structures even without sequence similarity View Answer

Answer: c

Explanation: structure comparison is one of the fundamental techniques in protein structure analysis. Structure comparison can often reveal distant evolutionary relationships between proteins, which is not feasible using the sequence-based alignment approach alone. In addition, protein structure comparison is a prerequisite for protein structural classification into different fold classes.

2. The intermolecular approach is normally applied to relatively structures.

a) distinctive

b) dissimilar

c) similar

d) different View Answer

Answer: c

Explanation: The algorithmic approaches to comparing protein geometric properties can be divided into three categories: the first superposes protein structures by minimizing intermolecular distances; the second relies on measuring intramolecular distances of a structure; and the third includes algorithms that combine both intermolecular and intramolecular approaches.

3. Which of the following is incorrect about intermolecular approach?

a) This procedure starts with identifying equivalent residues or atoms

b) After residue–residue correspondence is established, one of the structures is moved laterally and vertically toward the other structure to allow the two structures to be in the

same location

c) The structures are rotated relative to each other around the three-dimensional axes

d) The rotation doesn’t depend on the intermolecular distance View Answer

Answer: d

Explanation: The rotation continues until the shortest intermolecular distance is reached. At this point, an optimal superimposition of the two structures is reached. After superimposition, equivalent residue pairs can be identified, which helps to quantitate the fitting between the two structures.

4. The root mean square deviation (RMSD), without the size dependency correction is

View Answer

Answer: a

Explanation: An important measurement of the structure fit during superposition is the distance between equivalent positions on the protein structures. This requires using a least square-fitting function called root mean square deviation (RMSD), which is the square root of the averaged sum of the squared differences of the atomic distances. Here D is the distance between coordinate data points and N is the total number of corresponding residue pairs.

5. The root mean square deviation (RMSD), with the size dependency correction is

View Answer

Answer: d

Explanation: In practice, only the distances between Cα carbons of corresponding residues are measured. The goal of structural comparison is to achieve a minimum RMSD. However, the problem with RMSD is that it depends on the size of the proteins being compared. For the same degree of sequence identity, large proteins tend to have higher RMSD values than small proteins when an optimal alignment is reached. Recently, a logarithmic factor has been proposed to correct this size-dependency problem. This new measure is called RMSD100.

6. The intramolecular approach does not depend on sequence similarity between the proteins to be compared.

a) True

b) False View Answer

Answer: a

Explanation: The intramolecular approach relies on structural internal statistics and therefore does not depend on sequence similarity between the proteins to be compared. In addition, this method does not generate a physical superposition of structures, but instead provides a quantitative evaluation of the structural similarity between corresponding residue pairs.

7. Which of the following is incorrect about the intramolecular approach?

a) The method works by generating a distance matrix between residues of the same protein

b) It generates a string between residues of the same protein

c) In comparing two protein structures, the distance matrices from the two structures are moved relative to each other to achieve maximum overlaps

d) By overlaying two distance matrices, similar intramolecular distance patterns representing similar structure folding regions can be identified

View Answer

Answer: b

Explanation: For the ease of comparison, each matrix is decomposed into smaller submatrices consisting of hexapeptide fragments. To maximize the similarity regions between two structures, a Monte Carlo procedure is used. By reducing three-dimensional information into two- dimensional information, this strategy identifies overall structural resemblances and common structure cores.

8. Which of the following is incorrect about Multiple Structure Alignment?

a) The alignment strategy is different than the Clustal sequence alignment using a progressive approach

b) All structures are first compared in a pairwise fashion

c) A distance matrix is developed based on structure similarity scores such as RMSD

d) The aligned structures create a median structure that allows other structures to be progressively added for comparison based on the hierarchy described in the guide tree View Answer

Answer: a

Explanation: In addition to pairwise alignment, a number of algorithms can also perform multiple

structure alignment. The alignment strategy is similar to the Clustal sequence alignment using a progressive approach. When all the structures in the set are added, this eventually creates a multiple structure alignment.

9. Which of the following is incorrect about SSAP?

a) It is a web server that uses an intramolecular distance–based method

b) Matrices are built based on the Cβ distances of all residue pairs

c) Dynamic programming approach is not used here

d) Dynamic programming approach is used View Answer

Answer: c

Explanation: When comparing two different matrices, a dynamic programming approach is used to find the path of residue positions with optimal scores. The dynamic programming is applied at two levels, one at a lower level in which all residue pairs between the proteins are compared and another at an upper level in which subsequently identified equivalent residue pairs are processed to refine the matching positions. An SSAP score is reported for structural similarity. A score above 70 indicates a good structural similarity.

10. VAST is a web server that performs alignment using intramolecular approaches only

a) True

b) False View Answer

Answer: b

Explanation: VAST (Vector Alignment Search Tool) is a web server that performs alignment using both the inter- and intramolecular approaches. The superposition is based on information of directionality of secondary structural elements (represented as vectors). Optimal alignment between two structures is defined by the highest degree of vector matches.

Protein Structure Classification

1. The classification results from both systems, SCOP and CATH are quite dissimilar.

a) True

b) False View Answer

Answer: b

Explanation: Due to the differences in classification criteria, one might expect that there would be huge differences in classification results. In fact, the classification results from both systems are quite similar. Exhaustive analysis has shown that the results from the two systems converge at about 80% of the time. In other words, only about 20% of the structure fold assignments are different.

2. The first step in structure classification is to remove redundancy from databases.

a) True

b) False View Answer

Answer: a

Explanation: Among the tens of thousands of entries in PDB, the majority of the structures are redundant as they correspond to structures solved at different resolutions, or associated with different ligands or with single-residue mutations. The redundancy can be removed by selecting representatives through a sequence alignment–based approach.

3. The second step in structure classification is to separate structurally distinct domains within a structure.

a) True

b) False View Answer

Answer: a

Explanation: Because some proteins are composed of multiple domains, they must be subdivided before a sensible structural comparison can be carried out. This domain identification and separation can be done either manually or based on special algorithms for domain recognition.

4. The last step in structure classification involves grouping proteins/domains of similar structures.

a) True

b) False View Answer

Answer: a

Explanation: Once multidomain proteins are split into separate domains, structure comparison can be conducted at the domain level, either through manual inspection, or automated structural alignment, or a combination of both. This step involves grouping proteins/domains of similar structures and clustering them based on different levels of resemblance in secondary structure composition and arrangement of the secondary structures in space.

5. Which of the following is untrue about SCOP?

a) It is a database for comparing and classifying protein structures

b) It is constructed almost entirely based on manual examination of protein structures

c) The proteins are grouped into hierarchies of classes, folds, superfamilies, and families

d) The SCOP families consist of proteins having low sequence identity (>30%) View Answer

Answer: d

Explanation: The SCOP families consist of proteins having high sequence identity (>30%). Thus, the proteins within a family clearly share close evolutionary relationships and normally have the same functionality. The protein structures at this level are also extremely similar.

6. Members within the fold have evolutionary relationships.

a) same, always

b) same, do not always

c) one, always

d) different, do not View Answer

Answer: b

Explanation: Folds consist of superfamilies with a common core structure, which is determined manually. This level describes similar overall secondary structures with similar orientation and connectivity between them. Members within the same fold do not always have evolutionary relationships. Some of the shared core structure may be a result of analogy. Classes consist of folds with similar core structures.

7. In CATH, Structural domain separation is carried by

a) manual comparison only

b) computer programs only

c) human expertise only

d) a combined effort of a human expert and computer programs View Answer

Answer: d

Explanation: CATH classifies proteins based on the automatic structural alignment program SSAP as well as manual comparison. Structural domain separation is carried out also as a combined effort of a human expert and computer programs. Individual domain structures are classified at five major levels: class, architecture, fold/topology, homologous superfamily, and homologous family.

8. Which of the following is untrue about SCOP and CATH?

a) The definition for class in CATH is quite dissimilar to that in SCOP

b) The definition for class in CATH is based on secondary structure content

c) Architecture is a unique level in CATH, intermediate between fold and class

d) The definition for class in CATH is similar to that in SCOP View Answer

Answer: d

Explanation: The topology level is equivalent to the fold level in SCOP, which describes overall orientation of secondary structures and takes into account the sequence connectivity between the secondary structure elements. The homologous superfamily and homologous family levels are equivalent to the superfamily and family levels in SCOP with similar evolutionary definitions, respectively.

9. SCOP is based on manual comparison of structures by human experts with no quantitative criteria to group proteins.

a) entirely

b) almost entirely

c) not

d) partially View Answer

Answer: b

Explanation: It is argued that this approach offers some flexibility in recognizing distant structural relatives, because human brains may be more adept at recognizing slightly dissimilar structures that essentially have the same architecture. However, this reliance on human expertise also renders the method subjective. The exact boundaries between levels and groups are sometimes arbitrary.

10. CATH is a combination of manual curation and automated procedure, which makes the process less subjective

a) True

b) False View Answer

Answer: a

Explanation: For example, in defining domains, CATH first relies on the consensus of three different algorithms to recognize domains. When the computer programs disagree, human intervention will take place. In addition, the extra Architecture level in CATH makes the structure

classification more continuous. The drawback of the systems is that the fixed thresholds in structural comparison may make assignment less accurate.

6. Questions on Secondary Structure Prediction

Protein Secondary Structure Prediction for Globular Proteins

1. The formation of is determined by interactions, whereas the formation of is strongly influenced by interactions.

a) α-helices, long -range, α-helices, short -range

b) α-helices, long -range, β-strands, short -range

c) α-helices, short-range, β-strands, long-range

d) β-strands, short-range, β-strands, long-range View Answer

Answer: c

Explanation: Protein secondary structure prediction with high accuracy is not a trivial ask. It remained a very difficult problem for decades. This is because protein secondary structure elements are context dependent. The formation of α-helices is determined by short-range interactions, whereas the formation of β-strands is strongly influenced by long-range interactions. Prediction for long-range interactions is theoretically difficult. After more than three decades of effort, prediction accuracies have only been improved from about 50% to about 75%.

2. The secondary structure prediction methods can be either ab initio based, which make use of single sequence information only, or homology based, which make use of multiple sequence alignment information.

a) True

b) False View Answer

Answer: a

Explanation: The ab initio methods, which belong to early generation methods, predict secondary structures based on statistical calculations of the residues of a single query sequence. The homology-based methods do not rely on statistics of residues of a single sequence, but on common secondary structural patterns conserved among multiple homologous sequences.

3. Which of the following is untrue regarding Ab Initio–Based Methods?

a) This type of method predicts the secondary structure based on a single query sequence

b) This type of method predicts the secondary structure based on a multiple query sequence

c) It measures the relative propensity of each amino acid belonging to a certain secondary structure element

d) The propensity scores are derived from known crystal structures View Answer

Answer: b

Explanation: Examples of ab initio prediction are the Chou–Fasman and Garnier, Osguthorpe, Robson (GOR) methods. The ab initio methods were developed in the 1970s when protein structural data were very limited. The statistics derived from the limited data sets can therefore be rather inaccurate. However, the methods are simple enough that they are often used to illustrate the basics of secondary structure prediction.

4. The Chou–Fasman algorithm determines the propensity or intrinsic tendency of each residue to be in the helix, strand, and β-turn conformation using observed frequencies found in protein crystal structures.

a) True

b) False View Answer

Answer: a

Explanation: It determines the propensity or intrinsic tendency of each residue using observed frequencies found in protein crystal structures (conformational values for coils are not considered). For example, it is known that alanine, glutamic acid, and methionine are commonly found in α-helices, whereas glycine and proline are much less likely to be found in such structures.

5. The GOR method is based on the “propensity” of each residue to be in one of the two conformational states, helix (H), strand(E).

a) True

b) False View Answer

Answer: b

Explanation: The GOR method is based on the “propensity” of each residue to be in one of the two conformational states, helix (H), strand(E), turn(T),and coil (C). However, instead of using the propensity value from a single residue to predict a conformational state, it takes short-range interactions of neighboring residues into account.

6. Which of the following is untrue regarding Chou–Fasman and GOR methods?

a) Both are the first-generation methods

b) They are developed in the 1970s,

c) They suffer from the fact that the prediction rules are somewhat arbitrary

d) They are based on single sequence statistics with clear relation to known protein- folding theories

View Answer

Answer: d

Explanation: They are based on single sequence statistics without clear relation to known protein-folding theories. The predictions solely rely on local sequence information and fail to take into account long range interactions. A Chou-Fasman–based prediction does not even consider the short-range environmental information.

7. Which of the following is untrue regarding Homology-Based Methods?

a) The third generation of algorithms was developed in the late 1990s

b) They were developed by making use of evolutionary information

c) This type of method uses the ab initio secondary structure prediction of individual sequences only

d) This type of method combines the ab initio secondary structure prediction of individual sequences and alignment information from multiple homologous sequences (>35% identity)

View Answer

Answer: c

Explanation: The idea behind this approach is that close protein homologs should adopt the same secondary and tertiary structure. When each individual sequence is predicted for secondary structure using a method similar to the GOR method, errors and variations may occur. However, evolutionary conservation dictates that there should be no major variations for their secondary structure elements.

8. Because residues in the same aligned position are assumed to have the same secondary structure, any inconsistencies or errors in prediction of individual sequences can be corrected using a majority rule.

a) True

b) False View Answer

Answer: a

Explanation: By aligning multiple sequences, information of positional conservation is revealed.

This homology based method has helped improve the prediction accuracy by another 10% over the second-generation methods.

9. Which of the following is untrue regarding Prediction with Neural Networks?

a) The third-generation prediction algorithms extensively apply sophisticated neural networks

b) It is used to analyze substitution patterns in multiple sequence alignments

c) It is not a machine learning process

d) It requires a structure of multiple layers of interconnected variables or nodes View Answer

Answer: c

Explanation: a neural network is a machine learning process that requires a structure of multiple layers of interconnected variables or nodes. In secondary structure prediction, the input is an amino acid sequence and the output is the probability of a residue to adopt a particular structure.

10. Which of the following is untrue regarding Prediction with Neural Networks?

a) It has to be first trained by sequences with known structures

b) Between input and output are many connected hidden layers

c) Between the connected hidden layers the machine learning takes place to adjust the mathematical weights of internal connections

d) It doesn’t have to be first trained by sequences with known structures View Answer

Answer: d

Explanation: The neural network has to be first trained by sequences with known structures so it can recognize the amino acid patterns and their relationships with known structures. During this process, the weight functions in hidden layers are optimized so they can relate input to output correctly. When the sufficiently trained network processes an unknown sequence, it applies the rules learned in training to recognize particular structural patterns.

11. Which of the following is untrue regarding Prediction with Neural Networks?

a) When multiple sequence alignments and neural networks are combined, the result is further improved accuracy

b) A neural network is trained by a single sequence

c) A neural network is trained by a sequence profile derived from the multiple sequence alignment

d) When the sufficiently trained network processes an unknown sequence, it applies the

rules learned in training to recognize particular structural patterns View Answer

Answer: b

Explanation: A neural network is trained not by a single sequence but by a sequence profile derived from the multiple sequence alignment. This combined approach has been shown to improve the accuracy to above 75%, which is a breakthrough in secondary structure prediction. The improvement mainly comes from enhanced secondary structure signals through consensus drawing. The following lists several frequently used third generation prediction algorithms available as web servers.

12. Which of the following is untrue regarding PHD?

a) It stands for Profile network from Heidelberg

b) It is a web-based program that combines neural network only

c) It first performs a BLASTP of the query sequence against a non redundant protein sequence database

d) In initial steps it finds a set of homologous sequences, which are aligned with the MAXHOM program (a weighted dynamic programming algorithm performing global alignment)

View Answer

Answer: b

Explanation: is a web-based program that combines neural network with multiple sequence alignment. After the initial steps, the resulting alignment in the form of a profile is fed into a neural network that contains three hidden layers. The first hidden layer makes raw prediction based on the multiple sequence alignment by sliding a window of thirteen positions.

Secondary Structure Prediction for Transmembrane Proteins

1. Which of the following is untrue regarding the transmembrane proteins?

a) Constitute up to 30% of all cellular proteins

b) They are responsible for performing a wide variety of important functions in a cell, such as signal transduction, cross-membrane transport, and energy conversion

c) The membrane proteins are also of tremendous biomedical importance

d) They are not drug targets or receptors View Answer

Answer: d

Explanation: The membrane proteins are also of tremendous biomedical importance, as they often serve as drug targets for pharmaceutical development. There are two types of integral

membrane proteins: α-helical type and β-barrel type. Most transmembrane proteins contain solely α-helices, which are found in the cytoplasmic membrane. A few membrane proteins consist of β-strands forming a β- barrel topology, a cylindrical structure composed of antiparallel β-sheets.

2. The structures of this group of proteins, however, are comparatively a lot difficult to resolve either by x-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy.

a) True

b) False View Answer

Answer: a

Explanation: For this group of proteins, prediction of the transmembrane secondary structural elements and their organization is particularly important. Fortunately, the prediction process is somewhat easier because of the hydrophobic environment of the lipid bilayers, which restricts the transmembrane segments to be hydrophobic as well.

3. Which of the following is untrue regarding Prediction of Helical Membrane Proteins?

a) For membrane proteins consisting of transmembrane α–helices, these transmembrane helices are predominantly hydrophobic with a specific distribution of positively charged residues

b) The α-helices generally run perpendicular to the membrane plane

c) The α-helices generally run parallel to the membrane plane

d) The α-helices have an average length between seventeen and twenty-five residues View Answer

Answer: c

Explanation: The hydrophobic helices are normally separated by hydrophilic loops with average lengths of fewer than sixty residues. The residues bordering the transmembrane spans are more positively charged. Another feature indicative of the presence of transmembrane segments is that residues at the cytosolic side near the hydrophobic anchor are more positively charged than those at the lumenal or periplasmic side. This is known as the positive-inside rule.

4. The early algorithms based their prediction on hydrophobicity scales.

a) True

b) False View Answer

Answer: a

Explanation: A number of algorithms for identifying transmembrane helices have been developed

where the early algorithms based their prediction on hydrophobicity scales. They typically scan a window of seventeen to twenty-five residues and assign membrane spans based on hydrophobicity scores. Some are also able to determine the orientation of the membrane helices based on the positive-inside rule.

5. Predictions solely based on hydrophobicity profiles have lowest error rates.

a) True

b) False View Answer

Answer: b

Explanation: Predictions solely based on hydrophobicity profiles have high error rates. As with the third-generation predictions for globular proteins, applying evolutionary information with the help of neural networks or HMMs can improve the prediction accuracy significantly.

6. The presence of signal peptides can significantly compromise the prediction

because the programs tend to confuse hydrophobic signal peptides with membrane helices.

a) hydrophobic, accuracy

b) hydrophobic, error

c) hydrophilic, accuracy

d) hydrophilic, error View Answer

Answer: a

Explanation: Predicting transmembrane helices is relatively easy. The accuracy of Some of the best predicting programs, such as TMHMM or HMMTOP, can exceed 70%. To minimize errors, the presence of signal peptides can be detected using a number of specialized programs and then manually excluded.

7. Which of the following is untrue regarding TMHMM?

a) It is a web-based program based on an HMM algorithm

b) It is trained to recognize transmembrane helical patterns

c) It is not trained to recognize transmembrane helical patterns

d) When a query sequence is scanned, the probability of having an α-helical domain is given

View Answer

Answer: c

Explanation: It is trained to recognize transmembrane helical patterns based on a training set of 160 well-characterized helical membrane proteins. The orientation of the α-helices is predicted

based on the positive-inside rule. The prediction output returns the number of transmembrane helices, the boundaries of the helices, and a graphical representation of the helices. This program can also be used to simply distinguish between globular proteins and membrane proteins.

8. Which of the following is untrue regarding Phobius ?

a) It is a web-based program designed to overcome false positives caused by the presence of signal peptides

b) The program incorporates distinct HMM models for signal peptides only

c) The program incorporates distinct HMM models for signal peptides as well as transmembrane helices

d) After distinguishing the putative signal peptides from the rest of the query sequence, prediction is made on the remainder of the sequence

View Answer

Answer: b

Explanation: In addition to the given data, it has been shown that the prediction accuracy can be significantly improved compared to TMHMM (94% by Phobius compared to 70% by TMHMM). In addition to the normal prediction mode, the user can also define certain sequence regions as signal peptides or other nonmembrane sequences based on external knowledge.

9. Which of the following is true regarding Prediction of β-Barrel Membrane Proteins?

a) For membrane proteins with β-strands only, the β-strands forming the transmembrane segment are amphipathic in nature

b) For membrane proteins with β-strands only, the β-strands forming the transmembrane segment are only hydrophilic in nature

c) For membrane proteins with β-strands only, the β-strands forming the transmembrane segment are only hydrophobic in nature

d) They contain six to nine residues View Answer

Answer: a

Explanation: As stated, for membrane proteins with β-strands only, the β-strands forming the transmembrane segment are amphipathic in nature. They contain ten to twenty-two residues with every second residue being hydrophobic and facing the lipid bilayers whereas the other residues facing the pore of the β-barrel are more hydrophilic.

10. Scanning a sequence by hydrophobicity does not reveal transmembrane β-strands.

a) True

b) False View Answer

Answer: a

Explanation: These programs for predicting transmembrane α-helices are not applicable for this unique type of membrane proteins. To predict the β-barrel type of membrane proteins, a small number of algorithms have been made available based on neural networks and related techniques.

Coiled Coil Prediction

1. Coiled coils are super helical structures involving two to more interacting α-helices from the same or different proteins.

a) True

b) False View Answer

Answer: a

Explanation: The individual α-helices twist and wind around each other to form a coiled bundle structure. The coiled coil conformation is important in facilitating inter- or intra protein interactions. Proteins possessing these structural domains are often involved in transcription regulation or in the maintenance of cytoskeletal integrity.

2. Which of the following is true regarding Coiled coil?

a) They have an integral repeat of twenty residues

b) They have an integral repeat of seven residues

c) They have an integral repeat of thirty residues

d) The sequence periodicity doesn’t contribute in designing algorithms to predict the structural domain.

View Answer

Answer: b

Explanation: Coiled coils have an integral repeat of seven residues (heptads) which assume a side-chain packing geometry at facing residues. For every seven residues, the first and fourth are hydrophobic, facing the helical interface; the others are hydrophilic and exposed to the solvent.

The sequence periodicity forms the basis for designing algorithms to predict this important structural domain.

3. Which of the following is untrue regarding Coils?

a) It is a web-based program that detects coiled coil regions in proteins

b) It scans a window of fourteen, twenty-one, or twenty-eight residues

c) It scans a window of fourteen or twenty-one residues only

d) It compares the sequence to a probability matrix compiled from known parallel two- stranded coiled coils.

View Answer

Answer: c

Explanation: By comparing the similarity scores, the program calculates the probability of the sequence to adopt a coiled coil conformation. The program is accurate for solvent-exposed, left- handed coiled coils, but less sensitive for other types of coiled coil structures, such as buried or right-handed coiled coils.

4. In Multicoil, The scoring matrix is constructed based on a database of known three- stranded coiled coils only.

a) True

b) False View Answer

Answer: b

Explanation: Multicoil is a web-based program for predicting coiled coils. The scoring matrix is constructed based on a database of known two-stranded and three-stranded coiled coils. The program is more conservative than Coils. It has been recently used in several genome-wide studies to screen for protein–protein interactions mediated by coiled coil domains.

5. Leucine zipper domains are a special type of coiled coils found in transcription regulatory proteins which contain two anti parallel α-helices held together by hydrophobic interactions of leucine residues.

a) True

b) False View Answer

Answer: a

Explanation: The heptad repeat pattern is L-X(6)-L-X(6)-L–X(6)-L. This repeat pattern alone can sometimes allow the domain detection, albeit with high rates of false positives. The reason for the high false-positive rates is that the condition of the sequence region being a coiled coil conformation is not satisfied. To address this problem, algorithms have been developed that take into account both leucine repeats and coiled coil conformation to give accurate prediction.

6. Which of the following is untrue regarding PSIPRED?

a) It is a web-based program that predicts protein secondary structures

b) It uses a combination of evolutionary information and neural networks

c) It uses a combination of evolutionary information only

d) The multiple sequence alignment is derived from a PSI-BLAST database search View Answer

Answer: c

Explanation: A profile is extracted from the multiple sequence alignment generated from three rounds of automated PSI-BLAST. The profile is then used as input for a neural network prediction similar to that in PHD, but without the jury layer. To achieve higher accuracy, a unique filtering algorithm is implemented to filter out unrelated PSI-BLAST hits during profile construction.

7. Prof is not similar to PHD.

a) True

b) False View Answer

Answer: b

Explanation: Prof stands for Protein forecasting. It is an algorithm that combines PSI-BLAST profiles and a multistaged neural network, similar to that in PHD. In addition, it uses a linear discriminant function to discriminate between the three states.

8. Jpred combines the analysis results from six prediction algorithms, including PHD, PREDATOR, DSC, NNSSP, Jnet, and ZPred.

a) True

b) False View Answer

Answer: a

Explanation: The query sequence is first used to search databases with PSI-BLAST for three iterations. Redundant sequence hits are removed. The resulting sequence homologs are used to build a multiple alignment from which a profile is extracted. The profile information is submitted to the six prediction programs. If there is sufficient agreement among the prediction programs, the majority of the prediction is taken as the structure

7. Questions & Answers on Protein Tertiary Structure Prediction

Ab Initio Protein Structural Prediction & Homology Modeling

1. Which of the following is untrue about homology modeling?

a) Homology modeling predicts protein structures based on sequence homology with known structures

b) It is also known as comparative modeling

c) The principle behind it is that if two proteins share a high enough sequence similarity, they are likely to have very similar three-dimensional structures

d) It doesn’t involve the evolutionary distances anywhere View Answer

Answer: d

Explanation: As the name suggests, homology modeling predicts protein structures based on sequence homology with known structures. Homology modeling produces an all-atom model based on alignment with template proteins.

2. Which of the following is untrue about template Selection Step?

a) The first step in protein structural modeling is to select appropriate structural templates

b) This forms the foundation for rest of the modeling process

c) There is no use of heuristic alignment search programs

d) The template selection involves searching the Protein Data Bank (PDB) for homologous proteins with determined structures

View Answer

Answer: c

Explanation: The search can be performed using a heuristic pair wise alignment search program such as BLAST or FASTA. However, the use of dynamic programming based search programs such as SSEARCH or ScanPS can result in more sensitive search results. The relatively small size of the structural database means that the search time using the exhaustive method is still within reasonable limits, while giving a more sensitive result to ensure the best possible similarity hits.

3. Which of the following is untrue about Sequence Alignment Step?

a) Once the structure with the highest sequence similarity is identified as a template, the full-length sequences of the template and target proteins need to be realigned using refined alignment algorithms to obtain optimal alignment

b) The realignment is the most critical step in homology modeling

c) The realignment directly affects the quality of the final model

d) Errors made in the alignment step can be corrected in the following modeling steps View Answer

Answer: d

Explanation: incorrect alignment at this stage leads to incorrect designation of homologous residues and therefore to incorrect structural models. Errors made in the alignment step cannot

be corrected in the following modeling steps. Therefore, the best possible multiple alignment algorithms, such as Praline and T-Coffee should be used for this purpose.

4. Which of the following is untrue about Backbone Model Building Step?

a) Once optimal alignment is achieved, residues in the aligned regions of the target protein can assume a similar structure as the template proteins

b) Coordinates of the corresponding residues of the template proteins can be simply copied onto the target protein

c) If the two residues differ, everything other than the backbone atoms can be copied

d) If the two aligned residues are identical, coordinates of the side chain atoms are copied along with the main chain atoms

View Answer

Answer: c

Explanation: Option a and b mean the same. If the two residues differ, only the backbone atoms can be copied. The side chain atoms are rebuilt in a subsequent procedure.

In backbone modeling, it is simplest to use only one template structure. The structure with the best quality and highest resolution is normally chosen if multiple options are available.

5. Which of the following is untrue about Loop Modeling Step?

a) In the sequence alignment for modeling, there are often regions caused by insertions and deletions producing gaps in sequence alignment

b) In the sequence alignment for modeling, there are no regions producing gaps in sequence alignment

c) The gaps cannot be directly modeled

d) Loop modeling is required for closing the gaps requires View Answer

Answer: b

Explanation: Closing the gaps requires loop modeling, which is a very difficult problem in homology modeling and is also a major source of error. Loop modeling can be considered a mini–protein modeling problem by itself. Unfortunately, there are no mature methods available that can model loops reliably. Currently, there are two main techniques used to approach the problem: the database searching method and the ab initio method.

6. The procedure begins by measuring the orientation and distance of the anchor regions in the stems and searching PDB for segments of the same length that also match the above endpoint conformation.

a) True

b) False View Answer

Answer: a

Explanation: Usually, many different alternative segments that fit the endpoints of the stems are available. The best loop can be selected based on sequence similarity as well as minimal steric clashes with the neighboring parts of the structure. The conformation of the best matching fragments is then copied onto the anchoring points of the stems.

7. Which of the following is untrue about specialized programs for loop modeling?

a) PETRA is a web server that models loops using the database approach

b) FREAD is a web server that models loops using the database approach

c) CODA is a web server that uses a consensus method based on the prediction results from FREAD and PETRA

d) For loops of three to eight residues, CODA uses consensus conformation of both methods

View Answer

Answer: a

Explanation: PETRA is a web server that uses the ab initio method to model loops. For nine to thirty residues, CODA uses FREAD prediction only.

8. In Side Chain Refinement step, A side chain can be built by searching every possible conformation at every torsion angle of the side chain to select the one that has the lowest interaction energy with neighboring atoms.

a) True

b) False View Answer

Answer: a

Explanation: However, this approach is computationally prohibitive in most cases. In fact, most current side chain prediction programs use the concept of rotamers, which are favored side chain torsion angles extracted from known protein crystal structures. In prediction of side chain conformation, only the possible rotamers with the lowest interaction energy with nearby atoms are selected.

9. In the step of Model Refinement Using Energy Function, the structural irregularities can be corrected by applying the energy minimization procedure on the entire model, which moves the atoms in such a way that the overall conformation has the lowest energy potential.

a) True

b) False View Answer

Answer: a

Explanation: In these loop modeling and side chain modeling steps, potential energy calculations are applied to improve the model. However, this does not guarantee that the entire raw homology model is free of structural irregularities such as unfavorable bond angles, bond lengths, or close atomic contacts. There, this step is used. The goal of energy minimization is to relieve steric collisions and strains without significantly altering the overall structure.

10. Energy minimization has to be used with caution because excessive energy minimization often moves residues away from their correct positions.

a) True

b) False View Answer

Answer: a

Explanation: only limited energy minimization is recommended (a few hundred iterations) to remove major errors, such as short bond distances and close atomic clashes. Key conserved residues and those involved in cofactor binding have to be restrained if necessary during the process.

11. Which of the following is untrue about Ab initio prediction?

a) The limited knowledge of protein folding forms the basis of ab initio prediction

b) The ab initio prediction method attempts to produce all-atom protein models based on sequence information alone without the aid of known protein structures

c) The ab initio prediction method attempts to produce all-atom protein models based on sequence information alone with some aid of known protein structures

d) The perceived advantage of this method is that predictions are not restricted by known folds and that novel protein folds can be identified

View Answer

Answer: c

Explanation: Alongside the advantages, because the physicochemical laws governing protein folding are not yet well understood, the energy functions used in the ab initio prediction are, at present, rather inaccurate. The folding problem remains one of the greatest challenges in bioinformatics today.

12. The prediction programs are thus designed using the energy minimization principle.

a) True

b) False View Answer

Answer: a

Explanation: Current ab initio algorithms are not yet able to accurately simulate the protein folding process. They work by using some type of heuristics. Because the native state of a protein structure is near energy minimum, the prediction programs are thus designed using the energy minimization principle.

13. Searching for a fold with the absolute minimum energy may not be valid in reality.

a) True

b) False View Answer

Answer: a

Explanation: These algorithms search for every possible conformation to find the one with the lowest global energy. However, searching for a fold with the absolute minimum energy may not be valid in reality. This contributes to one of the fundamental flaws of this approach. In addition, searching for all possible structural conformations is not yet computationally feasible.

14. Rosetta is a web server that predicts protein three-dimensional conformations using the ab initio method.

a) True

b) False View Answer

Answer: a

Explanation: This in fact relies on a “mini-threading” method. The method first breaks down the query sequence into many very short segments (three to nine residues) and predict the secondary structure of the small segments using a Hidden Markov model–based program, HMMSTR.

15. In Rosetta, The segments with assigned structures are subsequently assembled into a dimensional configuration.

a) primary, three

b) secondary, three

c) secondary, two

d) primary, three View Answer

Answer: b

Explanation: Through random combinations of the fragments, a large number of models are built and their overall energy potentials calculated. The conformation with the lowest global free energy is chosen as the best model.

Threading and Fold Recognition

1. There are a large number of protein folds available, compared to millions of protein sequences

a) True

b) False View Answer

Answer: b

Explanation: There are only small number of protein folds available (<1,000), compared to millions of protein sequences. This means that protein structures tend to be more conserved than protein sequences. Consequently, many proteins can share a similar fold even in the absence of sequence similarities. This allowed the development of computational methods to predict protein structures beyond sequence similarities.

2. Threading or structural fold recognition predicts the structural fold of an unknown protein sequence by fitting the sequence into a structural database and selecting the best-fitting fold.

a) True

b) False View Answer

Answer: a

Explanation: To determine whether a protein sequence adopts a known three-dimensional structure fold relies on threading and fold recognition methods. The comparison emphasizes matching of secondary structures, which are most evolutionarily conserved. Therefore, this approach can identify structurally similar proteins even without detectable sequence similarity.

3. The algorithms used here can be classified into two categories, pairwise energy based and profile based.

a) True

b) False View Answer

Answer: a

Explanation: The pairwise energy–based method was originally referred to as threading and the

profile-based method was originally defined as fold recognition. However, the two terms are now often used interchangeably without distinction in the literature.

4. In the pairwise energy based method, a protein sequence is searched for in a structural fold database to find the best matching structural fold using criteria.

a) sequence-based

b) structure-based

c) energy-based

d) residue-based View Answer

Answer: c

Explanation: The detailed procedure involves aligning the query sequence with each structural fold in a fold library. The alignment is performed essentially at the sequence profile level using dynamic programming or heuristic approaches. Local alignment is often adjusted to get lower energy and thus better fitting. The adjustment can be achieved using algorithms such as double- dynamic programming.

5. The next step in the pairwise energy based method is to build a crude model for the target sequence by replacing aligned residues in the template structure with the corresponding residues in the query.

a) True

b) False View Answer

Answer: a

Explanation: After the mentioned step, the last step is to calculate the energy terms of the raw model, which include pairwise residue interaction energy, solvation energy, and hydrophobic energy. Finally, the models are ranked based on the energy terms to find the lowest energy fold that corresponds to the structurally most compatible fold.

6. Which of the following is untrue about profile method?

a) A profile is constructed for a group of related protein structures

b) The propensity of amino acids in not in picture of this method

c) Statistical information from these aligned residues is then used to construct a profile

d) The structural profile is generated by superimposition of the structures to expose corresponding residues

View Answer

Answer: b

Explanation: The profile contains scores that describe the propensity of each of the twenty amino

acid residues to be at each profile position. The profile scores contain information for secondary structural types, the degree of solvent exposure, polarity, and hydrophobicity of the amino acids. To predict the structural fold of an unknown query sequence, the query sequence is first predicted for its secondary structure, solvent accessibility, and polarity.

7. Because threading and fold recognition detect structural homologs relying on sequence similarities, they have been shown to be than PSI-BLAST in finding distant evolutionary relationships

a) without completely, far more sensitive

b) completely, far more sensitive

c) completely, less sensitive

d) without completely, less sensitive View Answer

Answer: a

Explanation: In many cases, they can identify more than twice as many distant homologs than PSI-BLAST. However, this high sensitivity can also be their weakness because high sensitivity is often associated with low specificity. The predictions resulting from threading and fold recognition often come with very high rates of false positives. Therefore, much caution is required in accepting the prediction results.

8. Which of the following is untrue about threading and fold recognition?

a) It assess the compatibility of an amino acid sequence with a known structure in a fold library

b) If the protein fold to be predicted does not exist in the fold library, the method won’t necessarily fail

c) If the protein fold to be predicted does not exist in the fold library, the method will fail

d) Threading and fold recognition do not generate fully refined atomic models for the query sequences

View Answer

Answer: b

Explanation: A disadvantage compared to homology modeling lies in the fact that threading and fold recognition do not generate fully refined atomic models for the query sequences. This is because accurate alignment between distant homologs is difficult to achieve. Instead, threading and fold recognition procedures only provide a rough approximation of the overall topology of the native structure.

9. Which of the following is untrue about 3D-PSSM?

a) It is a web-based program that employs the structural profile method to identify protein folds

b) The profiles for each protein superfamily are constructed by combining multiple smaller profiles

c) A protein structural superfamily doesn’t have sequence-based PSI-BLAST profile

d) In initial steps, protein structures in a superfamily based on the SCOP classification are superimposed

View Answer

Answer: c

Explanation: First, protein structures in a superfamily based on the SCOP classification are superimposed and are used to construct a structural profile by incorporating secondary structures and solvent accessibility information for corresponding residues. In addition, each member in a protein structural superfamily has its own sequence-based PSI-BLAST profile computed. These sequence profiles are used in combination with the structure profile to forma large superfamily profile in which each position contains both sequence and structural information.

10. Which of the following is true about Gen Threader?

a) It is a web-based program that uses a hybrid of the profile and pairwise energy methods

b) It is a web-based program that uses profile methods only

c) It is a web-based program that uses pairwise energy methods only

d) The initial step is quite dissimilar to 3D-PSSM View Answer

Answer: a

Explanation: The initial step is similar to 3D-PSSM; the query protein sequence is subject to three rounds of PSI-BLAST. The resulting multiple sequence hits are used to generate a profile. Its secondary structure is predicted using PSIPRED. Both are used as input for threading computation based on a pairwise energy potential method. The threading results are evaluated using neural networks that combine energy potentials, sequence alignment scores, and length information to create a single score representing the relationship between the query and template proteins.

8. Questions on RNA Structure Prediction

Types of RNA Structures

1. RNA structures can be experimentally determined using

a) x-ray crystallography techniques only

b) NMR techniques only

c) x-ray crystallography or NMR techniques

d) gel electrophoresis View Answer

Answer: c

Explanation: However, the two approaches are extremely time consuming and expensive. As a result, computational prediction has become an attractive alternative. Option d, here, becomes irrelevant as it comes to the structure of RNA.

2. Which of the following is not a form of RNA?

a) mRNA

b) tRNA

c) qRNA

d) rRNA View Answer

Answer: c

Explanation: It is known that RNA is a carrier of genetic information and exists in three main forms. They are messenger RNA (mRNA), ribosomal RNA (rRNA), and transfer RNA (tRNA). Their main roles are as follows: mRNA is responsible for directing protein synthesis; rRNA provides structural scaffolding within ribosomes; and tRNA serves as a carrier of amino acids for polypeptide synthesis.

3. Unlike DNA, which is mainly double stranded, RNA is double stranded, although an RNA molecule can self-hybridize at certain regions to form partial double-stranded structures.

a) True

b) False View Answer

Answer: b

Explanation: RNA is single stranded, although an RNA molecule can self-hybridize at certain regions to form partial double-stranded structures. Generally, mRNA is more or less linear and non-structured, whereas rRNA and tRNA can only function by forming particular secondary and tertiary structures.

4. Structures are not much important when it comes to study of the functions.

a) True

b) False View Answer

Answer: b

Explanation: knowledge of the structures of these molecules is particularly important for understanding their functions. Difficulties in experimental determination of RNA structures make theoretical prediction a very desirable approach. In fact, computational-based analysis is a main tool in RNA-based drug design in pharmaceutical industry. In addition, knowledge of the secondary structures of rRNA is key for RNA-based phylogenetic analysis.

5. Which of the following is not a base of RNA?

a) Thymine (T)

b) Adenine (A)

c) Cytosine (C)

d) Guanine (G) View Answer

Answer: a

Explanation: RNA structures can be described at three levels as in proteins: primary, secondary, and tertiary. The primary structure is the linear sequence of RNA, consisting of four bases, adenine (A), cytosine (C), guanine (G), and uracil (U).

6. Base pairing, in RNA, is A–G and U–C.

a) True

b) False View Answer

Answer: b

Explanation: The secondary structure refers to the planar representation that contains base- paired regions among single-stranded regions. The base pairing is mainly composed of traditional Watson–Crick base pairing, which is A–U and G–C.

7. In addition to the canonical base pairing, there often exists non-canonical base pairing such as and base paring.

a) G, U

b) G, C

c) U, C

d) A, C View Answer

Answer: a

Explanation: there often exists non-canonical base pairing such as G and U base paring. The G– U base pair is less stable and normally occurs within a double-strand helix surrounded by Watson–Crick base pairs. Finally, the tertiary structure is the three-dimensional arrangement of bases of the RNA molecule.

8. Four main subtypes of secondary structures can be identified. They are hairpin loops, bulge loops, interior loops, and multi-branch loops.

a) True

b) False View Answer

Answer: a

Explanation: Because the RNA tertiary structure is very difficult to predict, attention has been mainly focused on secondary structure prediction. Based on the arrangement of helical base pairing in secondary structures, the mentioned four main subtypes of secondary structures can be identified.

9. The refers to a structure with two ends of a single-stranded region (loop) connecting a base-paired region (stem).

a) helical junctions

b) hairpin loop

c) bulge loop

d) interior loop View Answer

Answer: b

Explanation: The bulge loop refers to a single stranded region connecting two adjacent base- paired segments so that it “bubbles” out in the middle of a double helix on one side. The multi- branch loop is also called helical junctions.

10. The refers to two single-stranded regions on opposite strands connecting two adjacent base-paired segments.

a) hairpin loop

b) interior loop

c) pseudoknot loop

d) helical junctions View Answer

Answer: b

Explanation: In addition to the traditional secondary structural elements, base pairing between loops of different secondary structural elements can result in a higher level of structures such as pseudoknots, kissing hairpins, and hairpin–bulge contact. A pseudoknot loop refers to base pairing formed between loop residues within a hairpin loop and residues outside the hairpin loop.

RNA Secondary Structure Prediction Methods

1. At present, there are essentially two types of method of RNA structure prediction. One is minimum free energy approach and the Second one is a comparative approach.

a) True

b) False View Answer

Answer: a

Explanation: One is based on the calculation of the minimum free energy of the stable structure derived from a single RNA sequence. This can be considered an ab initio approach. The second is a comparative approach which infers structures based on an evolutionary comparison of multiple related RNA sequences.

2. Ab initio approach makes structural predictions based on

a) a single RNA sequence

b) comparing RNA sequences

c) evolutionary basis

d) pure phylogenetics View Answer

Answer: a

Explanation: The rationale behind this method is that the structure of an RNA molecule is solely determined by its sequence. Thus, algorithms can be designed to search for a stable RNA structure with the lowest free energy.

3. In ab initio approach, generally, when a base pairing is formed, the energy of the molecule is because of attractive interactions between the two strands.

a) lowered

b) increased

c) multiplied

d) kept stable View Answer

Answer: a

Explanation: Here, the algorithms can be designed to search for a stable RNA structure with the lowest free energy. Thus, to search for a most stable structure, ab initio programs are designed to search for a structure with the maximum number of base pairs.

4. In ab initio methods, free energy can be calculated based on parameters empirically derived for small molecules.

a) True

b) False View Answer

Answer: a

Explanation: G–C base pairs are more stable than A–U base pairs, which are more stable than G–U base pairs. It is also known that base-pair formation is not an independent event.

5. The energy necessary to form individual base pairs is not quite affected by adjacent base pairs.

a) True

b) False View Answer

Answer: b

Explanation: The energy necessary to form individual base pairs is influenced by adjacent base pairs through helical stacking forces. This is known as co-operativity in helix formation.

6. The attractive interactions lead to energy.

a) increased

b) higher

c) lower

d) no change in View Answer

Answer: c

Explanation: If a base pair is next to other base pairs, the base pairs tend to stabilize each other through attractive stacking interactions between aromatic rings of the base pairs. The attractive interactions lead to even lower energy. Parameters for calculating the co-operativity of the base- pair formation have been determined and can be used for structure prediction.

7. If the base pair is adjacent to loops or bulges, the neighboring loops and bulges tend to the base-pair formation.

a) have no change on

b) decrease the energy

c) stabilize

d) destabilize View Answer

Answer: d

Explanation: This is because there is a loss of entropy when the ends of the helical structure are constrained by unpaired loop residues. The destabilizing force to a helical structure also depends on the types of loops nearby.

8. The scoring scheme based on the combined stabilizing and destabilizing interactions forms the foundation of the ab initio RNA secondary structure prediction method.

a) True

b) False View Answer

Answer: a

Explanation: Parameters for calculating different destabilizing energies have also been determined and can be used as penalties for secondary structure calculations. This method works by first finding all possible base-pairing patterns from a sequence and then calculating the total energy of a potential secondary structure by taking into account all the adjacent stabilizing and destabilizing forces.

9. Ab initio methods are energetically least favorable.

a) True

b) False View Answer

Answer: b

Explanation: These methods are energetically most favorable. If there are multiple alternative secondary structures, the method finds the conformation with the lowest energy.

10. The dot matrix method and the dynamic programming method can be used in detecting self-complementary regions of a sequence.

a) True

b) False View Answer

Answer: a

Explanation: A simple dot matrix can find all possible base-paring patterns of an RNA sequence when one sequence is compared with itself. In this case, dots are placed in the matrix to represent matching complementary bases instead of identical ones.

Ab Initio Approach

1. In dot matrix in ab initio methods, the diagonals to the main diagonal represent regions that can self hybridize.

a) parallel

b) cutting in random fashion

c) perpendicular

d) adjacent parallel View Answer

Answer: c

Explanation: The diagonals perpendicular to the main diagonal represent regions that can self hybridize to form double-stranded structure with traditional A–U and G–C base pairs. In reality, the pattern detection in a dot matrix is often obscured by high noise levels.

2. In dot matrix in ab initio methods, one way to reduce the noise in the matrix is to select an appropriate window size of a maximum number of contiguous base matches.

a) True

b) False View Answer

Answer: b

Explanation: to reduce the noise in the matrix is to select an appropriate window size of a minimum number of contiguous base matches. Normally, only a window size of four consecutive base matches is used. If the dot plot reveals more than one feasible structure, the lowest energy one is chosen.

3. In dynamic programming, in ab initio methods, the use of a dot plot can be effective in finding a secondary structure in a molecule.

a) multiple, large

b) single, large

c) single, small

d) multiple, small View Answer

Answer: c

Explanation: Mostly, The use of a dot plot can be effective in finding a single secondary structure in a small Molecule. However, if a large molecule contains multiple secondary structure segments, choosing a combination that is energetically most stable among a large number of possibilities can be a daunting task.

4. In ab initio methods, a quantitative approach such as dynamic programming can be used to assemble a final structure with optimal base-paired regions.

a) True

b) False View Answer

Answer: a

Explanation: In this approach, an RNA sequence is compared with itself. A scoring scheme is applied to fill the matrix with match scores based on Watson–Crick base complementarity.

5. In dynamic programming, in ab initio methods, Often, base pairing and energy terms of the base pairing incorporated into the scoring process.

a) G-C, are

b) G–U, are not

c) G–U, are also

d) G-C, are not View Answer

Answer: c

Explanation: Although the traditional structure comprises of A–U and G–C base pairs, G-U base pairing is incorporated into the scoring process. A path with the maximal score within a scoring matrix after taking into account the entire sequence information represents the most probable secondary structure form.

6. The dynamic programming method produces structure with score.

a) one, single best

b) multiple, single best

c) multiple, multiple

d) single, multiple View Answer

Answer: a

Explanation: However, this is potentially a drawback of this approach. In reality an RNA may exist in multiple alternative forms with near minimum energy but not necessarily the one with maximum base pairs.

7. The problem of dynamic programming to select one single structure can be complemented by adding a probability distribution function, known as the which calculates a mathematical distribution of probable base pairs in a thermodynamic equilibrium.

a) partition function

b) division function

c) increment function

d) fold function View Answer

Answer: a

Explanation: This function helps to select a number of suboptimal structures within a certain energy range. The MFOLD and RNAFold are two well-known programs using the ab initio prediction method.

8. Which of following is correct about MFOLD?

a) It uses the dynamic programming only

b) It uses the thermodynamic calculations only

c) It uses the both dynamic programming and thermodynamic calculations as well

d) It doesn’t take into account the themoststablility of the secondary structures View Answer

Answer: b

Explanation: It is a web-based program for RNA secondary structure prediction. It combines dynamic programming and thermodynamic calculations for identifying themostable secondary structures with the lowest energy. It also produces dot plots coupled with energy terms. This method is reliable for short sequences, but becomes less accurate as the sequence length increases.

9. Like Mfold, RNAfold only examines the energy terms of the optimal alignment in a dot plot.

a) True

b) False View Answer

Answer: b

Explanation: is one of the web programs in the Vienna package. Unlike Mfold, which only examines the energy terms of the optimal alignment in a dot plot, RNAfold extends the sequence alignment to the vicinity of the optimal diagonals to calculate thermodynamic stability of alternative structures.

10. Which of the following about the RNAFold is incorrect?

a) It extends the sequence alignment to the vicinity of the optimal diagonals to calculate thermodynamic stability of alternative structures

b) It incorporates a partition function

c) It doesn’t necessarily use a partition function

d) It aims to select a number of statistically most probable structures in one of its steps View Answer

Answer: c

Explanation: Based on both thermodynamic calculations and the partition function, a number of alternative structures that may be suboptimal are provided. The collection of the predicted structures may provide a better estimate of plausible foldings of an RNA molecule than the predictions by Mfold.

Comparative Approach

1. In ab initio methods for RNA structure prediction, the prediction results from RNAfold are not always guaranteed to be better than those predicted by Mfold.

a) True

b) False View Answer

Answer: a

Explanation: Because of the much larger number of secondary structures to be computed, a more simplified energy rule has to be used to increase computational speed. Thus, the prediction results are not always guaranteed to be better than those predicted by Mfold.

2. The comparative approach uses basic RNA structure based predictions to infer a consensus structure.

a) True

b) False View Answer

Answer: b

Explanation: The comparative approach uses multiple evolutionarily related RNA sequences to infer a consensus structure. This approach is based on the assumption that RNA sequences that deem to be homologous fold into the same secondary structure.

3. To distinguish the conserved secondary structure among multiple related RNA sequences, a concept of “covariation” is used.

a) True

b) False View Answer

Answer: a

Explanation: It is known that RNA functional motifs are structurally conserved. To maintain the

secondary structures while the homologous sequences evolve, a mutation occurring in one position that is responsible for base pairing should be compensated for by a mutation in the corresponding base-pairing position so to maintain base pairing and the stability of the secondary structure.

4. of covariation can be to the RNA structure and functions

a) Any lack, deleterious

b) Any lack, benign

c) Any abundance, deleterious

d) Any inadequacy, advantageous View Answer

Answer: a

Explanation: Based on this rule, algorithms can be written to search for the covariation patterns after a set of homologous RNA sequences are properly aligned. The detected correlated substitutions help to determine conserved base pairing in a secondary structure.

5. An aspect of the comparative method is to select a structure through consensus drawing.

a) relatively distinct

b) remote

c) common

d) least abundant View Answer

Answer: c

Explanation: predicting secondary structures for each individual sequence may produce errors, by comparing all predicted structures of a group of aligned RNA sequences and drawing a consensus. Hence, the commonly adopted structure can be selected; many other possible structures can be eliminated in the process.

6. The comparative-based algorithms can be further divided into two categories based on the type of input data.

a) True

b) False View Answer

Answer: a

Explanation: The comparative-based algorithms can be further divided as mentioned. One requires predefined alignment and the other does not.

7. The type of algorithm that requires predefined alignment, requires the user to provide

alignment as input.

a) not necessarily an alignment

b) multiple only

c) pairwise or multiple

d) pair wise only View Answer

Answer: c

Explanation: As the name suggests, it does require predefined alignment, the option ‘a’ becomes irrelevant. The sequence alignment can be obtained using standard alignment programs such as T-Coffee, PRRN, or Clustal. Based on the alignment input, the prediction programs compute structurally consistent mutational patterns such as covariation and derive a consensus structure common for all the sequences.

8. The type of algorithm that predefined alignment is for reasonably conserved sequences.

a) doesn’t require, more successful

b) requires, less successful

c) doesn’t require, relatively successful

d) requires, relatively successful View Answer

Answer: d

Explanation: The requirement for using this type of program is an appropriate set of homologous sequences that have to be similar enough to allow accurate alignment, but divergent enough to allow covariations to be detected. If this condition is not met, correct structures cannot be inferred.

9. The predefined alignment requiring method also depends on the quality of the input alignment. If there errors in the alignment, covariation signals detected.

a) are, will be

b) are, will not be

c) are not, will not be

d) are not, possibly will not be View Answer

Answer: b

Explanation: The selection of one single consensus structure is also a drawback because

alternative and evolutionarily unconserved structures are not predicted. The RNAalifold is an example of this type of program based on predefined aligned sequences.

10. Which of the following is true about the RNAalifold?

a) Dynamic programming is not involved

b) Minimum free energy method is not used

c) Only minimum free energy is used

d) Covariation information is taken into consideration View Answer

Answer: d

Explanation: It is a program in the Vienna package. It uses a multiple sequence alignment as input to analyze covariation patterns on the sequences. A scoring matrix is created that combines minimum free energy and covariation information. Dynamic programming is used to select the structure that has the minimum energy for the whole set of aligned RNA sequences.

Performance Evaluation

1. Rigorously evaluating the performance of RNA prediction programs has traditionally been hindered by the dearth of three-dimensional structural information for RNA.

a) True

b) False View Answer

Answer: a

Explanation: The availability of recently solved crystal structures of the entire ribosome provides a wealth of structural details relating to diverse types of RNA molecules. The high resolution structural information can then be used as a benchmark for evaluating state-of-the-art RNA structure prediction programs in all categories.

2. If prediction accuracy can be represented using a the programs score roughly 20% to 60% depending on the length of the sequences.

a) multiple parameter, ab initio–based

b) single parameter, ab initio–based

c) multiple parameter, comparative–based

d) single parameter, comparative–based View Answer

Answer: b

Explanation: As mentioned, the scores depend on the length of the sequences. Generally speaking, the programs perform better for shorter RNA sequences than for longer ones.

3. For RNA sequences, such as tRNA, some programs may be able to produce

% accuracy.

a) small, 70

b) small, 40

c) large, 90

d) large, 75 View Answer

Answer: a

Explanation: The number of the percentage may vary but the qualitative idea is that for small RNA sequences, some programs may produce better accuracy. The major limitation for performance gains of this category appears to be dependence on energy parameters alone, which may not be sufficient to distinguish different structural possibilities of the same molecule.

4. The pre-alignment independent programs fare for predicting long sequences.

a) slight better

b) much better

c) a bit worse

d) much worse View Answer

Answer: d

Explanation: For small RNA sequences such as tRNA, both subtypes can achieve very high accuracy (up to 100%). This illustrates that the comparative approach is consistently more accurate than the ab initio one.

5. Based on recent benchmark comparisons, the comparative-type algorithms can reach an accuracy range of 20% to 80%.

a) True

b) False View Answer

Answer: a

Explanation: The results depend on whether a program is pre-alignment dependent or not. Most of the superior performance comes from pre-alignment-dependent programs such as RNAalifold.

6. In comparative approach to RNA structure prediction, algorithms that do not use pre- alignment, align multiple input sequences and infers a consensus structure.

a) True

b) False View Answer

Answer: a

Explanation: The alignment is produced using dynamic programming with a scoring scheme that incorporates sequence similarity as well as energy terms. Because the full dynamic programming for multiple alignment is computationally too demanding, currently available programs limit the input to two sequences.

7. In comparative approach to RNA structure prediction, Foldalign is a web-based only program for RNA alignment

a) True

b) False View Answer

Answer: b

Explanation: Foldalign is a web-based program for RNA alignment and structure prediction. The user provides a pair of unaligned sequences.

8. In comparative approach to RNA structure prediction, the Foldalign program doesn’t use the covariation information.

a) True

b) False View Answer

Answer: b

Explanation: The program uses a combination of Clustal and dynamic programming with a scoring scheme that includes covariation information to construct the alignment. A commonly conserved structure for both sequences is subsequently derived based on the alignment. To reduce computational complexity, the program ignores multi-branch loops and is only suitable for handling short RNA sequences.

9. In comparative approach to RNA structure prediction, Dynalign is a program

a) Windows based

b) Fedora

c) UNIX

d) iOS based View Answer

Answer: c

Explanation: Is a UNIX program with a free source code for downloading. Here, the user again provides two input sequences. The program calculates the possible secondary structures of each using a method similar to Mfold.

10. In comparative approach to RNA structure prediction, in Dynalign program–by comparing from each sequence, a structure common to both sequences is selected that serves as the basis for sequence alignment.

a) multiple alternative structures, lowest energy

b) single structure, lowest energy

c) single structure, highest energy

d) multiple alternative structures, highest energy View Answer

Answer: a

Explanation: The unique feature of this program is that it does not require sequence similarity and therefore can handle very divergent sequences. However, because of the computation complexity, the program only predicts small RNA sequences such as tRNA with reasonable accuracy.

Limitations of Prediction

1. Which of the following is incorrect about the RNA structure prediction?

a) Given the sequence, it provides an ab initio prediction of secondary structure

b) From the many possible choices of complementary sequences that can potentially base-pair, the compatible sets that provide the highest energy molecules are chosen

c) Structures with energies almost as stable as the most stable one may also be produced

d) Regions whose predictions are the most reliable can be identified from such an analysis

View Answer

Answer: b

Explanation: Stable structures are the ones with least or relatively quite low energy. Sequence variations found in related sequences may also be used to predict which base pairs are likely to be found in each of the molecules. One variation of RNA structure prediction methods will predict a set of sequences that are able to form a particular structure.

2. A type of RNA secondary structure prediction method takes into account conserved patterns of base-pairing that are conserved during evolution of a given class of RNA molecules.

a) True

b) False View Answer

Answer: a

Explanation: Sequence positions that base-pair are found to vary at the same time during evolution of RNA molecules so that structural integrity is maintained. For example, if two positions G and C form a base pair in a given type of molecule, then sequences that have C and G reversed, or A and U or U and A at the corresponding positions, would be considered reasonable matches.

3. RNA secondary structure is composed primarily of triple-stranded RNA regions formed by folding the single-stranded molecule back twice on itself.

a) True

b) False View Answer

Answer: b

Explanation: RNA secondary structure is composed primarily of double-stranded RNA regions formed by folding the single-stranded molecule back on itself. To produce such double-stranded regions, a run of bases downstream in the RNA sequence must be complementary to another upstream run so that Watson–Crick base-pairing between the complementary nucleotides G/C and A/U (analogous to the G/C and A/T base pairs in DNA) can occur.

4. wobble pairs may be produced in these double-stranded regions.

a) A/A

b) A/U

c) G/C

d) G/U

View Answer

Answer: d

Explanation: As in DNA, the G/C base pairs contribute the greatest energetic stability to the molecule, with A/U base pairs contributing less stability than G/C, and G/U wobble base pairs contributing the least. From the RNA structures that have been solved, these base pairs and a number of additional ones have been identified.

5. In predicting RNA secondary structure, some simplifying assumptions are usually made, like–the structure is similar to the

a) most likely, energetically most unstable structure

b) most unlikely, energetically most stable structure

c) most likely, energetically most stable structure

d) least likely, energetically most stable structure View Answer

Answer: c

Explanation: The assumption here is that the most predicted or likely structure has to be similar to the energetically most stable structure. The abundance is not taken into consideration in this.

6. The Second assumption in predicting RNA secondary structure is that, the energy associated with any position in the structure is _ influenced by local sequence and structure.

a) only

b) not at all

c) partially

d) never View Answer

Answer: a

Explanation: the energy associated with a particular base pair in a double-stranded region is assumed to be influenced only by the previous base pair and not by the base pairs farther down the double-stranded region or anywhere else in the structure. These energies can be reliably estimated by experimentation with small, synthetic RNA oligonucleotides recently improved to include sequence dependence.

7. The third assumption in predicting RNA secondary structure is that, the structure is assumed to be formed by of the chain back on itself in a manner that

a) crossing, produces knots

b) crossing, does not produce any knots

c) folding, produces knots

d) folding, does not produce any knots View Answer

Answer: d

Explanation: The best way of representing this requirement is to draw the sequence in a circular form. The paired bases are then joined by arcs. If the total structure with all predicted base pairs is to be free of knots, none of the arcs must cross.

8. Martinez (1984) made a list of possible double-stranded regions, and these regions were then given weights in proportion to their equilibrium constants, calculated by

a) the Boltzmann function [ exp (-∆G/RT2) ].

b) the Boltzmann function [ exp (-∆G/RT) ].

c) the Boltzmann function [ exp (-∆G/RT -T) ].

d) the Boltzmann function [ exp (∆G/RT) ]. View Answer

Answer: b

Explanation: here, the (-∆G) is the free energy of the regions, R is the gas constant, and T is the temperature. The RNA molecule is folded by a Monte Carlo method in which one initial region is chosen at random from a weighted pool, similar to the method used in Gibbs sampling.

9. In 1971, first estimation of the energy associated with regions of secondary structure by extrapolation from studies with small molecules was done and then attempt was made to predict which configurations of larger molecules were the most energetically stable.

a) True

b) False View Answer

Answer: a

Explanation: Energy estimates included the stabilizing energy associated with stacking base pairs in a double-stranded region and the destabilizing influence of regions that were not paired. Pipas and McMahon (1975) developed computer programs that listed all possible helical regions in tRNA sequences.

10. Nussinov and Jacobson (1980) were the first to design a precise and efficient algorithm for predicting secondary structure.

a) True

b) False View Answer

Answer: a

Explanation: The algorithm generates two scoring matrices—one M (i,j) to keep track of the maximum number of base pairs that can be formed in any interval i to j in the sequence. The second K (i,j) is to keep track of the base position k that is paired with j.

Minimum Free – Energy Method & Stochastic Context – Free Grammars

1. Which of the following is incorrect about the prediction of RNA secondary structure?

a) Every base is first compared to every other base by a type of analysis very similar to the dot matrix analysis

b) A row of matches in the RNA matrix indicates a succession of complementary nucleotides that can potentially form a double-stranded region

c) A row of matches in the RNA matrix indicates a failure of complementary nucleotides

that can potentially form a double-stranded region

d) The sequence is listed across the top and down the side of the page, and G/C, A/U, and G/U base pairs are scored

View Answer

Answer: c

Explanation: The energy of each predicted structure is estimated by the nearest- neighbor rule by summing the negative base-stacking energies for each pair of bases in double-stranded regions. By adding the estimated positive energies of destabilizing regions such as loops at the end of hairpins, bulges within hairpins, internal bulges, and other unpaired regions.

2. Through a single scoring matrix, evaluation of all the different possible configurations is done.

a) True

b) False View Answer

Answer: b

Explanation: To evaluate all the different possible configurations and to find the most energetically favorable, several types of scoring matrices are used. The complementary regions are evaluated by a dynamic programming algorithm to predict the most energetically stable molecule. The method is similar to the dynamic programming method used for sequence alignment.

3. The object is to find a diagonal row of matches that goes from upper left to lower right.

a) True

b) False View Answer

Answer: b

Explanation: The object is to find a diagonal row of matches that goes from upper right to lower left. In general, each matrix value is obtained by considering the minimum energy values, obtained by all previous complementary pairs, decreased by the stacking energy of any additional complementary base pairs or increased by the destabilizing energy associated with non-complementary bases.

4. The increase depends on the type and length of loop that is introduced by the non- complementary base pair, whether internal loop, bulge loop, or hairpin loop.

a) True

b) False View Answer

Answer: a

Explanation: This comparison of all possible matches and energy values is continued until all nucleotides have been compared. There is a pattern followed in comparing bases within the RNA molecule.

5. The sequence is listed down the first column of base comparisons’ table and free energy calculations’ table in the 5’→3’ orientation.

a) True

b) False View Answer

Answer: a

Explanation: the first four bases of the sequence are also listed in the first row of the tables in the 5’→3’ direction. Several complementary base pairs between the first and last four bases that could lead to secondary structure are shown in the tables.

6. A general theory for modeling strings of symbols, such as bases in DNA sequences, has been developed by linguists. There is a hierarchy of these so-called transformational grammars that deal with situations of increasing complexity.

a) True

b) False View Answer

Answer: a

Explanation: The application of these grammars to sequence analysis has been extensively discussed elsewhere. The context-free grammar is suitable for finding groups of symbols in different parts of the input sequence that thus are not in the same context.

7. regions in sequences, such as those in RNA that will form secondary structures, are an example of such context-free sequences.

a) non-interlocking

b) non-Complementary

c) complementary

d) non-compatible View Answer

Answer: c

Explanation: Stochastic context-free grammars (SCFG) introduce uncertainty into the definition of such regions. It allows them to use alternative symbols as found in the evolution of RNA molecules.

8. The use of SCFGs in RNA secondary structure production analysis is in fact very similar to that of the covariance model, with the grammatical productions resembling the nodes in the ordered binary tree.

a) True

b) False View Answer

Answer: a

Explanation: As with hidden Markov models, the probability distribution of each production must be derived by training with known sequences. The algorithms used for training the SCFG and for aligning a sequence with the SCFG are somewhat different from those used with hidden Markov models, and the time and memory requirements are greater.

9. In a SCFG, each production of a non-terminal symbol has an associated probability for giving rise to the resulting product, and there are a set of productions, each giving a different result.

a) True

b) False View Answer

Answer: a

Explanation: For example, the production S1 →C S2 G could also be represented by 15 other base-pair combinations, and each of these has a corresponding probability. Thus, each production can be considered to be represented by a probability distribution over the possible outcomes.

10. The application of SCFGs to RNA secondary structure analysis is very similar in form to the probabilistic covariance models.

a) True

b) False View Answer

Answer: a

Explanation: For RNA, the symbols of the alphabet are A, C, G, and U. The context-free grammar establishes a set of rules called productions for generating the sequence from the alphabet, in this case an RNA molecule with sections that can base-pair and others that cannot base-pair.

MFOLD and the Use of Energy Plots

1. Originally, the FOLD program of predicted having the minimum free energy.

a) M. Zuker, two structures

b) M. Wunsch, two structures

c) M. Zuker, only one

d) M. Wunsch, only one View Answer

Answer: c

Explanation: However, changes in a single nucleotide can result in drastic changes in the predicted structure. A later version, called MFOLD, has improved prediction of non-base paired interactions and predicts several structures having energies close to the minimum free energy.

2. To find suboptimal structures, the dynamic programming method was modified.

a) True

b) False View Answer

Answer: a

Explanation: it was so done to evaluate parts of a new scoring matrix in which the sequence is represented in two tandem copies on both the vertical and horizontal axes. The regions from i 1 to n and j 1 to n are used to calculate an energy V (i,j) for the best structure that includes an i,j base pair and is called the included region.

3. In the programs related to FOLD, the second region, the _ region, is used to calculate the energy of the structure.

a) included, most likely

b) included, best

c) excluded, least likely

d) excluded, best View Answer

Answer: d

Explanation: A second region, the excluded region, is used to calculate the energy of the best structure that includes i,j but is not derived from the structure at i+1, j-1 After certain corrections are made, the difference between the included and excluded values is the most energetic structure that includes the base pair i, j.

4. In the programs related to FOLD, All complementary base pairs can be sampled in a fashion to determine which are present in a suboptimal structure.

a) True

b) False View Answer

Answer:

Explanation: All complementary base pairs can be sampled in a fashion to determine which are within a certain range of the optimal one. An energy dot plot is produced showing the locations of alternative base pairs that produce the most stable or suboptimally stable structures.

5. The program may be instructed to find structures within a certain percentage of the minimum free energy.

a) True

b) False View Answer

Answer: a

Explanation: Parameter d provides a measure of similarity between two structures. When MFOLD is established on a suitable local host machine, the window is interactive, and clicking a part of the display will lead to program output of the corresponding structure.

6. Three scores, Pnum (i), Hnum (i,j), and Ssum, have been derived to assist with a determination of the reliability of a secondary structure prediction for a particular base i or a base pair i,j.

Pnum

Which of the following is not a correct blank?

a) is the total number of energy dots regardless of color in the i th row and i th column of the energy dot plot

b) is the total number of energy dots considering the color in the ith row and ith column of the energy dot plot

c) represents in an unfiltered dot plot

d) represents the number of base pairs View Answer

Answer: b

Explanation: it represents the number of base pairs that the ith base can form with all other base pairs in structures within the defined energy range. The lower this value, the better defined or “well determined” the local structure because there are few competitive foldings.

7. Three scores, Pnum (i), Hnum (i,j), and Ssum, have been derived to assist with a determination of the reliability of a secondary structure prediction for a particular base i or a base pair i,j.

Hnum

Which of the following is not a correct blank?

a) is the sum of Pnum(i) and Pnum(j) -1

b) is the sum of Pnum(i) and Pnum(j) + 1

c) is the total number of dots in the ith row and jth column

d) represents the total number of base pairs with the ith or jth base in the predicted structures

View Answer

Answer: b

Explanation: The Hnum for a double stranded region is the average Hnum value for the base pairs in that helix. The lower this number, the more well determined the double-stranded region.

8. Three scores, Pnum (i), Hnum (i,j), and Ssum, have been derived to assist with a determination of the reliability of a secondary structure prediction for a particular base i or a base pair i,j.

Ssum

Which of the following is not a correct blank?

a) is also called as ss-count

b) is the number of foldings in which base i is single-stranded divided by m, the number of folding

c) is the number of foldings in which base i is single-stranded multiplied by m, the number of folding

d) gives the probability that base i is single-stranded View Answer

Answer: c

Explanation: If Snum is approximately 1, then base i is probably in a single-stranded region, and if Snum is approximately 0, then base i is probably not in such a region. This reliability information has been used to annotate output files of MFOLD and other RNA display programs.

9. A limitation of the Zuker method and other methods (Nakaya et al. 1995) for computing suboptimal RNA structures is that they do not compute all the structures within a given energy range of the minimum free-energy structure.

a) True

b) False View Answer

Answer: a

Explanation: For example, no alternative structures are produced that have the absence of base

pairs in the best structure, and, if two substructures are joined by a stretch of unpaired bases, no structures are produced that are suboptimal for both structures. These factors limit the number of alternative structures predicted compared to known variations based on sequence variations in tRNAs.

10. These limitations of the Zuker method and other methods have been largely overcome by using an algorithm originally described by Waterman and Byers (1985) for finding sequence alignments within a certain range of the optimal one by modifications of the trace-back procedure used in dynamic programming.

a) True

b) False View Answer

Answer: a

Explanation: This method efficiently calculates a large number of alternative structures, up to a very large number, within a given energy range of the minimum free-energy structure. The method has been used to demonstrate that natural tRNA sequences can form many alternative structures which are close to the minimum free-energy structure and that base modification plays a major role in this energetic stability.

Searching Genomes for RNA & RNA Structure Modeling Applications

1. molecules can simply be identified based on their sequence similarity with already-known sequences.

a) Larger, less conserved

b) Larger, highly conserved

c) smaller, highly conserved

d) shorter, highly conserved View Answer

Answer: b

Explanation: For smaller sequences with more sequence variation, this method does not work. A number of methods for finding small RNA genes have been described and are available on the Web. A major problem with these methods in searches of large genomes is that a small false positive rate becomes quite unacceptable because there are so many false positives to check out.

2. One of the first methods used to find tRNA genes was to search for sequences that are complementary and can fold into a knot like the three found in tRNAs.

a) True

b) False View Answer

Answer: b

Explanation: One of the first methods used to find tRNA genes was to search for sequences that are self-complementary and can fold into a hairpin like the three found in tRNAs (Staden 1980). Through the regions of self-complimentarity it was first possible to find the tRNA.

3. Fichant and Burks (1991) described a program, tRNAscan, that searches a genomic sequence with a sliding window searching simultaneously for matches to a set of invariant bases and conserved self-complementary regions in tRNAs with an accuracy of 97.5%.

a) True

b) False View Answer

Answer: a

Explanation: A method for finding the RNA polymerase III transcriptional control regions of tRNA genes using a scoring matrix derived from known control regions, was derived. That is also very accurate. Finally, Lowe and Eddy (1997) have devised a search algorithm tRNAscan-SE that uses a combination of three methods to find tRNA genes in genomic sequences—tRNAscan, the Pavesi algorithm, and the COVELS program based on sequence covariance analysis (Eddy and Durbin 1994). This method is reportedly 99–100% accurate with an extremely low rate of false positives.

4. The probabilistic model was used to identify small nucleolar (sno) RNAs in the yeast genome that methylate ribosomal RNA.

a) True

b) False View Answer

Answer: a

Explanation: The model is not used to search genomic sequences directly. Instead, a list of candidate sequences is first found by searching for patterns that match the sequences in the model (Lowe and Eddy 1999).

5. The probability model mentioned above was a hybrid combination of HMMs and SCFGs trained on sno RNAs.

a) True

b) False View Answer

Answer: a

Explanation: These RNAs vary sufficiently in sequence and structure that they are not found by straightforward similarity searches. The RNAs found were shown to be sno RNAs by insertional mutagenesis.

6. Which of the following is untrue regarding RNA structure?

a) RNA structure 4.6 is a Windows implementation of the Zuker algorithm

b) It includes additional options for other folding algorithms and incorporation of experimental data

c) The authors of RNA structure collaborate very closely with the Turner laboratory and keep the most up-to-date thermodynamic parameters

d) The OligoWalk program cannot be used for siRNA design View Answer

Answer: d

Explanation: Two unique ways of incorporating experimental data in the RNA folding is done with Dynalign and chemical modification. The Dynalign program computes the lowest free-energy sequence alignment and secondary structure common to two RNA sequences.

7. Which of the following is untrue about Vienna RNA Websuite?

a) It introduced the Wuchty algorithm, developed applications of the McCaskill algorithm

b) It also offers a wide variety of algorithms and functions

c) The Wuchty algorithm generates a small but complete set of suboptimal structures

d) The Wuchty algorithm computes some possible tertiary structures within a narrow free-energy range

View Answer

Answer: d

Explanation: The Wuchty algorithm computes all possible secondary structures within a narrow free-energy range. The Wuchty algorithm generates a small but complete set of suboptimal structures that may include some very different secondary structures but also very many highly similar structures. However, structures containing more than one suboptimal region may occur in the Wuchty set of structures but would be absent if the Zuker method for sampling suboptimal structures were used.

8. Which of the following is untrue about the Sfold algorithm?

a) It uses a unique algorithm to aid in the design of siRNA

b) The algorithm combines thermodynamic stabilities, calculations of target accessibility ,

and empirical rules

c) The website offers specialized programs for the design of siRNA, antisense RNA, trans-cleaving RNA, and mRNA-microRNA interactions

d) The website doesn’t offer programs for the design of a general program for statistically sampling suboptimal RNA structures

View Answer

Answer: d

Explanation: The algorithm uses a partition function calculation and then groups suboptimal structures by similarity .The centroid structure is the most-representative structure that is closest in similarity to all the other structures.

9. If the centroid structure is different from the minimum free-energy structure, the centroid structure is often closer to the phylogenetic prediction and contains fewer base pairs, or fewer false-positive base pair predictions, than the minimum free-energy prediction.

a) True

b) False View Answer

Answer: a

Explanation: The point is to show a structure that represents a group of structures rather than a single predicted structure. Many long RNA sequences, such as viral genomes or mRNA, may not have a single structure but instead have a dynamic structure that has some conserved features but also varies and changes, and these many conformations may all exist simultaneously in the cell.

10. The ILM program uses an iterative loop matching algorithm to maximize base pairs and allows pseudoknots to form by allowing base .

pairs to be added or removed in successive rounds.

a) True

b) False View Answer

Answer: a

Explanation: The Nussinov algorithm, or maximum loop matching algorithm, is the basic framework for generating a structure with the most possible base pairs. The base pairs are ranked using both thermodynamic parameters and covariation data for aligned sequences. ILM requires the RnaViz program to visualize the RNA secondary structure with pseudoknots.

9. Questions & Answers on Genome Mapping, Assembly and Comparison

Genome Mapping

1. Which of the following is untrue about the genome mapping?

a) It doesn’t lead to the understanding a genome structure

b) It involves identifying relative locations of genes

c) It involves identifying traits

d) It involves identifying mutations View Answer

Answer: a

Explanation: The first step to understanding a genome structure is through genome mapping, which is a process of identifying relative locations of genes, mutations or traits on a chromosome. A low-resolution approach to mapping genomes is to describe the order and relative distances of genetic markers on a chromosome.

2. Genetic markers are portions of a whose inheritance patterns can be followed.

a) unidentifiable, genes

b) unidentifiable, chromosome

c) identifiable, chromosome

d) identifiable, genes View Answer

Answer: c

Explanation: For many eukaryotes, genetic markers represent morphologic phenotypes. In addition to genetic linkage maps, there are also other types of genome maps such as physical maps and cytologic maps, which describe genomes at different levels of resolution.

3. Genetic linkage maps, also called genetic maps, identify the relative positions of genetic markers on a chromosome and are based on how frequent the markers are inherited together.

a) True

b) False View Answer

Answer: a

Explanation: The rationale behind genetic mapping is that the closer the two genetic markers are, the more likely it is that they are inherited together and are not separated in a genetic crossing

event. The distance between the two genetic markers is measured in centiMorgans (cM), which is the frequency of recombination of genetic markers.

4. One centiMorgan is defined as percentage of the total recombination events.

a) one

b) ten

c) 0.1

d) 0.01

View Answer

Answer: a

Explanation: One centiMorgan is one percentage of the total recombination events when separation of the two genetic markers is observed in a genetic crossing experiment. One centiMorgan is approximately 1 Mb in humans and 0.5 Mb in Drosophila.

5. Physical maps are maps of locations of identifiable landmarks on a genomic DNA

inheritance patterns.

a) remotely related to

b) related to

c) regardless of

d) associated with View Answer

Answer: c

Explanation: The distance between genetic markers is measured directly as kilobases (Kb) or megabases (Mb). Because the distance is expressed in physical units, it is more accurate and reliable than centiMorgans used in genetic maps.

6. Physical maps are constructed by using a chromosome walking technique.

a) True

b) False View Answer

Answer: a

Explanation: It uses a number of radio labeled probes to hybridize to a library of DNA clone fragments. By identifying overlapping clones probed by common probes, a relative order of the cloned fragments can be established.

7. Which of the following is untrue about cytologic maps?

a) They cannot be directly observed under microscope

b) They refer to banding patterns

c) They can be viewed on stained chromosomes

d) They can be directly observed under microscope View Answer

Answer: a

Explanation: Cytologic maps refer to banding patterns seen on stained chromosomes, which can be directly observed under a microscope. The observable light and dark bands are the visually distinct markers on a chromosome.

8. Cytologic maps can be considered to be of resolution and hence somewhat

physical maps.

a) very high, inaccurate

b) very low, accurate

c) very high, accurate

d) very low, inaccurate View Answer

Answer: d

Explanation: The banding patterns, however, are not always constant and are subject to change depending on the extent of chromosomal contraction. Thus, cytologic maps can be considered to be of very low resolution and hence somewhat inaccurate physical maps. The distance between two bands is expressed in relative units (Dustin units).

9. In medical applications, the ultimate goal of gene mapping is to disease genes.

a) True

b) False View Answer

Answer: a

Explanation: Once the gene is cloned, the determination of DNA sequence is possible. Further, the study of target protein is carried out.

10. One of the fundamental events that occur in meiosis is crossing over in which homologous chromosomes exchange segments causing a reshuffling of genes.

a) True

b) False View Answer

Answer: a

Explanation: If genes are far apart on the same chromosome, it is likely that recombination occurs. Conversely, if they are very close together, they are more likely to be transmitted as a block.

Genome Sequencing

1. The resolution genome map is the genomic DNA sequence that can be considered as a type of map describing a genome at the single base-pair level.

a) highest, physical

b) lowest, physical

c) highest, cytological

d) lowest, cytological View Answer

Answer: a

Explanation: Cytological maps have quite low resolution, when compared to physical maps. They can be viewed under microscopes as well.

2. Which of the following is untrue about DNA sequencing?

a) It is now routinely carried out using the Sanger method

b) This doesn’t make use of DNA polymerases

c) This involves synthesis of DNA chains of varying length

d) The DNA synthesis is stopped by adding dideoxynucleotides View Answer

Answer: b

Explanation: DNA polymerases are used to synthesize DNA chains. The dideoxynucleotides are labeled with fluorescent dyes, which terminate the DNA synthesis at positions containing all four bases, resulting in nested fragments that vary in length by a single base. When the labeled DNA is subjected to electrophoresis, the banding patterns in the gel reveal the DNA sequence.

3. In DNA sequencing, the fluorescent traces of the DNA sequences are read by a computer program that assigns bases for each peak in a chromatogram.

a) True

b) False View Answer

Answer: a

Explanation: This process is called base calling. Automated base calling may generate errors and human intervention is often required to correct the sequence calls.

4. The shotgun approach sequences clones from of cloned DNA.

a) randomly, one end

b) randomly, both ends

c) specifically, both ends

d) specifically, one end View Answer

Answer: b

Explanation: There are two major strategies for whole genome sequencing: the shotgun approach and the hierarchical approach. The shotgun approach generates a large number of sequenced DNA fragments. The number of random fragments has to be very large, so large that the DNA fragments overlap sufficiently to cover the entire genome.

5. The shotgun approach does not require knowledge of physical mapping of the clone fragments, but rather a robust computer assembly program to join the pieces of random fragments into a single, whole-genome sequence.

a) True

b) False View Answer

Answer: a

Explanation: Generally, the genome has to be redundantly sequenced in such a way that the overall length of the fragments covers the entire genome multiple times. This is designed to minimize sequencing errors and ensure correct assembly of a contiguous sequence. Overlapping sequences with an overall length of six to ten times the genome size are normally obtained for this purpose.

6. Despite the multiple coverage, sometimes certain genomic regions remain unsequenced, mainly owing to cloning difficulties.

a) True

b) False View Answer

Answer: a

Explanation: In such mentioned cases, the remainder gap sequences can be obtained through extending sequences from regions of known genomic sequences using a more traditional PCR technique. That which requires the use of custom primers and performs genome walking in a stepwise fashion. This step of genome sequencing is also known as finishing, which is followed by computational assembly of all the sequence data into a final complete genome.

7. The hierarchical genome sequencing approach is

a) entirely dissimilar to the shotgun approach

b) dissimilar to the shotgun approach

c) similar to the shotgun approach, but on a larger scale

d) similar to the shotgun approach, but on a smaller scale View Answer

Answer: d

Explanation: In this, the chromosomes are initially mapped using the physical mapping strategy. Longer fragments of genomic DNA (100 to 300 kB) are obtained and cloned into a high-capacity bacterial vector called bacterial artificial chromosome (BAC).

8. In hierarchical genome sequencing approach, based on the results of mapping, of the BAC clones on a chromosome can be determined.

a) physical, the locations and orders

b) physical, only the locations

c) cytological, only the locations

d) physical, only the orders View Answer

Answer: a

Explanation: By successively sequencing adjacent BACclone fragments, the entire genome can be covered. The complete sequence of each individual BAC clone can be obtained using the shotgun approach. Overlapping BAC clones are subsequently assembled into an entire genome sequence.

9. The hierarchical approach is and than the shotgun approach because it involves an initial clone-based physical mapping step.

a) slower, less costly

b) faster, more costly

c) faster, less costly

d) slower, more costly View Answer

Answer: d

Explanation: During the era of human genome sequencing, there was a heated debate on the merits of each of the two strategies. Despite the mentioned fact, once the map is generated, assembly of the whole genome becomes relatively easy and less error prone.

10. The whole genome shotgun approach can produce a draft sequence very rapidly because it is based on the direct sequencing approach.

a) True

b) False View Answer

Answer: a

Explanation: However, it is computationally very demanding to assemble the short random fragments. Although the approach has been successfully employed in sequencing small microbial genomes, for a complex eukaryotic genome that contains high levels of repetitive sequences, such as the human genome, the full shotgun approach becomes less accurate and tends to leave more “holes” in the final assembled sequence than the hierarchical approach.

Current genome sequencing of large organisms often uses a combination of both approaches.

Genome Sequence Assembly

1. The major challenges in genome assembly are sequence errors, contamination by bacterial vectors, and repetitive sequence regions.

a) True

b) False View Answer

Answer: a

Explanation: Sequence errors can often be corrected by drawing a consensus from an alignment of multiple overlapped sequences. Bacterial vector sequences can be removed using filtering programs prior to assembly. To overcome the problem of sequence repeats, programs such as RepeatMasker can be used to detect and mask repeats. Additional constraints on the sequence reads can be applied to avoid miss-assembly caused by repeat sequences.

2. When a sequence is generated from ends of a single clone, the distance between the two opposing fragments of a clone is fixed to meaning that they are always separated by a distance defined by a length (normally 1,000 to 9,000 bases).

a) both, an uncertain range, clone

b) one, an uncertain range, clone

c) both, a certain range, clone

d) both, a certain range, gene View Answer

Answer: c

Explanation: A commonly used constraint to avoid errors caused by sequence repeats is the so called forward–reverse constraint. When the constraint is applied, even when one of the fragments has a perfect match with a repetitive element outside the range, it is not able to be moved to that location to cause miss-assembly.

3. Which of the following is untrue about base calling and assembly programs?

a) The first step toward genome assembly includes derive base calls

b) The first step toward genome assembly includes assigning associated quality scores

c) One of the steps is to assemble the sequence reads into contiguous sequences

d) There is no identifying overlap between sequence fragments View Answer

Answer: d

Explanation: One of the steps includes identifying overlaps between sequence fragments, assigning the order of the fragments and deriving a consensus of an overall sequence.

Assembling all shotgun fragments into a full genome is a computationally very challenging step. There are a variety of programs available for processing the raw sequence data.

4. Which of the following is incorrect?

a) Initial DNA sequencing reactions generate short sequence reads from DNA clones

b) To assemble a whole genome sequence, these short fragments are joined to form larger fragments

c) The average length of the reads is about 50 bases

d) A number of overlapping contigs can be further merged to form scaffolds View Answer

Answer: c

Explanation: The average length of the reads is about 500 bases. To assemble a whole genome sequence, these short fragments are joined to form larger fragments after removing overlaps.

These longer, merged sequences are termed contigs, which are usually 5,000 to 10,000 bases long. A number of overlapping contigs can be further merged to form scaffolds (30,000–50,000 bases, also called supercontigs), which are unidirectionally oriented along a physical map of a chromosome.

5. Which of the following is incorrect about Phred?

a) It is a UNIX program

b) It doesn’t give a probability score in output

c) It is used for base calling

d) It uses a Fourier analysis to resolve fluorescence traces and predict actual peak locations of bases

View Answer

Answer: b

Explanation: It also gives a probability score for each base call that may be attributable to error. The commonly accepted score threshold is twenty, which corresponds to a 1% chance of error.

The higher the score, the better the quality of the sequence reads. If the score value falls below the threshold, human intervention is required.

6. Which of the following is incorrect about Phrap?

a) It aligns individual fragments in a pairwise fashion using the Smith–Waterman algorithm

b) It doesn’t take input from Phred

c) It is used for sequence assembly

d) It is a UNIX program View Answer

Answer: b

Explanation: It takes Phred base-call files with quality scores as input and aligns individual fragments in a pairwise fashion using the Smith–Waterman algorithm. The base quality information is taken into account during the pairwise alignment. After all the pair wise sequence similarity is identified, the program performs assembly by progressively merging sequence pairs with decreasing similarity scores while removing overlapped regions. Consensus contigs are derived after joining all possible overlapped reads.

7. VecScreen is a primarily aimed for sequence assembly.

a) True

b) False View Answer

Answer: b

Explanation: is a web-based Program that helps detect contaminating bacterial vector sequences. It scans an input nucleotide sequence and compares it with a database of known vector sequences by using the BLAST program.

8. Which of the following is incorrect about EULER?

a) It is an assembly algorithm

b) It uses a Eulerian Superpath approach, which is a polynomial algorithm

c) In this approach, a sequence fragment is broken down to tuples of five nucleotides

d) The tuples are distributed in a diagram with numerous nodes that are all interconnected

View Answer

Answer: c

Explanation: The tuples are converted to binary vectors in the nodes. By using a Viterbi algorithm, the shortest path among the vectors can be found, which is the best way to connect

the tuples into a full sequence. Because this approach does not directly rely on detecting overlaps, it may be advantageous in assembling sequences with repeat motifs.

9. TIGR Assembler is a UNIX program from TIGR for assembly of large shotgun sequence fragments.

a) True

b) False View Answer

Answer: a

Explanation: It treats the sequence input as clean reads without consideration of the sequence quality. A main feature of the program is the application of the forward–reverse constraints to avoid miss-assembly caused by sequence repeats. The sequence alignment in the assembly stage is performed using the Smith–Waterman algorithm.

10. Which of the following is incorrect about ARACHNE?

a) It accepts base calls with associated quality scores assigned by Phred as input

b) It is a free UNIX program

c) It is for the assembly of whole-genome shotgun reads

d) It doesn’t involve heuristic approach View Answer

Answer: d

Explanation: Its unique features include using a heuristic approach similar to FASTA to align overlapping fragments, evaluating alignments using statistical scores, correcting sequencing errors based on multiple sequence alignment, and using forward–reverse constraints. It accepts base calls with associated quality scores assigned by Phred as input and produces scaffolds or a fully assembled genome.

Genome Annotation

1. The genome annotation process involves two steps: gene prediction and functional assignment.

a) True

b) False View Answer

Answer: a

Explanation: Before the assembled sequence is deposited into a database, it has to be analyzed for useful biological features. The genome annotation process provides comments for the features.

2. Which of the following is incorrect regarding gene annotation?

a) The gene annotation of the human genome employs a combination of theoretical prediction and experimental verification

b) Gene structures are first predicted by ab initio exon prediction programs

c) The predicted genes are compared with experimentally determined cDNA and EST sequences

d) The pairwise alignment programs are not involved View Answer

Answer: d

Explanation: The predictions are verified by BLAST searches against a sequence database. The predicted genes are further compared with experimentally determined cDNA and EST sequences using the pairwise alignment programs such as GeneWise, Spidey, SIM4, and EST2 Genome.

3. Which of the following is incorrect regarding gene ontology?

a) It exists because there is a need to standardize protein functional descriptions

b) It uses a limited vocabulary to describe molecular functions

c) Biological processes are not described though

d) The cellular components are described using limited vocabulary View Answer

Answer: c

Explanation: The controlled vocabulary is organized such that a protein function is linked to the cellular function through a hierarchy of descriptions with increasing specificity. The top of the hierarchy provides an overall picture of the functional class, whereas the lower level in the hierarchy specifies more precisely the functional role. This way, protein functionality can be defined in a standardized and unambiguous way.

4. Which of the following is incorrect regarding gene ontology?

a) There is standardization of the names and activities

b) There is no standardization of associated pathways

c) It provides consistency in describing overall protein functions

d) It facilitates grouping of proteins of related functions View Answer

Answer: b

Explanation: A GO description of a protein provides three sets of information: biological process, cellular component, and molecular function, each of which uses a unique set of non-overlapping vocabularies. The standardization of the names, activities, and associated pathways provides consistency in describing overall protein functions.

5. Which of the following is incorrect regarding Automated Genome Annotation?

a) It exists because of the need to develop fast and automated methods to annotate the genomic sequences

b) The automated approach relies on homology detection

c) The automated approach doesn’t rely on heuristic sequence similarity searching

d) Automation brings speed in gene annotation process View Answer

Answer: c

Explanation: If a newly sequenced gene or its gene product has significant matches with a database sequence beyond a certain threshold, a transfer of functional assignment is taking place. In addition to sequence matching at the full length, detection of conserved motifs often offers additional functional clues.

6. Conserved functional sites can be identified by profile and hidden Markov model– based motif and domain search tools such as SMART and InterPro.

a) True

b) False View Answer

Answer: a

Explanation: Detecting remote homologs typically involves combined searches of protein motifs and domains and prediction for secondary and tertiary structures. The prediction can also be performed using structure-based approaches such as threading and fold recognition.

7. The remote homology detection helps to shed light on the possible functions of the proteins that previously have no functional information at all.

a) True

b) False View Answer

Answer: a

Explanation: The bioinformatic analysis can spur an important advance in knowledge in many cases. Some hypothetical proteins, because of their novel structural folds, still cannot be predicted even with the advanced bioinformatics approaches and remain challenges for both experimental and computational work.

8. Which of the following is incorrect regarding genome economy?

a) It is a phenomenon of synthesizing more proteins from fewer genes

b) This is a major strategy that eukaryotic organisms use to achieve a myriad of genotypic diversities only

c) This is a major strategy that eukaryotic organisms use to achieve a myriad of phenotypic diversities

d) There are numerous underlying genetic mechanisms to help account for genome economy

View Answer

Answer: b

Explanation: A major mechanism responsible for the protein diversity is alternative splicing, which refers to the splicing event that joins different exons from a single gene to form different transcripts. A related mechanism, known as exon shuffling, which joins exons from different genes to generate more transcripts, is also common in eukaryotes. It is known that, in humans, about two thirds of the genes exhibit alternative splicing and exon shuffling during expression, generating 90% of the total proteins.

9. In some circumstances, one mRNA transcript can lead to the translation of more than one protein.

a) True

b) False View Answer

Answer: a

Explanation: For example, human dentin phosphoprotein and dentin sialoprotein are proteins involved in tooth formation. An mRNA transcript that includes coding regions from both proteins is translated into a precursor protein that is cleaved to produce two different mature proteins.

10. Which of the following is incorrect regarding GeneQuiz?

a) It is a web server for protein DNA annotation

b) It is a web server for protein sequence annotation

c) It compares a query sequence against databases using BLAST and FASTA to identify homologs with high similarities

d) It performs domain analysis using the PROSITE and Blocks databases View Answer

Answer: a

Explanation: It performs domain analysis using the PROSITE and Blocks databases as well as analysis of secondary structures and super-secondary structures that includes prediction of coiled coils and transmembrane helices. Multiple search and analysis results are compiled to produce a summary of protein function with an assigned confidence level (clear, tentative, marginal, and negligible).

Comparative Genomics

1. Which of the following is untrue about comparative genomics?

a) It is comparison of whole genomes from different organisms

b) It includes comparison of gene number, gene location, and gene content from these genomes

c) It provides insights into the mechanism of genome evolution and gene transfer among genomes

d) It doesn’t help to reveal the extent of conservation among genomes View Answer

Answer: c

Explanation: It helps to understand the pattern of acquisition of foreign genes through lateral gene transfer. It also helps to reveal the core set of genes common among different genomes, which should correspond to the genes that are crucial for survival. This knowledge can be potentially useful in future metabolic pathway engineering.

2. Which of the following is untrue about Whole Genome Alignment?

a) This helps to reveal the presence of conserved functional elements

b) It doesn’t help to understand sequence conservation between genomes

c) It be accomplished through direct genome comparison or genome alignment

d) The alignment at the genome level is fundamentally no different from the basic sequence alignment

View Answer

Answer: b

Explanation: Regular alignment programs tend to be error prone and inefficient when dealing with long stretches of DNA containing hundreds or thousands of genes. Another challenge of genome alignment is effective visualization of alignment results. Because it is obviously difficult to sift through and make sense of the extremely large alignments, a graphical representation is a must for interpretation of the result.

3. Which of the following is untrue about LAGAN?

a) It stands for Limited Area Global Alignment of Nucleotides

b) It is a web-based program designed for pairwise alignment of small fragments of genomes only

c) It first finds anchors between two genomic sequences using an algorithm that identifies short, exactly matching words

d) Regions that have high density of words are selected as anchors View Answer

Answer: b

Explanation: is a web-based program designed for pairwise alignment of large genomes. The unique feature of this program is that it is able to take into account degeneracy of the genetic codes and is therefore able to handle more distantly related genomes.

4. A minimal constitutes genome, which is a set of genes required for maintaining a free living cellular organism.

a) maximum

b) maximal

c) highest number of set of

d) minimal View Answer

Answer: d

Explanation: Finding minimal genomes helps provide an understanding of genes constituting key metabolic pathways, which are critical for a cell’s survival. This analysis involves identification of orthologous genes shared between a number of divergent genomes.

5. Coregenes is a web-based program that determines a set of genes based on comparison of small genomes.

a) vast, four

b) core, fifteen

c) core, four

d) vast, fifteen View Answer

Answer: c

Explanation: The user supplies NCBI accession numbers for the genomes of interest. The program performs an iterative BLAST comparison to find orthologous genes by using one genome as a reference and another as a query. This pairwise comparison is performed for all four genomes. As a result, the common genes are compiled as a core set of genes from the genomes.

6. Which of the following is untrue about Lateral gene transfer?

a) It is also known as vertical gene transfer

b) There is exchange of genetic materials between species

c) It mainly occurs among prokaryotic organisms when foreign genes are acquired through mechanisms

d) It is one of the examples is transformation View Answer

Answer: a

Explanation: is defined as the exchange of genetic materials between species in a way that is incongruent with commonly accepted vertical evolutionary pathway. Examples are transformation (direct uptake of foreign DNA from environment), conjugation (gene uptake through mating behavior), and transduction (gene uptake mediated by infecting viruses). The transmission of genes between organisms can occur relatively recently or as a more ancient event.

7. A way to discern lateral gene transfer is through phylogenetic analysis, referred to as an

‘among-genome’ approach, which can be used to discover

a) recent lateral gene transfer events but almost negligible ancient events

b) recent lateral gene transfer events

c) ancient lateral gene transfer events

d) both recent and ancient lateral gene transfer events View Answer

Answer: d

Explanation: Abnormal groupings in phylogenetic trees are often interpreted as the possibility of lateral gene transfer events. There are some basic tools for identifying genomic regions that may be a result of lateral gene transfer events using the within-genome approach, namely, ACT, Swaap.

8. Within-Genome Approach is to identify regions within a genome with unusual compositions.

a) True

b) False View Answer

Answer: a

Explanation: Single or oligonucleotide statistics, such as G–C composition, codon bias, and oligonucleotide frequencies are used. Unusual nucleotide statistics in certain genomic regions versus the rest of the genome may help to identify “foreign” genes in a genome. A commonly used parameter is GC skew ((G − C)/(G + C)), which is compositional bias for G in a DNA sequence and is a commonly used indicator for newly acquired genetic elements.

9. Which of the following is untrue about Gene Order Comparison?

a) When the order of a number of linked genes is conserved between genomes, it is called synteny

b) Generally, gene order is much more conserved compared with gene sequences.

c) Generally, gene order is much less conserved compared with gene sequences.

d) It is in fact rarely observed among divergent species. View Answer

Answer: b

Explanation: Gene order conservation is in fact rarely observed among divergent species. Therefore, comparison of syntenic relationships is normally carried out between relatively close lineages. However, if syntenic relationships for certain genes are indeed observed among divergent prokaryotes, they often provide important clues to functional relationships of the genes of interest.

10. Genes involved in the same metabolic pathway tend to be clustered among phylogenetically diverse organisms.

a) True

b) False View Answer

Answer: a

Explanation: The preservation of the gene order is a result of the selective pressure to allow the genes to be co-regulated and function as an operon. Furthermore, the synteny of genes from divergent groups often associates with physical interactions of the encoded gene products.

10. Questions on Functional Genomics & Proteomics

Sequence – Based Approaches

1. Which of the following is untrue regarding expressed sequence tags (ESTs)?

a) One of the high throughput approaches to genome-wide profiling of gene expression is sequencing ESTs

b) They are short sequences obtained from cDNA clones

c) They serve as short identifiers of full-length genes

d) They are typically in the range of 800 to 900 nucleotides in length View Answer

Answer: d

Explanation: ESTs are typically in the range of 200 to 400 nucleotides in length obtained from either the 5’end or 3’end of cDNA inserts. Libraries of cDNA clones are prepared through reverse transcription of isolated mRNA populations by using oligo (dT) primers that hybridize with the poly (A) tail of mRNAs and ligation of the cDNAs to cloning vectors.

2. To generate EST data, clones in the cDNA library are randomly selected for sequencing from either end of the inserts.

a) True

b) False View Answer

Answer: a

Explanation: The EST data are able to provide a rough estimate of genes that are actively expressed in a genome under a particular physiological condition. This is because the frequencies for particular ESTs reflect the abundance of the corresponding mRNA in a cell, which corresponds to the levels of gene expression at that condition. Another potential benefit of EST sampling is that, by randomly sequencing cDNA clones, it is possible to discover new genes.

3. Which of the following is untrue regarding the drawbacks of expressed sequence tags (ESTs)?

a) They are often of lowquality because they are automatically generated without verification

b) Many bases are ambiguously determined, represented by N’s

c) Frame shift errors and artifactual stop codons are some common errors

d) Despite of all the failures, the translation the sequences is smooth View Answer

Answer: d

Explanation: Common errors also include frameshift errors and artifactual stop codons, resulting in failures of translating the sequences. In addition, there is often contamination by vector sequence, introns (fromunspliced RNAs), ribosomal RNA (rRNA), mitochondrial RNA, among others. ESTs represent only partial sequences of genes.

4. It has been estimated that up to 11% of cDNA clones may be chimeric.

a) True

b) False View Answer

Answer: a

Explanation: A problem of ESTs is the presence of chimeric clones owing to cloning artifacts in library construction, in which more than one transcript is ligated in a clone resulting in the 5_ end of a sequence representing one gene and the 3’ end another gene. Another fundamental problem with EST profiling is that it predominantly represents highly expressed, abundant transcripts. Weakly expressed genes are hardly found in a EST sequencing survey.

5. Which of the following is untrue regarding expressed sequence tags (ESTs)?

a) EST libraries can be easily generated from various cell lines, tissues, organs, and at various developmental stages

b) Although individual ESTs are prone to error, an entire collection of ESTs contains valuable information

c) Identification of cDNA clone is difficult

d) ESTs can also facilitate the unique identification of a gene from a cDNA library View Answer

Answer: c

Explanation: a short tag can lead to a cDNA clone. Often, after consolidation of multiple EST sequences, a full-length cDNA can be derived. By searching a non-redundant EST collection, one can identify potential genes of interest.

6. GenBank has a special EST database, dbEST that contains EST collections for a large number of organisms.

a) True

b) False View Answer

Answer: a

Explanation: The rapid accumulation of EST sequences has prompted the establishment of public and private databases to archive the data. The mentioned database is regularly updated to reflect the progress of various EST sequencing projects. Each newly submitted EST sequence is subject to a database search. If a strong similarity to a known gene is found, it is annotated accordingly.

7. Which of the following is untrue regarding EST Index Construction?

a) The goal of the EST databases is to organize and consolidate the largely redundant EST data

b) The process includes a preprocessing step that removes masks repeats

c) There is no screening of vector contaminants

d) The goal of the EST databases is to improve the quality of the sequence information so the data can be used to extract full-length cDNAs

View Answer

Answer: c

Explanation: The process includes a preprocessing step that removes vector contaminants and masks repeats. Vecscreen, can be used to screen out bacterial vector sequences. This is followed by a clustering step that associates EST sequences with unique genes.

8. Which of the following is untrue regarding UniGene?

a) It is an NCBI EST cluster database.

b) Overlapping EST sequences are computationally processed to represent a single expressed gene.

c) Each cluster is a set of overlapping EST sequences

d) The overlapping EST sequences are computationally processed to represent a set of expressed genes

View Answer

Answer: d

Explanation: The database is constructed based on combined information from dbEST, GenBank mRNA database, and “electronically spliced” genomic DNA. Only ESTs with 3’poly-A ends are clustered to minimize the problem of chimerism. The resulting 3’EST sequences provide more unique representation of the transcripts.

9. Which of the following is untrue regarding TIGR Gene Indices?

a) It is an EST database that the similar type of clustering method from UniGene

b) It is an EST database that uses a different clustering method from UniGene

c) It compiles data from dbEST, GenBank mRNA and genomic DNA data, and TIGR’s own sequence databased) Sequences are only clustered if they are more than 95% identical for over a fortynucleotide region in pairwise comparisons

View Answer

Answer: a

Explanation: BLAST and FASTA are used to identify sequence overlaps. In the sequence assembly stage, both TIGR Assembler andCAP3are used to construct contigs, producing a so- called tentative consensus (TC). To prevent chimerism, transcripts are clustered only if they match fully with known genes.

10. Which of the following is untrue regarding SAGE?

a) It stands for Serial analysis of gene expression

b) It is another high throughput, sequence-based approach for global gene expression profile analysis

c) It stands for Squared analysis of gene expression

d) Unlike EST sampling, SAGE is more quantitative in determining mRNA expression in a cell

View Answer

Answer: c

Explanation: In this method, short fragments of DNA (usually 15 base pairs [bp]) are excised

from cDNA sequences and used as unique markers of the gene transcripts. The sequence fragments are termed tags. They are subsequently concatenated (linked together), cloned, and sequenced.

Microarray-Based Approaches

1. Which of the following is incorrect about a microarray?

a) It is a slide attached with a high-density array of immobilized DNA oligomers representing the entire genome of the species under study

b) Array of immobilized DNA oligomers cannot be cDNAs

c) Each oligomer is spotted on the slide and serves as a probe for binding to a unique complementary cDNA

d) It is the most commonly used global gene expression profiling method View Answer

Answer: b

Explanation: The entire cDNA population, labeled with fluorescent dyes or radioisotopes, is allowed to hybridize with the oligo probes on the chip. The amount of fluorescent or radiolabels at each spot position reflects the amount of corresponding mRNA in the cell. Using this analysis, patterns of global gene expression in a cell can be examined.

2. which of the following is incorrect about Oligonucleotide Design in A microarray?

a) DNA microarrays are generated by fixing oligonucleotides onto a solid support

b) The oligonucleotide array slide represents thousands of preselected genes from an organism

c) The length of oligonucleotides is typically in the range of twenty-five to seventy bases long

d) The oligonucleotides don’t react with cDNA samples View Answer

Answer: d

Explanation: The oligonucleotides are called probes that hybridize to labeled cDNA samples. Shorter oligo probes tend to be more specific in hybridization because they are better at discriminating perfect complementary sequences fromsequences containing mismatches.

However, longer oligos can be more sensitive in binding cDNAs.

3. Which of the following is incorrect about Data Collection?

a) The two-color microarray uses multiple dyes at times

b) The most common type of microarray protocol is the two-color microarray

c) The cDNAs are obtained by extracting total RNA or mRNA from tissues or cells and

incorporating fluorescent dyes in the DNA strands during the cDNA biosynthesis

d) The expression of genes is measured via the signals from cDNAs hybridizing with the specific oligonucleotide probes on the microarray

View Answer

Answer: a

Explanation: The most common type of microarray protocol is the two-color microarray, which involves labeling one set of cDNA from an experimental condition with one dye (Cy5, red fluorescence) and another set of cDNA from a reference condition (the controls) with another dye (Cy3, green fluorescence). When the two differently labeled cDNA samples are mixed in equal quantity and allowed to hybridize with the DNA probes on the chips, gene expression patterns of both samples can be measured simultaneously.

4. In the analysis of microarray data–If replicated datasets are available, rigorous statistical tests such as t-test and analysis of variance (ANOVA) can be performed to test the null hypothesis that a given data point is not significantly different from the mean of the data distribution.

a) True

b) False View Answer

Answer: a

Explanation: For such tests, it is common to use a P-value cutoff of .05, which means a confidence level of 95% to distinguish the data groups. This level also corresponds to a gene expression level with two standard deviations from the mean of distribution.

5. Which of the following is incorrect about Classification of microarray data?

a) For microarray data, clustering analysis identifies coexpressed and coregulated genes

b) For microarray data, clustering analysis identifies coexpressed but not coregulated genes

c) For microarray data, clustering analysis identifies and coregulated but not coexpressed genes

d) Genes within a category have more similarity in expression than genes from different categories.

View Answer

Answer: a

Explanation: When genes are co-regulated, they normally reflect related functionality. Through gene clustering, functions of previously uncharacterized genes may be discovered. Clustering

methods include hierarchical clustering and partitioning clustering (e.g., k-means, self-organizing maps [SOMs]).

6. A supervised analysis refers to classification of data into a set of predefined categories. For example, depending on the purpose of the experiment, the data can be classified into predefined ‘diseased’ or ‘normal’ categories.

a) True

b) False View Answer

Answer: a

Explanation: Similarly, an unsupervised analysis does not assume predefined categories, but identifies data categories according to actual similarity patterns. The unsupervised analysis is also called clustering, which is to group patterns into clusters of genes with correlated profiles.

7. Which of the following is incorrect about Hierarchical Clustering?

a) The tree-branching pattern illustrates a higher degree of relationship between related gene groups

b) It is not similar to the distance phylogenetic tree-building method

c) It produces a treelike structure that represents a hierarchy or relative relatedness of data groups

d) In the tree leaves, similar gene expression profiles are placed more closely together than dissimilar gene expression profiles

View Answer

Answer: b

Explanation: A hierarchical clustering method is in principle similar to the distance phylogenetic tree-building method. When genes with similar expression profiles are grouped in such a way, functions for unknown genes can often be inferred. Hierarchical clustering uses the agglomerative approach that works in much the same way as the UPGMA method, in which the most similar data pairs are joined first to form a cluster.

8. Which of the following is incorrect about k-Means Clustering?

a) k-means clustering produces a dendrogram

b) It classifies data through a single step partition

c) It is a divisive approach

d) In this method, data are partitioned into k-clusters, which are prespecified at the outset

View Answer

Answer: a

Explanation: In contrast to hierarchical clustering algorithms, k-means clustering does not produce a dendrogram, but instead classifies data through a single step partition. The value of k is normally randomly set but can be adjusted if results are found to be unsatisfactory.

9. Which of the following is incorrect about Self-Organizing Maps?

a) Clustering by SOMs is in principle similar to the k-means method

b) It doesn’t involve neural networks

c) The data points are initially assigned to the nodes at random

d) It starts by defining a number of nodes View Answer

Answer: b

Explanation: This pattern recognition algorithm employs neural networks. The distance between the input data points and the centroids are calculated. The data points are successively adjusted among the nodes, and their distances to the centroids are recalculated. After many iterations, a stabilized clustering pattern are reached with the minimum distances of the data points to the centroids. The differences between SOM and k-means are that, in SOM, the nodes are not treated as isolated entities, but as connected to other nodes.

10. TIGR TM4 is a suite of multiplatform programs for analyzing microarray data.

a) True

b) False View Answer

Answer: a

Explanation: This comprehensive package includes four interlinked programs, TIGR spot finder (for image analysis), MIDAS (for data normalization), MeV (for clustering analysis and visualization), and MADAM (for data management). The package provides different data normalization schemes and clustering options. Other Similar Clustering Programs are EPCLUST, SOTA.

Comparison of SAGE and DNA Microarrays

1. Which of the following is untrue about SAGE?

a) This approach is much more efficient than the EST analysis

b) This approach is quite less efficient than the EST analysis

c) It uses a short nucleotide tag to define a gene transcript

d) It allows sequencing of multiple tags in a single clone View Answer

Answer: b

Explanation: If an average clone has a size of 700 bp, it can contain up to 50 sequence tags (15 bp each), which means that the SAGE method can be at least fifty times more efficient than the brute force EST sequencing and counting. Therefore, the SAGE analysis has a better chance of detecting weakly expressed genes.

2. Which of the following is untrue about SAGE?

a) Sequencing is the most costly and time-consuming step

b) Here, sequencing is economical but time-consuming step

c) Sequencing is economical but time-reducing step

d) It is difficult to know how many tags need to be sequenced to get a good coverage of the entire transcriptome

View Answer

Answer: a

Explanation: It is generally determined on a case-by-case basis. As a rule of thumb, 10,000 clones representing approximately 500,000 tags from each sample are sequenced. The scale and cost of the sequencing required for SAGE analysis are prohibitive for most laboratories. Only large sequencing centers can afford to carry out SAGE analysis routinely.

3. Which of the following is untrue about the drawbacks of SAGE?

a) One or two sequencing errors in the tag sequence can lead to ambiguous or erroneous tag identification

b) Correctly sequenced SAGE tag sometimes may correspond to several genes or no gene at all

c) Correctly sequenced SAGE tag always corresponds to several genes

d) The drawback with this approach is the sensitivity to sequencing errors View Answer

Answer: c

Explanation: To improve the sensitivity and specificity of SAGE detection, the lengths of the tags need to be increased for the technique. There are some comprehensive software tools for SAGE analysis viz. SAGEmap, SAGExProfiler.

4. SAGEmap is a SAGE database created by NCBI.

a) True

b) False View Answer

Answer: a

Explanation: Given a cDNA sequence, one can search SAGE libraries for possible SAGE tags

and perform “virtual” Northern blots that indicate the relative abundance of a tag in a SAGE library. Each output is hyperlinked to a particular UniGene entry with sequence annotation.

5. SAGExProfiler doesn’t provide information about overexpressed or silenced genes

a) True

b) False View Answer

Answer: b

Explanation: It is a web-based program that allows a “virtual subtraction” of an expression profile of one library (e.g., normal tissue) from another (e.g., diseased tissue). Comparison of the two libraries can provide information about overexpressed or silenced genes in normal versus diseased tissues.

6. Which of the following is untrue about SAGE Genie?

a) It is an NCBI web-based program

b) It allows matching of experimentally obtained SAGE tags to known genes

c) It provides an interface for visualizing human gene expression

d) It doesn’t filter out linker sequences from experimentally obtained SAGE tags View Answer

Answer: d

Explanation: It has a filtering function that filters out linker sequences from experimentally obtained SAGE tags and allows expression pattern comparison between normal and diseased human tissues. The data output can be presented using subprograms such as the Anatomic Viewer, Digital Northern, and Digital Gene Expression Display.

7. Which of the following is an incorrect statement?

a) SAGE and DNA microarrays are both high throughput techniques that determine global mRNA expression levels

b) Studies have indicated that the gene expression measurements from these methods are highly inconsistent with each other

c) SAGE does not require prior knowledge of the transcript sequence

d) DNA microarray experiments can only detect the genes spotted on the microarray View Answer

Answer: b

Explanation: SAGE has the potential to allow discovery of new, yet unknown gene transcripts. Because is able to measure all the mRNA expressed in a sample, it becomes possible.

8. DNA microarrays measure “absolute” mRNA expression levels without arbitrary reference standards, whereas SAGE indicates the relative expression levels.

a) True

b) False View Answer

Answer: b

Explanation: SAGE measures “absolute” mRNA expression levels without arbitrary reference standards, whereas DNA microarrays indicate the relative expression levels. Therefore, SAGE expression data are more comparable across experimental conditions and platforms. This makes public SAGE databases more informative by allowing comparison of data from reference conditions with various experimental treatments.

9. The PCR amplification step involved in the SAGE procedure means that it requires a large quantity of sample mRNA.

a) True

b) False View Answer

Answer: a

Explanation: The PCR amplification step involved in the SAGE procedure means that it requires only a minute quantity of sample mRNA. This compares favorably to the requirement for a much larger quantity of mRNA for microarray experiments, which may be impossible to obtain under certain circumstances.

10. Which of the following is an incorrect statement?

a) Collecting a SAGE library is very labor intensive and expensive

b) Collecting a SAGE library is quite economical

c) SAGE is not suitable for rapid screening of cells

d) Gene identification from SAGE data is also more cumbersome View Answer

Answer: b

Explanation: The Gene identification from SAGE data is also more cumbersome because the mRNA tags have to be extracted, compiled, and identified computationally, whereas in DNA microarrays, the identities of the probes are already known. In SAGE, comparison of gene expression profiles to discover differentially expressed genes and co-expressed genes is performed manually, whereas for microarrays, there are a large number of software algorithms to automate the process.

Technology of Protein Expression Analysis

1. The classic protein separation methods involve two-dimensional gel electrophoresis followed by gel image analysis.

a) True

b) False View Answer

Answer: a

Explanation: Further characterization involves determination of amino acid composition, peptide mass fingerprints, and sequences using mass spectrometry (MS). Finally, database searching is needed for protein identification.

2. Which of the following is incorrect regarding 2D-Page?

a) It stands for Two-dimensional polyacrylamide gel electrophoresis

b) It separates proteins by charge only

c) The gel is run in one direction in a pH gradient under a non-denaturing condition

d) It works to separate proteins by isoelectric points (pI) View Answer

Answer: b

Explanation: it is a high-resolution technique that separates proteins by charge and mass. It works to separate proteins by isoelectric points (pI) and then in an orthogonal dimension under a denaturing condition to separate proteins by molecular weights (MW). This is followed by staining, usually silver staining, which is very sensitive, to reveal the position of all proteins. The result is a two-dimensional gel map; each spot on the map corresponds to a single protein being expressed.

3. Which of the following is incorrect regarding 2D-Page?

a) Not all proteins can be separated by this method or stained properly

b) The stained gel can be scanned and digitized for image analysis

c) Membrane proteins are largely hydrophilic and readily solubilized

d) One of the challenges of this technique is the separation of membrane proteins View Answer

Answer: c

Explanation: membrane proteins are largely hydrophobic and not readily solublized. They tend to aggregate in the aqueous medium of a two-dimensional gel. To overcome this problem, membrane proteins can be fractionated using specialized protocols and then electrophoresed using optimized buffers containing zwitterionic detergents. Subfractionation can be carried out to

separate nuclear, cytosol, cytoskeletal, and other subcellular fractions to boost the concentrations of rare proteins and to reveal subcellular localizations of the proteins.

4. Comparing two-dimensional gel images from various experiments can sometimes pose a challenge because the gels, unlike DNA microarrays, may shrink or warp.

a) True

b) False View Answer

Answer: a

Explanation: This requires the software programs to be able to stretch or maneuver one of the gels relative to the other to find a common geometry. When the reference spots are aligned properly, the rest of the spots can be subsequently compared automatically.

5. Which of the following is incorrect regarding Mass Spectrometry Protein Identification?

a) The proteolysis doesn’t generate a pattern according to molecular weight

b) Proteins can be identified and characterized using MS

c) The proteins from a two dimensional gel system are first digested in situ with a protease

d) Protein spots of interest are excised from the two-dimensional gel View Answer

Answer: a

Explanation: The proteolysis generates a unique pattern of peptide fragments of various MWs, which is termed a peptide fingerprint. The fragments can be analyzed with MS, a high-resolution technique for determining molecular masses. Currently, electro-spray ionization MS and matrix- assisted laser desorption ionization (MALDI) MS are commonly used.

6. Electrospray ionization MS and matrix-assisted laser desorption ionization (MALDI) MS only differ in the ionization procedure used.

a) True

b) False View Answer

Answer: a

Explanation: In MALDI-MS, for example, the peptides are charged with positive ions and forced through an analyzing tube with a magnetic field. Peptides are analyzed in the gas phase.

Because smaller peptides are deflected more than larger ones in a magnetic field, the peptide fragments can be separated according to molecular mass and charges. A detector generates a spectrum that displays ion intensity as a function of the mass-to-charge ratio.

7. Which of the following is incorrect regarding the Protein Identification through Database Searching?

a) MS characterization of proteins is highly dependent on bioinformatic analysis

b) Bioinformatics programs can be used to search for the identity of a protein in a database of theoretically digested proteins

c) Even in reality, the protease digestion is always perfect in MS

d) The purpose of the database search is to find exact or nearly exact matches View Answer

Answer: c

Explanation: in reality, protease digestion is rarely perfect, often generating partially digested products as a result of missed cuts at expected cutting sites. Peptides resulting from MALDI-MS are also charged, which increases their mass slightly.

8. ExPASY is a comprehensive proteomics web server with a suite of programs for searching peptide information from the SWISS-PROT and TrEMBL databases.

a) True

b) False View Answer

Answer: a

Explanation: There are twelve database search tools in this server dedicated to protein identification based on MS data. For example, the AACompIdent program identifies proteins based on pI, MW, and amino acid composition and compares these values with theoretical compositions of all proteins in SWISS-PROT/TrEMBL.

9. Which of the following is incorrect regarding Mascot and ProFound?

a) ProFound is a web server with a set of interconnected programs

b) ProFound searches a protein sequence database using MS fingerprinting information

c) Bayesian algorithm is not involved in ProFound

d) Mascot is a web server that identifies proteins based on peptide mass fingerprints, sequence entries, or raw MS/MS data from one or more peptides

View Answer

Answer: c

Explanation: In ProFound, A Bayesian algorithm is used. It ranks the database matches according to the probability of database sequences producing the peptide mass fingerprints.

10. Which of the following is incorrect regarding Differential In-Gel Electrophoresis?

a) Proteins are mixed together before electrophoresis on a two-dimensional gel

b) Differentially expressed proteins in both conditions can’t be visualized in the same gel

c) In this, Differences in protein expression patterns can be detected in a similar way as in fluorescent-labeled DNA microarrays

d) Proteins from experimental and control samples are labeled with differently colored fluorescent dyes

View Answer

Answer: b

Explanation: Differentially expressed proteins in both conditions can be co-separated and visualized in the same gel. Compared to regular 2D-PAGE, the process reduces the noise and improves the reproducibility and sensitivity of detection. In principle, it resembles the two-color DNA microarray analysis. The drawbacks of this approach are that different proteins take up fluorescent tags to different extents and that some proteins labeled with the fluorophores may become less soluble and precipitate before electrophoresis.

Post translational Modification

1. Which of the following is a wrong statement?

a) To assume biological activity, many nascent polypeptides have to be covalently modified before or after the folding process

b) In eukaryotic cells most modifications take place in the endoplasmic reticulum and the Golgi apparatus

c) The modifications in eukaryotic cells include proteolytic cleavage; formation of disulfide bonds; addition of phosphoryl, methyl, acetyl, or other groups onto certain amino acid residues

d) The modifications in eukaryotic cells doesn’t include attachment of oligosaccharides or prosthetic groups to create mature proteins

View Answer

Answer: d

Explanation: Posttranslational modifications have a great impact on protein function by altering the size, hydrophobicity and overall conformation of the proteins. The modifications can directly influence protein–protein interactions and distribution of proteins to different subcellular locations.

2. Which of the following is a wrong about AutoMotif?

a) It is a web server predicting protein sequence motifs

b) It doesn’t use SVM approach

c) In this process, the query sequence is chopped up into a number of overlapping fragments

d) The overlapping fragments from are query sequence are fed into different kernels

(similar to nodes) View Answer

Answer: b

Explanation: Hyperplane, which has been trained to recognize known protein sequence motifs, separates the kernels into different classes. Each separation is compared with known motif classes, most of which are related to posttranslational modification. The best match with a known class defines the functional motif.

3. It is important to use bioinformatics tools to predict sites for posttranslational modifications based on specific protein sequences. However, prediction of such modifications can often be difficult because the short lengths of the sequence motifs associated with certain modifications.

a) True

b) False View Answer

Answer: a

Explanation: This often leads to many false-positive identifications. One such example is the known consensus motif for protein phosphorylation, [ST]-x-[RK]. Such a short motif can be found multiple times in almost every protein sequence. Most of the predictions based on this sequence motif alone are likely to be wrong, producing very high rates of false-positives.

4. To minimize false-positive results, a statistical learning process called support vector machine (SVM) can be used to increase the specificity of prediction.

a) True

b) False View Answer

Answer: a

Explanation: This is a data classification method similar to the linear or quadratic discriminant analysis. In this method, the data are projected in a three-dimensional space or even a multidimensional space.

5. In a statistical learning process called support vector machine (SVM), a hyperplane is

a) a linear or nonlinear mathematical function

b) nonlinear mathematical function

c) linear mathematical function

d) exponential mathematical function View Answer

Answer: a

Explanation: It is used to best separate true signals from noise. The algorithm has more environmental variables included that may be required for the enzyme modification. After training the algorithm with sufficient structural features, it is able to correctly recognize many posttranslational modification patterns.

6. A disulfide bridge is a unique type of modification in which bonds are formed between cysteine residues.

a) posttranslational, covalent

b) translational, covalent

c) translational, ionic

d) posttranslational, ionic View Answer

Answer: a

Explanation: Disulfide bonds are important for maintaining the stability of certain types of proteins. The disulfide prediction is the prediction of paring potential or bonding states of cysteines in a protein.

7. Accurate prediction of bonds may also help to predict the -dimensional structure of the protein of interest.

a) nitrogen, two

b) nitrogen, three

c) disulfide, three

d) oxygen, three View Answer

Answer: c

Explanation: This problem can be tackled by using profiles constructed from multiple sequence alignment. It can also be tackled by using residue contact potentials calculated based on the local sequence environment.

8. Only Advanced neural networks are used to discern long-distance pairwise interactions among cysteine residues.

a) True

b) False View Answer

Answer: b

Explanation: Advanced neural networks or SVM or hidden Markov model (HMM) algorithms are

often used to discern long-distance pairwise interactions among cysteine residues. Cysteine is one of the publicly available programs specialized in disulfide prediction.

9. Cysteine doesn’t make predictions by building profiles.

a) True

b) False View Answer

Answer: b

Explanation: Is a web server that predicts the disulfide bonding states of cysteine residues in a protein sequence by building profiles based on multiple sequence alignment information. A recursive neural network ranks the candidate residues for disulfide formation.

10. ExPASY contains a number of programs to determine posttranslational modifications based on MS molecular mass data.

a) True

b) False View Answer

Answer: a

Explanation: Find Mod is a subprogram that uses experimentally determined peptide fingerprint information to compare the masses of the peptide fragments with those of theoretical peptides. If a difference is found, it predicts a particular type of modification basedona set of predefined rules. It can predict twenty-eight types of modifications, including methylation, phosphorylation, lipidation, and sulfation.

Protein Sorting

1. Which of the following is an incorrect statement about the terminologies related to protein sorting?

a) Subcellular localization is an integral part of protein functionality

b) Many proteins exhibit functions only after being transported to certain compartments of the cell

c) All the proteins exhibit functions after being transported to certain compartments of the cell

d) Protein sorting is also known as protein targeting View Answer

Answer: c

Explanation: The study of the mechanism of protein trafficking and subcellular localization is the field of protein sorting, which has become one of the central themes in modern cell biology.

Identifying protein subcellular localization is an important aspect of functional annotation, because knowing the cellular localization of a protein often helps to narrow down its putative functions.

2. For many eukaryotic proteins, newly synthesized protein precursors have to be transported to specific membrane-bound compartments and be proteolytically processed to become functional

a) True

b) False View Answer

Answer: a

Explanation: These compartments include chloroplasts, mitochondria, the nucleus, and peroxisomes. To carry out protein translocation, unique peptide signals have to be present in the nascent proteins, which function as “zip codes” that direct the proteins to each of these compartments.

3. Once the proteins are translocated within the organelles, protease cleavage takes place to remove the signal sequences and generate mature proteins

a) True

b) False View Answer

Answer: a

Explanation: it is an example of posttranslational modification. Even in prokaryotes, proteins can be targeted to the inner or outer membranes, the periplasmic space between these membranes, or the extracellular space. The sorting of these proteins is similar to that in eukaryotes and relies on the presence of signal peptides.

4. The signal sequences have a consensus but contain some specific features. They all have a core region preceded by one or more positively charged residues.

a) weak, hydrophilic

b) weak, hydrophobic

c) strong, hydrophilic

d) strong, hydrophilic View Answer

Answer: b

Explanation: However, the length and sequence of the signal sequences vary tremendously. Peptides targeting mitochondria, for example, are located in the N-terminal region.

5. The signal sequences are typically residues long, rich in charged residues such as arginines as well as hydroxyl residues such as serines and threonines, but devoid of charged residues.

a) 28 to 80, positively, negatively

b) 300 to 800, negatively, positively

c) 28 to 80, negatively, positively

d) 300 to 500, positively, negatively View Answer

Answer: a

Explanation: they have the tendency to form amphiphilic α-helices. These targeting sequences are cleaved once the precursor proteins are inside the mitochondria.

6. Chloroplast localization signals are also located in the -terminus and are about 25 to 100 residues in length, containing very few _ charged residues but many hydroxylated residues such as serine.

a) N, negatively

b) C, negatively

c) C, positively

d) N, positively View Answer

Answer: a

Explanation: Chloroplast localization signals are also called transit Peptides. An interesting feature of the proteins targeted for the chloroplasts is that the transit signals are bipartite.

7. Chloroplast localization signals consist of two adjacent signal peptides, one for targeting the proteins to the stromaportion of the chloroplast before being cleaved and the other for targeting the remaining portion of the proteins to the thylakoids.

a) True

b) False View Answer

Answer: a

Explanation: Localization signals targeting to the nucleus are variable in length (seven to forty- one residues) and are found in the internal region of the proteins. They typically consist of one or two stretches of basic residues with a consensus motif K(K/R)X(K/R). Nuclear signal sequences are not cleaved after protein transport.

8. Which of the following is an incorrect statement about SignalP

a) It only uses neural networks

b) It only uses HMMs

c) It is a web-based program that predicts subcellular localization signals

d) It uses both neural networks and HMMs View Answer

Answer: b

Explanation: The neural network algorithm combines two different scores, one for recognizing signal peptides and the other for protease cleavage sites. The HMM-based analysis discriminates between signal peptides and the N-terminal transmembrane anchor segments required for insertion of the protein into the membrane.

9. Which of the following is not one of the training sets in SignalP?

a) Prokaryotes

b) Eukaryotes

c) Gram-positive bacteria

d) Gram-negative bacteria View Answer

Answer: a

Explanation: This distinction is necessary because there are significant differences in the characteristics of the signal peptides from these organisms. Therefore, appropriate datasets need to be selected before analyzing the sequence. The program predicts both the signal peptides and the protease cleavage sites of the query sequence.

10. TargetP is a neural network-based program, similar to SignalP.

a) True

b) False View Answer

Answer: a

Explanation: It predicts the subcellular locations of eukaryotic proteins based on their N-terminal amino acid sequence only. It uses analysis output from SignalP and feeds it into a decision neural network, which makes a final choice regarding the target compartment.

Protein Interactions

1. Which of the following is untrue regarding the classic yeast two-hybrid method?

a) It is used for the detection of Protein interactions

b) Method that relies on the interaction of “bait” and “prey” proteins in molecular constructs in yeast

c) DNA-binding domain and a trans-activation domain don’t necessarily interact

d) In this strategy, a two-domain transcriptional activator is employed as a helper for determining protein–protein interactions

View Answer

Answer: c

Explanation: The two domains which are a DNA-binding domain and a trans-activation domain normally interact to activate transcription. However, molecular constructs are made such that each of the two domains is covalently attached to each of the two candidate proteins (bait and prey).

2. If the bait and prey proteins they bring the DNA-binding and trans-activation domains in such close proximity that they reconstitute the function of the transcription activator, turning the expression of a reporter gene as a result.

Which of the following is not the correct pair of blanks?

a) physically interact, on

b) do not interact, on

c) do not interact, off

d) stop interacting, off View Answer

Answer: b

Explanation: molecular constructs are made such that each of the two domains is covalently attached to each of the two candidate proteins. If the two candidate proteins do not interact, the reporter gene expression remains switched off.

3. Which of the following is untrue regarding the classic yeast two-hybrid method?

a) Protein–protein interaction networks of yeast and a small number of other species have been subsequently determined using this method

b) This technique is a high throughput approach

c) Each bait and prey construct has to be prepared individually to map interactions between all proteins

d) It has been systematically applied to study interactions at the whole proteome level View Answer

Answer: b

Explanation: This technique is essentially a low throughput approach. A major flaw in this method is that it is an indirect approach to probe protein–protein interaction and has a tendency to generate false positives (spurious interactions) and false negatives (undetected interactions). It

has been estimated from proteome-wide characterizations that the rate of false positives can be as high as 50%.

4. An alternative approach to determining protein–protein interactions is to use a large- scale affinity purification technique that involves attaching fusion tags to proteins and purifying the associated protein complexes in an affinity chromatography column.

a) True

b) False View Answer

Answer: a

Explanation: The purified proteins are then analyzed by gel electrophoresis followed by MS for identification of the interacting components.

The protein microarray systems mentioned above also provide a high throughput alternative for studying protein–protein interactions.

5. Which of the following is untrue regarding the Predicting Interactions Based on Domain Fusion

a) It is based on gene fusion events

b) Predicting protein–protein interactions is called the “Rosetta stone” method

c) A fused protein often reveals relationships between its domain components

d) A fused protein doesn’t necessarily reveal about the relationships between its domain components

View Answer

Answer: d

Explanation: The rationale goes like this: if A and B exist as interacting domains in a fusion protein in one proteome, the gene encoding the protein is a fusion gene. Their homologous gene sequences A and B existing separately in another genome most likely encode proteins interacting to perform a common function. Conversely, if ancestral genes A and B encode interacting proteins, they may have a tendency to be fused together in other genomes during evolution to enhance their effectiveness.

6. When the two domains are located in two different proteins, to preserve the same functionality, their close proximity and interaction have to be preserved as well.

a) True

b) False View Answer

Answer: a

Explanation: In this method, by studying gene/protein fusion events, protein–protein interactions

can be predicted. This prediction rule has been proven to be rather reliable and since successfully applied to a large number of proteins from both prokaryote and eukaryotes.

7. The justification behind Rosetta stone method is that when two domains are fused in a single protein, they have to be in proximity to perform a common function.

a) distant

b) close

c) extremely distant

d) extremely close View Answer

Answer: d

Explanation: When the two domains are located in two different proteins, to preserve the same functionality, their close proximity and interaction have to be preserved as well. Therefore, by studying gene/protein fusion events, protein–protein interactions can be predicted.

8. In Predicting Interactions Based on Gene Neighbors– if a certain gene linkage is found to be indeed conserved across divergent genomes, it can be used as a strong indicator of formation of an operon that encodes proteins that are functionally and even physically coupled.

a) True

b) False View Answer

Answer: a

Explanation: This rule of predicting protein–protein interactions holds up for most prokaryotic genomes. For eukaryotic genomes, gene order may be a less potent predictor of protein interactions than a tight co-regulation for gene expression.

9. Which of the following is untrue regarding the predicting Interactions Based on Phylogenetic Information?

a) Proteins do not operate as a complex

b) This method detects the co-presence or co-absence of orthologs across a number of genomes

c) Protein interactions can be predicted using phylogenetic profiles

d) Phylogenetic profile are defined as patterns of gene pairs that are concurrently present or absent across genomes

View Answer

Answer: a

Explanation: the logic behind the co-occurrence approach is that proteins normally operate as a

complex. If one of the components of the complex is lost, it results in the failure of the entire complex. Under the selective pressure, the rest of the nonfunctional interacting partners in the complex are also lost during evolution because they have become functionally unnecessary.

10. Which of the following is untrue regarding the STRING?

a) Search Tool for the Retrieval of Interacting Genes/Proteins

b) Functional associations include only the direct protein-protein interactions

c) It is based on combined evidence of gene linkage, gene fusion and phylogenetic profiles

d) It is a web server that predicts gene and protein functional associations View Answer

Answer: b

Explanation: Functional associations include both direct and indirect protein-protein interactions. Indirect interactions can mean enzymes in the same pathway sharing a common substrate or proteins regulating each other in the genetic pathway.

11. Questions & Answers on Molecular Phylogenetics

Phylogenetics Basics

1. Phylogenetics is the study of the evolutionary history of living organisms using treelike diagrams to represent pedigrees of these organisms.

a) True

b) False View Answer

Answer: a

Explanation: Tree branching patterns representing the evolutionary divergence are referred to as phylogeny. Phylogenetics can be studied in various ways. It is often studied using fossil records, which contain morphological information about ancestors of current species and the timeline of divergence.

2. The descriptions of morphological traits are often which are due to

a) ambiguous, multiple genetic factors

b) lucid, more than one genetic factors

c) clear, multiple genetic factors

d) ambiguous, one or two genetic factors View Answer

Answer: a

Explanation: Thus, using fossil records to determine phylogenetic relationships can often be biased. For microorganisms, fossils are essentially nonexistent, which makes it impossible to study phylogeny with this approach.

3. Which of the following is incorrect regarding the advantages of Molecular data for phylogenetics study?

a) They are more numerous than fossil records

b) They are easier to obtain as compared to fossil records

c) Sampling bias is involved

d) More clear-cut and robust phylogenetic trees can be constructed with the molecular data

View Answer

Answer: c

Explanation: There is no sampling bias involved, which helps to mend the gaps in real fossil records. Therefore, they have become favorite and sometimes the only information available for researchers to reconstruct evolutionary history. The advent of the genomic era with tremendous amounts of molecular sequence data has led to the rapid development of molecular phylogenetics.

4. To use molecular data to reconstruct evolutionary history requires making a number of reasonable assumptions. Which of the following is incorrect about it?

a) The molecular sequences used in phylogenetic construction are homologous

b) The molecular sequences used in phylogenetic construction share a common origin

c) Phylogenetic divergence cannot be bifurcating

d) Parent branch splits into two daughter branches at any given point. View Answer

Answer: c

Explanation: Here, option c and d contradict. Another assumption in phylogenetics is that each position in a sequence evolved independently. The variability among sequences is sufficiently informative for constructing unambiguous phylogenetic trees.

5. Building phylogenetic tree involves bifurcation and multifurcation.

a) True

b) False View Answer

Answer: a

Explanation: Multifurcation is normally a result of insufficient evidence to fully resolve the tree or a result of an evolutionary process known as radiation. Sometimes, a branch point on a phylogenetic tree may have more than two descendents, resulting in a multifurcating node.

6. Which of the following is incorrect regarding the terminologies of phylogenetics?

a) The connecting point where two adjacent branches join is called a node

b) Node represents an inferred ancestor of extant taxa

c) At the tips of the branches are long lost species or sequences

d) The lines in the tree are called branches View Answer

Answer: c

Explanation: At the tips of the branches are present-day species or sequences known as taxa (the singular form is taxon) or operational taxonomic units. The bifurcating point at the very bottom of the tree is the root node, which represents the common ancestor of all members of the tree.

7. Which of the following is incorrect regarding the terminologies of phylogenetics?

a) A group of taxa descended from a single common ancestor is defined as a clade or monophyletic group

b) In a monophyletic group, two taxa share a unique common ancestor shared by other taxa as well

c) Lineage is often synonymous with a tree branch leading to a defined monophyletic group

d) When a number of taxa share more than one closest common ancestors, they do not fit the definition of a clade. In this case, they are referred to as paraphyletic

View Answer

Answer: b

Explanation: In a monophyletic group, two taxa share a unique common ancestor not shared by any other taxa. They are also referred to as sister taxa to each other. The branch path depicting an ancestor–descendant relationship on a tree is called a lineage.

8. Which of the following is incorrect regarding the terminologies of phylogenetics?

a) The branching pattern in a tree is called tree topology

b) When all branches bifurcate on a phylogenetic tree, it is referred to as dichotomy

c) In case of dichotomy, each ancestor divides and gives rise to multiple descendants

d) An unrooted phylogenetic tree does not assume knowledge of a common ancestor View Answer

Answer: c

Explanation: Sometimes, a branch point on a phylogenetic tree may have more than two descendents, resulting in a multifurcating node. The phylogeny with multifurcating branches is called polytomy. A polytomy is an be a result of either an ancestral taxon giving rise to more than two immediate descendants simultaneously during evolution, a process known as radiation, or an unresolved phylogeny in which the exact order of bifurcations cannot be determined precisely.

9. Because there is no indication of which node represents an ancestor, there is no direction of an evolutionary path in an unrooted tree.

a) True

b) False View Answer

Answer: a

Explanation: To define the direction of an evolution path, a tree must be rooted. In a rooted tree, all the sequences under study have a common ancestor or root node from which a unique evolutionary path leads to all other nodes.

10. Molecular clock is an assumption by which molecular sequences evolve at varying rates.

a) True

b) False View Answer

Answer: b

Explanation: Molecular clock is an assumption by which molecular sequences evolve at constant rates so that the amount of accumulated mutations is proportional to evolutionary time. Based on this hypothesis, branch lengths on a tree can be used to estimate divergence time. This assumption of uniformity of evolutionary rates, however, rarely holds true in reality.

Gene Phylogeny Versus Species Phylogeny

1. A gene phylogeny only describes the evolution of a particular gene or encoded protein.

a) True

b) False View Answer

Answer: a

Explanation: One of the objectives of building phylogenetic trees based on molecular sequences is to reconstruct the evolutionary history of the species involved. However, strictly speaking, a gene phylogeny (phylogeny inferred from a gene or protein sequence) only describes the evolution of that particular gene or encoded protein.

2. Evolution of a particular sequence correlate with the evolutionary path of the species.

a) does not

b) always

c) does not necessarily

d) invariably View Answer

Answer: c

Explanation: The sequence may evolve more or less rapidly than other genes in the genome or may have a different evolutionary history from the rest of the genome owing to horizontal gene transfer events. Thus, the evolution of a particular sequence does not necessarily correlate with the evolutionary path of the species.

3. The species evolution is the of evolution by in a genome.

a) combined result, multiple genes

b) result, single genes

c) result, sole genes

d) distinct results, single gene View Answer

Answer: a

Explanation: In a species tree, the branching point at an internal node represents the speciation event whereas, in a gene tree, the internal node indicates a gene duplication event. The two events may or may not coincide.

4. To obtain a species phylogeny, phylogenetic trees from a variety of gene families need to be constructed

a) True

b) False View Answer

Answer: a

Explanation: To obtain a species phylogeny, phylogenetic trees from a variety of gene families need to be constructed to give an overall assessment of the species evolution. Phylogenetic trees drawn as cladograms (top) and phylograms (bottom).

5. It is often desirable to define the root of a tree. There are two ways to define the root of a tree. One is to use an outgroup, which

a) is a sequence that is homologous to the sequences under consideration

b) is separated from those sequences at an early evolutionary time

c) is generally determined from independent sources of information

d) is generally determined from similar or related sources of information View Answer

Answer: d

Explanation:Here, option c and contradict and it can be explained as follows. For example, a bird sequence can be used as a root for the phylogenetic analysis of mammals based on multiple lines of evidence that indicate that birds branched off prior to all mammalian taxa in the in group. Outgroups are required to be distinct from the in group sequences, but not too distant from the in group.

6. Which of the following is incorrect statement about the Kimura model?

a) It is a model to correct evolutionary distances and is a more sophisticated model

b) In this, the mutation rates for transitions and transversion are assumed to be different

c) According to this model, occur more frequently than transversions

d) According to this model, transversions occur more frequently than transitions View Answer

Answer: d

Explanation: This provides a more realistic estimate of evolutionary distances. The Kimura model uses the following formula dAB = −(1/2) ln(1 − 2pti − ptv) − (1/4) ln(1 − 2ptv). dAB is the evolutionary distance between sequences Aand B, pti is the observed frequency for transition, and ptv the frequency of transversion.

7. Which of the following is incorrect statement about Choosing Substitution Models?

a) There is one substitution at a particular position, in divergent sequences

b) The evolutionary divergence is beyond the ability of the statistical models to correct

c) The statistical models used to correct homoplasy are called substitution models or evolutionary models

d) For constructing DNA phylogenies, there are nucleotide substitution models available View Answer

Answer: a

Explanation: The caveat of using these models is that if there are too many multiple substitutions at a particular position, which is often true for very divergent sequences, the position may become saturated. This means that the evolutionary divergence is beyond the ability of the statistical models to correct. In this case, true evolutionary distances cannot be derived.

Therefore, only reasonably similar sequences are to be used in phylogenetic comparisons.

8. The second step in phylogenetic analysis is to construct sequence alignment. This is probably the most critical step in the procedure because it establishes positional correspondence in evolution.

a) True

b) False View Answer

Answer: a

Explanation: Incorrect alignment leads to systematic errors in the final tree or even a completely wrong tree. For that reason, it is essential that the sequences are correctly aligned. Multiple state-of-the-art alignment programs such as T-Coffee should be used. The alignment results from multiple sources should be inspected and compared carefully to identify the most reasonable one. Automatic sequence alignments almost always contain errors and should be further edited or refined if necessary.

Forms of Tree Representation

1. Which of the following is incorrect statement?

a) In a phylogram, the branch lengths represent the amount of evolutionary divergence

b) Trees like cladogram are said to be scaled

c) The scaled trees have the advantage of showing both the evolutionary relationships and information about the relative divergence time of the branches

d) In a cladogram, the external taxa line up neatly in a row or column View Answer

nswer: b

Explanation: In a cladogram, the external taxa line up neatly in a row or column. Their branch lengths are not proportional to the number of evolutionary changes and thus have no phylogenetic meaning. In such an unscaled tree, only the topology of the tree matters, which shows the relative ordering of the taxa.

2. Which of the following is incorrect statement about Newick Format?

a) It was designed to provide information of tree topology to computer programs without having to draw the tree itself

b) In this format, trees are represented by taxa excluded in nested parentheses

c) In this linear representation, each internal node is represented by a pair of parentheses

d) For a tree with scaled branch lengths, the branch lengths in arbitrary units are placed immediately after the name of the taxon separated by a colon

View Answer

Answer: b

Explanation: Trees are represented by taxa included in nested parentheses. In this linear representation, each internal node is represented by a pair of parentheses that enclose all member of a monophyletic group separated by a comma.

3. Sometimes a tree-building method may result in several equally optimal trees. A consensus tree can be built by showing the commonly resolved bifurcating portions and collapsing the ones that disagree among the trees, which results in a polytomy.

a) True

b) False View Answer

Answer: a

Explanation: Combining the nodes can be done either by strict consensus or by majority rule.

In a strict consensus tree, all conflicting nodes are collapsed into polytomies. In a consensus tree based on a majority rule, among the conflicting nodes, those that agree by more than 50% of the nodes are retained whereas the remaining nodes are collapsed into multifurcation.

4. The number of rooted trees (NR) for n taxa is a) NR = (2n− 3)! /2n+2 (n− 2)!

b) NR = (2n− 3)! /2n (n− 2)!

c) NR = (2n− 3)! /2n−2 (n− 5)!

d) NR = (2n− 3)! /2n−2 (n− 2)!

View Answer

Answer: d

Explanation: The number of potential tree topologies can be enormously large even with a moderate number of taxa. The increase of possible tree topologies follows an exponential function. In this formula, (2n−3)! Is a mathematical expression of factorial, which is the product of positive integers from 1 to 2n − 3. For example, 5! = 1 × 2 × 3 × 4 × 5 = 120.

5. For unrooted trees, the number of unrooted tree topologies (NU) is a) NU = (2n− 5)!/2n−3(n− 5)!

b) NU = (2n− 5)!/2n−3(n− 3)!

c) NU = (2n− 5)!/2−2(n− 3)!

d) NU = (2n− 5)!/2n(n− 3)!

View Answer

Answer: b

Explanation: The number of possible topologies increases extremely rapidly with the number of taxa. For six taxa, there are 105 unrooted trees and 945 rooted trees. If there are ten taxa, there can be 2,027,025 unrooted trees and 34,459,425 rooted ones.

6. It can be computationally very demanding to find a true phylogenetic tree when the number of sequences is large.

a) True

b) False View Answer

Answer: a

Explanation: Because the number of rooted topologies is much larger than that for unrooted ones, the search for a true phylogenetic tree can be simplified by calculating the unrooted trees first. Once an optimal tree is found, rooting the tree can be performed by designating a number of taxa in the data set as an outgroup based on external information to produce a rooted tree.

7. Which of the following is incorrect statement about Molecular Markers?

a) For studying very closely related organisms, protein sequences are preferred

b) The decision to use nucleotide or protein sequences depends on the purposes of the study

c) For constructing molecular phylogenetic trees, one can use either nucleotide or protein sequence data

d) The decision to use nucleotide or protein sequences depends on the properties of the sequences

View Answer

Answer: a

Explanation: The choice of molecular markers is an important matter because it can make a major difference in obtaining a correct tree. For studying very closely related organisms, nucleotide sequences, which evolve more rapidly than proteins, can be used. For example, for evolutionary analysis of different individuals within a population, noncoding regions of mitochondrial DNA are often used.

8. For studying the evolution of divergent groups of organisms, one may choose either nucleotide sequences, such as ribosomal RNA or protein sequences.

a) less widely, slowly evolving

b) more widely, slowly evolving

c) more widely, rapidly evolving

d) less widely, rapidly evolving View Answer

Answer: b

Explanation: If the phylogenetic relationships to be delineated are at the deepest level, such as between bacteria and eukaryotes, using conserved protein sequences makes more sense than using nucleotide sequences.

9. In many cases, sequences are preferable to sequences because they are relatively conserved.

a) protein, nucleotide, less

b) nucleotide, protein, less

c) protein, nucleotide, more

d) nucleotide, protein, more View Answer

Answer: c

Explanation: Protein sequences are preferable to nucleotide sequences because protein sequences are relatively more conserved as a result of the degeneracy of the genetic code in which sixty-one codons encode for twenty amino acids, meaning thereby a change in a codon may not result in a change in amino acid.

10. Protein sequences can remain the same while the corresponding DNA sequences have more room for variation.

a) True

b) False View Answer

Answer: a

Explanation: The protein sequences can remain the same while the corresponding DNA sequences have more room for variation, especially at the third codon position. The significant difference in evolutionary rates among the three nucleotide positions also violates one of the assumptions of tree-building. In contrast, the protein sequences do not suffer from this problem, even for divergent sequences.

11. DNA sequences are sometimes more biased than protein sequences because of preferential codon usage in different organisms.

a) True

b) False View Answer

Answer: a

Explanation: In this case, different codons for the same amino acid are used at different frequencies, leading to sequence variations not attributable to evolution. In addition, the genetic code of mitochondria varies from the standard genetic code. Therefore, for comparison of mitochondria protein-coding genes, it is necessary to translate the DNA sequences into protein sequences.

12. In Jukes–Cantor Model to correct evolutionary distances, A formula for deriving evolutionary distances that include hidden changes is introduced by using a logarithmic function. It is

a) dAB = −(3/4) log[1 − (4/7)pAB].

b) dAB = −(3/4) ln[1 − (5/3)pAB].

c) dAB = −(3/4) log[1 − (4/3)pAB].

d) dAB = −(3/4) ln[1 − (4/3)pAB].

View Answer

Answer: d

Explanation: The simplest nucleotide substitution model is the Jukes–Cantor model, which assumes that all nucleotides are substituted with equal probability. dAB is the evolutionary distance between sequences A and B and p AB is the observed sequence distance measured by the proportion of substitutions over the entire length of the alignment.

12. Questions on Phylogenetic Tree Construction Methods and Programs

Distance Based Methods

1. Which of the following is untrue about distance based methods?

a) The computed evolutionary distances can be used to construct a matrix of distances between all individual pairs of taxa

b) Clustering is the only method among the algorithms for the distance-based tree- building method

c) The clustering-type algorithms compute a tree based on a distance matrix starting from the most similar sequence pairs

d) Based on the pairwise distance scores in the matrix, a phylogenetic tree can be constructed for all the taxa involved

View Answer

Answer: b

Explanation: The algorithms for the distance-based tree-building method can be subdivided into either clustering based or optimality based. These algorithms include an unweighted pair group method using arithmetic average (UPGMA) and neighbor joining. The optimality-based algorithms compare many alternative tree topologies and select one that has the best fit between estimated distances in the tree and the actual evolutionary distances.

2. Which of the following is untrue about the Unweighted Pair Group Method Using Arithmetic Average?

a) The simplest clustering method is UPGMA, which builds a tree by a sequential clustering method

b) Given a distance matrix, it starts by grouping two taxa with the largest pairwise distance in the distance matrix

c) The distances between this new composite taxon and all remaining taxa are calculated to create a reduced matrix

d) The grouping process is repeated and another newly reduced matrix is created View Answer

Answer: b

Explanation: It starts by grouping two taxa with the smallest pairwise distance in the distance matrix. A node is placed at the midpoint or half distance between them. It then creates a reduced matrix by treating the new cluster as a single taxon.

3. The basic assumption of the UPGMA method is that all taxa evolve at a constant rate and that they are equally distant from the root, implying that a molecular clock is in effect.

a) True

b) False View Answer

Answer: a

Explanation: However, real data rarely meet this assumption. Thus, UPGMA often produces erroneous tree topologies. However, owing to its fast speed of calculation, it has found extensive usage in clustering analysis of DNA microarray data.

4. In the Neighbor Joining step, The UPGMA method uses unweighted distances and assumes that all taxa have constant evolutionary rates.

a) True

b) False View Answer

Answer: a

Explanation: Since this molecular clock assumption is often not met in biological sequences, to build a more accurate phylogenetic trees, the neighbor joining (NJ) method can be used, which is somewhat similar to UPGMA in that it builds a tree by using stepwise reduced distance matrices. However, the NJ method does not assume the taxa to be equidistant from the root.

5. Corrects for unequal evolutionary rates between sequences by using a conversion step. This conversion requires the calculations of “r-values” and “transformed r-values” using the following formula

a) dAB’= dAB − 1/4 × (rA + rB)

b) dAB’= dAB − 1/2 × (rA + rB)

c) dAB’= dAB − 1/3 × (rA + rB)

d) dAB’= (dAB/3) − 1/2 × (rA + rB)

View Answer

Answer: b

Explanation: AB is the converted distance between A and B and dAB is the actual evolutionary distance between A and B. The value of rsub>A (or rB) is the sum of distances of A (or B) to all other taxa.

6. A generalized expression of the r-value is ri calculated based on the following formula

a) ri = ∑dij + dj2

b) ri = ∑dij

c) ri = ∑dij + di

d) ri = ∑dij + dj View Answer

Answer: b

Explanation: i and j are two different taxa. The r-values are needed to create a modified distance matrix. The transformed r-values (r ‘) are used to determine the distances of an individual taxon to the nearest node: r i2= ri/ (n−2)

7. The tree construction process is somewhat similar to that used UPGMA.

a) True

b) False View Answer

Answer: b

Explanation: Rather than building trees from the closest pair of branches and progressing to the entire tree, the NJ tree method begins with a completely unresolved star tree by joining all taxa onto a single node and progressively decomposes the tree by selecting pairs of taxa based on the above modified pairwise distances. This allows the taxa with the shortest corrected distances to be joined first as a node.

8. Which of the following is untrue about the Optimality-Based Methods?

a) The clustering-based methods produce multiple trees as output

b) Optimality-based methods select a tree that best fits the actual evolutionary distance matrix

c) There is no criterion in judging how this tree is compared to other alternative trees

d) Optimality-based methods have a well-defined algorithm to compare all possible tree topologies

View Answer

Answer: a

Explanation: The clustering-based methods produce a single tree as output. Based on the differences in optimality criteria, there are two types of algorithms, Fitch–Margoliash and minimum evolution, that are described next. The exhaustive search for an optimal tree necessitates a slow computation, which is a clear drawback especially when the dataset is large.

9. Which of the following is untrue about the Fitch–Margoliash?

a) Method selects a best tree among all possible trees based on minimal deviation between the distances calculated in the overall branches in the tree and the distances in the original dataset

b) It starts by randomly clustering two taxa in a node

c) It starts by creating three equations to describe the distances

d) The method searches for some specific tree topologies View Answer

Answer: d

Explanation: It solves the three algebraic equations for unknown branch lengths. The clustering of the two taxa helps to create a newly reduced matrix. This process is iterated until a tree is completely resolved. The method searches for all tree topologies and selects the one that has the lowest squared deviation of actual distances and calculated tree branch lengths.

10. Minimum evolution (ME) constructs a tree with a similar procedure, but uses a different optimality criterion that finds a tree among all possible trees with a minimum overall branch length. The optimality criterion relies on the formula S = ∑bi where bi is the (i)th branch length.

a) True

b) False View Answer

Answer: a

Explanation: Searching for the minimum total branch length is an indirect approach to achieving the best fit of the branch lengths with the original dataset. Analysis has shown that minimum evolution in fact slightly outperforms the least square-based FM method.

Character Based Methods

1. Which of the following is incorrect statement about Character-based methods?

a) They are also called discrete methods

b) They are based directly on the sequence characters rather than on pairwise distances

c) They doesn’t count mutational events accumulated on the sequences

d) They may avoid the loss of information when characters are converted to distances View Answer

Answer: c

Explanation: They count mutational events accumulated on the sequences. This preservation of character information means that evolutionary dynamics of each character can be studied.

Ancestral sequences can also be inferred. The two most popular character-based approaches are the maximum parsimony (MP) and maximum likelihood (ML) methods.

2. Which of the following is incorrect statement about Maximum Parsimony Method?

a) By cutting off the unnecessary variables, model development may become difficult, and there may be more chances of introducing inconsistencies, ambiguities, and redundancies, hence, the name Occam’s razor

b) In dealing with problems that may have an infinite number of possible solutions, choosing the simplest model may help to ‘cut off’ those variables that are not really necessary to explain the phenomenon

c) This method chooses a tree that has the fewest evolutionary changes or shortest overall branch lengths

d) It is based on a principle related to a medieval philosophy called Occam’s razor View Answer

Answer: a

Explanation: The theory was formulated by William of Occam in the thirteenth century and states that the simplest explanation is probably the correct one. This is because the simplest explanation requires the fewest assumptions and the fewest leaps of logic.

3. Which of the following is incorrect statement about Building Work of MP tree?

a) It works by searching for all possible tree topologies and reconstructing ancestral sequences that require the minimum number of changes to evolve to the current sequences

b) Other than informative sites are non-informative, which are constant sites or sites that have changes occurring only once

c) Informative sites are the ones that can often be explained by a unique tree topology

d) Constant sites have the same state in all taxa and are quite useful in evaluating the various topologies

View Answer

Answer: d

Explanation: Constant sites have the same state in all taxa and are obviously useless in evaluating the various topologies. The sites that have changes occurring only once are not very useful either for constructing parsimony trees because they can be explained by multiple tree topologies. The non-informative sites are thus discarded in parsimony tree construction.

4. Because these ancestral character states are not known directly, multiple possible solutions may exist. In this case, the parsimony principle applies to choose the character states that result in a minimum number of substitutions.

a) True

b) False View Answer

Answer: a

Explanation: The inference of an ancestral sequence is made by first going from the leaves to internal nodes and to the common root to determine all possible ancestral character states. Then it goes back from the common root to the leaves to assign ancestral sequences that require the minimum number of substitutions.

5. The unweighted method treats all mutations as equivalent.

a) True

b) False View Answer

Answer: a

Explanation: This may be an oversimplification; mutations of some sites are known to occur less frequently than others, for example, transversions versus transitions, functionally important sites versus neutral sites. Therefore, a weighting scheme that takes into account the different kinds of mutations helps to select tree topologies more accurately. The MP method that incorporates a weighting scheme is called weighted parsimony.

6. Which of the following is incorrect statement about Tree-Searching Methods?

a) The choice of the first three taxa can be random

b) Parsimony method examines all possible tree topologies to find the maximally parsimonious tree.

c) It starts by building a three taxa unrooted tree, for which only one topology is available

d) This is different than exhaustive search method View Answer

Answer: d

Explanation: This is an exhaustive search method. The next step is to add a fourth taxon to the existing branches, producing three possible topologies. The remaining taxa are progressively added to form all possible tree topologies .Obviously, this brute-force approach only works if there are relatively few sequences.

7. Which of the following is incorrect statement about branch-and-bound?

a) It uses a shortcut to find an MP tree

b) It establishes an upper limit (or upper bound) for the number of allowed sequence variations

c) It solely uses UPGMA method

d) It starts by building a distance tree for all taxa involved View Answer

Answer: c

Explanation: It starts by building a distance tree for all taxa involved using either NJ or UPGMA and then computing the minimum number of substitutions for this tree. The resulting number defines the upper bound to which any other trees are compared. The rationale is that a maximally parsimonious tree must be equal to or shorter than the distance-based tree.

8. The branch-and-bound method starts building trees in a similar way as in the exhaustive method.

a) True

b) False View Answer

Answer: a

Explanation: The difference is that the previously established upper bound limits the tree growth. Whenever the overall tree length at every single stage exceeds the upper bound, the topology search toward a particular direction aborts. By doing so, it dramatically reduces the number of trees considered hence the computing time while at the same time guaranteeing to find the most parsimonious tree.

9. When the number of taxa exceeds twenty, even the branch-and-bound method becomes computationally unfeasible.

a) True

b) False View Answer

Answer: a

Explanation: A more heuristic search method must be used. A computer heuristic procedure is an approximation strategy to find an empirical solution for a complicated problem. This strategy generates quick answers, but not necessarily the best answer.

10. In a heuristic tree search, only a small subset of all possible trees is examined.

a) True

b) False View Answer

Answer: a

Explanation: This method starts by carrying out a quick initial approximation, which is to build an NJ tree and subsequently modifying it slightly into a different topology to see whether that leads to a shorter tree.

Phylogenetic Tree Evaluation

1. Which of the following is incorrect about Bootstrapping?

a) It is a statistical technique that tests the sampling errors of a phylogenetic tree

b) It does the tests by repeatedly sampling trees through slightly perturbed datasets

c) A newly constructed tree is not biased at all

d) The robustness of the original tree can be assessed here View Answer

Answer: c

Explanation: The rationale for bootstrapping is that a newly constructed tree is possibly biased owing to incorrect alignment or chance fluctuations of distance measurements. To determine the robustness or reproducibility of the current tree, trees are repeatedly constructed with slightly perturbed alignments that have some random fluctuations introduced.

2. A truly robust phylogenetic relationship should have enough characters to support the relationship even if the dataset is perturbed in such away.

a) True

b) False View Answer

Answer: a

Explanation: Otherwise, the noise introduced in the resampling process is sufficient to generate different trees, indicating that the original topology may be derived from weak phylogenetic signals. Thus, this type of analysis gives an idea of the statistical confidence of the tree topology.

3. Which of the following is incorrect about nonparametric bootstrapping?

a) A new multiple sequence alignment of the same length is generated with random duplication of some of the sites

b) A new multiple sequence alignment of the distinct lengths is generated with random duplication of some of the sites

c) Certain sites are randomly replaced by other existing sites

d) Certain sites may appear multiple times, and other sites may not appear at all in the new alignment

View Answer

Answer: b

Explanation: In nonparametric bootstrapping, a new multiple sequence alignment of the same length is generated with random duplication of some of the sites (i.e., the columns in an alignment) at the expense of some other sites. This process is repeated 100 to 1,000 times to create 100 to 1,000 new alignments that are used to reconstruct phylogenetic trees using the same method as the originally inferred tree.

4. Which of the following is incorrect about nonparametric bootstrapping?

a) All the bootstrapped trees are summarized into a consensus tree based on a majority rule

b) The most supported branching patterns shown at each node are labeled with bootstrap values

c) The most supported branching patterns are the percentage of appearance of a particular clade.

d) This test doesn’t provide a measure for evaluating the confidence levels of the tree topology.

View Answer

Answer: d

Explanation: The bootstrap test provides a measure for evaluating the confidence levels of the tree topology. Analysis has shown that a bootstrap value of 70% approximately corresponds to 95% statistical confidence, although the issue is still a subject of debate.

5. Which of the following is incorrect about Caveats?

a) Unusually high GC content in the original dataset is the potential cause for generating biased trees

b) Unusually accelerated evolutionary rates is the potential cause for generating biased trees

c) Unusually accelerated evolutionary rates is the potential cause for generating biased bootstrap estimates

d) Not a large number of bootstrap re-sampling steps are needed to achieve yielding results

View Answer

Answer: d

Explanation: In addition, from a statistical point of view, a large number of bootstrap resampling steps are needed to achieve meaningful results. It is generally recommended that a phylogenetic tree should be bootstrapped 500 to 1,000 times. However, this presents a practical dilemma.

6. Which of the following is incorrect statement?

a) In this method one half of the sites in a dataset are randomly deleted

b) It creates datasets half as long as the original

c) Each new dataset is subjected to phylogenetic tree construction using the different methods as the original

d) One criticism of this approach is that the size of datasets has been changed into one half and that the datasets are no longer considered replicates

View Answer

Answer: c

Explanation: Each new dataset is subjected to phylogenetic tree construction using the same method as the original. The advantage of jackknifing is that sites are not duplicated relative to the original dataset and that computing time is much shortened because of shorter sequences.

7. Which of the following is incorrect about Bayesian Simulation?

a) It does not require bootstrapping

b) It requires bootstrapping

c) The MCMC procedure itself involves thousands or millions of steps of resampling

d) Posterior probabilities are assigned at each node of a best Bayesian tree as statistical support

View Answer

Answer: b

Explanation: Because of fast computational speed of MCMC tree searching, the Bayesian method offers a practical advantage over regular ML and makes the statistical evaluation of ML trees more feasible. Unlike bootstrap values, Bayesian probabilities are normally higher because most trees are sampled near a small number of optimal trees. Therefore, they have a different statistical meaning from bootstrap.

8. In phylogenetic analysis, it is also important to test whether two competing tree topologies can be distinguished and whether one tree is significantly better than the other.

a) True

b) False View Answer

Answer: a

Explanation: The task is different from bootstrapping in that it tests the statistical significance of the entire phylogeny, not just portions of it. For that purpose, several statistical tests have been developed specifically for each of the three types of tree reconstruction methods, distance, parsimony, and likelihood. A test devised specifically for MP trees is called the Kishino– Hasegawa (KH) test.

9. The KH test sets out to test the null hypothesis that the two competing tree topologies are not significantly different.

a) True

b) False View Answer

Answer: a

Explanation: A paired Student t-test is used to assess whether the null hypothesis can be rejected at a statistically significant level. In this test, the difference of branch lengths at each informative site between the two trees is calculated.

10. In Shimodaira–Hasegawa Test, The degree of freedom used for the analysis depends on the substitution model used. It relies on the following test formula d = 2(ln LA – ln LB) = 2 ln(LA/LB). Here, is the log likelihood ratio score and ln LA and ln LB are

likelihood scores for tree A and tree B, respectively.

a) True

b) False View Answer

Answer: a

Explanation: A frequently used statistical test for ML trees is the Shimodaira–Hasegawa (SH) test (likelihood ratio test). It tests the goodness of fit of two competing trees using the χ2 test. For this test, log likelihood scores of two competing trees have to be obtained first.

Phylogenetic Programs

1. PAUP is a Macintosh program (UNIX version available in the GCG package) with a very user-friendly graphical interface.

a) True

b) False View Answer

Answer: a

Explanation: It stands for Phylogenetic analysis using parsimony. It is a commercial phylogenetic package. It is probably one of the most widely used phylogenetic programs available from Sinauer Publishers. PAUP was originally developed as parsimony program, but expanded to a comprehensive package that is capable of performing distance, parsimony, and likelihood analyses.

2. In PAUP, The distance options include NJ, ME, FM, and UPGMA.

a) True

b) False View Answer

Answer: a

Explanation: For distance or ML analyses, PAUP has the option for detailed specifications of substitution models, base frequencies, and among site rate heterogeneity (γ -shape parameters, proportion of invariant sites). PAUP is also able to perform nonparametric bootstrapping, jackknifing, KH testing, and SH testing.

3. Phylip stands for Phylogenetic inference package (by Joe Felsenstein)

a) True

b) False View Answer

Answer: a

Explanation: Is a free multiplatform comprehensive package containing thirty-five subprograms for performing distance, parsimony, and likelihood analysis, as well as bootstrapping for both nucleotide and amino acid sequences.

4. In PAUP, to complete an analysis the user is not required to move between different subprograms while keeping modifying names of the intermediate output files.

a) True

b) False View Answer

Answer: a

Explanation: The only problem is that to complete an analysis the user is required to move between different subprograms while keeping modifying names of the intermediate output files. It is command-line based, but relatively easy to use for each single program.

5. Which of the following is untrue regarding TREE-PUZZLE?

a) It is a program performing quartet puzzling

b) It allows various substitution models for likelihood score estimation

c) It doesn’t incorporate a discrete γ model

d) Because of the heuristic nature of the program, it allows ML analyses of large datasets

View Answer

Answer: c

Explanation: The advantage is that it allows various substitution models for likelihood score estimation. Also, it incorporates a discrete γ model for rate heterogeneity among sites.

6. Which of the following is untrue regarding TREE-PUZZLE?

a) The resulting puzzle trees are automatically assigned puzzle support values to internal branches

b) The support values are percentages of consistent quartet trees

c) The support values do not have the same meaning as bootstrap values

d) The support values have the same meaning as bootstrap values View Answer

Answer: d

Explanation: Because of the heuristic nature of the program, it allows ML analyses of large datasets. TREE-PUZZLE version 5.0 is available for Mac, UNIX, and Windows.

7. PHYML is a web-based program using the

a) phylogenetic, GA (Genetic Algorithm )

b) sequence based alignment, GA (Genetic Algorithm )

c) phylogenetic, dynamic programming

d) sequence based alignment, dynamic programming View Answer

Answer: a

Explanation: It first builds an NJ tree. Further it uses it as a starting tree for subsequent iterative refinement through subtree swapping. Branch lengths are simultaneously optimized during this process.

8. In PHYML, The tree searching when the total ML score no longer

a) ceases, increases

b) stops, decreases

c) terminates, decreases

d) stops, increases View Answer

Answer: d

Explanation: PHYML is a web-based phylogenetic program using the GA. The main advantage of this program is the ability to build trees from very large datasets with hundreds of taxa and to complete tree searching within a relatively short time frame.

9. MrBayes is a Bayesian phylogenetic inference program.

a) True

b) False View Answer

Answer: a

Explanation: It randomly samples tree topologies using the MCMC procedure. Next it infers the posterior distribution of tree topologies.

10. MrBayes has a range of probabilistic models available to search for a set of trees with the lowest posterior probability.

a) True

b) False View Answer

Answer: b

Explanation: MrBayes has a range of probabilistic models available to search for a set of trees with the highest posterior probability. It is fast and capable of handling large datasets. The program is available in multi platform versions. A web program that also employs Bayesian inference for phylogenetic analysis is BAMBE.

Maximum Parsimony Method

1. Which of the following is untrue regarding the maximum parsimony method?

a) This method predicts the evolutionary tree

b) It minimizes the number of steps required to generate the observed variation in the sequences

c) The method is also sometimes referred to as the minimum evolution method

d) Only a pairwise sequence alignment is required to predict which sequence positions are likely to correspond

View Answer

Answer: d

Explanation: A multiple sequence alignment is required to predict which sequence positions are likely to correspond. These positions will appear in vertical columns in the multiple sequence alignment. For each aligned position, phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes are identified.

2. Which of the following is untrue regarding the maximum parsimony method?

a) The analysis steps are continued for every position in the sequence alignment

b) This method is used for large numbers of sequences

c) Those trees that produce the smallest number of changes overall for all sequence positions are identified

d) This method is used for sequences that are quite similar View Answer

Answer: b

Explanation: The algorithm followed is not particularly complicated, but it is guaranteed to find the best tree, because all possible trees relating a group of sequences are examined. For this reason, the method is quite time-consuming and is not useful for data that include a large number of sequences or sequences with a large amount of variation.

3. Which of the following is untrue regarding the programs for analysis of nucleic acid sequences?

a) DNAPARS treats gaps as a fifth nucleotide state.

b) DNAPENNY performs parsimonious phylogenies by branch-and-bound search

c) DNAPENNY can analyze sequences up to 11 or 12

d) Compatibility criterion is not involved in DNACOMP View Answer

Answer: d

Explanation: DNACOMP performs phylogenetic analysis using the compatibility criterion. Rather than searching for overall parsimony at all sites in the multiple sequence alignment, this method finds the tree that supports the largest number of sites. This method is recommended when the rate of evolution varies among sites.

4. PROTPARS counts the minimum number of mutations to change a codon for the first amino acid into a codon for the second amino acid, but only scores those mutations in the mutational path that actually change the amino acid.

a) True

b) False View Answer

Answer: a

Explanation: PROTPARS is used For analysis of protein sequences. As mentioned, Silent mutations that do not change the amino acid are not scored on the grounds that they have little evolutionary significance.

5. Parsimony can give information when rates of sequence change in the different branches of a tree that are represented by the sequence data.

a) misleading, vary

b) useful, change

c) misleading, are constant

d) sometimes contradicting, are constant View Answer

Answer: a

Explanation: These variations produce a range of branch lengths, long ones representing more extended periods of time and short ones representing shorter times. Although other columns in the sequence alignment that show less variation may provide the correct tree, the columns representing greater variation dominate the analysis.

6. Which of the following is untrue regarding the distance methods?

a) The sequence pairs that have the largest number of sequence changes between them are termed ‘neighbors’

b) On a tree, these sequences share a node or common ancestor position and are each joined to that node by a branch

c) It produces a phylogenetic tree of the group

d) It employs the number of changes between each pair in a group of sequences View Answer

Answer: a

Explanation: The goal of distance methods is to identify a tree that positions the neighbors correctly and that also has branch lengths which reproduce the original data as closely as possible. Finding the closest neighbors among a group of sequences by the distance method is often the first step in producing a multiple sequence alignment.

7. Which of the following is untrue regarding the distance methods?

a) The distance method was pioneered by Feng and Doolittle

b) A collection of programs by authors Feng and Doolittle will produce both an alignment and tree of a set of protein sequences

c) The program CLUSTALW uses the neighbor-joining distance method as a guide to multiple sequence alignments

d) Among the Programs of the PHYLIP package, DNADIST is not one of them View Answer

Answer: d

Explanation: DNADIST and PROTDIST are the Programs of the PHYLIP package that perform a distance analysis. They automatically read in a sequence in the PHYLIP in file format and automatically produce a file called outfile with a distance table.

8. Which of the following is untrue regarding the Distance analysis programs in PHYLIP?

a) FITCH estimates a phylogenetic tree assuming additivity of branch lengths

b) FITCH uses the Fitch-Margoliash method

c) FITCH assumes a molecular clock but KITSCH does not

d) NEIGHBOR estimates phylogenies using the neighbor-joining or unweighted pair group method with arithmetic mean (UPGMA)

View Answer

Answer: c

Explanation: KITSCH assumes a molecular clock but FITCH does not. Also, in NEIGHBOR the neighbor-joining method does not assume a molecular clock and produces an unrooted tree. The UPGMA method assumes a molecular clock and produces a rooted tree.

9. Which of the following is untrue regarding the neighbor-joining method?

a) It is very much like the Fitch-Margoliash method

b) It is totally dissimilar than the Fitch-Margoliash method

c) It is especially suitable when the rate of evolution of the separate lineages under consideration varies

d) When the branch lengths of trees of known topology are allowed to vary in a manner

that simulates varying levels of evolutionary change, it is most reliable method View Answer

Answer: b

Explanation: The neighbor-joining method is very much like the Fitch-Margoliash method except that the choice as to which sequences to pair is determined by a different algorithm. In the situation mentioned in option d, the neighbor-joining method and the Sattath and Taversky method, are the most reliable in predicting the correct tree.

10. Neighbor-joining chooses the sequences that should be joined to give the best leastsquares estimates of the branch lengths that most closely reflect the actual distances between the sequences.

a) True

b) False View Answer

Answer: a

Explanation: It is not necessary to compare all possible trees to find the least squares fit as in the Fitch-Margoliash method. The method pairs sequences based on the effect of the pairing on the sum of the branch lengths of the tree.

The Maximum Likelihood Approach

1. Which of the following is wrong statement about the maximum likelihood approach?

a) This method doesn’t always involve probability calculations

b) It finds a tree that best accounts for the variation in a set of sequences

c) The method is similar to the maximum parsimony method

d) The analysis is performed on each column of a multiple sequence alignment View Answer

Answer: a

Explanation: This method involve probability calculations to find a tree that best accounts for the variation in a set of sequences. All possible trees are considered. Hence, the method is only feasible for a small number of sequences.

2. In about the maximum likelihood approach, for each tree, the number of sequence changes or mutations that may have occurred to give the sequence variation is considered.

a) True

b) False View Answer

Answer: a

Explanation: Because the rate of appearance of new mutations is very small, the more mutations needed to fit a tree to the data, the less likely that tree (Felsenstein 1981). The maximum likelihood method resembles the maximum parsimony method in that trees with the least number of changes will be the most likely.

3. The maximum likelihood method can be used to explore relationships among more diverse sequences, conditions that are not well handled by maximum parsimony methods.

a) True

b) False View Answer

Answer: a

Explanation: The maximum likelihood method presents an additional opportunity to evaluate trees with variations in mutation rates in different lineages. Also it provides opportunity to use explicit evolutionary models such as the Jukes-Cantor and Kimura models with allowances for variations in base composition.

4. The main disadvantage of maximum likelihood methods is that they are

a) mathematically less folded

b) mathematically less complex

c) computationally lucid

d) computationally intense View Answer

Answer: d

Explanation: The main disadvantage of maximum likelihood methods is that they are computationally intense. However, with faster computers, the maximum likelihood method is seeing wider use and is being used for more complex models of evolution.

5. Maximum likelihood has also been used for an analysis of mutations in overlapping reading frames in viruses.

a) True

b) False View Answer

Answer: a

Explanation: PAUP version 4 can be used to perform a maximum likelihood analysis on DNA sequences. The method has also been applied for changes from one amino acid to another in protein sequences.

6. Which of the following is wrong statement about DNAML and DNAMLK?

a) PHYLIP includes mentioned two programs for this maximum likelihood analysis

b) DNAML estimates phylogenies from nucleotide sequences by the maximum likelihood method

c) DNAMLK estimates phylogenies in the same manner as DNAML

d) DNAMLK estimates phylogenies without molecular clock View Answer

Answer: d

Explanation: DNAMLK estimates phylogenies from nucleotide sequences by the maximum likelihood method in the same manner as DNAML, but assumes a molecular clock. DNAML allows for variable frequencies of the four nucleotides, for unequal rates of transitions and transversions, and for different rates of change in different categories of sites, as specified by the program.

7. Which of the following is wrong statement about the maximum likelihood method’s steps?

a) It starts with an evolutionary model of sequence change that provides estimates of rates of substitution of one base for another

b) In the beginning there is an evolutionary model of sequence change that provides estimates of transitions and transversions in a set of nucleic acid sequences

c) The rates of all possible substitutions are chosen so that the base composition differs

d) The set of sequences is then aligned View Answer

Answer: c

Explanation: The rates of all possible substitutions are chosen so that the base composition remains the same. The set of sequences is then aligned, and the substitutions in each column are examined for their fit to a set of trees that describe possible phylogenetic relationships among the sequences.

8. Once all positions in the sequence alignment have been examined, the likelihoods given by each column in the alignment for each tree are to give the likelihood of the tree.

a) multiplied

b) added

c) divided

d) squared View Answer

Answer: a

Explanation: Because these likelihoods are very small numbers, their logarithms are usually added to give the logarithm likelihood of each tree. The most likely tree given the data is then identified.

9. A method of sequence alignment based on a Model (Bishop and Thompson 1986) was introduced that predicts the manner in which DNA sequences change during evolution. Which of the following is wrong about it?

a) The basis of this method is to devise a scheme for introducing substitutions, insertions, and gaps into sequences

b) The basis of this method is to provide a probability that each of these changes occurs over certain periods of evolutionary time

c) Given each of these predicted changes, the method examines all the possible combinations of mutations to change one sequence into another

d) Multiple combinations are selected that will be the most likely over time View Answer

Answer: d

Explanation: One of these combinations will be the most likely one over time and that is selected. Once this combination has been determined, a sequence alignment and the distance between the sequences will be known.

10. A method of sequence alignment based on a Model (Bishop and Thompson 1986) was introduced that predicts the manner in which DNA sequences change during evolution. Which of the following is wrong about it?

a) This method is different from the Smith-Waterman local alignment algorithm

b) This method is quite similar to the Smith-Waterman local alignment algorithm

c) The underlying mutational theory is like those used to produce the PAM matrices for predicting changes in DNA and protein sequences

d) Sequences are predicted to change by a Markov process such that each mutation in the sequence is independent of previous mutations at that site or at other sites

View Answer

Answer: b

Explanation: This method is different from the Smith-Waterman local alignment algorithm in identifying the most probable (maximum likelihood probability alignment) based on an

evolutionary model of change in sequences, as opposed to a score based on observed substitutions in related proteins and a gap scoring system. An example for option d can be–a given nucleotide at any sequence position can mutate into another at the same rate or may not change at all during a period of evolutionary time.

Reliability of Phylogenetic Predictions

1. Phylogenetic analysis of a set of sequences that aligns is straightforward because the positions that correspond in the sequences can be readily identified in a

of the sequences.

a) very well, multiple sequence alignment

b) in a haphazard manner, multiple sequence alignment

c) in a distorted way, multiple sequence alignment

d) very well, self alignment View Answer

Answer: a

Explanation: Option d, here, becomes irrelevant as there is phylogenetic analysis involved. The types of changes in the aligned positions or the numbers of changes in the alignments between pairs of sequences then provide a basis for a determination of phylogenetic relationships among the sequences by the above methods of phylogenetic analysis.

2. For sequences that have , a phylogenetic analysis is

a) diverged considerably, more challenging

b) not diverged, more challenging

c) diverged considerably, less challenging

d) diverged considerably, a less work to do View Answer

Answer: a

Explanation: Clearly, option a and b contradict. For diverged sequences, the analysis steps increase as well. A determination of the sequence changes that have occurred is more difficult because the multiple sequence alignment may not be optimal and because multiple changes may have occurred in the aligned sequence positions.

3. The choice of a suitable multiple sequence alignment method depends on the degree of variation among the sequences.

a) True

b) False View Answer

Answer: a

Explanation: The degree of variation also sometimes affecs the efficiency and nature of output. Once a suitable alignment has been found, one may also ask how well the predicted phylogenetic relationships are supported by the data in the multiple sequence alignment.

4. In the bootstrap method, the data are resampled by choosing columns from the aligned sequences to produce, in effect, a new sequence alignment of

the _

a) randomly, horizontal, same length

b) specifically, vertical, different lengths

c) randomly, vertical, same length

d) randomly, vertical, different lengths View Answer

Answer: c

Explanation: Each column of data may be used more than once and some columns may not be used at all in the new alignment. Trees are then predicted from many of these alignments of resampled sequences (Felsenstein 1988).

5. In the bootstrap method, for branches in the predicted tree topology to be significant, the resampled data sets should frequently predict the same branches.

a) True

b) False View Answer

Answer: a

Explanation: Bootstrap analysis is supported by most of the commonly used phylogenetic inference software packages and is commonly used to test tree branch reliability. Another method of testing the reliability of one part of the tree is to collapse two branches into a common node (Maddison and Maddison 1992).

6. In the final steps of the bootstrap method, the the decay value, the significant the original branches.

a) greater, less

b) greater, more

c) lesser, more

d) more, less View Answer

Answer: b

Explanation: In the final steps of the bootstrap method,The tree length is again evaluated and

compared to the original length, and any increase is the decay value. The greater the decay value, the more significant the original branches. In addition to these methods, there are some additional recommendations that increase confidence in a phylogenetic prediction.

7. A common recommendation is to use at least two of the methods—maximum parsimony, distance, or maximum likelihood, for the analysis.

a) True

b) False View Answer

Answer: a

Explanation: If two of these methods provide the same prediction, confidence in the prediction is much higher. Another recommendation is to pay careful attention to the evolutionary assumptions and models that are used for both sequence alignment and tree construction.

8. The traditional use of phylogenetic analysis is to discover evolutionary relationships among species.

a) True

b) False View Answer

Answer: a

Explanation: In such cases, a suitable gene or DNA sequence that shows just enough, but not too much, variation among a group of organisms is selected for phylogenetic analysis. For example, analysis of mitochondrial sequences is used to discover evolutionary relationships among mammals.

9. Two more recent uses of phylogenetic analysis are to analyze and to trace the evolutionary history of specific genes. Which of the following could not be the correct blank?

a) gene families

b) genomes

c) proteomes

d) physical separation methods View Answer

Answer: d

Explanation: Option, here, indicates the laboratory operations unlike the computational data mentioned in other options. For example, database similarity searches may identify several proteins in a plant genome that are similar to a yeast query protein.

10. Tracking the evolutionary history of individual genes in a group of species can reveal which genes have remained in a genome for a long time and which genes have been horizontally transferred between species.

a) True

b) False View Answer

Answer: a

Explanation: Thus, phylogenetic analysis can also contribute to an understanding of genome evolution. Or e.g. from a phylogenetic analysis of the protein family, the plant gene most closely related to the yeast gene and therefore most likely to have the same function can be determined.

13. Questions & Answers on Gene and Promoter Prediction

Categories of Gene Prediction Programs

1. Which of the following is true regarding the methods of gene prediction?

a) They solely consist of a type called ab initio–based methods

b) The ab initio–based approach predicts genes based on the given sequence alone

c) The ab initio–based approach predicts genes based on the given sequence and relative homology data

d) They solely consist of a type called homology-based approaches View Answer

Answer: b

Explanation: The current gene prediction methods can be classified into two major categories, ab initio–based and homology-based approaches. The ab initio–based approach predicts genes based on the given sequence alone.

2. In the ab initio–based approaches—they rely on two major features associated with genes: one of them being the existence of gene signals, which include start and stop codons, intron splice signals, transcription factor binding sites etc

a) True

b) False View Answer

Answer: a

Explanation: They also include ribosomal binding sites, and polyadenylation (poly-A) sites. In

addition, the triplet codon structure limits the coding frame length to multiples of three, which can be used as a condition for gene prediction.

3. In the ab initio–based approaches—they rely on two major features associated with genes: one of them being gene content, which is statistical description of coding regions.

a) True

b) False View Answer

Answer: a

Explanation: It has been observed that nucleotide composition and statistical patterns of the coding regions tend to vary significantly from those of the non-coding regions. The unique features can be detected by employing probabilistic models such as Markov models or hidden Markov models to help distinguish coding from non-coding regions.

4. The homology-based method makes predictions based on significant matches of the query sequence with sequences of known genes.

a) True

b) False View Answer

Answer: a

Explanation: For instance, if a translated DNA sequence is found to be similar to a known protein or protein family from a database search, this can be strong evidence that the region codes for a protein. Alternatively, when possible exons of a genomic DNA region match a sequenced cDNA, this also provides experimental evidence for the existence of a coding region.

5. FGENESB is a web-based program that is also based on fifth-order HMMs for detecting coding regions.

a) True

b) False View Answer

Answer: a

Explanation: The program is specifically trained for bacterial sequences. It uses the Vertibi algorithm to find an optimal match for the query sequence with the intrinsic model. A linear discriminant analysis (LDA) is used to further distinguish coding signals from non-coding signals.

6. Which of the following is untrue about GeneMark?

a) It is a suite of gene prediction programs based on the fifth-order HMMs

b) The main program is trained on a number of complete microbial genomes

c) A GeneMark heuristic program can be used to improve accuracy

d) If the sequence to be predicted is from a non-listed organism, the most closely related organism can be chosen as the basis for computation

View Answer

Answer: c

Explanation: Another option for predicting genes from a new organism is to use a self-trained program GeneMarkS as long as the user can provide at least 100 kbp of sequence on which to train the model. If the query sequence is shorter than 100 kbp, a GeneMark heuristic program can be used with some loss of accuracy. In addition to predicting prokaryotic genes, GeneMark also has a variant for eukaryotic gene prediction using HMM.

7. Which of the following is untrue about Glimmer?

a) It stands for Gene Locator and Interpolated Markov Modeler

b) It is a UNIX program from TIGR

c) It does not necessarily use the IMM algorithm

d) It is used to predict potential coding regions View Answer

Answer: c

Explanation: The computation consists of two steps, namely model building and gene prediction. The model building involves training by the input sequence, which optimizes the parameters of the model. In an actual gene prediction, the overlapping frames are “flagged” to alert the user for further inspection. Glimmer also has a variant, GlimmerM, for eukaryotic gene prediction. advertisement

8. RBS finder is a UNIX program that uses the prediction output from Glimmer and searches for the Shine–Delgarno sequences in the vicinity of predicted start sites.

a) True

b) False View Answer

Answer: a

Explanation: A high-scoring site is found by the intrinsic probabilistic model, a start codon is confirmed. Otherwise the program moves to other putative translation start sites and repeats the process.

Gene Prediction in Prokaryotes

1. Which of the following is a wrong statement?

a) Prokaryotes include bacteria and Archaea

b) Prokaryotes have relatively large genomes

c) Prokaryotes have relatively small genomes

d) In Prokaryotes, The gene density in the genomes is high, with more than 90% of a genome sequence containing coding sequence

View Answer

Answer: b

Explanation: Prokaryotes have relatively small genomes with sizes ranging from0.5 to 10Mbp (1Mbp=106 bp). Each prokaryotic gene is composed of a single contiguous stretch of ORF coding for a single protein or RNA with no interruptions within a gene.

2. In bacteria, the majority of genes have a start codon ATG (orAUG in mRNA; because prediction is done at the DNA level, T is used in place of U), which codes for methionine.

a) True

b) False View Answer

Answer: a

Explanation: Occasionally, GTG and TTG are used as alternative start codons. But methionine is still the actual amino acid inserted at the first position.

3. The presence of these codons at The beginning of the frame give a clear indication of the translation initiation site.

a) always

b) does not necessarily

c) does not

d) never View Answer

Answer: b

Explanation: Because there may be multiple ATG, GTG, or TGT codons in a frame, the presence of these codons at the beginning of the frame does not necessarily give a clear indication of the translation initiation site. Instead, to help identify this initiation codon, other features associated with translation are used.

4. Shine-Delgarno sequence, which is a stretch of purine-rich sequence complementary to 16S rRNA in the ribosome.

a) True

b) False View Answer

Answer: a

Explanation: It is located immediately downstream of the transcription initiation site and slightly upstream of the translation start codon. In many bacteria, it has a consensus motif of AGGAGGT. Identification of the ribosome binding site can help locate the start codon.

5. There are possible stop codons, identification of which is straightforward.

a) five

b) two

c) ten

d) three View Answer

Answer: d

Explanation: At the end of the protein coding region is a stop codon that causes translation to stop. There are three possible stop codons, identification of which is straightforward. Many prokaryotic genes are transcribed together as one operon.

6. Which of the following is a wrong statement regarding the conventional determination of open reading frames?

a) Without the use of specialized programs, prokaryotic gene identification can rely on manual determination of ORFs and major signals related to prokaryotic genes

b) Prokaryotic DNA is first subject to conceptual translation in all six possible frames, two frames forward and four frames reverse

c) A stop codon occurs in about every twenty codons by chance in a noncoding region

d) Prokaryotic DNA is first subject to conceptual translation in all six possible frames, three frames forward and three frames reverse

View Answer

Answer: b

Explanation: Prokaryotic DNA is first subject to conceptual translation in all six possible frames, three frames forward and three frames reverse. Because a stop codon occurs in about every twenty codons by chance in a noncoding region, a frame longer than thirty codons without interruption by stop codons is suggestive of a gene coding region, although the threshold for an ORF is normally set even higher at fifty or sixty codons.

7. The putative ORF can be translated into a protein sequence, which is then used to search against a protein database.

a) True

b) False View Answer

Answer: a

Explanation: The putative frame is further manually confirmed by the presence of other signals such as a start codon and Shine–Delgarno sequence. Detection of homologs from this search is probably the strongest indicator of a protein-coding frame.

8. Which of the following is a wrong statement regarding TESTCODE method?

a) This is based on the nucleotide composition of the third position of a codon

b) In practice, because genes can be in any of the six frames, the statistical patterns are computed for all possible frames

c) It is implemented in the commercial GCG package

d) It exploits the fact that the third codon nucleotides in a coding region fails to repeat themselves

View Answer

Answer: d

Explanation: In a coding sequence, it has been observed that this position has a preference to use G or C over A or T. By plotting the GC composition at this position, regions with values significantly above the random level can be identified, which are indicative of the presence of ORFs. This method exploits the fact that the third codon nucleotides in a coding region tend to repeat themselves.

9. The conventional determination of open reading methods identify only typical genes and tend to miss atypical genes in which the rule of codon bias is not strictly followed.

a) True

b) False View Answer

Answer: a

Explanation: These statistical methods, which are based on empirical rules, examine the statistics of a single nucleotide (either G or C). To improve the prediction accuracies, the new generation of prediction algorithms uses more sophisticated statistical models.

10. Which of the following is a wrong statement regarding Gene Prediction Using Markov Models and Hidden Markov Models?

a) Markov models and HMMs can be very helpful in providing finer statistical description of a gene

b) A Markov model describes the probability of the distribution of nucleotides in a DNA sequence

c) In a Markov model the conditional probability of a particular sequence position depends on k alternate positions

d) A zero-order Markov model assumes each base occurs independently with a given probability

View Answer

Answer: c

Explanation: In a Markov model the conditional probability of a particular sequence position depends on k previous positions. In this case, k is the order of a Markov model. In a zero-order Markov model, it is often the case for noncoding sequences. A first-order Markov model assumes that the occurrence of a base depends on the base preceding it. A second-order model looks at the preceding two bases to determine which base follows, which is more characteristic of codons in a coding sequence.

Gene Prediction in Eukaryotes – 1

1. Which of the following is untrue?

a) Eukaryotic nuclear genomes are much larger than prokaryotic ones

b) They tend to have a very high gene density

c) Eukaryotic nuclear genomes’ sizes range from 10 Mbp to 670 Gbp (1 Gbp = 109 bp)

d) They tend to have a very high gene density View Answer

Answer: b

Explanation: In humans, for instance, only3%of the genome codes for genes, with about 1 gene per 100 kbp on average. The space between genes is often very large and rich in repetitive sequences and transposable elements.

2. Which of the following is untrue about translation and transcription?

a) The first is capping is at the 5’ end of the transcript which involves methylation at the initial residue of the RNA

b) The splicing process involves a large RNA-protein complex called spliceosome

c) The second event is splicing, which is the process of removing exons and joining introns

d) The second event is splicing, which is the process of removing introns and joining exons

View Answer

Answer: c

Explanation: The reaction requires intermolecular interactions between a pair of nucleotides at each end of an intron and the RNA component of the spliceosome.To make the matter even more complex, some eukaryotic genes can have their transcripts spliced and joined in different

ways to generate more than one transcript per gene. This is the phenomenon of alternative splicing.

3. The main issue in prediction of eukaryotic genes is the identification of exons, introns, and splicing sites.

a) True

b) False View Answer

Answer: a

Explanation: From a computational point of view, it is a very complex and challenging problem. Because of the presence of split gene structures, alternative splicing, and very low gene densities, the difficulty of finding genes in such an environment is likened to finding a needle in a haystack.

4. Most vertebrate genes use as the translation start codon and have a uniquely conserved flanking sequence call a Kozak sequence (CCGCCATGG).

a) AAG

b) ATG

c) AUG

d) AGG View Answer

Answer: b

Explanation: In addition, most of these genes have a high density of CG dinucleotides near the transcription start site. This region is referred to as a CpG island (p refers to the phosphodiester bond connecting the two nucleotides), which helps to identify the transcription initiation site of a eukaryotic gene. The poly-A signal can also help locate the final coding sequence.

5. Which of the following is untrue about Ab Initio–Based Programs for Gene Prediction?

a) The goal of the ab initio gene prediction programs is to discriminate exons from noncoding sequences

b) The goal is joining exons together in the correct order

c) The main difficulty is correct identification of exons

d) To predict exons, the algorithms rely solely on gene signals View Answer

Answer: d

Explanation: To predict exons, the algorithms rely on two features, gene signals and gene content. Signals include gene start and stop sites and putative splice sites, recognizable consensus sequences such as poly-A sites.

6. In Ab Initio–Based Programs for Gene Prediction– Gene content refers to coding statistics, which includes nonrandom nucleotide distribution, amino acid distribution, synonymous codon usage, and hexamer frequencies.

a) True

b) False View Answer

Answer: a

Explanation: Among these features, the hexamer frequencies appear to be most discriminative for coding potentials. To derive an assessment for this feature,HMMscan be used, which require proper training. In addition to HMMs, neural network-based algorithms are also common in the gene prediction field.

7. Which of the following is untrue about PredictionUsing NeuralNetworks for Gene Prediction?

a) A neural network is a statistical model with a special architecture for pattern recognition and classification

b) It is composed of a network of mathematical variables

c) They resembles ab initio approaches

d) The variables in NeuralNetworks resemble the biological nervous system, with variables or nodes connected by weighted functions that are analogous to synapses View Answer

Answer: c

Explanation: Aspect of the model that makes it look like a biological neural network is its ability to “learn” and then make predictions after being trained. The network is able to process information and modify parameters of the weight functions between variables during the training stage. Once it is trained, it is able to make automatic predictions about the unknown. This is quite different than the ab initio methods.

8. Which of the following is untrue about Prediction Using Neural Networks for Gene Prediction?

a) A neural network is constructed with multiple layers; the input, output, and hidden layers

b) The input is the gene sequence with intron and exon signals

c) The model is not fed with a sequence of known gene structure

d) The output is the probability of an exon structure View Answer

Answer: c

Explanation: Between input and output, there may be one or several hidden layers where the machine learning takes place. The machine learning process starts by feeding the model with a sequence of known gene structure. The gene structure information is separated into several classes of features such as hexamer frequencies, splice sites, and GC composition during training. The weight functions in the hidden layers are adjusted during this process to recognize the nucleotide patterns and their relationship with known structures.

9. GRAIL is a web-based program that is based on a neural network algorithm Which is trained on several statistical features such as splice junctions, start and stop codons, poly-A sites, promoters, and CpG islands.

a) True

b) False View Answer

Answer: a

Explanation: The program scans the query sequence with windows of variable lengths and scores for coding potentials and finally produces an output that is the result of exon candidates. The program is currently trained for human, mouse, Arabidopsis, Drosophila, and Escherichia coli sequences.

10. Which of the following is untrue about Prediction Using Discriminant Analysis for Gene Prediction?

a) QDA draws a curved line based on a quadratic function

b) LDA works by drawing a diagonal line that best separates coding signals from noncoding signals based on knowledge learned from training data sets of known gene structures

c) Some gene prediction algorithms rely on discriminant analysis, either LDA or quadratic discriminant analysis (QDA), to improve accuracy

d) LDA works by plotting a three-dimensional graph of coding signals versus all potential 3’ splice site positions

View Answer

Answer: d

Explanation: QDA draws a curved line based on a quadratic function instead of drawing a straight line to separate coding and noncoding features. This strategy is designed to be more flexible and provide a more optimal separation between the data points.

Gene Prediction in Eukaryotes – 2

1. Which of the following is untrue about FGENES?

a) It stands for FindGenes

b) It is a web-based program that uses LDA

c) It is used to determine whether a signal is an exon

d) It does not make a use of HMMs View Answer

Answer: d

Explanation: In addition to FGENES, there are many variants of the program. Some programs, such as FGENESH, make use of HMMs. There are others, such as FGENESH C, that are similarity based. Some programs, such as FGENESH+, combine both ab initio and similarity- based approaches.

2. GENSCAN is awebbased program that makes predictions based on fifth-order HMMs.

a) True

b) False View Answer

Answer: a

Explanation: It combines hexamer frequencies with coding signals (initiation codons, TATA box, cap site, poly-A, etc.) in prediction. Putative exons are assigned a probability score (P) of being a true exon. Only predictions with P > 0.5 are deemed reliable. This program is trained for sequences from vertebrates, Arabidopsis, and maize. It has been used extensively in annotating the human genome.

3. Which of the following wrong about HMM GENE?

a) It is also an HMM-based web program

b) It uses a criterion called the conditional maximum likelihood to discriminate coding from non-coding features

c) HMM prediction is unbiased toward the locked region

d) If a sequence already has a sub-region identified as coding region, which may be based on similarity with cDNAs or proteins in a database, these regions are locked as coding regions

View Answer

Answer: c

Explanation: An HMM prediction is subsequently made with a bias toward the locked region and is extended from the locked region to predict the rest of the gene coding regions and even

neighboring genes. The program is in a way a hybrid algorithm that uses both ab initio-based and homology-based criteria.

4. Which of the following is untrue about Homology-Based Programs?

a) They are based on the fact that exon structures and exon sequences of related species are less conserved

b) This approach assumes that the database sequences are correct

c) It is a reasonable assumption in light of the fact that many homologous sequences to be compared with are derived from cDNA or expressed sequence tags (ESTs) of the same species

d) Potential coding frames in a query sequence are translated and used to align with closest protein homologs found in databases

View Answer

Answer: a

Explanation: Homology-based programs are based on the fact that exon structures and exon sequences of related species are highly conserved. When potential coding frames in a query sequence are translated and used to align with closest protein homologs found in databases, near perfectly matched regions can be used to reveal the exon boundaries in the query.

5. The drawback of Homology-based approach is its reliance on the presence of homologs in databases.

a) True

b) False View Answer

Answer: a

Explanation: If the homologs are not available in the database, the method cannot be used. Novel genes in a new species cannot be discovered without matches in the database. A number of publicly available programs use this approach.

6. GenomeScan is a web-based server that combines GENSCAN prediction results with BLASTX similarity searches.

a) True

b) False View Answer

Answer: a

Explanation: The user provides genomic DNA and protein sequences from related species. The genomic DNA is translated in all six frames to cover all possible exons. The translated exons are then used to compare with the user-supplied protein sequences.

7. Which of the following is untrue about EST2Genome?

a) It is a web-based program purely based on the sequence alignment approach to define intron–exon boundaries

b) It compares an EST (or cDNA) sequence with a genomic DNA sequence containing the corresponding gene

c) The alignment is rarely done using a dynamic programming–based algorithm

d) Advantage of the approach is the ability to find very small exons and alternatively spliced exons that are very difficult to predict by any ab initio–type algorithms

View Answer

Answer: c

Explanation: The alignment is done using a dynamic programming–based algorithm. Another advantage is that there is no need for model training, which provides much more flexibility for gene prediction. The limitation is that EST or cDNA sequences often contain errors or even introns if the transcripts are not completely spliced before reverse transcription.

8. Which of the following is untrue about SGP-1?

a) The program translates all potential exons in each sequence and does pair wise alignment for the translated protein sequences using a dynamic programming approach

b) The near-perfect matches at the protein level define coding regions

c) It is a similarity-based web program that aligns two genomic DNA sequences from distinctly related organisms

d) It stands for Syntenic Gene Prediction View Answer

Answer: c

Explanation: It aligns two genomic DNA sequences from closely related organisms. Similar to EST2Genome, there is no training needed. The limitation is the need for two homologous sequences having similar genes with similar exon structures; if this condition is not met, a gene escapes detection from one sequence when there is no counterpart in another sequence.

9. TwinScan is also a similarity-based gene-finding Server and it is similar to GenomeScan in that it uses GenScan to predict all possible exons from the genomic sequence.

a) True

b) False View Answer

Answer: a

Explanation: The putative exons are used for BLAST searching to find closest homologs. The

putative exons and homologs from BLAST searching are aligned to identify the best match. Only the closest match from a genome database is used as a template for refining the previous exon selection and exon boundaries.

10. Because different prediction programs have different levels of sensitivity and specificity, it makes sense to combine results of multiple programs based on consensus. This idea has prompted development of consensus-based algorithms.

a) True

b) False View Answer

Answer: a

Explanation: These programs work by retaining common predictions agreed by most programs and removing inconsistent predictions. Such an integrated approach may improve the specificity by correcting the false positives and the problem of over prediction. However, since this procedure punishes novel predictions, it may lead to lowered sensitivity and missed predictions.

14. Questions on Promoter and Regulatory Element Prediction

Promoter and Regulatory Elements in Prokaryotes & Eukaryotes

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Promoter and Regulatory Elements in Prokaryotes & Eukaryotes”.

1. In bacteria, transcription is initiated by DNA polymerase.

a) True

b) False View Answer

Answer: b

Explanation: In bacteria, transcription is initiated by RNA polymerase, which is a multi-subunit enzyme. The σ subunit (e.g., σ70) of the RNA polymerase is the protein that recognizes specific sequences upstream of a gene and allows the rest of the enzyme complex to bind.

2. The upstream sequence where the σ protein binds constitutes the promoter sequence.

a) True

b) False View Answer

Answer: a

Explanation: This includes the sequence segments located 35 and 10 base pairs (bp) upstream from the transcription start site. They are also referred to as the −35 and −10 boxes. For the σ70 subunit in Escherichia coli, for example, the −35 box has a consensus sequence of TTGACA. The –10 box has a consensus of TATAAT.

3. The promoter sequence may determine the expression of one gene or a number of linked genes downstream.

a) True

b) False View Answer

Answer: a

Explanation: In the latter case, the linked genes form an operon. It is controlled by the promoter.

4. In addition to the RNA polymerase, there are also a number of DNA-binding proteins that facilitate the process of transcription.

a) True

b) False View Answer

Answer: a

Explanation: These proteins are called transcription factors. They bind to specific DNA sequences to either enhance or inhibit the function of the RNA polymerase.

5. The specific DNA sequences to which the transcription factors bind are referred to as

a) replication elements

b) blocking factors

c) transcription factors

d) regulatory elements View Answer

Answer: d

Explanation: The regulatory elements may bind in the vicinity of the promoter or bind to a site several hundred bases away from the promoter. The reason that the regulatory proteins binding at long distance can still exert their effect is because of the flexible structure of DNA, which is

able to bend and exert its effect by bringing the transcription factors in close contact with the RNA polymerase complex.

6. In eukaryotes, gene expression is not regulated by a protein complex formed between transcription factors and RNA polymerase.

a) True

b) False View Answer

Answer: b

Explanation: Here, the gene expression is also regulated by a protein complex formed between transcription factors and RNA polymerase. However, eukaryotic transcription has an added layer of complexity in that there are three different types of RNA polymerase complexes, namely RNA polymerases I, II, and III.

7. Which of the following is untrue?

a) RNA polymerases I is responsible for the transcription of ribosomal RNA

b) RNA polymerases III is responsible for the transcription of tRNA

c) RNA polymerase II is exclusively responsible for transcribing protein-encoding genes

d) Synthesis of mRNAs is carried out by RNA polymerase I View Answer

Answer: d

Explanation: Ach polymerase transcribes different sets of genes. RNA polymerase II is exclusively responsible for transcribing protein-encoding genes or synthesis of mRNAs.

8. In eukaryotes, genes often form an operon with a shared promoter.

a) True

b) False View Answer

Answer: a

Explanation: Unlike in prokaryotes, where genes often form an operon with a shared promoter, each eukaryotic gene has its own promoter. The eukaryotic transcription machinery also requires many more transcription factors than its prokaryotic counterpart to help initiate transcription.

9. Eukaryotic RNA polymerase II does not directly bind to the promoter, but relies on a dozen or more transcription factors to recognize and bind to the promoter in a specific order before its own binding around the promoter.

a) True

b) False View Answer

Answer: a

Explanation: The core of many eukaryotic promoters is a so-called TATA box, located 30 bps upstream from the transcription start site, having a consensus motif TATA (A/T) A (A/T).

However, not all eukaryotic promoters contain the TATA box. Many genes such as housekeeping genes do not have the TATA box in their promoters.

10. The TATA box is often used as an indicator of the presence of a promoter.

a) True

b) False View Answer

Answer: a

Explanation: In addition, many genes have a unique initiator sequence (Inr), which is a pyrimidine rich sequence with a consensus (C/T)(C/T)CA(C/T)(C/T). This site coincides with the transcription start site. Most of the transcription factor binding sites are located within 500 bp upstream of the transcription start site.

Prediction Algorithms – 1

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Prediction Algorithms – 1”.

1. Ab initio type of algorithm predicts prokaryotic and eukaryotic promoters and regulatory elements based on characteristic sequences patterns for promoters and regulatory elements.

a) True

b) False View Answer

Answer: a

Explanation: Some ab initio programs are signal based, relying on characteristic promoter sequences such as the TATA box. Other programs rely on content information such as hexamer frequencies.

2. The advantage of the ab initio method is that the sequence can be applied as such without having to obtain experimental information.

a) True

b) False View Answer

Answer: a

Explanation: The limitation is the need for training, which makes the prediction programs species specific. In addition, this type of method has a difficulty in discovering new, unknown motifs.

3. Which of the following is incorrect regarding the ab initio approaches?

a) The conventional approach to detecting a promoter or regulatory site is through matching a consensus sequence pattern represented by regular expressions

b) The conventional approach to detecting a promoter or regulatory site is through matching a position-specific scoring matrix constructed from well-characterized binding sites

c) The consensus sequences or the matrices are relatively short, covering 6 to 10 bases

d) The consensus sequences or the matrices are relatively large, covering 700 to 1000 bases

View Answer

Answer: d

Explanation: To determine whether a query sequence matches a weight matrix, the sequence is scanned through the matrix. Scores of matches and mismatches at all matrix positions are summed up to give a log odds score, which is then evaluated for statistical significance. This simple approach, however, often has difficulty differentiating true promoters from random sequence matches and generates high rates of false positives as a result.

4. To improve the specificity of prediction, some algorithms selectively coding regions and focus on the upstream regions , which are most likely to contain promoters. In that sense, promoter prediction and gene prediction are coupled.

a) include, (0.5 to 2.0 kb) only

b) include, (0.5 to 2.0 Mb) only

c) exclude, (0.5 to 2.0 Mb) only

d) exclude, (0.5 to 2.0 kb) only View Answer

Answer: d

Explanation: To better discriminate true motifs from background noise, a new generation of algorithms has been developed that take into account the higher order correlation of multiple subtle features by using discriminant functions, neural networks, or hidden Markov models (HMMs) that are capable of incorporating more neighboring sequence information.

5. Operon prediction is less important in prokaryotic promoter prediction.

a) True

b) False View Answer

Answer: b

Explanation: One of the unique aspects in prokaryotic promoter prediction is the determination of operon structures, because genes within an operon share a common promoter located upstream of the first gene of the operon. Hence, operon prediction is the key in prokaryotic promoter prediction.

6. Once an operon structure is known, for the presence of a promoter and regulatory elements, in the operon do not possess such DNA elements.

a) only the first gene is predicted, whereas other genes

b) only the first hundred genes are predicted, whereas next few genes

c) only first two genes are predicted, whereas next few genes

d) only first ten genes are predicted, whereas next few genes View Answer

Answer: a

Explanation: Only the first gene is predicted for the presence of a promoter and regulatory elements, whereas other genes in the operon do not possess such DNA elements. There are a number of methods available for prokaryotic operon prediction. The most accurate is a set of simple rules developed.

7. Which of the following is correct regarding the method for prokaryotic operon prediction?

a) It relies on two kinds of information: gene orientation and intergenic distances of a pair of genes of interest and conserved linkage of the genes based on comparative genomic analysis

b) It relies only on the gene orientation and intergenic distances of a pair of genes of interest

c) It relies only on the conserved linkage of the genes based on comparative genomic analysis

d) The prediction cannot be done manually using the rules View Answer

Answer: a

Explanation: A scoring scheme is developed to assign operons with different levels of Confidence. This method is claimed to produce accurate identification of an operon structure,

which in turn facilitates the promoter prediction. The prediction can be done manually using the rules. The few dedicated programs for prokaryotic promoter prediction do not apply the rule for historical reasons. The most frequently used program is BPROM.

8. Which of the following is incorrect regarding BPROM?

a) It is a web-based program for prediction of bacterial promoters

b) It is a web-based program only for prediction of eukarotic promoters

c) It uses a linear discriminant function

d) The linear discriminant function is combined with signal and content information View Answer

Answer: b

Explanation: The linear discriminant function is combined with signal and content Information such as consensus promoter sequence and oligonucleotide composition of the promoter sites. This program first predicts a given sequence for bacterial operon structures by using an intergenic distance of 100 bp as basis for distinguishing genes to be in an operon.

9. In BPROM, once the operons are assigned, the program is able to predict putative promoter sequences.

a) True

b) False View Answer

Answer: a

Explanation: The most bacterial promoters are located within 200 bp of the protein coding region. Hence, the program is most effectively used when about 200 bp of upstream sequence of the first gene of an operon is supplied as input to increase specificity.

10. Which of the following is incorrect regarding FindTerm?

a) It is a program for searching bacterial ρ-independent termination signals located at the end of operons

b) It is a program for searching bacterial ρ-dependent termination signals located within the operons

c) The predictions are made based on matching of known profiles of the termination signals combined with energy calculations

d) It is available from the same site as FGENES and BPROM. View Answer

Answer: b

Explanation: The predictions are made based on matching of known profiles of the termination signals combined with energy calculations for the derived RNA secondary structures for the

putative hairpin-loop structure. The sequence region that scores best in features and energy terms is chosen as the prediction. The information can sometimes be useful in defining an operon.

Prediction Algorithms – 2

This set of Bioinformatics Assessment Questions and Answers focuses on “Prediction Algorithms – 2”.

1. The eukaryotic transcription initiation is less dependent on transcription factors.

a) True

b) False View Answer

Answer: b

Explanation: The eukaryotic transcription initiation requires cooperation of a large number of transcription factors. Co-operativity means that the promoter regions tend to contain a high density of protein-binding sites. Thus, finding a cluster of transcription factor binding sites often enhances the probability of individual binding site prediction.

2. CpGProD is a web-based program that predicts promoters containing a high density of CpG islands

a) in archea genomic sequences

b) in mammalian genomic sequences

c) in eukaryotic and bacterial genomic sequences

d) only in bacterial genomic sequences View Answer

Answer: b

Explanation: It calculates moving averages of GC% and CpG ratios (observed/expected) over a window of a certain size (usually 200 bp). When the values are above a certain threshold, the region is identified as a CpG island.

3. Which of the following is incorrect regarding Eponine?

a) It is a web-based program that predicts transcription start sites

b) It is a web-based program that particularly predicts tranpososons and retropososons

c) The regulatory sites include the TATA box, the CCAAT box, and CpG islands

d) It is based on a series of pre-constructed PSSMs of several regulatory sites View Answer

Answer: b

Explanation: The query sequence from a mammalian source is scanned through the PSSMs.

The sequence stretches with high-score matching to all the PSSMs, as well as matching of the spacing between the elements, are declared transcription start sites. A Bayesian method is also used in decision making.

4. Which of the following is incorrect regarding Cluster-Buster?

a) It is an HMM-based web-based program

b) A query sequence is scanned with a window size of 1 kb for putative regulatory motifs using motif HMMs

c) It works by detecting a region of high concentration of unknown transcription factor binding sites and regulatory motifs at the initiation

d) It is designed to find clusters of regulatory binding sites View Answer

Answer: c

Explanation: It works by detecting a region of high concentration of known transcription factor binding sites and regulatory motifs. If multiple motifs are detected within a window, a positive score is assigned to each motif found. The total score of the window is the sum of each motif score subtracting a gap penalty, which is proportional to the distances between motifs. If the score of a certain region is above a certain threshold, it is predicted to contain a regulatory cluster.

5. Which of the following is incorrect regarding First EF?

a) It is a program that predicts promoters for bacterial DNA

b) It is a web-based program that predicts promoters for human DNA

c) It stands for First Exon Finder

d) It integrates gene prediction with promoter prediction View Answer

Answer: a

Explanation: It uses quadratic discriminant functions (see Chapter 8) to calculate the probabilities of the first exon of a gene and its boundary sites. A segment of DNA (15 kb) upstream of the first exon is subsequently extracted for promoter prediction on the basis of scores for CpG islands.

6. McPromoter, a web-based program, uses a neural network to make promoter predictions.

a) True

b) False View Answer

Answer: a

Explanation: It has a unique promoter model containing six scoring segments. The program

scans a window of 300 bases for the likelihoods of being in each of the coding, noncoding, and promoter regions.

7. The input for the neural network includes parameters for sequence physical properties, such as

a) DNA bendability

b) Signals such as the TATA box

c) Signals such as initiator box

d) Signals such as CpAA islands View Answer

Answer: d

Explanation: As seen, the correct answer is CpG in option d. The hidden layer combines all the features to derive an overall likelihood for a site being a promoter. Another unique feature is that McPromoter does not require that certain patterns must be present, but instead the combination of all features is important. For instance, even if the TATA box score is very low, a promoter prediction can still be made if the other features score highly. The program is currently trained for Drosophila and human sequences.

8. TSSW is a web program that distinguishes promoter sequences from non-promoter sequences based on a combination of unique content information such as hexamer/trimer frequencies and signal information such the TATA box in the promoter region.

a) True

b) False View Answer

Answer: a

Explanation: As mentioned here, TSSW uses unique content information such as hexamer/trimer frequencies and signal information such the TATA box in the promoter region. The values are fed to a linear discriminant function to separate true motifs from background noise.

9. Which of the following is incorrect regarding CONPRO?

a) It is a web-based program that uses a consensus method

b) It is used to identify promoter elements for human DNA

c) cDNA does not play a role in prediction

d) The program uses the information to search the human genome database for the position of the gene

View Answer

Answer: c

Explanation: To use the program, a user supplies the transcript sequence of a gene (cDNA). It then uses the GENSCAN program to predict 5’ untranslated exons in the upstream region. Once the 5’-most exon is located, a further upstream region (1.5 kb) is used for promoter prediction, which relies on a combination of five promoter prediction programs, TSSG, TSSW, NNPP, PROSCAN, and PromFD.

10. In CONPRO, for each program, the highest score prediction is taken as the promoter in the region.

a) True

b) False View Answer

Answer: a

Explanation: If three predictions fall within a 100-bp region, this is considered a consensus prediction If no three-way consensus is achieved, TSSG and PromFD predictions are taken. Because no coding sequence is used in prediction, specificity is improved relative to each individual program.

Prediction Algorithms – 3

This set of Bioinformatics Problems focuses on “Prediction Algorithms – 3”.

1. Which of the following is incorrect regarding Phylogenetic Footprinting–Based Method?

a) It is possible to obtain promoter sequences for a particular gene through comparative analysis

b) The conservation from closely related organisms is both at the sequence level and at the level of organization of the elements

c) The conservation from closely related organisms is most at the sequence level

d) It has been observed that promoter and regulatory elements from closely related organisms such as human and mouse are highly conserved

View Answer

Answer: c

Explanation: The identification of conserved noncoding DNA elements that serve crucial

functional roles is referred to as phylogenetic footprinting; the elements are called phylogenetic footprints. This type of method can apply to both prokaryotic and eukaryotic sequences.

2. A caveat of phylogenetic footprinting is to extract noncoding sequences Upstream of corresponding genes and focus the comparison to this region only, which helps to prevent false positives.

a) True

b) False View Answer

Answer: a

Explanation: The predictive value of this method also depends on the quality of the subsequent sequence alignments. The advanced alignment programs can be used. Even more sophisticated expectation maximization (EM) and Gibbs sampling algorithms can be used in detecting weakly conserved motifs.

3. Which of the following is untrue about ConSite?

a) It is a web server that finds putative promoter elements

b) It includes comparing two orthologous sequences

c) The program does not accept pre-computed alignment

d) The program accepts pre-computed alignment View Answer

Answer: c

Explanation: The user provides two individual sequences which are aligned by ConSite using a global alignment algorithm. Conserved regions are identified by calculating identity scores, which are then used to compare against a motif database of regulatory sites (TRANSFAC). High- scoring sequence segments upstream of genes are returned as putative regulatory elements.

4. rVISTA uses two orthologous sequences as input and first identifies all putative regulatory motifs based on TRANSFAC matches.

a) True

b) False View Answer

Answer: a

Explanation: rVISTA is a cross-species comparison tool for promoter recognition. It aligns the two sequences using a local alignment strategy. The motifs that have the highest percent identity in the pairwise comparison are presented graphically as regulatory elements.

5. Which of the following is untrue about Bayes Aligner?

a) Posterior probability values, which are considered estimates of the true alignment, are

calculated for each alignment.

b) The method generates a single best alignment

c) It aligns two sequences using a Bayesian algorithm which is a unique sequence alignment method

d) It is a web-based footprinting program View Answer

Answer: b

Explanation: Instead of returning a single best alignment, the method generates a distribution of a large number of alignments using a full range of scoring matrices and gap penalties. By studying the distribution, the alignment that has the highest likelihood score, which is in the extreme margin of the distribution, is chosen. Based on this unique alignment searching algorithm, weakly conserved motifs can be identified with high probability scores.

6. Which of the following is untrue about FootPrinter?

a) It is a web-based program for phylogenetic footprinting using multiple input sequences

b) The motifs from organisms spanning over the widest evolutionary distances are identified as promoter or regulatory motifs

c) The program performs multiple alignment of the input sequences to identify conserved motifs

d) The user does not necessarily provides a phylogenetic tree that defines the evolutionary relationship of the input sequences

View Answer

Answer: d

Explanation: The user also needs to provide a phylogenetic tree that defines the evolutionary relationship of the input sequences. One may obtain the tree information from the “Tree of Life” web site, which archives known phylogenetic trees using ribosomal RNAs as gene markers. It identifies unusually well-conserved motifs across a set of orthologous sequences.

7. Which of the following is untrue?

a) MEME is the EM based program only for protein motif discovery

b) AlignACE is a web-based program using the Gibbs sampling algorithm to find common motifs

c) AlignACE is optimized for DNA sequence motif extraction

d) Melina stands for Motif Elucidator In Nucleotide sequence Assembly View Answer

Answer: a

Explanation: The use of MEME is similar to that for protein sequences and DNA motif finding. AlignACE automatically determines the optimal number and lengths of motifs from the input sequences. Melina is a web-based program that runs four individual motif-finding algorithms – MEME, GIBBS sampling, CONSENSUS, and Core search – simultaneously. The user compares the results to determine the consensus of motifs predicted by all four prediction methods.

8. Which of the following is untrue about Expression Profiling–Based Method?

a) Genes with similar expression profiles are considered coexpressed, which can be identified through a clustering approach

b) This approach appears to be less effective for finding transcription factor binding sites.

c) An advanced alignment-independent profile construction method such as EM and Gibbs motif sampling is often used in finding the subtle sequence motifs

d) The basis for coexpression is thought to be due to common promoters and regulatory elements.

View Answer

Answer: b

Explanation: This approach is essentially experimentally based and appears to be robust for finding transcription factor binding sites. The problem is that the regulatory elements of coexpressed genes are usually short and weak. Their patterns are difficult to discern using simple multiple sequence alignment approaches.

9. INCLUSive is a suite of web based tools designed to streamline the process of microarray data collection and sequence motif detection.

a) True

b) False View Answer

Answer: a

Explanation: The pipeline processes microarray data, automatically clusters genes according expression patterns, retrieves upstream sequences of coregulated genes and detects motifs using a Gibbs sampling approach (Motif Sampler). To further avoid the problem of getting stuck in a local optimum, each sequence dataset is submitted to Motif Sampler ten times. The results may vary in each run. The results from the ten runs are compiled to derive consensus motifs.

10. Which of the following is untrue about PhyloCon?

a) It stands for Phylogenetic Consensus

b) It is used to identify regulatory motifs

c) It is a UNIX program that combines phylogenetic footprinting with gene expression profiling analysis

d) No conservation among orthologous genes and conservation among coregulated genes is a disadvantage

View Answer

Answer: d

Explanation: This approach takes advantage of conservation among orthologous genes as well as conservation among coregulated genes. For each individual gene in a set of coregulated genes, multiple sequence homologs are aligned to derive profiles. Based on the gene expression data, profiles between coregulated genes are further compared to identify functionally conserved motifs among evolutionary conserved motifs.

15. Predicting the Structure of Protein – Biomolecular Interactions

Molecular Complementarity

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Molecular Complementarity”.

1. Which of the following is untrue about Amino acid conservation?

a) It has been known for some time that conservation of residues at the surface of a protein family is often related to function

b) Conservation of residues at the core of a protein family is often related to function

c) This may be an enzyme active-site or binding site

d) Unlike hydrophobicity or electrostatic potential, displaying residue conservation on the molecular surface has no physical or chemical basis

View Answer

Answer: b

Explanation: However, the evolutionary information can sometimes delineate a functional epitope allowing residues to be identified that are important for binding. In order to infer structure-function relationships the proteins must be structurally related, preferably a large family of proteins with related function.

2. Which of the following is untrue about Shape Complementarity?

a) The complementarity between two proteins or a protein and ligand can be described by surface contact or overlap

b) The complementarity between two proteins or a protein and ligand can be described by the overall buried surface area of two molecules in contact

c) The complementarity between two proteins or a protein and ligand can be described

by the number of adjacent surface points (atoms or residues)

d) The hydrophobic effect barely has impact on protein folding View Answer

Answer: d

Explanation: For example simple atom neighbor counting and the simple surface contact scores are of this type. Whilst these measures are easily calculated they also have some physical basis for being effective scoring functions. The principal driving force for protein folding and binding is the hydrophobic effect, which involves the free energy gain of removing non-polar surface area from water.

3. The burial of surface area (or the maximization of surface contact) is an approximation of the effect of shape complementarity.

a) True

b) False View Answer

Answer: a

Explanation: However, since no distinction is generally made between polar and non-polar surface area this is an approximation. As burial of the former is unfavorable or neutral the sentence is held true.

4. Which of the following is untrue about Grid Representation?

a) To speed up the matching process the topology of the protein can be simplified from atomic level detail to a series of cubic elements

b) To speed up the matching process discretizing the 3-dimensional space using a grid is done

c) Discretizing allows very fast computer matching using search methods such as Fourier transform

d) Fourier transform is hardly used in computer matching View Answer

Answer: d

Explanation: The shape of a molecule is described by mapping it to a 3-D grid of uniformly spaced points. Clearly the level of detail is controlled by the grid spacing. The larger the grid spacing the cruder the representation with respect to the atomic level.

5. In Grid representation, in a translational scan the mobile molecule B moves through the grid representing the static molecule A and a signal describing shape complementarity, fc, is generated for each mapping. Mathematically the correlation function, fc, of fA and fB is given by where N is the number of grid points along

the cubic axes i, j, and k and α, β, and γ are the translational vectors of the mobile molecule B relative to the static one A.

View Answer Answer: a

Explanation: The overlap between points representing the surface of the molecules is scored favorably, however, overlap with points representing the core of the static molecule are scored unfavorably. Zero correlation score is given to two molecules not in contact. Negative scores represent surface overlap with the core region of molecule A.

6. In Grid representation, the lowest score represents the best surface complementarity for a given translational scan.

a) True

b) False View Answer

Answer: b

Explanation: The highest score represents the best surface complementarity for a given translational scan. Note that the mobile molecule must be mapped differently to the grid for each rotational change applied to the molecule.

7. In Property-based measures, displaying physical properties on the molecular surface of molecules can help to guide molecular docking.

a) True

b) False View Answer

Answer: a

Explanation: Indeed in many cases, displaying physical properties on the molecular surface of molecules can help to guide molecular docking. Alternatively sequence conservation might also help, particularly where a homologous family of proteins maintain a specific binding partner.

8. Which of the following is untrue about Hydrophobicity?

a) The hydrophobic effect plays a dominant role in the folding of proteins

b) Hydrophobic residues aggregate away from contact with water

c) Hydrophobic residues aggregate to form hydrophobic cores with more polar residues

d) Hydrophobic residues form the solvent accessible surface but restrict the solubility of the protein

View Answer

Answer: d

Explanation: Clearly a hydrophobic interface will drive the formation of protein-protein or protein- ligand interactions. It has been noted that hydrophobicity is fairly common at protein-protein interfaces particularly in homodimers (two identical protein monomers that associate) and oligomeric proteins.

9. Oligomers are often obligate complexes meaning that the free-energy cost of dissociation is high and they exist as oligomers under physiological conditions.

a) True

b) False View Answer

Answer: a

Explanation: In some cases biological function is dependent on this. However, many protein interactions are non-obligatory being made/broken according to their environment. These proteins must be independently stable in solution. These are commonly heterodimeric complexes including enzyme inhibitor and antibody-antigen complexes as well as a host of other casual interacting proteins. The hydrophobic effect is often much less dominant and interfaces are more polar in nature partly because of issues relating to protein stability and aggregation. Therefore, hydrophobicity is useful as a guide to molecular complementarity in selective protein-protein interactions.

10. Which of the following is untrue about Electrostatic Complementarity?

a) The electrostatic properties of biomolecules play an important role in determining interactions.

b) The burial of charged residues at protein-protein/DNA interfaces is thought to be generally net destabilizing with the hydrophobic effect being the primary driving force

c) Charged groups involved in the biomolecular interface are often stabilized by other polar or oppositely charged groups on the interacting molecule

d) Charged groups involved in the biomolecular interface are often stabilized by similar polar or same charged groups on the interacting molecule

View Answer

Answer: d

Explanation: Therefore charge complementarity can play an important role in determining the specificity of the interaction. In many cases in biology a protein must recognize a highly charged molecule such as poly-anions like DNA or RNA. In order to make a close approach the protein must have charged residues that complement the negative charges present on the phosphate

backbone. Electrostatic complementarity is also important in many protein-protein and protein- ligand interactions.

Conformational Flexibility

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Conformational Flexibility”.

1. Proteins are dynamic entities that undergo

a) only fluctuation of flexible loop regions about equilibrium positions when in solution

b) only limited conformational change of amino acid side-chains when in solution

c) both limited conformational change of amino acid side-chains and fluctuation of flexible loop regions about equilibrium positions when in solution

d) illimitable or total conformational change of amino acid side-chains when in solution View Answer

Answer: c

Explanation: Such flexibility can often be adequately treated by ‘soft’ potentials or limited conformational flexibility and/or refinement of side-chains on docking. However, proteins often undergo more extensive conformational changes which may involve large-scale motions of domains relative to one another or possibly conformational change involving order-disorder transitions.

2. Typical motions of proteins will be primarily treated by a rigid body model for docking.

a) True

b) False View Answer

Answer: b

Explanation: These types of motions will be poorly treated by a rigid body model for docking. Again distinction is often made between the general treatment of protein-protein docking and protein-ligand docking.

3. In case of Protein-ligand docking, ligands are often in adapting their shape to fit the receptor binding pocket.

a) small molecule, highly flexible

b) large molecule, highly flexible

c) large molecule, more flexible

d) small molecule, less flexible View Answer

Answer: a

Explanation: The degree of conformational flexibility of a small molecule ligand (substrate, cofactor or inhibitor) can be considerable particularly where there are multiple torsion angles. This presents a major challenge in protein-ligand docking and several different approaches have been used to solve this problem.

4. In Multiple conformation rigid-body method, a ligand is assumed to be able to adopt a number (N) of different that are computed the ligand being docked into the receptor.

a) low-energy conformations, after

b) low-energy conformations, prior to

c) high-energy conformations, prior to

d) high-energy conformations, after View Answer

Answer: b

Explanation: These, N, low-energy conformations are then docked individually into the receptor assuming a rigid conformation using a descriptor-based approach. The scoring function is used to determine which of the resulting solutions is optimal.

5. The disadvantage of Multiple conformation rigid-body method is that the active conformation may be missed as the result of a minor structural difference not considered in the _ ligand conformations. Where, the N is the number of low-energy conformations.

a) N +1

b) N

c) N2

d) N/2

View Answer

Answer: b

Explanation: The disadvantage is that the active conformation may be missed as the result of a minor structural difference not considered in the N ligand conformations and The advantage of this approach is that the search can be restricted to a smaller number of relevant ligand conformations.

6. In Stochastic search methods, they include methods such as Monte Carlo simulation, simulated annealing, Tabu search, genetic algorithms and evolutionary programming.

a) True

b) False View Answer

Answer: a

Explanation: Stochastic processes use a random sampling procedure to search conformational space. The ligand molecule performs a random walk in space in the receptor cavity. Usually, the ligand is placed in a random orientation in the receptor cavity. Then at each step a small displacement is made in any of the degrees of freedom of the ligand molecule (translation, rotation or torsion angle).

7. In a Monte Carlo simulation the score (or energy) is calculated at each step and compared to the previous step. The probability of accepting the step is given by where ΔE is the difference in energy; kB is Boltzmann’s constant and T the temperature.

a) True

b) False View Answer

Answer:

Explanation: If the new energy is lower the step is accepted, otherwise the result is treated probabilistically by a Boltzmann mechanism. If P (ΔE) is greater than a random number generated between 0 and 1 then the step is accepted. The higher the temperature (or the smaller ΔE at a given T) the higher the likelihood the step is accepted. A conventional Monte Carlo simulation proceeds at constant temperature, whilst in simulated annealing the temperature is gradually cooled during the simulation in an attempt to locate a globally optimal solution. In simulated annealing the computer stores a single solution and generates a new solution randomly.

8. Genetic methods (genetic algorithms and evolutionary programming) store multiple solutions. These solutions form a population of members.

a) True

b) False View Answer

Answer: a

Explanation: Each member has an associated score or fitness. During the search for the global optimal solution successive new populations are created by a procedure involving selection of the fittest members. These members then have offspring to create a new population. Differences arise in how the methods generate offspring. In a genetic algorithm two solutions are mated to form a new offspring solution. In evolutionary programming each member of the population generates an offspring by mutation.

9. Stochastic methods can guarantee reaching a global optimal solution and the methods are computationally costly in comparison to the other methods.

a) True

b) False View Answer

Answer: b

Explanation: The advantages of stochastic methods are that the ligand is able to explore conformational space in a relatively unconstrained way, frequently leading to the globally optimal solution. The disadvantages are that it cannot guarantee reaching a global optimal solution and the methods are computationally costly in comparison to the other methods.

10. Which of the following is true regarding Protein flexibility?

a) Methods have been described for the introduction of side-chain flexibility to both protein-ligand and protein-protein docking

b) Methods have been described for the introduction of side-chain flexibility to protein- ligand docking only

c) The Mean Field principle is rarely used in this

d) Methods have been described for the introduction of side-chain flexibility to protein- protein docking only

View Answer

Answer: a

Explanation: The Mean Field approach is one type of what are called bounded search methods. Others include the Dead-end-elimination theorem and the A* algorithm. These methods use different approaches to find a solution and a detailed discussion is beyond the scope of this chapter. However, they use a multiple copy representation of protein side chains built using a rotamer library.

Evaluation of Models & Visualization Methods

This set of Bioinformatics Multiple Choice Questions & Answers (MCQs) focuses on “Evaluation of Models & Visualization Methods”.

1. In a similar way to structure prediction methods models can be evaluated using RMSD to measure the similarity between two molecular complexes.

a) True

b) False View Answer

Answer: a

Explanation: This is only the case if an experimental structure already exists. Several docking methods have been evaluated at ‘Critical Assessment of Structure Prediction 2’ (CASP2) for both protein-ligand and protein-protein interactions.

2. In case of protein-protein docking, the level of success is dependent on the system under study.

a) True

b) False View Answer

Answer: a

Explanation: For the protein-protein docking evaluation several of the same docking methodologies were used in both docking challenges. In the first challenge involving a protein inhibitor all the groups successfully predicted the complex.

3. There was little degree of success in docking an antibody-antigen complex in the second challenge for the protein-protein docking evaluation.

a) True

b) False View Answer

Answer: a

Explanation: This may be because modeling molecular recognition is in general more difficult in antibody-antigen than protein-inhibitor systems. In the protein-ligand docking challenge at CASP2 there was generally a good level of success, however, again certain targets proved to be problematic.

4. The is an ongoing community-wide experiment on the comparative evaluation of protein-protein docking for structure prediction.

a) Chronological Assignment of Prediction of Interactions (CAPRI)

b) Chronological Assessment of Prediction of Interactions (CAPRI)

c) Critical Assignment of Prediction of Interactions (CAPRI)

d) Critical Assessment of Prediction of Interactions (CAPRI) View Answer

Answer: d

Explanation: Clearly there is an ongoing need for assessment of methods. There is currently no one universal method or scoring function that will work on all occasions.

5. An understanding of the importance of different factors in a particular interaction is important if confidence in the results is required. Which of the following are not those

factors?

a) Shape

b) Hydrophobicity

c) Electrostatics

d) pH

View Answer

Answer: d

Explanation: Not that the parameter pH is not related at all but is quite distinctly related to the mentioned factors. Also, pH is relative to the state of the molecule at given instant, other factors being evolutionary relationships and conformational flexibility.

6. Information about shape, hydrophobicity, electrostatics, evolutionary relationships and conformational flexibility is not always available and the search for a universally applicable scoring function as well as an adequate treatment of conformational flexibility is ongoing.

a) True

b) False View Answer

Answer: a

Explanation: The technique of virtual screening of small molecule drugs by computer has become commonplace and a very useful tool in narrowing down the very large number of drugs that might need to be screened by experimental methods. In one study virtual screening provided lead compounds where conventional experimental random screening had failed.

7. In practical circumstances there exists an experimental structure of the complex.

a) True

b) False View Answer

Answer: a

Explanation: In practical circumstances an experimental structure of the complex does not already exist. Some other evaluation criteria are needed. This is by necessity experimental validation. The hypothetical protein-biomolecular complex predicts a mode of interaction between the two molecules that can be tested experimentally.

8. Visualization methods are very important in viewing molecular properties on molecules. Of particular note is the rendering of molecular surfaces according to their various properties (that can be expressed numerically).

a) True

b) False View Answer

Answer: a

Explanation: This approach has been popularized by the GRASP program. Also Virtual Reality Modeling Language (VRML) viewers can provide similar displays.

9. The popular program RasMol can be made to view molecular properties by assigning those properties to the temperature factor column of the sdf file in question only.

a) True

b) False View Answer

Answer: b

Explanation: The popular program RasMol can also be made to view molecular properties by assigning those properties to the temperature factor column of the PDB file in question. The molecules are then best viewed in Spacefill mode, however, the color scheme used is not flexible.

10. The GRASP and VRLM have been incorporated into the GRASS server.

a) True

b) False View Answer

Answer: a

Explanation: This allows a Web-based interactive exploration of molecules in the PDB allowing the molecular properties to be viewed on the molecular surface. There are also several other popular molecular graphics programs that allow a similar visualization.

16. Global Approaches for Studying Protein – Protein Interactions

Protein – Protein Interactions

1. The physical contacts between domains are crucial for the functioning of the cellular machinery.

a) True

b) False View Answer

Answer: a

Explanation: Interactions between domains occur in multidomain proteins, in stable complexes

and in transient interactions between proteins that also exist independently. Experimental approaches for the large-scale determination of protein interactions are emerging. Theoretical analyses based on protein structures have unraveled some of the overall principles and features of the way domains evolved to interact with each other.

2. There exist three types of interactions between domains. Which of the following is not one of them?

a) Stable complex

b) Transient interaction

c) Multi-domain protein

d) Unstable interaction View Answer

Answer: d

Explanation: Interactions between domains determine the structure of multidomain proteins, in which there are several domains on one polypeptide chain. Given that all proteins consist of domains, interactions between domains also occur between the proteins that are permanently associated in stable complexes and proteins that interact transiently, but also exist independently of each other.

3. Stable complexes consist of proteins that are associated with each other, like many proteins for instance.

a) temporarily, oligomeric

b) temporarily, monomeric

c) permanently, oligomeric

d) permanently, monomeric View Answer

Answer: c

Explanation: Well-known stable complexes include the histone octamer, the ribosome and DNA and RNA polymerases. Transient interactions on the other hand are all those protein-protein interactions that occur between proteins that also exist independently.

4. Sets of proteins that are part of stable complexes and sets of proteins involved in transient interactions in terms of the similarity in gene expression among the set of proteins.

a) are similar

b) differ

c) are same

d) show similar function View Answer

Answer: b

Explanation: Proteins permanently associated in a stable complex need to be present or absent in the cell at the same time. Analysis of microarray data by Gerstein and co-workers by methods along the lines, has shown that the members of stable complexes in the yeast Saccharomyces cerevisiae have highly correlated gene expression patterns.

5. Correlation of gene expression for pairs of transiently interacting proteins is compared to randomly chosen pairs of proteins.

a) not significant

b) only marginally significant

c) totally significant

d) significant to much extent View Answer

Answer: b

Explanation: In prokaryotes, genes are co-regulated if they are a member of the same operon, and many proteins that are members of the same stable complex are part of the same operon. For instance, Ouzounis and Karp determined that over 90% of the enzymes that are in stable complexes in E. coli metabolic pathways are adjacent on the E. coli chromosome.

6. Membership in a stable complex also differs from transient interaction in terms of evolutionary constraints upon sequence divergence.

a) True

b) False View Answer

Answer: a

Explanation: Thus the proteins in stable complexes are more similar across species, having higher sequence identity between orthologs, than the proteins in transient interactions. A calculation by Teichmann showed that there are significant differences between the average values for sequence identities between S. cerevisiae and S. pombe orthologs in stable complexes, transient interactions and monomers.

7. For proteins in stable complexes the average sequence identity is 46%, while for proteins in transient interactions it is 41%.

a) True

b) False View Answer

Answer: a

Explanation: (Proteins not known to be involved in any type of interaction have an average sequence identity of 38 %) One of the main reasons for this is the surface area involved in interfaces of stable complexes which is larger than in transient complexes. Sequence divergence may be slower in order to conserve these extensive interfaces.

8. Which of the following is incorrect about Yeast-two-hybrid screens?

a) The yeast-two-hybrid system uses the transcription of a reporter gene driven by the Gal4 transcription factor to monitor whether or not two proteins are interacting

b) The DNA-binding domain chimeric protein will not bind upstream of the reporter gene

c) If the activation domain chimeric protein interacts with the DNA-binding domain chimeric protein, the reporter gene will be transcribed

d) Disadvantages of the method are that only pairwise interactions are tested, and not interactions that can only take place when multiple proteins come together, as well as a high false positive rate

View Answer

Answer: b

Explanation: If the interaction between two proteins, A and B, is being tested, one of their genes would be fused to the DNA-binding domain of the Gal4 transcription factor (Gal4-DBD) while the other would be fused to the activation domain (Gal4-AD). The DNA-binding domain chimeric protein will bind upstream of the reporter gene. This experiment can be carried out hundreds or even thousands of times on microassay plates, as in the case of the study by Uetz and colleagues on S.cerevisiae (yeast) interactions. Each array element on these plates contains yeast cells transformed with a particular combination of two plasmids, one carrying the DNA- binding domain chimeric protein and the other the activation domain chimeric protein.

9. Which of the following is incorrect about Purification of protein complexes followed by mass spectrometry?

a) Isolating protein complexes from cells allows identification of interactions between ensembles of proteins instead of just pairs

b) Systematic purification of complexes on a large scale is done by tagging hundreds of genes with an epitope

c) UnLike in the yeast-two-hybrid assay, this does not involve chimeric genes

d) Affinity purification based on the epitope will then extract all the proteins attached to the bait protein from cell lysates

View Answer

Answer: c

Explanation: Like in the yeast-two-hybrid assay, this is done by making chimeric genes that are introduced into cells. The principle of mass spectrometric identification of proteins is that the protein is chopped into fragments by tryptic digestion, and the mass of each fragment is measured by matrix-assisted laser desorption/ ionization-time-of-flight mass spectrometry (MALDI-TOF MS). This measurement is so accurate that the combination of amino acids in each fragment can be calculated and compared to a database of all the proteins in the proteome of the organism in order to find the correct one.

Structural Analyses of Domain Interactions

1. Which of the following is untrue?

a) Many entries in the Protein DataBank (PDB) are three-dimensional structures of multiple domains

b) The structures in PDB provide experimental information about interactions between domains at atomic detail

c) There are comparatively few three-dimensional structures compared to the amount of data available from the lower resolution large-scale experiments

d) Many entries in the Protein DataBank (PDB) are two-dimensional structures of multiple domains

View Answer

Answer: d

Explanation: Analysis of structures consisting of multiple domains has uncovered some of the principles of domain interactions in three dimensions. This information can therefore be complementary to the experimental data on protein interactions and to the predicted interactions.

2. In protein domain family interaction map, the physical contacts of the domains in different families are represented by the lines between the nodes.

a) True

b) False View Answer

Answer: a

Explanation: Each node in this graph represents a protein domain family. There are a few families that are hubs in the network: these are large families that are functionally versatile, such as Rossmann domains indicated by an ‘R’ here. Most families engage in only one or two types of interactions.

3. In the Interaction map of domain families, the interactions of one family represent the sum of all the interactions of domains in that family.

a) True

b) False View Answer

Answer: a

Explanation: To study the large-scale patterns and evolution of interactions between protein domains, the interactions in terms of the domain families can be summarized. Thus the interactions of one family represent the sum of all the interactions of domains in that family. Precise information about contacts between individual domains can be extracted by analysis of PDB entries.

4. Most domain families only interact with one or two other families, while a few families are extremely versatile in their interactions and are connected to many families.

a) True

b) False View Answer

Answer: a

Explanation: The result of the known interactions between members of structural protein families is a graph of connections between families, where the nodes are protein families and the edges represent an interaction between at least one of the domains from each of the two families. This pattern is observed at the level of individual proteins as well, as similar networks can be constructed for the individual proteins in the yeast proteome, for instance.

5. Almost engage in interactions with domains from their own family when one includes oligomeric proteins.

a) one fifith of all known families

b) one fourth of all known families

c) all of all known families

d) half of all known families View Answer

Answer: d

Explanation: In this case, half of all known families engage in interactions with domains from their own family. Such symmetrical interactions appear to be particularly favorable.

6. In order to understand the geometry of domain combinations, different structures of homologous pairs of domains must be studied.

a) True

b) False View Answer

Answer: a

Explanation: This is important, because though the methods for structure prediction of individual domains are well established, much less is known about assemblies of domains. The network of domain family interactions is a purely two-dimensional map: it lays out the connections between families but does not provide information on the three-dimensional geometry of interactions.

7. The investigation (Aloy and Russel) of domain combinations in multidomain proteins by Bashton and Chothia focuses on two-domain proteins belonging to the Rossmann domain family.

a) True

b) False View Answer

Answer: a

Explanation: These proteins generally consist of one Rossmann domain and one catalytic domain. As for the analysis of transient interactions, all the proteins belonging to one family of catalytic domains form the same type of interface to the Rossmann domains.

8. The linkers between the catalytic domain and the Rossmann domain were conserved in each family.

a) True

b) False View Answer

Answer: a

Explanation: This means that interface conservation within one catalytic family is a result of the direct evolutionary relationship between the proteins that have a particular pair of domains. In other words, each set of Rossmann domain proteins with a particular catalytic domain has descended from one common ancestral recombination event.

9. Across the different types of catalytic families, the position of the two domains with respect to one another varied, but only within a range of about

a) 20°

b) 10°

c) 90°

d) 80°

View Answer

Answer: c

Explanation: This is the result of a functional constraint in these enzymes: the catalytic domain can only take up a variety of positions, as the substrate needs to be held sufficiently close to the NAD(P) cofactor of the Rossmann domain. In other multidomain proteins where there is no such strict functional constraint, the domain interfaces of one domain family to other families may well be more variable.

The Use of Gene Order & Phylogeny to Predict Protein – Protein Interactions

1. Experimentation is most desirable over computational methods by every means.

a) True

b) False View Answer

Answer: b

Explanation: Computational methods for predicting protein-protein interactions are desirable as experimental determination is time-consuming and expensive. Several prediction methods are based on the observation that if two proteins are part of the same complex, it is favorable for the two interaction partners to be co-expressed and co-regulated.

2. Interactions between proteins can be predicted computationally by looking for sets of genes that occur as a

a) single gene in at least one genome

b) multiple genes in at least one genome

c) multiple genes in various genomes

d) single gene in various genomes View Answer

Answer: a

Explanation: Interactions between proteins can be predicted computationally by looking for sets of genes that occur as a single gene in at least one genome. Or it can be done by looking for prokaryotic genes that have conserved adjacent gene order across several genomes.

3. Genes that are consistently part of the same operon across different, distantly related genomes are likely to be part of the same protein complex or functional process across all species.

a) True

b) False View Answer

Answer: a

Explanation: This is because they have been selected to remain as a co-regulated unit throughout the extensive shuffling of gene order that takes place in prokaryote genomes. Thus conservation of gene order across different, distantly related genomes has been used as a method for predicting protein interactions.

4. When comparing pairs of genes or sets of genes in different genomes for this purpose, it is not mandatory for the genes to be orthologs.

a) True

b) False View Answer

Answer: b

Explanation: When comparing pairs of genes or sets of genes in different genomes for this purpose, it is important to ensure that the genes are truly equivalent, in other words that they are orthologs, as opposed to merely similar genes. This is frequently done by only accepting a pair of proteins as orthologs if they are ‘bi-directional best hits’. This means that both proteins are the best match to each other when searching against the other proteome. The other extreme would be to consider proteins as equivalent if they share just one of many domains, for instance.

5. Members of a stable complex are often co-regulated and thus will be detected by the method of Conservation of gene order.

a) True

b) False View Answer

Answer: a

Explanation: As much it is true proteins which are part of the same metabolic pathway or part of the same biological process can also be co-regulated. Therefore, although some of the genes detected as interacting by this method physically interact, others are just functionally associated with each other.

6. In a quantitative assessment of this method (Conservation of gene order) using the genome of the parasitic organism Mycoplasma genitalium as a benchmark.

a) True

b) False View Answer

Answer: a

Explanation: Huynen and colleagues found that two-thirds to four-fifths of the general interactions detected correspond to physical interactions and another 13% correspond to a metabolic or non-

metabolic pathway. Genes those are physically associated with each other while being regulated and expressed individually will not be detected by this method.

7. Conservation of gene order due to operon structure is , so interactions of proteins specific to eukaryotes cannot be detected by method of Conservation of gene order.

a) not applicable to archea genomes

b) not applicable to prokaryote genomes

c) applicable to eukaryote genomes

d) not applicable to eukaryote genomes View Answer

Answer: d

Explanation: Co-regulation of genes in eukaryotes can be inferred by similarity in the expression patterns of genes. A comparison of co-expressed yeast and worm genes showed that 90% of those pairs of genes with conserved co-expression are members of stable complexes. Thus, the principle of conserved co-regulation across distantly related organisms applies to stable complexes in both prokaryotes and eukaryotes, and can be used as a prediction tool in both.

8. An approach for predicting to look for cases across a set of genomes where

are part of the same gene in one genome resulted in gene fusion method.

a) gene interactions, only three to four orthologs

b) gene interactions, two orthologs

c) protein interactions, two or more orthologs

d) protein interactions, two orthologs View Answer

Answer: c

Explanation: The prediction is then that the orthologs that are on separate genes in the other genomes interact with each other. In the case of gene fusion, the fused proteins are not only co- regulated, as in conservation of gene order described above, but also permanently colocalized in the cell. The additional requirement of colocalization beyond just co-regulation poses a further limitation on the prediction method.

9. Domains that are part of a multidomain protein are

a) nethier co-regulated nor colocalized

b) not co-regulated but colocalized

c) co-regulated and but not colocalized

d) co-regulated and colocalized View Answer

Answer: d

Explanation: Therefore, members of stable complexes as well as consecutive enzymes may be involved in gene fusions. However elements of signal transduction chains are seldom part of the same gene, for instance, as it is an essential part of their function that they can be regulated and localized independently.

10. Due to the requirement for co-regulation as well as colocalization, the method is mostly limited to certain classes of protein-protein interactions.

a) True

b) False View Answer

Answer: a

Explanation: Those classes are members of the same stable complex and proteins in the same metabolic pathway. In their assessment of computational methods for prediction of all types of protein interactions, Huynen and colleagues found that two-thirds of the interactions detected in this way were between proteins that physically interact and another 15% between proteins part of the same metabolic pathway. The remaining interactions involved hypothetical proteins of unknown function.

11. The phylogenetic profile method relies on detection of orthologs (or homologs, in a variation of the method) in a set of genomes.

a) True

b) False View Answer

Answer: a

Explanation: If the pattern of ortholog presence or absence is the same in a group of proteins, then these proteins are clustered together as belonging to the same functional class. A method that appears to reliably predict a loose functional correlation between proteins is the phylogenetic profile method developed by Pellegrini and colleagues.

12. In the assessment of methods to predict protein-protein interactions, one third of such pairs were found to physically interact, and an additional third to belong to the same metabolic pathway or functional process.

a) True

b) False View Answer

Answer: a

Explanation: As mentioned, one third of such pairs were found to physically interact. The

phylogenetic patterns of clusters of orthologous groups of proteins deposited in the COG database of Koonin and co-workers could in principle be used for prediction in the same way. advertisement

13. In the phylogenetic profile method for predicting protein interaction, presence or absence of orthologous genes is scored across a variety of genomes.

a) True

b) False View Answer

Answer: a

Explanation: This is represented by presence or absence of a dot in the row. Genes that have the same pattern of presence or absence across genomes are predicted to interact.

14. Structural analyses on small sets of proteins have shown that the domains from a pair of families bind to each other with the same geometry in multi-domain proteins and in transient interactions.

a) True

b) False View Answer

Answer: a

Explanation: Most domain families engage in interactions with one or two other types of families, but a few families are very versatile and interact with many families. These versatile families are ubiquitously useful families such as P-loop nucleotide triphosphate hydrolases and Rossmann domains.

15. The most detailed experimental information about protein-protein interactions comes from three-dimensional structures.

a) True

b) False View Answer

Answer: a

Explanation: The 3-D structure gives the most conformation of the structure. It is likely that complexed protein structures will be solved more frequently in the wake of the structural genomics projects.

17. Predicting the Structure of Protein – Biomolecular Interactions

DNA & Genomic Sequencing

1. Which of the following is untrue about DNA sequencing methods?

a) Purified fragments of DNA cut from plasmid/phage clones or amplified by polymerase chain reaction (PCR)

b) Clones of DNA fragments are denatured to single strands, and one of the strands is hybridized to an oligonucleotide primer

c) Taq polymerase is quite heat sensitive

d) New strands of DNA are synthesized from the end of the primer View Answer

Answer: c

Explanation: In an automated procedure, new strands of DNA are synthesized from the end of the primer by heat-resistant Taq polymerase from a pool of deoxyribonucleotide triphosphates (dNTPs) that includes a small amount of one of four chain-terminating nucleotides (ddNTPs).

2. Using ddATP, the resulting synthesis creates a set of nested DNA fragments, each one ending at one of the as in the sequence through the substitution of a fluorescent- labeled ddATP.

a) True

b) False View Answer

Answer: a

Explanation: A similar set of fragments is made for each of the other three bases. But each set is labeled with a different fluorescent ddNTP.

3. The combined mixture of all labeled DNA fragments is electrophoresed to the fragments by and the ladder of fragments is scanned for the presence of each of the four labels.

a) separate, size

b) separate, pH

c) assimilate, pH

d) assimilate, size View Answer

Answer: a

Explanation: A computer program then determines the probable order of the bands and predicts the sequence. Depending on the actual procedure being used, one run may generate a reliable sequence of as many as 500 nucleotides.

4. The sequence can also be verified by making an oligonucleotide primer complementary to the distal part of the readable sequence and using it to obtain the sequence of the complementary strand on the original DNA template.

a) True

b) False View Answer

Answer: a

Explanation: For accurate work, a printout of the scan is usually examined for abnormalities that decrease the quality of the sequence, and the sequence may then be edited manually. The first sequence can also be extended by making a second oligonucleotide matching the distal end of the readable sequence and using this primer to read more of the original template.

5. When the process is fully automated, a number of priming sites may be used to obtain sequencing results that give optimal separation of bands in each region of the sequence.

a) True

b) False View Answer

Answer: a

Explanation: By repeating this procedure, both strands of a DNA fragment several kilobases in length can be sequenced. Sequential sequencing of a DNA molecule using oligonucleotide primers is done later.

6. To sequence larger molecules, individual chromosomes are purified and broken into

or larger random fragments, which are cloned into vectors designed for large molecules.

a) 100-Mb

b) 100-kb

c) 5000-kb

d) 600-kb View Answer

Answer: b

Explanation: To sequence larger molecules, such as human chromosomes, individual chromosomes are purified and broken into 100-kb or larger random fragments, which are cloned into vectors designed for large molecules, such as artificial yeast (YAC) or bacterial (BAC) chromosomes. In a laborious procedure, the resulting library is screened for fragments called contigs, which have overlapping or common sequences, to produce an integrated map of the chromosome.

7. Many levels of clone redundancy may be required to build a consensus map because individual clones can have

a) rearrangements

b) deletions

c) two separate fragments

d) vectors View Answer

Answer: d

Explanation: Option d here becomes irrelevant as it has quite less relevancy to redundancy of the clones. These do not reflect the correct map and have to be eliminated.

8. Once the correct map has been obtained, unique overlapping clones are chosen for sequencing.

a) True

b) False View Answer

Answer: a

Explanation: However, these molecules are too large for direct sequencing. One procedure for sequencing these clones is to subclone them further into smaller fragments that are of sizes suitable for sequencing, make a map of these clones and then sequence overlapping clones. However, this method is expensive because it requires a great deal of time to keep track of all the subclones.

9. An alternative method is to sequence all the subclones, produce a computer database of the sequences, and then have the computer assemble the sequences from the overlaps that are found.

a) True

b) False View Answer

Answer: a

Explanation: Up to 10 levels of redundancy are used to get around the problem of a small fraction of abnormal clones. This procedure was first used to obtain the sequence of the 4- Mb chromosome of the bacterium Haemophilus influenzae by The Institute of Genetics Research (TIGR) team. Only a few regions could not be joined because of a problem subcloning those regions into plasmids, requiring manual sequencing of these regions from another library of phage subclones.

10. Which of the following is untrue about Shotgun Sequencing?

a) When DNA fragments derived from different chromosomal regions have repeats of the same sequence, they will appear to overlap

b) When DNA fragments derived from different chromosomal regions have repeats of the same sequence, they will appear to scrutinize

c) In a new whole shotgun approach, Celera Genomics is sequencing both ends of DNA fragments of short (2 kb), medium (10 kb), and long (BAC or >100 kb) lengths

d) A large number of reads are then assembled by computer View Answer

Answer: b

Explanation: A controversy has arisen as to whether or not the above shotgun sequencing strategy can be applied to genomes with repetitive sequences such as those likely to be encountered in sequencing the human genome. This method has been used to assemble the genome of the fruit fly Drosophila melanogaster after removal of the most highly repetitive regions and also to assemble a significant proportion of the human genome.

Sequencing cDNA Libraries of Expressed Genes, Submission of Sequences to the Databases

1. Two common goals in sequence analysis are to identify sequences that encode proteins, which determine all cellular metabolisms, and to discover sequences that regulate the expression of genes or other cellular processes.

a) True

b) False View Answer

Answer: a

Explanation: Genomic sequencing meets both goals. However, only a small percentage of the genomic sequence of many organisms actually encodes proteins because of the presence of introns within coding regions and other noncoding regions in the genome.

2. cDNA libraries have been prepared that have the same sequences as the mRNA molecules produced by organisms, or else cDNA copies are sequenced directly by RT- PCR (copying of mRNA by reverse transcriptase followed by sequencing of the cDNA copy by the polymerase chain reaction).

a) True

b) False View Answer

Answer: a

Explanation: There has been a great deal of progress in developing computational methods for analyzing genomic sequences and finding these protein-encoding regions. But these methods are not completely reliable and, furthermore, such genomic sequences are often not available.

3. Using cDNA sequence with the it is much simpler to locate protein- encoding sequences in these molecules.

a) exons taken out

b) exons removed

c) introns added

d) introns removed View Answer

Answer: d

Explanation: The only possible difficulty is that a gene of interest may be developmentally expressed or regulated in such a way that the mRNA is not present. This problem has been circumvented by pooling mRNA preparations from tissues that express a large proportion of the genome, from a variety of tissues and developing organs or from organisms subjected to several environmental influences.

4. An important development for computational purposes was the decision by Craig Venter to prepare databases of partial sequences of the expressed genes, called expressed sequence tags or ESTs.

a) True

b) False View Answer

Answer: a

Explanation: This was an important development from resolution point of view. This has just enough DNA sequence to give a good idea of the protein sequence.

5. The translated sequence can then be compared to a database of protein sequences with the hope of finding a strong similarity to a protein of known function, and hence to identify the function of the cloned EST.

a) True

b) False View Answer

Answer: a

Explanation: The translated sequence can then be compared in the mentioned way hence to

identify the function of the cloned EST. The corresponding cDNA clone of the gene of interest can then be obtained and the gene completely sequenced.

6. Investigators are encouraged to submit their newly obtained sequences directly to a member of the International Nucleotide Sequence Database Collaboration, such as the NCBI, DDBJ, and EMBL.

a) True

b) False View Answer

Answer: a

Explanation: NCBI stands for National Center for Biotechnology Information. It manages GenBank. DDBJ and EMBL stand for DNA Database Bank of Japan and European Molecular Biology Laboratory respectively.

7. NCBI reviews new entries and updates existing ones, as requested.

a) True

b) False View Answer

Answer: a

Explanation: A database accession number, which is required to publish the sequence, is provided. New sequences are exchanged daily by the GenBank, EMBL, and DDBJ databases.

8. Which of the given statements is incorrect?

a) The simplest and newest way of submitting sequences is through the Web site on a Web form page called BankIt

b) The sequence can also be annotated with information about the sequence, such as mRNA start and coding regions

c) The submitted form is transformed into GenBank format and returned to the submitter for review before being added to GenBank

d) Sequin does not run on UNIX View Answer

Answer: d

Explanation: The other method of submission is to use Sequin (formerly called Authorin), which runs on personal computers and UNIX machines. The program provides an easy-to-use graphic interface and can manage large submissions such as genomic sequence information.

9. Which of the given statements is untrue?

a) There is no detailed check of sequence accuracy prior to submission to GenBank and other databases

b) Often, a sequence is submitted at the time of publication of the sequence in a journal article, providing a certain level of checking by the editorial peer review process

c) No sequence is submitted without being published or prior to publication

d) In laboratories performing large sequencing projects, such as those engaged in the Human Genome Project or the genome projects of model organisms, the granting agency requires a certain level of accuracy of the order of 1 possible error per 10 kb View Answer

Answer: c

Explanation: Many sequences are submitted without being published or prior to publication. As mentioned in option d, the level of accuracy should be sufficient for most sequence analysis applications such as sequence comparisons, pattern searching, and translation.

10. Granting agency requires a certain level of accuracy in case of errors. Which of the given statements is untrue regarding it?

a) In other laboratories, such as those performing a single-attempt sequencing of ESTs, the error rate may be much higher, approximately 1 in 100, including incorrectly identified bases and inserted or deleted bases

b) Incorrect bases always translate to the right amino acid

c) Base insertions/deletions will cause frame-shifts in the sequence

d) Making alignment with a protein sequence becomes difficult because of frameshifts View Answer

Answer: b

Explanation: In translating EST sequences in GenBank and other databases, incorrect bases may translate to the wrong amino acid. Another type of database sequence that is error-prone is a fragment of sequence from the immunological variant of a pathogenic organism, such as the regions in the protein coat of the human immunodeficiency virus (HIV). Although this low level of accuracy may be suitable for some purposes such as identification, for more detailed analyses, e.g., evolutionary analyses, the accuracy of such sequence fragments should be verified.

Sequence Formats & Computer Storage of Sequences

1. Which of the following is wrong about GenBank DNA Sequence Entry?

a) The information is organized into fields, each with an identifier, shown as the first text on each line

b) In some entries, these identifiers may be abbreviated to two letters, e.g., RF for reference

c) Some identifiers may have additional subfields

d) The CDS subfield in the field FEATURES does not offer the amino acid sequence View Answer

Answer: d

Explanation: The CDS subfield in the field FEATURES gives the amino acid sequence obtained by translation of known and potential open reading frames. The format of a database entry in GenBank, the NCBI nucleic acid and protein sequence database, is as follows: Information describing each sequence entry is given, including literature references, information about the function of the sequence, locations of mRNAs and coding regions, and positions of important mutations.

2. A consecutive set of three-letter words that could be codons specifying the amino acid sequence of a protein. The sequence entry is assumed by computer programs to lie between the identifiers “ORIGIN” and “//”.

a) True

b) False View Answer

Answer: a

Explanation: The sequence includes numbers on each line so that sequence positions can be located by eye. Because the sequence count or a sequence checksum value may be used by the computer program to verify the sequence composition, the sequence count should not be modified except by programs that also modify the count. The GenBank sequence format often has to be changed for use with sequence analysis software.

3. In Organization of the GenBank database and the search procedure used by ENTREZ—each row is another sequence entry and each column another GenBank field.

a) True

b) False View Answer

Answer: a

Explanation: When one sequence entry is retrieved, all of these fields will be displayed. Search for the term “SOS regulon and coli” in all fields will find two matching sequences. Finding these sequences is simple because indexes have been made listing all of the sequences that have any given term, one index for each field. Similarly, a search for transcriptional regulator will find three sequences.

4. Which of the following is wrong about European Molecular Biology Laboratory Data Library Format?

a) EMBL maintains DNA and protein sequence databases

b) As with GenBank entries, a large amount of information describing each sequence entry is given

c) Sequence entry includes literature references and information about the function of the sequence, but not locations of mRNAs and coding regions

d) Information is organized into fields, each with an identifier, shown as the first text on each line

View Answer

Answer: c

Explanation: Sequence entry includes literature references and information about the function of the sequence, locations of mRNAs and coding regions and positions of important mutations. The sequence count or a checksum value for the sequence may be used by computer programs to make sure that the sequence is complete and accurate. For this reason, the sequence part of the entry should usually not be modified except with programs that also modify this count.

5. The format of an entry in the SwissProt protein sequence database is very similar to the EMBL format.

a) True

b) False View Answer

Answer: a

Explanation: The format is quite similar to the EMBL format, except that considerably more information about the physical and biochemical properties of the protein is provided. Also, the output of a DDBJ DNA sequence entry is almost identical to that of GenBank.

6. Which of the following is wrong about FASTA Sequence Format?

a) The FASTA sequence format includes a comment line identified by a “>” character in the first column followed by the name and origin of the sequence

b) The FASTA sequence format includes the sequence in standard one-letter symbols

c) This format provides a very convenient way to copy just the sequence part from one window to another because there are no numbers or other nonsequence characters within the sequence

d) The presence of ‘*’ is not quite essential for reading the sequence correctly by some sequence analysis programs

View Answer

Answer: d

Explanation: The FASTA sequence format includes an optional ‘*’ which indicates end of sequence and which may or may not be present and its presence maybe essential. The FASTA

sequence format is similar to the protein information resource (NBRF) format except that the NBRF format includes a first line with a “>” character in the first column followed by information about the sequence, a second line containing an identification name for the sequence, and the third to last lines containing the sequence.

7. Which of the following is wrong about National Biomedical Research Foundation/Protein Information Resource Sequence Format?

a) Sequences retrieved from the PIR database are not in this compact format, but in an expanded format with much more information about the sequence

b) The NBRF format is similar to the FASTA sequence format but with significant differences

c) This is different than PIR format

d) The first line includes an initial “>” character followed by a two-letter code such as P for complete sequence or F for fragment, followed by a 1 or 2 to indicate type of sequence, then a semicolon, then a four- to six-character unique name for the entry View Answer

Answer: c

Explanation: This sequence format, which is sometimes also called the PIR format. It has been used by the National Biomedical Research Foundation/Protein Information Resource (NBRF) and also by other sequence analysis programs.

8. In Stanford University/Intelligenetics Sequence Format– At the end of the sequence, a 1 is placed if the sequence is linear, and a 2 if the sequence is circular.

a) True

b) False View Answer

Answer: a

Explanation: It is started by a molecular genetics group at Stanford University, and subsequently continued by a company, Intelligenetics, the IG format is similar to the PIR format, except that a semicolon is usually placed before the comment line. The identifier on the second line is also present.

9. Which of the following is wrong about Genetics Computer Group Sequence Format?

a) Earlier versions of the Genetics Computer Group (GCG) programs require a unique sequence format and include programs that convert other sequence formats into GCG format

b) Information about the sequence in the GenBank entry is not included but the line information is carried out

c) If one or more sequence characters become changed through error, a program reading the sequence will be able to determine that the change has occurred because the checksum value in the sequence entry will no longer be correct

d) Lines of information are terminated by two periods, which mark the end of information and the start of the sequence on the next line

View Answer

Answer: b

Explanation: Information about the sequence in the GenBank entry is first included, followed by a line of information about the sequence and a checksum value. This value (not shown) is provided as a check on the accuracy of the sequence by the addition of the ASCII values of the sequence. If the sequence has not been changed, this value should stay the same.

10. Which of the following is wrong about Abstract Syntax Notation Sequence Format?

a) The information is much more difficult to read by eye than a GenBank formatted sequence

b) Abstract Syntax Notation (ASN.1) is a formal data description language that has been developed by the computer industry

c) All the information found in other forms of sequence storage, e.g., the GenBank format, is present. For example, sequences can be retrieved in this format by ENTREZ

d) Taxonomic information and bibliographic information cannot be encoded with this format

View Answer

Answer: d

Explanation: ASN.1 has been adopted by the National Center for Biotechnology Information (NCBI) to encode data such as sequences, maps, taxonomic information, molecular structures, and bibliographic information. These data sets may then be easily connected and accessed by computers. The ASN.1 sequence format is a highly structured and detailed format especially designed for computer access to the data.

11. Which of the given statements is in correct?

a) Before using a sequence file in a sequence analysis program, it is important to ensure that computer sequence files contain only sequence characters and not special characters used by text editors

b) Computer sequence files might contain special characters used by text editors

c) Editing a sequence file with a word processor can introduce such changes if one is not careful to work only with text or so-called ASCII files

d) Most text editors normally create text files that include control characters in addition to

standard ASCII characters View Answer

Answer: b

Explanation: As option a and b contradict, option a being right, one should check for special characters. The control characters will only be recognized correctly by the text editor program. Sequence files that contain such control characters may not be analyzed correctly, depending on whether or not the sequence analysis program filters them out. Editors usually provide a way to save files with only standard ASCII characters, and these files will be suitable for most sequence analysis programs.

12. Which of the given statements is in correct about ASCII and Hexadecimal?

a) Computers store sequence information as simple rows of sequence characters called strings, which are similar to the sequences shown on the computer terminal

b) Each character is stored in binary code in the smallest unit of memory, called a byte

c) Each character is stored in binary code in the smallest unit of memory, called a bit

d) By convention, many of these combinations have a specific definition, called their ASCII equivalent

View Answer

Answer: b

Explanation: Each byte comprises 8 bits, with each bit having a possible value of 0 or 1, producing 255 possible combinations. Some ASCII values are defined as keyboard characters, others as special control characters, such as signaling the end of a line (a line feed and a carriage return), or the end of a file full of text (end-of-file character). A file with only ASCII characters is called an ASCII file.

13. Which of the given statements is untrue?

a) Sequence and other data files that contain non-ASCII characters also may not be transferred correctly from one machine to another and may cause unpredictable behavior of the communications software

b) The ASCII mode is useful for transferring text files, and the binary mode is useful for transferring compressed data files, which also contain non-ASCII characters

c) ASCII and binary modes cannot be set by the user

d) Most sequence analysis programs also require not only that a DNA or protein sequence file be a standard ASCII file, but also that the file be in a particular format such as the FASTA format

View Answer

Answer: b

Explanation: The file transfer program (FTP) has ASCII and binary modes, which may be set by the user. Some communications software can be set to ignore such control character. The use of windows on a computer has simplified such problems, since one merely has to copy a sequence from one window, for example, a window that is running a Web browser on the ENTREZ Web site, and paste it into another, for example, that of a translation program.

14. According to standard amino acid code letters which of the given pair is not right?

a) K- lysine

b) Y- tyrosine

c) Q- glutamine

d) R- serine View Answer

Answer: d

Explanation: In addition to the standard four base symbols, A, T, G, and C, the Nomenclature Committee of the International Union of Biochemistry has established a standard code to represent bases in a nucleic acid sequence that is uncertain or ambiguous. R is represented by arginine.

15. For computer analysis of proteins, it is more convenient to use single-letter than three letter amino acid codes.

a) True

b) False View Answer

Answer: a

Explanation: For example, GenBank DNA sequence entries contain a translated sequence in single-letter code. The standard, single-letter amino acid code was established by a joint international committee.

Multiple Sequence Formats & Storage of Information in a Sequence Database

1. which of the given statements is incorrect about Block multiple sequence alignment format?

a) Identification starts contain a short identifier for the group of sequences from which the block was made and often is the original Prosite group ID

b) The identifier is terminated by a comma, and “BLOCK” indicates the entry type

c) AC contains the block number, a seven-character group number for sequences from which the block was made, followed by a letter (A–Z) indicating the order of the block in

the sequences

d) The block number is a 5-digit number preceded by BL (BLOCKS database) or PR (PRINTS database)

View Answer

Answer: b

Explanation: The identifier is terminated by a semicolon, and “BLOCK” indicates the entry type. Min, max is the minimum, maximum number of amino acids from the previous blocks or from the sequence starting. DE describes sequences from which the block was made.

2. BL contains information about the block: xxx is the amino acids in the spaced triplet found by MOTIF upon which the block is based.

a) True

b) False View Answer

Answer: a

Explanation: In addition to this, w is the width of the sequence segments (columns) in the block. s is the number of sequence segments (rows) in the block. Other values (n1, n2) describe statistical features of the block. Sequence id is a list of sequences. Each sequence line contains a sequence identifier, the offset from the beginning of the sequence to the block in parentheses, the sequence segment, and a weight for the segment.

3. Which of the given statements is incorrect about READSEQ?

a) It is an extremely useful sequence formatting program developed by D. G. Gilbert at Indiana University, Bloomington

b) It was developed at Indiana University, Bloomington

c) It can recognize a DNA or protein sequence file in any of the formats

d) It can recognize a DNA or protein sequence file in some particular formats View Answer

Answer: d

Explanation: It can identify the format, and write a new file with an alternative format. Some of these formats are used for special types of analyses such as multiple sequence alignment and phylogenetic analysis.

4. Data files that have multiple sequences, such as those required for multiple sequence alignment and phylogenetic analysis using parsimony (PAUP), are not converted in READSEQ.

a) True

b) False View Answer

Answer: a

Explanation: Data files with such multiple sequences as mentioned are converted in READSEQ. Options to reverse-complement and to remove gaps from sequences are included. SEQIO and another sequence conversion program for a UNIX machine.

5. The “from” programs convert sequence files from GCG format into the named format, and the “to” programs convert the alternative format into GCG format.

a) True

b) False View Answer

Answer: a

Explanation: In addition, the GCG programs include the following sequence formatting programs:

(1) GETSEQ, which converts a simple ASCII file being received from a remote PC to GCG format; (2) REFORMAT, which will format a GCG file that has been edited, and will also perform other functions; and (3) SPEW, which sends a GCG sequence file as an ASCII file to a remote PC.

6. The Common Object Request Broker Architecture (CORBA) is the Object Management Group’s interface for objects.

a) True

b) False View Answer

Answer: a

Explanation: It allows different computer applications to communicate with each other through a common language, Interface Definition Language (IDL). To plan an object-oriented database by defining the classes of objects and the relationships among these objects, a specific set of procedures called the Unified Modeling Language (UML) has been devised by the OMG group.

7. The FASTA format is readily converted into other formats and also is smaller and simpler

a) True

b) False View Answer

Answer: a

Explanation: It contains just a line of sequence identifiers followed by the sequence without

numbers, is very useful for browsing and analyzing purposes. One browser window may retrieve sequences from a database and a second may analyze these sequences.

8. Each DNA or protein sequence database entry has much information, including

a) an assigned accession number(s)

b) source organism

c) name of locus

d) reference number type(s) View Answer

Answer: d

Explanation: In addition to these keywords that apply to sequence, features in the sequence such as coding regions, intron splice sites, and mutations; and finally the sequence itself is given the sequence database entry. The above information is organized into a tabular form very much like that found in a relational database.

9. Which of the following is an incorrect statement?

a) The last column contains the sequences themselves

b) It is quite tough making an index of the information in each of these fields so that a search query can locate all the occurrences through the index

c) If one imagines a large table with each sequence entry occupying one row, then each column will include one of the above types of information for each sequence, and each column is called a FIELD

d) The DNA, protein, and reference databases have all been cross-referenced so that moving between them is readily accomplished

View Answer

Answer: b

Explanation: It is very easy to make an index of the information in each of these fields so that a search query can locate all the occurrences through the index. Even related sequences are cross-referenced. In addition, the information in one database can be cross-referenced to that in another database.

10. Which of the given statements is incorrect about Database Types?

a) Relational databases are more useful in the development of biological databases

b) The tables in relational database are carefully indexed and cross-referenced with each other, sometimes using additional tables, so that each item in the database has a unique set of identifying features

c) The relational database orders data in tables made up of rows giving specific items in

the database, and columns giving the features as attributes of those items

d) The two principal types of DBs are the relational and object-oriented databases View Answer

Answer: a

Explanation: The object-oriented database structure has been useful in the development of biological databases. The objects, such as genetic maps, genes, or proteins, each have an associated set of utilities for analysis and display of the object and a set of attributes such as identifying name or references.

Using the Database Access Program ENTREZ

1. Which of the following is incorrect about ENTREZ?

a) It is a resource prepared only by the staff of the National Center for Biotechnology Information

b) It provides a series of forms that can be filled out to retrieve a Medline reference related to the molecular biology sequence databases

c) One straightforward way to access the sequence databases is through ENTREZ

d) It provides a series of forms that can be filled out to retrieve a DNA or protein sequence

View Answer

Answer: a

Explanation: It is a resource prepared by the staff of the National Center for Biotechnology Information and National Library of Medicine, Bethesda, Maryland. After search for either a protein or a DNA sequence is chosen at the above address, another Web page is provided with a form to fill out for the search.

2. The databases Genbank, EMBL and DDBJ are updated daily.

a) True

b) False View Answer

Answer: a

Explanation: The mentioned database centers are updated daily and exchange new sequences daily, so that it is only necessary to access one of them. The EMBL stands for European Molecular Biology Laboratory and DDBJ for DNA DataBank of Japan.

3. Using boolean logic, the search looks for database entries that include the first term

the second, and subsequent terms repeated until the last term.

a) AND

b) OR

c) ExOR

d) NAND View Answer

Answer: a

Explanation: On the ENTREZ form, make a selection in the data entry window after the term “Search,” then enter search terms in the longer data entry window after “for.” The database will be searched for sequence database entries that contain all of these terms or related ones.

4. To assist in finding suitable terms, for each field, ENTREZ provides a list of index entries.

a) True

b) False View Answer

Answer: a

Explanation: When searching for terms in a particular field, some knowledge of the terms that are in the database can be helpful. The “Limits” link on the ENTREZ form page is used to limit the GenBank field to be searched, and various logical combinations of search terms may be designed by this method. These fields refer to the GenBank fields.

5. For a protein search, for example, current choices for fields include Which of the following is a wrong blank?

a) Accession (number)

b) E. C. number

c) Issue

d) Journal number View Answer

Answer: d

Explanation: Other fields being- author name, journal name, keyword, modification date. Also, it includes organism, page number, primary accession (number), properties, protein name, publication date (of reference), seqID string, sequence length, substance name, text word, title word, volume, and sequence ID. Similar fields are shown for the DNA database search.

6. The results of searches in separate fields may be combined to narrow down the choices.

a) True

b) False View Answer

Answer: a

Explanation: The number of terms to be searched for and the field to be searched is the main decisions to be made. In doing so, it is important to be as specific as possible, or else there may be a great many possibilities.

7. Knowing should be enough to find the required entry quickly.

a) publication date, protein name, journal name

b) accession number, protein name, or name of gene

c) publication date, protein name, or volume

d) properties, protein name, or title word View Answer

Answer: b

Explanation: If the same protein has been sequenced in several organisms, providing an organism name is also helpful. When the chosen search terms and fields have been decided and submitted, a database comprising all of the currently available sequences (called the non redundant or NR database) will be searched. Other database selections can also be made.

8. The program returns the number of matches found and provides an opportunity to narrow this list by including more terms.

a) True

b) False View Answer

Answer: a

Explanation: When the number of matching sequences has been narrowed to a reasonable number, the sequence may be retrieved in a chosen format in several straightforward steps. This helps in getting to the required data in less number of steps.

9. Which of the following is incorrect about ENTREZ?

a) There is no simple way to find the correct sequence without manually checking the information provided in each sequence, but this usually takes longer time

b) Before leaving ENTREZ, it is often useful to check for sequence database entries that are similar to the one of interest, called “neighbors” by ENTREZ

c) The expanded query searches other database entries of interest, such as the same protein in another organism, a large chromosomal sequence that includes the gene, or members of the same gene family

d) While visiting the site, note that ENTREZ has been adapted to search through a

number of other biological databases, and also through Medline, and these searches are available from the initial ENTREZ Web page

View Answer

Answer: a

Explanation: Opposite to what is mentioned in option a, this takes shorter time. It is important to look through the sequences to locate the one intended. There may be several different copies of the sequence because it may have been sequenced from more than one organism, or the sequence may be a mutant sequence, a particular clone, or a fragment.

10. Which of the following is incorrect about Retrieving a Specific Sequence?

a) It can be difficult to retrieve the sequence of a specific gene or protein simply because of the sheer number of sequences in the Gen-Bank database and the complex problem of indexing them

b) Other projects may benefit from the availability of better curated and annotated protein sequence databases, but not PIR and SwissProt

c) For projects that require the most currently available sequences, the NR databases should be searched

d) The genomic databases can also provide the sequence of a particular gene or protein. Protein sequences in the Genpro database are generated by automatic translation of DNA sequences

View Answer

Answer: b

Explanation: Curated and annotated protein sequence databases include PIR and SwissProt. When read from cDNA copies of mRNA sequences, they provide a reliable sequence, given a certain amount of uncertainty as to the translational start site. Many protein sequences are now predicted by translation of genomic sequences, requiring a prediction of exons, a somewhat error-prone step.

Genome Anatomy – 1

1. The chromosomes comprised linear DNA molecules in a tightly compact form that was wrapped around protein complexes, called the nucleosome.

a) True

b) False View Answer

Answer: a

Explanation: Nuclei and chromosomes were not observed in bacteria (a prokaryotic cell), but when bacterial DNA was eventually detected, the molecule was usually circular and was also in a compacted form. The following sections outline the structure and composition of prokaryotic and eukaryotic genomes.

2. The first bacterial genome to be sequenced was that of a mild human pathogen

a) Hemophilus influenzae

b) Lactobacillus

c) Vibrio cholarae

d) Clostridium botulinum View Answer

Answer: a

Explanation: This project was carried out at the Institute of Genomics Research. It was carried out in part to prove a new genome sequencing method—the shotgun method.

3. While sequencing of the first bacterial genome–A large number of random overlapping fragments were sequenced and then a consensus sequence of the entire

chromosome of

Hemophilus was assembled by computer. a) 8.6 x 109 bp

b) 1.8 x 106 bp

c) 6.9 x 105 bp

d) 1.8 x 104 bp View Answer

Answer: b

Explanation: It was done excepting several regions that had to be assembled manually. Once available, open reading frames were identified, and these were compared to the existing proteins by a database similarity search.

4. While sequencing of The first bacterial genome—Approximately of the predicted genes matched genes of another species, the bacterial species E. coli K-12 that had been the subject of many years of genetic and biochemical research.

a) 46%, 1500

b) 58%, 1496

c) 72%, 1743

d) 58%, 1743

View Answer

Answer: d

Explanation: The function of the other 42% of the Hemophilus genes could not be identified, although some of them were similar to the 38% of E. coli genes that were also of unknown function. Other unique sequences that appeared to be associated with the ability of the organism to behave as a human pathogen were also found.

5. After sequencing the Hemophilus genome and Organisms were selected for sequencing based on minimum criteria. Which of the following is not one of them?

a) They had been subjected to a good deal of biological analysis

b) They were model eukaryotic organisms

c) They were an important human pathogen, e.g., Mycobacterium tuberculosis (tuberculosis)

d) They were of phylogenetic interest View Answer

Answer: b

Explanation: There were model prokaryotic organisms. For e.g., E. coli and Bacillus subtilis. The success of sequencing the Hemophilus genome in a relatively short time and with a modest budget heralded the sequencing of a large number of additional prokaryotic organisms.

6. Analysis of the ribosomal RNA molecules of prokaryotes and eukaryotes had led to the prediction of three main branches in the tree of life.

a) True

b) False View Answer

Answer: a

Explanation: The three branches are represented by Archaea, the Bacteria, and the Eukarya. Analysis of the ribosomal for genome sequencing projects, organisms have been sampled from throughout the tree, including some that are in deeper branches of the tree and that have growth properties reminiscent of an ancient environment.

7. Annotation involves identifying open reading frames in the genome sequence.

a) True

b) False View Answer

Answer: a

Explanation: It is done by using the predicted protein as query sequences in a database similarity

search. It further adds significant matches to the genome sequence entry in the sequence database.

8. A simple way to retrieve sequences of viral and other extra-chromosomal genetic elements such as organelles is through the National Center for Biotechnology Information (NCBI)

a) True

b) False View Answer

Answer: a

Explanation: Prior to the sequencing of H. influenzae, the first free-living organism to be sequenced, a large number of viruses had been sequenced. Many of these organisms also serve as model systems for studying replication and gene expression. As an example, the nucleotide sequence of bacteriophage lambda was completed by Sanger.

9. Which of the given is wrongly matched?

a) Escherichia coli – Bacteria

b) Methanococcus jannaschii – Archaea

c) Synechocystis sp. – Archaea

d) Aquifex aeolicus – Bacteria View Answer

Answer: c

Explanation: Synechocystis sp. – Bacteria is the correct pair. It is an ancient organism that produces oxygen by light-harvesting.

10. In examining the results of analysis, it is important to look for the method used the statistical significance of the result, and the overall degree of confidence in the alignments.

a) True

b) False View Answer

Answer: a

Explanation: The analysis should be repeated if necessary. Annotation errors occur when the above criteria are not followed.

Genome Anatomy – 2

1. Computational resources can facilitate the analysis of bacterial genomes.

a) True

b) False View Answer

Answer: a

Explanation: GeneQuiz is an example of such a resource. There are Web sites that provide a complete annotation of the prokaryotic genomes that have been sequenced.

2. Telomeres hold newly replicated daughter chromosomes together.

a) True

b) False View Answer

Answer: b

Explanation: Centromeres hold newly replicated daughter chromosomes together. They serve as a point of attachment for pulling the chromosomes apart during cell division.

3. Prokaryotic genomes commonly have tandem repeats of sequences and include introns in protein-coding genes.

a) True

b) False View Answer

Answer: a

Explanation: Eukaryotic genomes commonly have tandem repeats of sequences. In addition to this they having linear chromosomes within a nucleus, and differing from prokaryotic genomes in this respect.

4. The sequences of satellite DNA fall into different types, each with a different repeat unit of length

a) 5–400 Mbp

b) 3–300 kbp

c) 6–900 Mbp

d) 5–200 bp View Answer

Answer: d

Explanation: Because of the skewed base composition of regions that have repeats, they may be

purified by virtue of having different buoyant densities. The repeat unit of length is 5–200 bp and not in the scale if Mega.

5. Most of the repetitive DNA is found near the open ends of chromosomes.

a) True

b) False View Answer

Answer: b

Explanation: Some of it is found near the open ends of chromosomes- telomeres but Most of this repetitive DNA is found near the centromere. It is because they serve as a point of attachment for pulling the chromosomes apart during cell division.

6. Minisatellites are made up of repeat units of up to and microsatellites compose of repeat units of or less.

a) 25 bp, 10 bp

b) 70 bp, 6 bp

c) 80 bp, 9 bp

d) 25 bp, 4 bp View Answer

Answer: d

Explanation: They are also found in eukaryotic Genomes. Microsatellite repeats are found at the ends of eukaryotic chromosomes at the telomeres, which in humans comprise hundreds of copies of a 6-bp repeat TTAGGG.

7. In nondividing cells, a mixture of lightly and darkly stained chromosomal regions called euchromatin and heterochromatin respectively, are observed.

a) True

b) False View Answer

Answer: b

Explanation: In nondividing cells, a mixture of lightly and darkly stained chromosomal regions called heterochromatin and euchromatin respectively, are observed. The centromeric and telomeric regions are located in the heterochromatin, which is in a compact configuration and is thought not to be transcribed.

8. Genes that are transcribed are located in the

a) euchromatin

b) heterochromatin

c) heterochromatin and euchromatin

d) tightly bound DNA View Answer

Answer: a

Explanation: They are located in the less compact euchromatin. This gives the regulatory proteins the access to the genetic material.

9. can comprise a large proportion of the eukaryotic genome as

a) transposable elements, single copy sequences

b) transposable elements, repetitive sequences

c) macrosatellite DNA elements, single copy sequences

d) satellite DNA elements, single copy sequences View Answer

Answer: b

Explanation: Transposable elements (TEs) are thought to play an important role in the evolution of these genomes. TEs are DNA sequences that can move from one chromosomal location to another faster than the chromosome can replicate.

10. (transposable elements) TEs have the potential to in number until they comprise a proportion of the genome sequence.

a) decrease, large

b) decrease, micro

c) increase, micro

d) increase, large View Answer

Answer: d

Explanation: It is a feature already observed in many plants and animals. They remain detectable in the genome until they blend into the background sequence by mutation.

Genome Anatomy – 3

1. More than of the human genome consists of interspersed repetitive sequences derived from TEs (transposable elements).

a) one-third

b) one-eighth

c) one-fifth

d) half

View Answer

Answer: a

Explanation: The presence of these elements may be demonstrated using programs for detection of low-complexity regions in sequences. For e.g. in the fruit fly Drosophila has 15% of genome that is made up of transposable elements.

2. The retroposons include short interspersed nuclear elements (SINES)

a) 90–4000 bp long

b) 80–500 Mbp long

c) 80–300 bp long

d) 100–3000 bp long View Answer

Answer: c

Explanation: There exists also (6–8 kbp long) interspersed nuclear elements (LINES). Different types of transposable elements are present in high copy numbers in mammalian genomes in varying manner.

3. of the human genome comprises one particular family of the SINE Element, designated Alu (1.2 million copies)

a) 10%

b) 20%

c) 60%

d) 40%

View Answer

Answer: a

Explanation: Ten percent of the human genome comprises one particular family of the SINE Element. And 14.6% of one particular LINE designated LINE1 (593,000 copies) are present.

4. Vertebrate chromosomes have long (>300 kb) regions of distinct GC richness, repeat content, and gene density, designated isochores in a model of genome organization proposing that genomes are made up of distinct segments of unique composition.

a) True

b) False View Answer

Answer: a

Explanation: Human and mouse chromosomal regions that have a low density of genes are AT- rich and have more Alu or B1/B2 (SINES) than LINE1 elements. Whereas the reverse is true for regions that have a high gene density, and those regions are more GC-rich.

5. The human genome contains about of class II of elements that probably predate human evolution (Smit 1996).

a) 2,000 copies

b) 200,000 copies

c) 2,00,00,000 copies

d) 20,00,000 copies View Answer

Answer: b

Explanation: The class of TEs, class II, is made up of elements that employ a DNA-based mechanism of transposition. Class II elements also include the Activation-Dissociation (Ac-Ds) family in maize and the P element in Drosophila.

6. A third category of TEs has features of both class I and class II TEs. These miniature, inverted repeat TEs (MITES) are in length.

a) 400 bp

b) 500 Mbp

c) 300 kbp

d) 600 kbp View Answer

Answer: a

Explanation: They were discovered in diverse flowering plants where they are frequently associated with regulatory regions of genes. Hence, they could be exerting an influence on regulation of gene expression.

7. Which of the given features is incorrect?

a) TEs are present in few particular chromosomes

b) TEs are present in all of the chromosomes

c) Abundance of TEs varies

d) TEs can comprise a large portion of the genomes of higher eukaryotes, both plants and animals

View Answer

Answer: a

Explanation: TEs are present in all of the chromosomes, ranging from bacteria to humans, but their abundance varies. They can comprise a large portion of the genomes of higher eukaryotes, thus, only a small fraction of the genome of these organisms carries gene sequences.

8. Eukaryotic genes that encode proteins are interrupted by

a) exons of varying length and number

b) introns of varying length and number

c) exons of varying length and but same number

d) introns of varying number but same length View Answer

Answer: b

Explanation: In S. cerevisiae (budding yeast), only a small fraction of the genes contain introns, and there are a total of 239 introns in the entire genome. In contrast, in individual human genes, introns may be present in numbers exceeding 100 and comprise more than 95% of the gene.

9. Introns can remain at a corresponding position in a eukaryotic gene for long periods of evolutionary time.

a) True

b) False View Answer

Answer: a

Explanation: The origin of introns in eukaryotic genes is not understood but has been accounted for by two models. The “introns-early” view proposes that introns were used to assemble the first genes from sets of ancient conserved exons, whereas the “introns-late” view proposes that introns broke up previously continuous genes by inserting into them.

10. The intron structure of genes in a particular eukaryote is used for predicting the location of genes of genome sequences.

a) True

b) False View Answer

Answer: a

Explanation: Other features of eukaryotic genes in a particular organism that are useful for gene prediction include the consensus sequences at exon–intron and intron–exon splice junctions, base composition, codon usage, and preference for neighboring codons. Computational methods incorporate this information into a gene model that may be used to predict the presence of genes in a genome sequence.

Sequence Assembly and Gene Identification – 1

1. Sequencing of genomes depends on the assembly of a large number of DNA reads into a linear, contiguous DNA sequence.

a) True

b) False View Answer

Answer: a

Explanation: The cost and efficiency of this process has been greatly improved by automatic methods of sequence assembly, first used for the sequencing of the bacterium H. influenza. This same method of assembly was also used, in part, to complete the sequencing of the Drosophila and human genomes in a timely manner.

2. Each genome sequence is scanned for protein-encoding genes using gene models trained on known gene sequences from the same organism.

a) True

b) False View Answer

Answer: a

Explanation: For a new genome, each predicted gene is translated into a protein sequence; the collection of protein sequences encoded by the genome is the proteome of the organism. every protein in the proteome is then used as a query sequence in a database similarity search.

Matching database sequences are realigned with the query sequence to evaluate the extent and significance of the alignment.

3. Screening the predicted protein sequences against library confirms the prediction and expression of the gene.

a) expressed sequence tag (EST)

b) tags

c) palindromes

d) proteomes View Answer

Answer: a

Explanation: The collective information on proteome function can then be further analyzed by self-comparison to find duplicated genes (paralogs) and by a proteome-by-proteome comparison to identify orthologs, genes that have maintained the same function through speciation, and other sequence and evolutionary relationships that are important for metabolic, regulatory, and cellular functions.

4. In case of genome sequence assembly which of the given statement is incorrect?

a) Full chromosomal sequences are assembled from the overlaps in a highly redundant set of fragments by an automatic computational method or from the fragment order on a physical map

b) Chromosome cloning is carried out in bacterial artifical chromosomes (BACs)

c) Chromosomes of a target organism are purified, fragmented, and subcloned in fragments of size hundreds of bp

d) Genome sequences are assembled from DNA sequence fragments of approximate length 500 bp obtained using DNA sequencing machines

View Answer

Answer: c

Explanation: Chromosomes of a target organism are purified, fragmented, and subcloned in fragments of size hundreds of kbp and not bp. The BAC fragments are then further subcloned as smaller fragments into plasmid vectors for DNA sequencing.

5. TEs (transposable elements) can at most comprise one-fourth of the genome sequence.

a) True

b) False View Answer

Answer: b

Explanation: TEs (transposable elements) can comprise one-half or more of the genome sequence. Eukaryotic genomes comprise classes of repeated elements, including tandem repeats present in centromeres and telomeres, dispersed tandem repeats (minisatellites and macrosatellites), and interdispersed TEs.

6. Gene identification in prokaryotic organisms is simplified by their lacking

a) exons

b) introns

c) coding segments

d) useful nucleotide sequences View Answer

Answer: b

Explanation: Once the sequence patterns that are characteristic of the genes in a particular prokaryotic organism (e.g, codon usage, codon neighbor preference) have been found, gene locations in the genome sequence can be predicted quite accurately. The presence of introns in eukaryotic genomes makes gene prediction more involved because, in addition to the above features, locations of intron–exon and exon–intron splice junctions must also be predicted.

7. Which of the given statement is incorrect?

a) The predicted set of proteins for the genome is referred to as the proteome

b) The amino acid sequence of proteins encoded by the predicted genes is used as a

query of the protein sequence databases in a database similarity search

c) A match of a predicted protein sequence to one or more database sequences serves only to identify the gene function but it doesn’t validate the gene prediction

d) The genome sequence is annotated with the information on gene content and predicted structure, gene location, and functional predictions

View Answer

Answer: c

Explanation: A match of a predicted protein sequence to one or more database sequences not only serves to identify the gene function, but also validates the gene prediction. Pseudogenes, gene copies that have lost function, may also be found in this analysis.

8. Which of the following information is not directly obtained by microarray analysis?

a) Which genes are expressed at a particular stage of the cell cycle

b) Which genes are expressed at a particular stage of developmental cycle of an organism

c) Which genes are depleted at what time

d) Genes that respond to a given environmental signal to the same extent View Answer

Answer: c

Explanation: For chronological information there are other numerous techniques that can be followed. This type of information provides an indication as to which genes share a related biological function or may act in the same biochemical pathway and may thereby give clues that will assist in gene identification.

9. Which of the given statement is incorrect about Functional Genomics?

a) Functional genomics involves the preparation of mutant or transgenic organisms with a mutant form of a particular gene usually designed to prevent expression of the gene

b) An abnormal properties of the mutant organism does not reveal the gene function

c) When two or more members of a gene family are found ,rather than a single match to a known gene, the biological activity of these members may be analyzed by functional genomics to look for diversification of function in the family

d) A more detailed analysis of the relative amount of sequence variability in a chromosomal region within populations of closely related species can reveal the presence of genes that are under selection

View Answer

Answer: b

Explanation: The gene function is revealed by any abnormal properties of the mutant organism.

This methodology provides a way to test a gene function that is predicted by sequence similarity to be the same as that of a gene of known function in another organism. If the other organism is very different biologically (comparing a predicted plant or animal gene to a known yeast gene), then functional genomics can also shed light on any newly acquired biological role.

10. Which of the given statement is incorrect about gene map?

a) Gene order in two related organisms reflects the order that was present in a common ancestor genome. Chromosomal breaks followed by a reassembly of fragments in a different order can produce new gene maps

b) Gene order is only revealed by the physical order of genes on the chromosome

c) Sequence variations (polymorphisms) that are close to (tightly linked) a trait may be used to trace the trait by virtue of the fact that the polymorphism and the trait are seldom separated from one generation to the next

d) These types of evolutionary changes in genomes have been modeled by computational methods

View Answer

Answer: b

Explanation: Gene order is revealed not only by the physical order of genes on the chromosome, but also by genetic analysis. Populations of an organism show sequence variations that are readily detected by DNA sequencing and other analysis methods. The inheritance of genetic diseases in humans and animals (e.g., cancer and heart disease), and of desirable traits in plants, can be traced genetically by pedigree analysis or genetic crosses.

Sequence Assembly and Gene Identification – 2

1. In the program COGNITOR each protein in the proteome is used as a query of a database of protein clusters.

a) True

b) False View Answer

Answer: a

Explanation: The database was made by performing an all-by-all genome comparison across a spectrum of prokaryotic organisms and a portion of the yeast proteome. Orthologous pairs of sequence were then merged with clusters or orthologous pairs (COGs) for multiple proteomes.

2. WU-BLAST produces P scores and BLAST (NCBI) produces E scores where

a) E = ln (1 + P2)

b) E = ln (1 – P2)

c) E = ln (1 + P)

d) E = ln (1 – P) View Answer

Answer: d

Explanation: For values less than 0.05, E = P. The choice of a < 10-20 score is a conservative one for identification of orthologs that should have a similar domain structure.

3. In Proteolysis and fragment sequencing, Protein spots may be excised from a two- dimensional protein gel and subjected to a combination of amino acid sequencing and cleavage analyses using the techniques of mass spectrometry and high-pressure liquid chromatography.

a) True

b) False View Answer

Answer: a

Explanation: Genome regions that encode these sequences can then be identified and the corresponding gene located. A similar method may be used to identify the gene that encodes a particular protein that has been purified and characterized in the laboratory.

4. In Protein 2D gel Electrophoresis, Individual proteins produced by the genome can be separated to by this method and specific ones identified by various

a) smaller extent, biochemical and immunological tests

b) a large extent, biochemical and immunological tests

c) a large extent, biochemical tests only

d) smaller extent, purely mechanical tests View Answer

Answer: b

Explanation: Moreover, changes in levels of proteins in response to an environment signal can be monitored in much the same way as a microarray analysis is performed. Microarrays only detect untranslated mRNAs, whereas a two-dimensional gel protein analysis detects translation products, thus revealing an additional level of regulation.

5. In Metabolic pathways and regulation, as genes are identified in a new genome sequence, some will be found that are known to act sequentially in a metabolic pathway or to have a known role in gene regulation in other organisms.

a) True

b) False View Answer

Answer: a

Explanation: From this information, the metabolic pathways and metabolic activities of the organism will become apparent. In some cases, the apparent absence of a gene in a well- represented pathway may lead to a more detailed search for the gene. Clustering of genes in the pathway on the genome of a related organism can provide a further hint as to where the gene may be located.

6. of the Drosophila sequence is composed of TEs and is heterochromatic regions that do not include genes.

a) one-fourth, one-third

b) one-fifth, one-fourth

c) one-sixth, one-eighth

d) one-sixth, one-third View Answer

Answer: d

Explanation: Hence, in the euchromatic regions, the gene density in the Drosophila genome is one gene per 9 kb. Despite the fact that the lower number of predicted genes in Drosophila is smaller than that of the other genomes, the amount of functional diversity, as evidenced by protein family representation, is similar.

7. Yeast is about compact than E. coli.

a) fivefold, less

b) threefold, more

c) twofold, less

d) twofold, more View Answer

Answer: c

Explanation: Of the remaining genomes, C. elegans and A. thaliana have approximately the same density of genes (one gene per 6 kb). Drosophila is the least dense in this comparison (one gene per 14 kb).

8. The are by genetic structure to retroviruses.

a) STR retrotransposons, related

b) LTR retrotransposons, related

c) STR transposons, related

d) LTR retrotransposons, not related View Answer

Answer: b

Explanation: There are three main subclasses of these TEs—the long terminal repeat (LTR) retrotransposons, retroposons, and retrovirus-like elements with LTRs. Class I elements encode a reverse transcriptase and use RNA mediated mechanisms of transcription.

9. Which of the given statement is incorrect?

a) As in an all-by-all protein comparison within a proteome, a matrix of alignment scores with E values is made, and the most closely related sequences in the two organisms are identified

b) To perform a between-proteome analysis, proteome databases are made for the known and predicted genes of two or more genomes

c) Each protein of one proteome is selected in turn as a query of the proteome of another organism or the combined proteome of a group of organisms

d) Each protein of one proteome is selected in turn as a query of the proteome of another single organism only

View Answer

Answer: d

Explanation: This analysis can predict orthologs. In other words proteins have an identical function attributable to descent of the respective genes from a common ancestor.

10. The higher the E value, the more significant the alignment between a pair of matching sequences.

a) True

b) False View Answer

Answer: b

Explanation: The lower the E value, the more significant the alignment between a pair of matching sequences. The E value of an alignment score is the probability that an alignment score as good as the one found would be observed between two random or unrelated sequences in a search of a database of the same size.

Sequence Assembly and Gene Identification – 3

1. Which of the given statement is incorrect about proteome analyses?

a) For BLAST, setting an effective database size appropriate for each search and program is important

b) Due to the large number of comparisons that must be made in these types of analyses and due to the volume of program output, the procedure must be automated

on a local machine using Perl scripts or a similar method and a database system

c) BLAST is used for obtaining a correct statistical evaluation of alignment scores

d) BLAST does not give statistical evaluation of alignment scores View Answer

Answer: d

Explanation: Each protein encoded by the genome is used as a query in database similarity searches to identify similar database proteins, some having a known structure or function. Additional searches of EST databases can be used to identify additional relatives of the query sequence.

2. An all-against-all analysis requires first making a database of the proteome. This database is then sequentially searched by each individual protein sequence of the proteome using a rapid database similarity search tool such as

a) XPBLAST

b) WU-BLAST

c) BLAST

d) FASTA View Answer

Answer: a

Explanation: P values of WU-BLAST are similar to E values of NCBI BLAST (Rubin et al. 2000) for values of P and E < 0.05. This analysis generates a matrix of alignment scores, each with an E value and corresponding alignment for each pair of proteins. Also, the E value of an alignment score is the probability that an alignment score as good as the one found would be observed between two random or unrelated sequences in a search of a database of the same size.

3. Evolutionary modeling can include a various types of analyses. Which is mostly not one of them?

a) The prediction of chromosomal rearrangements

b) Eu/Hetero-chromatin structures

c) Duplications at gene, chromosomal and full genome level

d) Duplications at the protein domain level View Answer

Answer: b

Explanation: Option b indicates the structural studies. Also as mentioned, analysis of the prediction of chromosomal rearrangements that preceded the present arrangement is done (e.g., a comparison of mouse and human chromosomes).

4. Which of the given statement is incorrect about Clusters of functionally related genes?

a) In microbial genomes, genes specifying a metabolic pathway may be contiguous on the genome where they are coregulated transcriptionally in an operon by a common promoter

b) In related organisms, gene order on the chromosome is least likely to be conserved

c) As the relationship between the organisms decreases, local groups of genes remain clustered together, but chromosomal rearrangements move the clusters to other locations

d) The function of a particular gene can sometimes be predicted, given the known function of a neighboring, closely linked gene

View Answer

Answer: b

Explanation: In related organisms, both gene content of the genome and gene order on the chromosome are likely to be conserved.

5. Which of the given statement is incorrect about Orthologs?

a) In comparing two proteomes, a common standard is to require that for each pair of orthologs, the first of the pair is the best hit when the second is used to query the proteome of the first

b) To identify orthologs, each protein in the proteome of an organism is used as a query in a similarity search of a database comprising the proteomes of only one different organism

c) The best hit in each proteome is likely to be with an ortholog of the query gene

d) Orthologs are genes that are so highly conserved by sequence in different genomes that the proteins they encode are strongly predicted to have the same structure and function and to have arisen from a common ancestor through speciation

View Answer

Answer: b

Explanation: To identify orthologs, each protein in the proteome of an organism is used as a query in a similarity search of a database comprising the proteomes of one or more different organisms.

6. In protein/domain analysis, each protein in the predicted proteome is again used as a query of a curated protein sequence database such as in order to locate similar domains and sequences. To find orthologs, very low E value scores (E<10<20) for the alignment score and an alignment that includes 60–80% of the query sequence are generally required in order to avoid matches to paralogs.

a) PubChem

b) Genbank

c) MeSH

d) SwissProt View Answer

Answer: d

Explanation: The domain composition of each protein is also determined by searching for matches in domain databases such as Interpro. The analysis reveals how many domains and domain combinations are present in the proteome, and reveals any unusual representation that might have biological significance. The number of expressed genes in each family can also be compared to the number in other organisms to determine whether or not there has been an expansion of the family in the genome.

7. In all-against-all self comparison, A comparison is made in which every protein is used as a query in a similarity search against a database composed of the rest of the proteome, and the significant matches are identified by a low expect value.

a) True

b) False View Answer

Answer: a

Explanation: Many proteins comprise different combinations of a common set of domains, proteins that align along most of their lengths (80% identity is a conservative choice). Hence they are chosen to select those that have a conserved domain structure.

8. Processed pseudogenes are also derived from a functional gene and they contain introns and a promoter.

a) True

b) False View Answer

Answer: b

Explanation: Processed pseudogenes are also derived from a functional gene, but they do not contain introns and lack a promoter; hence, they are not expressed. The origin of these pseudogenes is probably due to reverse transcription of the mRNA of the functional gene and insertion of the cDNA copy into a new chromosomal location by a LINE1 reverse transcriptase.

9. Pseudogenes are DNA sequences that were derived from distinct genes but which have acquired mutations that are deleterious to function in the same period of time.

a) True

b) False View Answer

Answer: b

Explanation: Pseudogenes are DNA sequences that were derived from a functional copy of a gene but which have acquired mutations that are deleterious to function. For e.g. the pseudogene TRY5 is similar to the nearby functional gene TRY4.

10. New gene functions are thought to be gained by duplication of an existing gene creating two tandem copies.

a) True

b) False View Answer

Answer: a

Explanation: Functional differentiation then occurs between the copies by mutation and selection. However, because most mutations are deleterious, and because only one gene copy may be needed for function, there is a strong tendency of one copy to accumulate mutations that render the gene nonfunctional.

Comparative Genomics – 1

1. Comparative genomics includes a comparison of gene number, gene content, and gene location in both prokaryotic and eukaryotic groups of organisms.

a) True

b) False View Answer

Answer: a

Explanation: The availability of complete genome sequences makes possible a comparison of all of the proteins encoded by one genome, the proteome of that organism, with those of another.

Because the genome sequence provides both the sequence and the map location of each gene, both the sequence and location can be compared.

2. Which of the following information Sequence comparisons do not provide?

a) Gene relationships

b) Function history

c) Evolutionary history

d) Gene locations View Answer

Answer: d

Explanation: Map locations of orthologous genes may also be compared. If a set of genes is grouped together at a particular chromosomal location, and if a set of similar genes is also grouped together in the genome of another organism, these groups share an evolutionary history.

3. Which of the given statements is incorrect?

a) Proteins may be clustered into families on the basis of either sequence or structural similarity

b) Proteins often comprise separate domains

c) The number of protein sequences that are available is insufficient to determine that domain shuffling occurs in evolution

d) Proteins are modular View Answer

Answer: c

Explanation: The number of protein sequences is sufficient unlike mentioned in option c. The comparisons of proteomes of different organisms can identify the type of domain changes and also provide an indication as to what biological role they may have in a particular organism.

4. Which of the given statements is incorrect?

a) Two tandem copies of a gene are produced while Proteins with new functions are produced

b) Proteins with new functions are produced by a gene duplication event

c) Assortment and reassortment of protein domains takes place in individual genomes

d) In no case the two duplicated genes both undergo change View Answer

Answer: d

Explanation: In a possibility, two duplicated genes both undergo change, but interactions between the proteins stabilize the original function and support the evolution of new ones. Through mutation and natural selection, one of the copies can develop a new function, leaving the other copy to cover for the original function. However, because most mutations are deleterious to function, often one of the copies becomes a pseudogene. Not all gene duplications are thought to have the above effects.

5. Which of the given statements is incorrect?

a) The processes of domain assortment and gene duplication produce families of proteins in organisms

b) Following speciation, a newly derived genome will inherit the families of ancestor organisms, but will also develop new ones to meet evolutionary challenges

c) Comparison of each of the proteins encoded by an organism with every protein, an all-against-all comparison, reveals which protein families have been amplified and what rearrangements have occurred as steps in the evolutionary process

d) When two or more proteins in the proteome share a high degree of similarity they are least likely to be paralogs

View Answer

Answer: d

Explanation: When two or more proteins in the proteome share a high degree of similarity because they share the same set of domains, they are likely to be paralogs, genes that arose by gene duplication events. Proteins that align over shorter regions share some domains, but also may not share others. Although gene duplication events could have created such variation, other rearrangements may have also occurred, blurring the evolutionary history.

6. Which of the given statements is incorrect about All-against-all Self-comparison?

a) A comparison of each protein in the proteome with all other proteins distinguishes unique proteins from proteins that have arisen from gene duplication, and also reveals the number of protein families but the domain content of these proteins cannot be known

b) In all-against-all proteome comparison, each protein is used as a query in a similarity search against the remaining proteome

c) In all-against-all proteome comparison, the similar sequences are ranked by the quality and length of the alignments found

d) In all-against-all proteome comparison, The search is conducted with each alignment score receiving a statistical evaluation (P or E value)

View Answer

Answer: a

Explanation: The domain content of these proteins may also be analyzed. in all-against-all proteome comparison, a match between a query sequence and another proteome sequence with the same domain structure will produce a high-scoring, highly significant alignment. These proteins are designated paralogs because they have almost certainly originated from a gene duplication event.

7. Which of the given statements is incorrect about Cluster analysis?

a) Clustering organizes the proteins into groups by some objective criterion

b) One criterion for a matching protein pair is the statistical significance of their alignment score

c) The P or E value from BLAST searches cannot be the criterion for a matching protein pair

d) A criterion for clustering proteins is the distance between each pair of sequences in a multiple sequence alignment

View Answer

Answer: c

Explanation: Option c and b mean the same yet are different by the negation in option c.The lower this value, the better the alignment. There will be a cutoff P or E value at which the matches in the BLAST search are no longer considered significant. A value of P or E = 0.01–0.05 is usually the point at which the alignment score is no longer considered to be significant in order to focus on a more closely related group of proteins.

8. Which of the given statements is incorrect about Clustering by making subgraphs?

a) Each sequence is a vertex and each pair of sequences that is matched with a significant alignment score is joined by an edge that is weighted according to the statistical significance of the alignment score

b) One way to identify the most strongly supported clusters is simply to add the most weakly supported edges in the graph

c) One way to identify the most strongly supported clusters is simply to remove the most weakly supported edges in the graph

d) An edge is weighted according to the statistical significance of the alignment score View Answer

Answer: c

Explanation: As weaker and weaker links are removed, the remaining combinations of vertices and edges represent most strongly linked sequences. This type of analysis was performed on an initial collection of E. coli genes by Labedan and Riley (1995).

9. Which of the given statements is incorrect about Clustering by single linkage?

a) In First step, a group of related sequences found in the all-against-all proteome comparison is subjected to a multiple sequence alignment usually by CLUSTALW

b) A neighbor-joining algorithm is rarely used in this method

c) This procedure and the algorithms are the same as those used to make a phylogenetic tree by the distance methods

d) A distance matrix that shows the number of amino acid changes between each pair of sequences is made

View Answer

Answer: b

Explanation: The matrix is then used to cluster the sequences by a neighbor-joining algorithm.

These methods produce a tree or a different representation of the tree called a dendrogram, which minimizes the number of amino acid changes that would generate the group of sequences.

10. The all-against-all analyses provide an indication as to the number of protein/gene families in an organism. This number represents the core proteome of the organism from which all biological functions have diversified.

a) True

b) False View Answer

Answer: a

Explanation: In Hemophilus, 1247 of the total number of 1709 proteins do not have paralogs. The core proteomes of the worm and fly are similar in size but with a greater number of duplicated genes in the worm. It is quite remarkable that the core proteome of the multicellular organisms (worm and fly) is only twice that of yeast.

Comparative Genomics – 2

1. Which of the given statements is incorrect about Grouping Sequences?

a) The problem of deciding which sequences to include in the same group or cluster and which to separate into different groups or clusters is a recurring one

b) Divergence is necessary, but the sequences chosen should be clearly related based on inspection of each pair-wise alignment and a statistical analysis

c) The conservative approach is to group distinct sequences

d) The adventurous approach is to choose a set of marginally alignable sequences to pursue the difficult task of making a multiple sequence alignment and then to make profile models that may recognize divergence but will also give false predictions View Answer

Answer: c

Explanation: The conservative approach is to group only very similar sequences together. However, in making a conservative multiple sequence alignment with only very alike sequences, it is not possible to analyze the evolutionary divergence that may have occurred in a family of proteins. Furthermore, if a matrix or profile model is made from this alignment, that model will not be useful for identifying more divergent members of a family.

2. Which of the given statements is incorrect about Clusters of orthologous groups?

a) Using the protein from one of the organisms to search the proteome of the other for high-scoring matches should identify the ortholog as the highest- scoring match, or best hit

b) When entire proteomes of the two organisms are available, orthologs may be identified

c) a pair of orthologous genes in two organisms share so much sequence similarity that they may be assumed to have arisen from a common ancestor gene

d) each of the orthologs belongs to a family composed of paralogous sequences but irrelevant or not related to each other

View Answer

Answer: d

Explanation: In many cases, each of the orthologs belongs to a family composed of paralogous sequences related to each other by gene duplication events. Hence, in the above database search, the ortholog will not only match the orthologous sequence in the second proteome but also these other paralogous sequences. The objective of the clusters of orthologous groups (COG) approach is to identify all matching proteins in the organisms; defined as an orthologous group related by both speciation and gene duplication events.

3. Which of the given statements is incorrect about Clusters of orthologous groups?

a) Paralogs may include a best hit or a high-scoring match of one of the sequences by another, but the reciprocal match can have low similarity that does not have to be significant

b) Paralogs defined by sets of three matching sequences in the selected organisms were kept separated from the clusters

c) Orthologous pairs were first defined by the best hits in reciprocal searches

d) To produce COGs, similarity searches were performed among the proteomes of phylogenetically distinct clades of prokaryotes

View Answer

Answer: b

Explanation: Paralogs defined by sets of three matching sequences in the selected organisms were also added to these clusters. Sixty percent of the original set of 720 COGs does not include paralogs, or includes paralogs from one lineage only, suggesting that there has not been extensive duplication of this group.

4. Which of the given statements is incorrect about the Comparison of proteomes to EST databases of an organism?

a) ESTs are single DNA sequence reads that contain a small fraction of incorrect base assessments, insertions, and deletions

b) Many sequences arise from near the 5’ end of the mRNA, although every effort is usually made to read as far 3’ as possible into the upstream portion of the cDNA

c) EST libraries are useful for preliminary identification of genes by database similarity searches

d) An EST database of an organism can be analyzed for the presence of gene families, orthologs, and paralogs

View Answer

Answer: b

Explanation: Many sequences arise from near the 3’ end of the mRNA, although every effort is usually made to read as far 5’ as possible into the upstream portion of the cDNA. Because not all of the genes may be expressed in the tissues chosen for analysis, the library will often not be complete.

5. Which of the given statements is incorrect about Searching for orthologs to a protein family in an EST database?

a) Searches of EST databases for matches to a query sequence routinely produce minimal amounts of output that must be searched manually for significant hits

b) ESTs with a high percent identity with the query sequence, a long alignment with the query sequence, and a very low E value of the alignment score represent groups of paralogous and orthologous genes

c) To identify orthologs as the most closely related sequence, ESTs were aligned using the amino acid alignment as a guide

d) To identify orthologs as the most closely related sequence, a phylogenetic tree was produced by the maximum likelihood method

View Answer

Answer: a

Explanation: The Searches of EST databases for matches to a query sequence routinely produce large amounts of output that must be searched manually for significant hits. an automatic method was described in 1999 utilizing a computer script, FAST-PAN, that scans EST databases with multiple queries from a protein family, sorts the alignment scores, and produces charts and alignments of the matches found.

6. Which of the given statements is incorrect about Family and Domain Analysis?

a) Gene identification of predicted proteins in the genome is designed to discover the metabolic features of an organism

b) In a particular organism or group of organisms, one particular domain can be expanded to perform a particular function

c) Comparison of the domain content of an entire proteome with that of another proteome cannot help in revealing the biological roles of diverse domains in different organisms

d) Different proteins are mosaics of domains that occur in different combinations in a given protein

View Answer

Answer: b

Explanation: In a particular organism or group of organisms, various domains can be expanded to perform a particular function. More than 2000 fly and worm proteins are multidomain proteins, compared to about one-third this number in yeast.

7. Which of the given statements is incorrect about Ancient Conserved Regions?

a) The method involves database similarity searches of the SwissProt database with human, worm, yeast, or E. coli genes and identification of matches with sequences from a different phylum than the query sequence

b) An analysis of ACRs that predate the radiation of the major animal phyla some 580– 540 million years ago suggested that 50–60% of coding sequences are ACRs

c) These ACRs may represent proteins present at the time of the prokaryotic–eukaryotic divergence

d) Phylogenetically diverse groups of organisms have been analyzed for the presence of conserved proteins and protein domains that have been conserved over long periods of evolutionary time, called ancient conserved regions or ACRs

View Answer

Answer: b

Explanation: The analysis of ACRs 580–540 million years ago suggested that 20–40% of coding sequences are ACRs. For example, a search with 1916 E. coli proteins detected 266 ACRs found in 439 sequences, roughly one-quarter of the SwissProt database.

8. Which of the given statements is incorrect about Horizontal Gene Transfer?

a) The genomes of most organisms are derived by vertical transmission, the inheritance of chromosomes from parents to offspring from one generation to the next

b) It is the acquisition of genetic material from a different organism

c) The transferred material becomes a temporary addition to the recipient genome

d) An extreme example is the proposed endosymbiont origin of mitochondria in

eukaryotic cells and chloroplasts in plants View Answer

Answer: c

Explanation: The transferred material becomes a permanent addition to the recipient genome. Although these exchanges do not occur very often on a generation-to-generation basis, a significant number can occur over a period of hundreds of millions of years.

9. Which of the given statements is incorrect about Horizontal Gene Transfer?

a) It is a significant source of genome variation in bacteria, allowing them to exploit new environments

b) Such transfer is rendered possible by a variety of natural mechanisms in bacteria for transferring DNA from one species to another

c) Detection of HT is made possible by the fact that each genome of each bacterial species has a unique base composition

d) The time of transfer of DNA cannot be estimated by the composition of the HT DNA View Answer

Answer: d

Explanation: The time of transfer of DNA may be estimated by the degree to which the composition of the HT DNA has blended into that of the recipient genome. Transfer of a portion of a genome from one organism to another can generally be detected as an island of sequence of different composition in the recipient. If the amino acid composition of transferred genes is typical, these islands may be detected by a codon usage analysis.

10. Annotation is based on finding significant alignment to sequences of known function in database similarity searches.

a) True

b) False View Answer

Answer: a

Explanation: Accurate annotation of genome sequences is an important first step in genome analysis. Matches of lesser significance provide only a tentative or hypothetical prediction and should be used as a working hypothesis of function.

18. Questions on Genome Analysis

Functional Classification of Genes

1. GeneQuiz focuses on deriving a predicted protein function, based on a variety of available evidence, including the evaluation of the similarity to the closest homolog in a database.

a) True

b) False View Answer

Answer: a

Explanation: GeneQuiz is an integrated system for large-scale biological sequence analysis that uses a variety of search and analysis methods using current sequence databases. By applying expert rules to the results of the different methods, GeneQuiz creates a compact summary of findings.

2. Which of the given statement is incorrect regarding MAGPIE?

a) It analyzes the genome using a set of automated processes

b) It is designed for high-throughput genome sequence analysis

c) It is unable to locate potential promoters

d) It automatically annotates genomic sequence data and maintains a daily up-to-date record in response to user queries about one or more genomes

View Answer

Answer: c

Explanation: The system also uses a set of rules in logic programming to make decisions that may be used to interpret information from various sources. It has been used to locate potential promoters, terminators, start codons, Shine-Dalgarno sites, DNA motif sites, co-transcription units, and putative operons in microbial genomes. These sites are shown on a map display of the genome that may be edited.

3. Which of the given statement is incorrect?

a) paralogous sequences, frequently are found to have dissimilar functions

b) An early classification scheme for eight related groups of E. coli genes included categories for enzymes, transport elements

c) An early classification scheme for eight related groups of E. coli genes included categories for regulators, membranes, structural elements, protein factors, leader peptides, and carriers

d) Ninety percent of E. coli genes related by significant sequence similarity fell into these same broad categories

View Answer

Answer: a

Explanation: Genes that are significantly similar in an organism, i.e., paralogous sequences, frequently are found to have a related biological function. This discovery follows the expected origin of paralogs by gene duplication events, leaving one copy to perform the original function and producing a second copy to develop a new function not too distant from the original one under evolutionary selection.

4. The designation ECa.b.c.d conveys information. Which of the following is not one of it?

a) One of twelve main classes of biochemical reactions

b) The group of substrate molecule

c) The nature of chemical bond that is involved in the reaction

d) Designation for acceptor molecules (cofactors) View Answer

Answer: a

Explanation: Option a should be ‘one of six main classes of biochemical reactions’. The Enzyme Commission numbers formulated by the Enzyme Commission of the International Union of Biochemistry and Molecular Biology provide a detailed way to classify enzymes based on the biochemical reactions they catalyze.

5. An approach to classification of genes that encode enzymes is to examine relationships among multiple enzymes that perform the same biochemical function in the same organism.

a) True

b) False View Answer

Answer: a

Explanation: Although catalyzing the same reaction, these enzymes showed variations in metabolic regulation of their activity. More than one-half of multiple enzymes in E. coli share significant sequence similarity; i.e., they are paralogs. However, the remainder do not share any sequence similarity.

6. Other functional classification schemes for genes include a broader category for genes involved in the same biological process, e.g., a three-group scheme for energy- related, information-related, and communication-related genes has also been used.

a) True

b) False View Answer

Answer: a

Explanation: By this scheme, plants devote more than one-half of their genome to energy metabolism. Whereas, animals devote one-half of their genome to communication-related functions.

7. Two species that have recently diverged from a common ancestor might be expected to have a set of genes and chromosomes with these genes positioned along the chromosomes in the same order.

a) distinct, similar

b) similar, distinct

c) similar, dissimilar

d) similar, similar View Answer

Answer: d

Explanation: Over evolutionary time, the sequence of each pair of genes will slowly diverge, as the species diverge and other changes such as geneduplication and gene loss change the gene content. In addition, the order of genes also changes over evolutionary time as a result of chromosomal rearrangements.

8. Which of the given statement is incorrect about the observations made with regard to gene order?

a) Order is highly conserved in closely related species

b) Order in closely related species becomes changed by rearrangements over evolutionary time

c) As more and more rearrangements occur, there will no longer be any correspondence in the order of orthologous genes on the chromosome of one organism with that of a second organism

d) Order is less conserved in closely related species View Answer

Answer: d

Explanation: Order is more conserved in closely related species. Another observation is that the groups of genes that have a similar biological function tend to remain localized in a group or cluster.

9. Which of the given statement is incorrect about the Chromosomal Rearrangements?

a) Comparison of the number of rearrangements in a given period of evolutionary history may vary significantly from one organism to the next

b) If gene A has a neighboring gene B, then if an ortholog of A occurs in another

genome, there is an increased probability of an ortholog of B also occurring in the other organism

c) If gene A has a neighboring gene B, then if an ortholog of A occurs in another genome, the B ortholog is more likely to be a neighbor of the A ortholog of the genome of the second species if the two species are more divergent

d) In general, the order of orthologs is not well conserved in prokaryotes when the genomes have diverged sufficiently that the orthologs have < 50% identity

View Answer

Answer: c

Explanation: The B ortholog is less likely to be a neighbor of the A

ortholog of the genome of the second species if the two species are more divergent. By classifying genes using a nine class functional classification scheme, several genes falling into the same functional category are clustered together on the chromosomes of both of these organisms, and the clusters are in a similar order.

10. Which of the given statement is incorrect?

a) In a given organism or species, genes are found in a given order that is maintained on the chromosomes from one generation to the next

b) Genes with a related function are frequently found to be distorted on a chromosome

c) A possibility is that there is genetic variation (alleles) within each gene in a cluster of a given species and that only certain allelic combinations of different genes are compatible

d) Clustering of related genes presumably provides an evolutionary advantage to a species

View Answer

Answer: b

Explanation: Genetic analysis has revealed that genes with a related function are frequently found to be clustered at one chromosomal location. As genome-by-genome comparisons of the chromosomes of related species are made and the rearrangements are discovered, a further challenge to computational and evolutionary biologists is to estimate the number and types of rearrangements that have occurred and also to determine when they occurred. For example, a comparison of the mouse and human chromosomes reveals many rearrangements.

Global Gene Regulation

1. Which of the given statements is incorrect about global gene regulation?

a) One way to obtain useful information about a genome is to determine which genes are induced or repressed in response to a phase of the cell cycle

b) Sets of a gene whose expression rises and falls under the same condition are likely to have a related function

c) Sets of a gene whose expression rises and falls under the same condition are likely to have dissimilar functions

d) Cell cycle is a developmental phase, or a response to the environment View Answer

Answer: c

Explanation: In addition, a pattern of gene expression may also be an indicator of abnormal cellular regulation and is a useful tool in cancer diagnosis. Because genomes, especially eukaryotic genomes, are so large, a new technology has been developed for studying the regulation of thousands of genes on a microscope slide.

2. Which of the given statements is incorrect about Microarray (or microchip) analysis?

a) It is a new technology in which all of the genes of an organism are represented by oligonucleotide sequences spread out in an 80 x 80 array on microscope slides

b) The oligonucleotide sequences cannot be synthesized directly on the slide

c) The oligonucleotides are collectively hybridized to a labeled cDNA library prepared by reverse-transcribing mRNA from cells

d) The amount of label binding to each oligonucleotide spot reflects the amount of mRNA in the cell

View Answer

Answer: b

Explanation: The oligonucleotide sequences can also be synthesized directly on the slide at densities of up to one million per square centimeter. Genes that are responding the same way to an environmental signal, in this case the addition of serum to serum-starved skin cells are clustered together in a display. From this analysis, a set of genes that responds in an identical manner may be identified.

3. Once a set of genes that are co-regulated has been found, the promoter regions of these genes may be analyzed for conserved patterns that represent sites of interaction with specific transcription factors.

a) True

b) False View Answer

Answer: a

Explanation: Automatic methods for clustering related sets of genes have been devised. The first

of these methods, hierarchical clustering, is commonly used, but the other two methods are better designed to detect differences in patterns over a set of time points or samples.

4. Which of the given statements is incorrect about Microarray Analysis?

a) It is designed to detect global changes in transcription in a genome

b) It provides information about the levels of protein products of the genes

c) The proteins are first separated in a column on the basis of size and then across a second dimension on a slab on the basis of charge

d) Labeled protein samples may also be extracted from treated cells and separated by two-dimensional gel electrophoresis

View Answer

Answer: b

Explanation: Microarray analysis is designed to detect global changes in transcription in a genome but does not provide information about the levels of protein products of the genes, which may also be subject to translational regulation. This method also can resolve thousands of proteins based on size and charge. There are databases of the patterns found in different organisms.

5. In cluster analysis of microarray data– If Xi is the log odds value for gene X at time i, then for two genes X and Y and N observations, a similarity score is calculated. S(X,Y) is also known as the Pearson correlation coefficent. Xoffset and Yoffset can be the mean of the observations on X or Y, respectively, in which case is the standard deviation, or else Xoffset and Yoffset can be set to zero when a reference state is used.Which of the following best represents it?

View Answer

Answer: b

Explanation: After values of S(X,Y) have been calculated for all gene combinations, the most closely related pairs are identified in an above-diagonal scoring matrix. The object of clustering is

to identify genes that respond the same way to the environmental treatment. Each gene is compared to every other gene and a gene similarity score (metric) is produced.

6. In cluster analysis of microarray data– A node is created between the _ scoring pair, and the gene expressed profiles of these two genes are averaged and the joined elements are weighted by the of elements they contain.

a) lowest, frequency

b) average, sequence

c) lowest, number

d) highest, number View Answer

Answer: d

Explanation: The node is created as mentioned. The matrix is then updated replacing the two joined elements by the node.

7. In cluster analysis of microarray data– For n genes, the process is repeated times until a single element remains.

a) n2

b) n

c) n-1

d) n-4

View Answer

Answer: c

Explanation: This number of iterations gives the best results. In the final dendrogram, the order of genes within a cluster is determined by simple weighting schemes, e.g., average dendrogram level.

8. The hierarchical clustering method generates a similarity score [S(X,Y)] for all gene combinations, places the scores in a matrix, joins those genes that have the highest score, and then continues to join progressively less similar pairs.

a) True

b) False View Answer

Answer: a

Explanation: The disadvantage of this method is that it fails to discriminate between different patterns of variation. For example, a gene expression pattern for which a high value is found at an intermediate time point will be clustered with another for which a high value is found at a late time point in the experiment. These variations have to be separated in a subsequent step.

9. In Self-organizing maps a choice is made of a number of clusters by which to organize the data.

a) True

b) False View Answer

Answer: a

Explanation: The object is to move each node to the center of a cluster of data points. At each iteration a data point P is selected, and the node closest to that point is identified.

10. SVMs (Support vector machines) are a binary classification method to discriminate one set of data points from another

a) True

b) False View Answer

Answer: a

Explanation: They are similar to the types of discriminant analyses. For microarray analysis, sets of genes are identified that represent a target pattern of gene expression.

Prediction of Gene Function Based on a Composite Analysis

1. When two proteins share a considerable degree of sequence identity throughout the sequence alignment, they are least likely to share the same function.

a) True

b) False View Answer

Answer: b

Explanation: In the mentioned case they are more likely to share the same function. A considerable fraction of a genome may encode proteins whose function may not be identified in this manner because the proteins are not related to another of known function.

2. Other types of evidence for a relationship between two genes are also given that are not dependent in sequence similarity. These include

Which of the following is a wrong blank?

a) genes are closely linked on the same chromosomes

b) genes are transcribed from the same DNA strand

c) gene fusions are observed between otherwise separate genes

d) phylogenetic profiles show the genes are not that commonly present in organisms View Answer

Answer: d

Explanation: Phylogenetic profiles reveal the genes are both commonly present in many organisms implying they have interdependent metabolic functions. Option a and b imply coordinated regulation in an operon-like structure. Option c suggests the encoded proteins are physically associated in a common complex.

3. In Genome-wide prediction of protein functions by a combinatorial method– Each point represents a protein, and branches between proteins indicate a relationship by one of several criteria indicated in the legend.

a) True

b) False View Answer

Answer: a

Explanation: Branch lengths are shorter for closely related proteins and thicker when two or more prediction methods indicate a relationship. The links are based on experimental data, proteins whose homologs are known to operate sequentially in metabolic pathways, proteins that evolved in a correlated fashion as evidenced by presence in fully sequenced genomes, proteins whose homologs are fused into a single protein in another organism, and proteins whose mRNA expression profiles are similar under a range of cellular and environmental conditions.

4. Which of the given statement is untrue about functional genomics?

a) Known functions are derived from experimental evidence in molecular biology and genetic studies with model organisms

b) Non-Orthologous genes between biologically distinct species can be identified, and it is strong evidence for a related function

c) Sequence-based methods of gene prediction can be augmented by the types of genome comparisons that are designed to identify related genes based on common patterns of expression, evolutionary profiles, chromosomal locations, and other features

d) Genome analysis depends to a large extent on sequence analysis methods that identify gene function based on similarity between proteins of unknown function and proteins of known function

View Answer

Answer: b

Explanation: Orthologous genes between biologically distinct species (for example, yeast and

fruit flies) can be identified, and the high sequence similarity between them is strong evidence for a related function. Given the more complex multicellular biology of flies, the fly gene could have an additional function that is not predictable by the yeast model. In other cases, the occurrence of families of paralogous genes that share common domains can make a precise guess of function of one of these proteins more difficult because all match a model protein to some degree.

5. In case of functional genomics– Two general types of approaches are used—one in which a genetic construct is made that interferes with the expression of a particular gene (and sometimes a set of related genes) and a second in which a large number of random mutations are generated in a population of organisms.

a) True

b) False View Answer

Answer: a

Explanation: The individual with a mutation in a particular gene is then identified. Once mutants are obtained, the effect of the mutant genes on phenotype is determined. The gene function may then be predicted on the basis of the observed alterations. Because such extreme genetic experiments cannot be performed with humans, the mouse model for the human genome serves the same purpose.

6. A genome database may also be interfaced with other types of data, such as clinical data.

a) True

b) False View Answer

Answer: a

Explanation: These types of organization, termed data warehousing, can facilitate the search for novel relationships among the data by data-mining methods. These methods include genetic algorithms, neuronetworks, and others.

7. The ultimate step in genome analysis is to collect the information found on gene and protein sequences, alignments, gene function and location, protein families and domains, relationships of genes to those in other organisms, chromosomal rearrangements, and so on, into a comprehensive database.

a) True

b) False View Answer

Answer: a

Explanation: This database should be logically organized so that all types of information are readily accessible and easily retrievable by users who have widely divergent knowledge of the organism. This goal is best achieved by using controlled vocabularies that can identify the same genetic or biochemical function in different organisms without ambiguity.

8. In addition to the care needed in organizing genome databases, a great deal of human input is needed to annotate the genome manually with information.

a) True

b) False View Answer

Answer: a

Explanation: This information can be about individual genes and proteins, effects of mutations in these genes, and other types of genome variations that cannot be readily incorporated into the database by automated methods. For the human genome, this activity will occupy the time of many scientists for many years to come.

9. In Reverse-genetics analysis of gene function– Even though a particular gene may be

ortholog of a gene of known function in another organism, that gene may be acquired by a function.

a) a highly predicted, similar

b) a highly predicted, same

c) a highly predicted, novel

d) less predicted, novel View Answer

Answer: c

Explanation: For example, a defect in a plant or animal gene that is a homolog of a yeast gene may have an effect on a developmental process or other biologically unique function of multicellular organisms. Information on knockout mutants in model organisms is available through the genome Web sites.

Databases developed

S.N Database Develop Full name Repository data

1 NCBI 1988 National canter for biological information

2 GENE BANK 1992 Include NCBI

3 DDBJ DNA Databank of Japan

4 PROSITE

5 PDB

6 STRING

7 KAGGLE

8 GO

9 EMBL

10 NLM 1986

11 Entrez 1991

12 UniProtKB/SWISS-Pro

BIOINFORMATICS Multiple Choice Questions

1. The first bioinformatics database was created by

A. Richard Durbin

B. Dayhoff

C. Michael j.Dunn

D. Pearson Answer:- B

2. SWISSPROT protein sequence database began in A. 1985

B. 1986

C. 1987

D. 1988

Answer:- C

3. An example of Homology & similarity tool?

A. PROSPECT

B. EMBOSS

C. RASMOL

D. BLAST Answer:- D

4. The tool for identification of motifs?

A. COPIA

B. pattern hunter

C. PROSPECT

D. BLAST Answer:- A

5. First molecular biology server Expasy in the year? A. 1991

B. 1992

C. 1993

D. 1994

Answer:- C

6. Deposition of cDNA into inert structure is

A. DNA finingerprinting

B. DNA polymerase

C. DNA probes

D. DNA microarrays Answer:- D

7. Human genome contains about

A. 2 billion base pairs

B. 3 billion base pairs

C. 4 billion base pairs

D. 5 billion base pairs Answer:- B

8. The identification of drugs through genomic study

A. Genomics

B. Cheminformatics

C. Pharmagenomics

D. Phrmacogenetics Answer:- C

9. Analysing or comparing entire genome of species

A. Bioinformatics

B. Genomics

C. Proteomics

D. Pharmacogenomics Answer:- B

10. Characterizing molecular component is

A. Genomics

B. Cheminformatics

C. Proteomics

D. Bioinformatics Answer:- D

11. If you were using a proteomics approach to FInd the cause of a muscle disorder, which of the following techniques might you be using?

a. creating a genomic library

b. sequencing the gene responsible for the disorder

c. developing physical maps from genomic clones

d. determining which environmental factors inÙuence the expression of your gene of interest annotating the gene sequence

Answer:- D

12. Shotgun cloning differs from the clone-by-clone method in which of the following ways?

A. The location of the clone being sequenced is known relative to other clones within the genomic library in shotgun cloning.

B. Genetic markers are used to identify clones in shotgun cloning.

C. Computer software assembles the clones in the clone-by-clone method.

D. The entire genome is sequenced in the clone-by-clone method, but not in shotgun sequencing.

E. No genetic or physical maps of the genome are needed to begin shotgun cloning. Answer:- E

13. CpG islands and codon bias are tools used in eukaryotic genomics to .

a. identify open reading frames

b. differentiate between eukaryotic and prokaryotic DNA sequences

c. find regulatory sequences

d. look for DNA-binding domains

e. identify a gene’s function Answer:- A

14. As the complexity of an organism increases, all of the following characteristics emerge except .

a. the gene density decreases

b. the number of introns increases

c. the gene size increases

d. an increase in the number of chromosomes

e. repetitive sequences are present Answer:- D

15. Gene duplication has been found to be one of the major reasons for genome expansion in eukaryotes. In general, what would be the selective advantage of gene duplication?

a. If one gene copy is non-functional, a backup is available.

b. Larger genomes are more resistant to spontaneous mutations.

c. Duplicated genes will make more of the protein product.

d. Gene duplication will lead to new species evolution. Answer:- A

16. How are so many different antibodies produced from fewer than 300 major genes?

A. gene duplication

B. alternative splicing mechanisms

C. the formation of polyproteins

D. the formation of Non specific B cells

E. recombination, deletions, and random assortment of DNA segments Answer:- E

17. Two-dimensional gels are used to .

A. separate DNA fragments

B. separate RNA fragments

C. separate different proteins

D. observe a protein in two dimensions

E. separate DNA from RNA Answer:- C

18. What would be a likely explanation for the existence of Pseudogenes?

a. gene duplication

b. gene duplication and mutation events

c. mutation events

d. unequal crossing over

e. evolutionary pressure Answer:- B

19. If you enter a set of IUPAC codes into BLAST, you are probably trying to

A. find out whether a certain protein has any role in human disease.

B. search for the genes that are located on the same chromosome as a gene whose sequence you have.

C. find which section of a piece of DNA is transcribed into mRNA.

D. determine the identity of a protein. Answer:- D

20. Your lab partner is using BLAST, and his best E value is 3. This means that

A. he’s found 3 proteins in the database that have the same sequence as his protein.

B. the chance that these similarities arose due to chance is one in 10^3.

C. there would be 3 matches that good in a database of this size by chance alone.

D. the match in amino acid sequences is perfect, except for the amino acids at 3 positions. Answer:- C

21. You do a BLAST search on a DNA sequence and it identifies it as ‘Exon 1’ of a certain gene. An exon is

A. a section of a eukaryotic gene that is translated into protein.

B. a section of a eukaryotic gene that is NOT translated into protein.

C. a regulatory sequence that turns genes on and off.

D. DNA that has no genetic role, but does maintain the physical structure of a chromosome. Answer:- A

22. You see that your lab partner is staring at a Colorful Swiss-Prot page. He’s probably trying to

A. translate a DNA segment into protein.

B. find out structural and functional information about a protein he’s identified.

C. determine how many harmful mutations have been reported in a certain gene.

D. identify an amino acid sequence. Answer:- B

23. Your TA tells you to go to the NCBI Human Genome page. What does she probably want you to do?

A. Determine what genes are around ‘your’ protein’s gene on its chromosome.

B. Identify a DNA sequence and see if it came from a human.

C. Look up papers about diseases caused by abnormalities in a certain protein.

D. Look at colourful, rotating, 3-D pictures of the tertiary structure of a protein. Answer:- A

24. Many scientists are very interested in studying mitochondrial DNA because it

A. is only present in vertebrates closely related to humans.

B. replicates by synthesizing an mRNA that then acts as a DNA polymerase.

C. contains over 50% of the genes in the human genome.

D. mutates rapidly and allows us to study evolution over short time scales. Answer:- D

25. A single piece of information in a database is called

A. File

B. Field

C. Record

D. Data set Answer:- B

26. Which of the following is a nucleotide sequence data base?

A. EMBL

B. SWISS PROT

C. PROSITE

D. TREMBL Answer:- A

26. Operating system is

A. A collection of hardware components

B. A collection of input-output devices

C. A collection of software routines

D. All of the above Answer:- C

27. A data base of current sequence map of the human genome is called

A. OMIM

B. HGMD

C. Golden path

D. GeneCards Answer:- C

28. BLAST programme is used in

A. DNA sequencing

B. Amino acid sequencing

C. DNA bar coding

D. Bioinformatics Answer:- D

29. SWISS PORT is related to

A. Portable data

B. Swiss Bank data

C. Sequence data bank

D. Sequence sequence data Answer:- C

30. BLOSUM matrices are used for

A. Multiple sequence alignment

B. Pair wise sequence alignment

C. Phylogenetic analysis

D. All of the above Answer:- B

31. Phylogenetic relationship can be shown by

A. Dendrogram

B. Gene Bank

C. Data retrieving tool

D. Data search tool Answer:- A

32. PRINTS are software used for

A. detection of genes from genome sequence

B. detection of tRNA genes

C. prediction of function of a new gene

D. IdentiFIcation of functional domains/motifs of proteins Answer:- D

33. The term bioinformatics was coined by

A. J D Watson

B. Margaret Dayhoff

C. Pauline Hogeweg

D. Frederic Sanger Answer:- C

34. Margaret Dayhoff developed the FIrst protein sequence database called

a) SWISS PROT

b) PDB

c) Atlas of protein sequence and structure

d) Protein sequence databank Answer:- C

35. Step wise method for solving problems in computer science is called

a) Ùowchart

b) sequential design

c) procedure

d) algorithm Answer:- D

36. The FIrst published completed gene sequence was of

a) M 13 phage

b) T 4 phage

c) f X174

d) lambda phage Answer:-C

37. The term used to refer something ‘performed on computer or computer simulation”

a) dry lab

b) web lab

c) invitro

d) insilico Answer:-D

38. ‘Laboratory work using chemicals, drugs etc using water’ is referred as

a) dry lab

b) web lab

c) wet lab

d) insilico Answer:-C

39. ‘Laboratory work using computers and computer generated models generally ofÙine’ is referred as

a) dry lab

b) web lab

c) wet lab

d) insilico Answer:-A

40. ‘Laboratory work using computers and associated web based analysis generally online’ is referred as

a) dry lab

b) web lab

c) wet lab

d) insilico Answer:-B

41. ‘invitro’ in latin means

a) within the glass

b) within the lab

c) outside the lab

d) outside the glass Answer:-A

42. Application of bioinformatics include

a) data storage and management

b) drug designing

c) understand relationships between organisms

d) all of the above Answer:-D

43. The computational methodology that tries to FInd the best matching between two molecule, a receptor and ligand is called

a) molecular matching

b) molecular docking

c) molecular FItting

d) molecule afFInity checking Answer:-B

44. Proteomics is the study of

a) set of proteins

b) set of proteins in a speciFIc region of the cell

c) entire set of expressed proteins in a cell

d) none of these Answer:-C

45. The process of FInding relative location of genes on a chromosome is called

a) gene tracing

b) genome mapping

c) genome walking

d) chromosome walking Answer:-B

Search This Blog

ICAR ASRB NET BIOINFORMATICS

Introduction to Bioinformatics

Comments

Post a Comment

Popular posts from this blog

Unit 1 Computing

Database Systems (ICAR ASRB NET Bioinformatics Unit 3)

ICAR ASRB NET – Bioinformatics 2023 model paper