I HISTORY OF BIOINFORMATICS
Bioinformatics is an interdisciplinary field that develops methods and software tools
for understanding biological data. As an interdisciplinary field of science, bioinformatics
combines computer science, statistics, mathematics, and engineering to analyze and interpret
biological data. Bioinformatics has been used for in silico analyses of biological
queries using mathematical and statistical techniques. Bioinformatics derives knowledge from
computer analysis of biological data. These can consist of the information stored in the genetic
code, but also experimental results from various sources, patient statistics, and scientific
literature. Research in bioinformatics includes method development for storage, retrieval, and
analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly
interdisciplinary, using techniques and concepts from informatics, statistics, mathematics,
chemistry, biochemistry, physics, and linguistics. It has many practical applications in different
areas of biology and medicine.
Bioinformatics: Research, development, or application of computational tools and approaches
for expanding the use of biological, medical, behavioral or health data, including those to
acquire, store, organize, archive, analyze, or visualize such data.
Computational Biology: The development and application of data-analytical and theoretical
methods, mathematical modeling and computational simulation techniques to the study of
biological, behavioral, and social systems.
"Classical" bioinformatics: "The mathematical, statistical and computing methods that aim to
solve biological problems using DNA and amino acid sequences and related information.”
The National Center for Biotechnology Information (NCBI 2001) defines
bioinformatics as: "Bioinformatics is the field of science in which biology, computer science,
and information technology merge into a single discipline. There are three important subdisciplines within bioinformatics: the development of new algorithms and statistics with which
to assess relationships among members of large data sets; the analysis and interpretation of
various types of data including nucleotide and amino acid sequences, protein domains, and
protein structures; and the development and implementation of tools that enable efficient access
and management of different types of information
3
Even though the three terms: bioinformatics, computational biology and bioinformation
infrastructure are often times used interchangeably, broadly, the three may be defined as
follows:
1. bioinformatics refers to database-like activities, involving persistent sets of data that are
maintained in a consistent state over essentially indefinite periods of time;
2. computational biology encompasses the use of algorithmic tools to facilitate biological
analyses; while
3. bioinformation infrastructure comprises the entire collective of information management
systems, analysis tools and communication networks supporting biology. Thus, the latter may
be viewed as a computational scaffold of the former two.
There are three important sub-disciplines within bioinformatics:
• the development of new algorithms and statistics with which to assess
relationships among members of large data sets;
• the analysis and interpretation of various types of data including nucleotide
and amino acid sequences, protein domains, and protein structures;
• and the development and implementation of tools that enable efficient access
and management of different types of information
Bioinformatics definition - other sources
• Bioinformatics or computational biology is the use of mathematical and informational
techniques, including statistics, to solve biological problems, usually by creating or
using computer programs, mathematical models or both. One of the main areas of
bioinformatics is the data mining and analysis of the data gathered by the various
genome projects. Other areas are sequence alignment, protein structure prediction,
systems biology, protein-protein interactions and virtual evolution. (source:
www.answers.com)
• Bioinformatics is the science of developing computer databases and algorithms for the
purpose of speeding up and enhancing biological research. (source: www.whatis.com)
• "Biologists using computers, or the other way around. Bioinformatics is more of a tool
4
than a discipline.(source: An Understandable Definition of Bioinformatics , The
O'Reilly Bioinformatics Technology Conference, 2003) (4)
• The application of computer technology to the management of biological information.
Specifically, it is the science of developing computer databases and algorithms to
facilitate and expedite biological research.(source: Webopedia)
• Bioinformatics: a combination of Computer Science, Information Technology and
Genetics to determine and analyze genetic information. (Definition from
BitsJournal.com)
• Bioinformatics is the application of computer technology to the management and
analysis of biological data. The result is that computers are being used to gather, store,
analyse and merge biological data.(EBI - 2can resource)
• Bioinformatics is concerned with the creation and development of advanced
information and computational technologies to solve problems in biology.
• Bioinformatics uses techniques from informatics, statistics, molecular biology and
high-performance computing to obtain information about genomic or protein
sequence data.
Bioinformatics versus a Bioinformatician
A bioinformatics is an expert who not only knows how to use bioinformatics tools,
but also knows how to write interfaces for effective use of the tools.
A bioinformatician , on the other hand, is a trained individual who only knows to use
bioinformatics tools without a deeper understanding.
Aims of Bioinformatics
In general, the aims of bioinformatics are three-fold.
1. The first aim of bioinformatics is to store the biological data organized in form of a
database. This allows the researchers an easy access to existing information and submit
new entries. These data must be annoted to give a suitable meaning or to assign its
functional characteristics. The databases must also be able to correlate between
different hierarchies of information. For example:
GenBank for nucleotide and protein
sequence information,
Protein Data Bank for 3D macromolecular structures, etc.
2. The second aim is to develop tools and resources that aid in the analysis of data. For
example:
BLAST to find out similar nucleotide/amino-acid sequences, ClustalW to
align two or more nucleotide/amino-acid sequences,
Primer3 to design primers probes
for PCR techniques, etc.
3. The third and the most important aim of bioinformatics is to exploit these computational
tools to analyze the biological data interpret the results in a biologically meaningful
manner.
Goals
The goals of bioinformatics thus is to provide scientists with a means to explain
1. Normal biological processes
2. Malfunctions in these processes which lead to diseases
3. Approaches to improving drug discovery
To study how normal cellular activities are altered in different disease states, the biological
data must be combined to form a comprehensive picture of these activities. Therefore, the field
of bioinformatics has evolved such that the most pressing task now involves the analysis and
interpretation of various types of data. This includes nucleotide and amino acid sequences,
protein domains, and protein structures. The actual process of analyzing and interpreting data
is referred to as computational biology.
Important sub-disciplines within bioinformatics and computational biology include:
• Development and implementation of computer programs that enable efficient access to,
use and management of, various types of information
• Development of new algorithms (mathematical formulas) and statistical measures that
assess relationships among members of large data sets. For example, there are methods to
locate a gene within a sequence, to predict protein structure and/or function, and to
cluster protein sequences into families of related sequences.
The primary goal of bioinformatics is to increase the understanding of biological processes.
What sets it apart from other approaches, however, is its focus on developing and applying
6
computationally intensive techniques to achieve this goal. Examples include: pattern
recognition, data mining, machine learning algorithms, and visualization. Major research
efforts in the field include sequence alignment, gene finding, genome assembly, drug design,
drug discovery, protein structure alignment, protein structure prediction, prediction of gene
expression and protein–protein interactions, genome-wide association studies, the modeling of
evolution and cell division/mitosis.
Bioinformatics now entails the creation and advancement of databases, algorithms,
computational and statistical techniques, and theory to solve formal and practical problems
arising from the management and analysis of biological data.
Tools: Used in three areas
• Molecular Sequence Analysis
• Molecular Structural Analysis
• Molecular Functional Analysis
Over the past few decades, rapid developments in genomic and other molecular research
technologies and developments in information technologies have combined to produce a
tremendous amount of information related to molecular biology.
Bioinformatics is the name
given to these mathematical and computing approaches used to glean understanding of
biological processes.
Common activities in bioinformatics include mapping and analyzing DNA and protein
sequences, aligning DNA and protein sequences to compare them, and creating and viewing 3-
D models of protein structures.
Bioinformatics encompasses the use of tools and techniques from three separate disciplines;
molecular biology (the source of the data to be analyzed), computer science (supplies the
hardware for running analysis and the networks to communicate the results), and the data
analysis algorithms which strictly define bioinformatics. For this reason, the editors have
decided to incorporate events from these areas into a brief history of the field.
A SHORT HISTORY OF BIOINFORMATICS
1933 A new technique, electrophoresis, is introduced by Tiselius for separating proteins
in solution.
7
1951 Pauling and Corey propose the structure for the alpha-helix and beta-sheet (Proc.
Natl. Acad. Sci. USA, 27: 205-211, 1951; Proc. Natl. Acad. Sci. USA, 37: 729- 740,
1951).
1953 Watson and Crick propose the double helix model for DNA based on x-ray data
obtained by Franklin and Wilkins (Nature, 171: 737-738, 1953).
1954 Perutz's group develop heavy atom methods to solve the phase problem in protein
crystallography.
1955 The sequence of the first protein to be analyzed, bovine insulin, is announced by
F. Sanger.
1969 The ARPANET is created by linking computers at Stanford and UCLA.
1970 The details of the Needleman-Wunsch algorithm for sequence comparison are
published.
1972 The first recombinant DNA molecule is created by Paul Berg and his group.
1973 The Brookhaven Protein Data Bank is announced (Acta. Cryst. B, 1973, 29:
1746).
Robert Metcalfe receives his Ph.D. from Harvard University. His thesis describes
Ethernet.
1974 Vint Cerf and Robert Kahn develop the concept of connecting networks of
computers into an "internet" and develop the Transmission Control Protocol (TCP).
1975 Microsoft Corporation is founded by Bill Gates and Paul Allen.
Two-dimensional electrophoresis, where separation of proteins on SDS polyacrylamide
gel is combined with separation according to isoelectric points, is announced by P. H.
O'Farrell (J. Biol. Chem., 250: 4007-4021, 1975).
E. M. Southern published the experimental details for the Southern Blot technique of
specific sequences of DNA (J. Mol. Biol., 98: 503-517, 1975).
1977 The full description of the Brookhaven PDB (http://www.pdb.bnl.gov) is
published (Bernstein, F.C.; Koetzle, T.F.; Williams, G.J.B.; Meyer, E.F.; Brice, M.D.;
Rodgers, J.R.; Kennard, O.; Shimanouchi, T.; Tasumi, M.J.; J. Mol. Biol., 1977, 112:,
535).
Allan Maxam and Walter Gilbert (Harvard) and Frederick Sanger (U.K. Medical
Research Council), report methods for sequencing DNA.
1980 The first complete gene sequence for an organism (FX174) is published. The gene
consists of 5,386 base pairs which code nine proteins.
8
Wuthrich et. al. publish paper detailing the use of multi-dimensional NMR for protein
structure determination (Kumar, A.; Ernst, R.R.; Wuthrich, K.; Biochem. Biophys. Res.
Comm., 1980, 95:, 1).
IntelliGenetics, Inc. founded in California. Their primary product is the IntelliGenetics
Suite of programs for DNA and protein sequence analysis.
1981 The Smith-Waterman algorithm for sequence alignment is published.
IBM introduces its Personal Computer to the market.
1982 Genetics Computer Group (GCG) created as a part of the University of Wisconsin
of Wisconsin Biotechnology Center. The company's primary product is The Wisconsin
Suite of molecular biology tools.
1983 The Compact Disk (CD) is launched.
1984 Jon Postel's Domain Name System (DNS) is placed on-line.
The Macintosh is announced by Apple Computer.
1985 The FASTP algorithm is published.
The PCR reaction is described by Kary Mullis and co-workers.
1986 The term "Genomics" appeared for the first time to describe the scientific
discipline of mapping, sequencing, and analyzing genes. The term was coined by
Thomas Roderick as a name for the new journal.
Amoco Technology Corporation acquires IntelliGenetics.
NSFnet debuts.
The SWISS-PROT database is created by the Department of Medical Biochemistry of
the University of Geneva and the European Molecular Biology Laboratory (EMBL).
1987 The use of yeast artifical chromosomes (YAC) is described (David T. Burke, et.
al., Science, 236: 806-812).
The physical map of E. coli is published (Y. Kohara, et. al., Cell 51: 319-337).
1988 The National Center for Biotechnology Information (NCBI) is established at the
National Cancer Institute.
The Human Genome Initiative is started (Commission on Life Sciences, National
Research Council. Mapping and Sequencing the Human Genome, National Academy
Press: Washington, D.C.), 1988.
The FASTA algorithm for sequence comparison is published by Pearson and Lupman.
A new program, an Internet computer virus designed by a student, infects 6,000 military
computers in the US.
9
1989 The Genetics Computer Group (GCG) becomes a private company.
Oxford Molecular Group, Ltd. (OMG) founded in Oxford, UK by Anthony
Marchington, David Ricketts, James Hiddleston, Anthony Rees, and W. Graham
Richards. Primary products: Anaconda, Asp, Cameleon and others (molecular
modeling, drug design, protein design).
1990 The BLAST program (Altschul, et. al.) is implemented.
Molecular Applications Group is founded in California by Michael Levitt and Chris
Lee. Their primary products are Look and SegMod which are used for molecular
modeling and protein design.
InforMax is founded in Bethesda, MD. The company's products address sequence
analysis, database and data management, searching, publication graphics, clone
construction, mapping and primer design.
1991 The research institute in Geneva (CERN) announces the creation of the
protocols which make-up the World Wide Web.
The creation and use of expressed sequence tags (ESTs) is described (J. Craig Venter,
et. al., Science, 252: 1651-1656).
Incyte Pharmaceuticals, a genomics company headquartered in Palo Alto California,
is formed.
Myriad Genetics, Inc. is founded in Utah. The company's goal is to lead in the
discovery of major common human disease genes and their related pathways. The
Company has discovered and sequenced, with its academic collaborators, the following major
genes: BRCA1, BRCA2, CHD1, MMAC1, MMSC1, MMSC2, CtIP, p16, p19, and MTS2.
1992 Human Genome Systems, Gaithersburg Maryland, is formed by William
Haseltine.
The Institute for Genomic Research (TIGR) is established by Craig Venter.
Genome Therapeutics announces its incorporation.
Mel Simon and coworkers announce the use of BACs for cloning.
1993 CuraGen Corporation is formed in New Haven, CT.
Affymetrix begins independent operations in Santa Clara, California
1994
Netscape Comminications Corporation founded and releases Navigator, the
commercial version of NCSA's Mozilla.
Gene Logic is formed in Maryland.
10
The PRINTS database of protein motifs is published by Attwood and Beck.
Oxford Molecular Group acquires IntelliGenetics.
1995 The Haemophilus influenzea genome (1.8 Mb) is sequenced.
The Mycoplasma genitalium genome is sequenced.
1996 Oxford Molecular Group acquires the MacVector product from Eastman
Kodak.
The genome for Saccharomyces cerevisiae (baker's yeast, 12.1 Mb) is sequenced.
The Prosite database is reported by Bairoch, et.al.
Affymetrix produces the first commercial DNA chips.
1997 The genome for E. coli (4.7 Mbp) is published.
Oxford Molecular Group acquires the Genetics Computer Group.
LION bioscience AG founded as an integrated genomics company with strong focus on
bioinformatics. The company is built from IP out of the European Molecular Biology
Laboratory (EMBL), the European Bioinformatics Institute (EBI), the German Cancer
Research Center (DKFZ), and the University of Heidelberg.
Paradigm Genetics Inc., a company focussed on the application of genomic
technologies to enhance worldwide food and fiber production, is founded in Research
Triangle Park, NC.
deCode genetics publishes a paper that described the location of the FET1 gene, which
is responsible for familial essential tremor, on chromosome 13 (Nature Genetics).
1998 The genomes for Caenorhabditis elegans and baker's yeast are published.
The Swiss Institute of Bioinformatics is established as a non-profit foundation.
Craig Venter forms Celera in Rockville, Maryland.
PE Informatics was formed as a Center of Excellence within PE Biosystems. This
center brings together and leverages the complementary expertise of PE Nelson and
Molecular Informatics, to further complement the genetic instrumentation expertise of
Applied Biosystems.
Inpharmatica, a new Genomics and Bioinformatics company, is established by
University College London, the Wolfson Institute for Biomedical Research, five
leading scientists from major British academic centers and Unibio Limited.
GeneFormatics, a company dedicated to the analysis and prediction of protein structure
and function, is formed in San Diego.
11
Molecular Simulations Inc. is acquired by Pharmacopeia
1999 deCode genetics maps the gene linked to pre-eclampsia as a locus on
chromosome 2p13.
2000 The genome for Pseudomonas aeruginosa (6.3 Mbp) is published.
The A. thaliana genome (100 Mb) is secquenced.
The D. melanogaster genome (180Mb) is sequenced.
Pharmacopeia acquires Oxford Molecular Group.
2001 The human genome (3,000 Mbp) is published.
2002 Chang Gung Genomic Research Center established.
-Bioinformatics Center, -Proteomics Center, -Microarray Center
Figure 1
Applications
Bioinformatics joins mathematics, statistics, and computer science and information technology
to solve complex biological problems. These problems are usually at the molecular level which
cannot be solved by other means. This interesting field of science has many applications and
research areas where it can be applied.
1950 1960 1970 1980 1990 2000 2010 2020
Key milestones
12
All the applications of bioinformatics are carried out in the user level. Here is the biologist
including the students at various level can use certain applications and use the output in their
research or in study. Various bioinformatics application can be categorized under following
groups:
Sequence Analysis
Function Analysis
Structure Analysis
Figure 2
Sequence Analysis: All the applications that analyzes various types of sequence information
and can compare between similar types of information is grouped under Sequence Analysis.
Function Analysis: These applications analyze the function engraved within the sequences and
helps predict the functional interaction between various proteins or genes. Also expressional
analysis of various genes is a prime topic for research these days.
Structure Analysis: When it comes to the realm of RNA and Proteins, its structure plays a
vital role in the interaction with any other thing. This gave birth to a whole new branch termed
13
Structural Bioinformatics with is devoted to predict the structure and possible roles of these
structures of Proteins or RNA
Sequence Analysis:
The application of sequence analysis determines those genes which encode regulatory
sequences or peptides by using the information of sequencing. For sequence analysis, there are
many powerful tools and computers which perform the duty of analyzing the genome of various
organisms. These computers and tools also see the DNA mutations in an organism and also
detect and identify those sequences which are related. Shotgun sequence techniques are also
used for sequence analysis of numerous fragments of DNA. Special software is used to see the
overlapping of fragments and their assembly.
Prediction of Protein Structure:-
It is easy to determine the primary structure of proteins in the form of amino acids which are
present on the DNA molecule but it is difficult to determine the secondary, tertiary or
quaternary structures of proteins. For this purpose either the method of crystallography is used
or tools of bioinformatics can also be used to determine the complex protein structures.
Genome Annotation:-
In genome annotation, genomes are marked to know the regulatory sequences and protein
coding. It is a very important part of the human genome project as it determines the regulatory
sequences.
Comparative Genomics:-
Comparative genomics is the branch of bioinformatics which determines the genomic structure
and function relation between different biological species. For this purpose, intergenomic maps
are constructed which enable the scientists to trace the processes of evolution that occur in
genomes of different species. These maps contain the information about the point mutations as
well as the information about the duplication of large chromosomal segments.
Health and Drug discovery:
The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease
14
management. Complete sequencing of human genes has enabled the scientists to make
medicines and drugs which can target more than 500 genes. Different computational tools and
drug targets has made the drug delivery easy and specific because now only those cells can be
targeted which are diseased or mutated. It is also easy to know the molecular basis of a disease.
Application of Bioinformatics in various Fields
Molecular medicine
The human genome will have profound effects on the fields of biomedical research and clinical
medicine. Every disease has a genetic component. This may be inherited (as is the case with an
estimated 3000-4000 hereditary disease including Cystic Fibrosis and Huntingtons disease) or
a result of the body's response to an environmental stress which causes alterations in the
genome (eg. cancers, heart disease, diabetes.). The completion of the human genome
means that we can search for the genes directly associated with different diseases and begin to
understand the molecular basis of these diseases more clearly. This new knowledge of the
molecular mechanisms of disease will enable better treatments, cures and even preventative
tests to be developed.
Personalised medicine
Clinical medicine will become more personalised with the development of the field of
pharmacogenomics. This is the study of how an individual's genetic inheritence affects the
body's response to drugs. At present, some drugs fail to make it to the market because a small
percentage of the clinical patient population show adverse affects to a drug due to sequence
variants in their DNA. As a result, potentially life saving drugs never make it to the
marketplace. Today, doctors have to use trial and error to find the best drug to treat a particular
patient as those with the same clinical symptoms can show a wide range of responses to the
same treatment. In the future, doctors will be able to analyse a patient's genetic profile and
prescribe the best available drug therapy and dosage from the beginning.
Preventative medicine
With the specific details of the genetic mechanisms of diseases being unravelled, the
development of diagnostic tests to measure a persons susceptibility to different diseases may
become a distinct reality. Preventative actions such as change of lifestyle or having treatment
15
at the earliest possible stages when they are more likely to be successful, could result in huge
advances in our struggle to conquer disease.
Gene therapy
In the not too distant future, the potential for using genes themselves to treat disease may
become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by
changing the expression of a persons genes. Currently, this field is in its infantile stage with
clinical trials for many different types of cancer and other diseases ongoing.
Drug development
At present all drugs on the market target only about 500 proteins. With an improved
understanding of disease mechanisms and using computational tools to identify and validate
new drug targets, more specific medicines that act on the cause, not merely the symptoms, of
the disease can be developed. These highly specific drugs promise to have fewer side effects
than many of today's medicines.
Microbial genome applications
Microorganisms are ubiquitous, that is they are found everywhere. They have been found
surviving and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are
present in the environment, our bodies, the air, food and water. Traditionally, use has been
made of a variety of microbial properties in the baking, brewing and food industries. The arrival
of the complete genome sequences and their potential to provide a greater insight into the
microbial world and its capacities could have broad and far reaching implications for
environment, health, energy and industrial applications. For these reasons, in 1994, the US
Department of Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence
genomes of bacteria useful in energy production, environmental cleanup, industrial processing
and toxic waste reduction. By studying the genetic material of these organisms, scientists can
begin to understand these microbes at a very fundamental level and isolate the genes that give
them their unique abilities to survive under extreme conditions.
Waste cleanup
Deinococcus radiodurans is known as the world's toughest bacteria and it is the most radiation
resistant organism known. Scientists are interested in this organism because of its potential
16
usefulness in cleaning up waste sites that contain radiation and toxic chemicals.
Climate change Studies
Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels
for energy, are thought to contribute to global climate change. Recently, the DOE (Department
of Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One
method of doing so is to study the genomes of microbes that use carbon dioxide as their sole
carbon source.
Alternative energy sources
Scientists are studying the genome of the microbe Chlorobium tepidum which has an unusual
capacity for generating energy from light
Biotechnology
The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential
for practical applications in industry and government-funded environmental remediation.
These microorganisms thrive in water temperatures above the boiling point and therefore may
provide the DOE, the Department of Defence, and private companies with heat-stable enzymes
suitable for use in industrial processes Other industrially useful microbes include,
Corynebacterium glutamicum which is of high industrial interest as a research object because
it is used by the chemical industry for the biotechnological production of the amino acid lysine.
The substance is employed as a source of protein in animal nutrition. Lysine is one of the
essential amino acids in animal nutrition. Biotechnologically produced lysine is added to feed
concentrates as a source of protein, and is an alternative to soybeans or meat and bonemeal.
Xanthomonas campestris pv. is grown commercially to produce the exopolysaccharide xanthan
gum, which is used as a viscosifying and stabilising agent in many industries. Lactococcus
lactis is one of the most important micro-organisms involved in the dairy industry, it is a nonpathogenic rod-shaped bacterium that is critical for manufacturing dairy products like
buttermilk, yogurt and cheese. This bacterium, Lactococcus lactis ssp., is also used to prepare
pickled vegetables, beer, wine, some breads and sausages and other fermented foods.
Researchers anticipate that understanding the physiology and genetic make- up of this
bacterium will prove invaluable for food manufacturers as well as the pharmaceutical industry,
17
which is exploring the capacity of L. lactis to serve as a vehicle for delivering drugs.
Antibiotic resistance
Scientists have been examining the genome of Enterococcus faecalis-a leading cause of
bacterial infection among hospital patients. They have discovered a virulence region made up
of a number of antibiotic-resistant genes that may contribute to the bacterium's transformation
from harmless gut bacteria to a menacing invader. The discovery of the region, known as a
pathogenicity island, could provide useful markers for detecting pathogenic strains and help to
establish controls to prevent the spread of infection in wards.
Forensic analysis of microbes
Scientists used their genomic tools to help distinguish between the strain of Bacillus anthryacis
that was used in the summer of 2001 terrorist attack in Florida with that of closely related
anthrax strains.
The reality of bioweapon creation
Scientists have recently built the virus poliomyelitis using entirely artificial means. They did
this using genomic data available on the Internet and materials from a mail-order chemical
supply. The research was financed by the US Department of Defence as part of a biowarfare
response program to prove to the world the reality of bioweapons. The researchers also hope
their work will discourage officials from ever relaxing programs of immunisation. This project
has been met with very mixed feeelings
Evolutionary studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea
means that evolutionary studies can be performed in a quest to determine the tree of life and
the last universal common ancestor.
Crop improvement
Comparative genetics of the plant genomes has shown that the organisation of their genes has
remained more conserved over evolutionary time than was previously believed. These findings
18
suggest that information obtained from the model crop systems can be used to suggest
improvements to other food crops. At present the complete genomes of Arabidopsis thaliana
(water cress) and Oryza sativa (rice) are available.
Insect resistance
Genes from Bacillus thuringiensis that can control a number of serious pests have been
successfully transferred to cotton, maize and potatoes. This new ability of the plants to resist
insect attack means that the amount of insecticides being used can be reduced and hence the
nutritional quality of the crops is increased.
Improve nutritional quality
Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin
A, iron and other micronutrients. This work could have a profound impact in reducing
occurrences of blindness and anaemia caused by deficiencies in Vitamin A and iron
respectively. Scientists have inserted a gene from yeast into the tomato, and the result is a plant
whose fruit stays longer on the vine and has an extended shelf life.
Development of Drought resistance varieties
Progress has been made in developing cereal varieties that have a greater tolerance for soil
alkalinity, free aluminium and iron toxicities. These varieties will allow agriculture to succeed
in poorer soil areas, thus adding more land to the global production base. Research is also in
progress to produce crop varieties capable of tolerating reduced water conditions.
Veterinary Science
Sequencing projects of many farm animals including cows, pigs and sheep are now well under
way in the hope that a better understanding of the biology of these organisms will have huge
impacts for improving the production and health of livestock and ultimately have benefits for
human nutrition.
Comparative Studies
Analysing and comparing the genetic material of different species is an important method for
studying the functions of genes, the mechanisms of inherited diseases and species evolution.
19
Bioinformatics tools can be used to make comparisons between the numbers, locations and
biochemical functions of genes in different organisms.
Organisms that are suitable for use in experimental research are termed model organisms. They
have a number of properties that make them ideal for research purposes including short life
spans, rapid reproduction, being easy to handle, inexpensive and they can be manipulated at
the genetic level.
An example of a human model organism is the mouse. Mouse and human are very closely
related (>98%) and for the most part we see a one to one correspondence between genes in the
two species. Manipulation of the mouse at the molecular level and genome comparisons
between the two species can and is revealing detailed information on the functions of human
genes, the evolutionary relationship between the two species and the molecular mechanisms of
many human diseases.
20
Table 1
21
Definitions of Fields Related to Bioinformatics
Bioinformatics has various applications in research in medicine, biotechnology, agriculture
etc.
Following research fields has integral component of Bioinformatics
1. Computational Biology: The development and application of data-analytical and
theoretical methods, mathematical modeling and computational simulation techniques
to the study of biological, behavioral, and social systems.
2. Genomics: Genomics is any attempt to analyze or compare the entire genetic
complement of a species or species (plural). It is, of course possible to compare
genomes by comparing more-or-less representative subsets of genes within genomes.
3. Proteomics: Proteomics is the study of proteins - their location, structure and
function. It is the identification, characterization and quantification of all proteins
involved in a particular pathway, organelle, cell, tissue, organ or organism that can be
studied in concert to provide accurate and comprehensive data about that system.
Proteomics is the study of the function of all expressed proteins. The study of the
proteome, called proteomics, now evokes not only all the proteins in any given cell,
but also the set of all protein isoforms and modifications, the interactions between
them, the structural description of proteins and their higher-order complexes, and for
that matter almost everything 'post-genomic'."
4. Pharmacogenomics: Pharmacogenomics is the application of genomic approaches
and technologies to the identification of drug targets. In Short, pharmacogenomics is
using genetic information to predict whether a drug will help make a patient well or
sick. It Studies how genes influence the response of humans to drugs, from the
population to the molecular level.
5. Pharmacogenetics: Pharmacogenetics is the study of how the actions of and reactions
to drugs vary with the patient's genes. All individuals respond differently to drug
treatments; some positively, others with little obvious change in their conditions and
yet others with side effects or allergic reactions. Much of this variation is known to
have a genetic basis. Pharmacogenetics is a subset of pharmacogenomics which uses
genomic/bioinformatic methods to identify genomic correlates, for example SNPs
(Single Nucleotide Polymorphisms), characteristic of particular patient response
profiles and use those markers to inform the administration and development of
therapies. Strikingly such approaches have been used to "resurrect" drugs thought
previously to be ineffective, but subsequently found to work with in subset of patients
22
or in optimizing the doses of chemotherapy for particular patients.
6. Cheminformatics:
Chemical informatics: 'Computer-assisted storage, retrieval and analysis of chemical
information, from data to chemical knowledge.' This definition is distinct from
Chemoinformatics which focus on drug design.
chemometrics: The application of statistics to the analysis of chemical data (from
organic, analytical or medicinal chemistry) and design of chemical experiments and
simulations. computational chemistry: A discipline using mathematical methods for
the calculation of molecular properties or for the simulation of molecular behavior. It
also includes, e.g., synthesis planning, database searching, combinatorial library
manipulation
7. Structural genomics or structural bioinformatics refers to the analysis of
macromolecular structure particularly proteins, using computational tools and
theoretical frameworks. One of the goals of structural genomics is the extension of
idea of genomics, to obtain accurate three-dimensional structural models for all known
protein families, protein domains or protein folds Structural alignment is a tool of
structural genomics.
8. Comparative genomics: The study of human genetics by comparisons with model
organisms such as mice, the fruit fly, and the bacterium E. coli.
9. Biophysics: The British Biophysical Society defines biophysics as: "an
interdisciplinary field which applies techniques from the physical sciences to
understanding biological structure and function".
10. Biomedical informatics / Medical informatics: "Biomedical Informatics is an
emerging discipline that has been defined as the study, invention, and implementation
of structures and algorithms to improve communication, understanding and
management of medical information."
11. Mathematical Biology: Mathematical biology also tackles biological problems, but
the methods it uses to tackle them need not be numerical and need not be implemented
in software or hardware. It includes things of theoretical interest which are not
necessarily algorithmic, not necessarily molecular in nature, and are not necessarily
useful in analyzing collected data.
12. Computational chemistry: Computational chemistry is the branch of theoretical
chemistry whose major goals are to create efficient computer programs that calculate
the properties of molecules (such as total energy, dipole moment, vibrational
23
frequencies) and to apply these programs to concrete chemical objects. It is also
sometimes used to cover the areas of overlap between computer science and
chemistry.
13. Functional genomics: Functional genomics is a field of molecular biology that is
attempting to make use of the vast wealth of data produced by genome sequencing
projects to describe genome function. Functional genomics uses high-throughput
techniques like DNA microarrays, proteomics, metabolomics and mutation analysis
to describe the function and interactions of genes.
14. Pharmacoinformatics: Pharmacoinformatics concentrates on the aspects of
bioinformatics dealing with drug discovery
15. In silico ADME-Tox Prediction: Drug discovery is a complex and risky treasure
hunt to find the most efficacious molecule which do not have toxic effects but at the
same time have desired pharmacokinetic profile. The hunt starts when the researchers
look for the binding affinity of the molecule to its target. Huge amount of research
requires to be done to come out with a molecule which has the reliable binding profile.
Once the molecules have been identified, as per the traditional methodologies, the
molecule is further subjected to optimization with the aim of improving efficacy.
The molecules which show better binding is then evaluated for its toxicity and
pharmacokinetic profiles. It is at this stage that most of the candidates fail in the race to
become a successful drug.
16. Agroinformatics / Agricultural informatics: Agroinformatics concentrates on the
aspects of bioinformatics dealing with plant genomes.
INTERNET
The Internet is a global system of interconnected computer networks that use the standard
Internet protocol suite (TCP/ IP) to serve billions of users worldwide. It is a network of
networks that consists of millions of private, public, academic, business, and government
networks, of local to global scope, that are linked by a broad array of electronic, wireless and
optical networking technologies. The Internet carries a vast range of information resources and
services, such as the inter- linked hypertext documents of the World Wide Web (WWW) and
the infrastructure to support electronic mail.
Uses of Internet
24
Internet has been the most useful technology of the modern times which helps us not only
in our daily lives, but also our personal and professional lives developments. The internet
helps us achieve this in several different ways.
For the students and educational purposes the internet is widely used to gather information so
as to do the research or add to the knowledge of various subjects. Even the business
professionals and the professionals like doctors, access the internet to filter the necessary
information for their use. The internet is therefore the largest encyclopedia for everyone, in all
age categories. The internet has served to be more useful in maintaining contacts with friends
and relatives who live abroad permanently.
Advantages of Internet:
E-mail: Email is now an essential communication tools in business. With e-mail you can
send and receive instant electronic messages, which workslike writing letters. Your messages
are delivered instantly to people anywhere in the world, unlike traditional mail that takes
a lot of time. Email is free, fast and very cheap when compared to telephone, fax and postal
services.
24 hours a day - 7 days a week: Internet is available, 24x7 days for usage.
Information: Information is probably the biggest advantage internet is offering. There is a
huge amount of information available on the internet for just about every subject, ranging
from government law and services, trade fairs and conferences, market information, new
ideas and technical support. You can almost find any type of data on almost any kind of
subject that you are looking for by using search engines like google, yahoo, msn, etc.
Online Chat: You can access many ‘chat rooms’ on the web that can be used to meet new
people, make new friends, as well as to stay in touch with old friends. You can chat in
MSN and yahoo websites.
Services: Many services are provided on the internet like net banking, job searching,
purchasing tickets, hotel reservations, guidance services on array of topics engulfing
every aspect of life.
Communities: Communities of all types have sprung up on the internet. Its a great way
to meet up with people of similar interest and discuss common issues.
25
E-commerce: Along with getting information on the Internet, you can also shop
online. There are many online stores and sites that can be used to look for products as
well as buy them using your credit card. You do not need to leave your house and can do
all your shopping from the convenience of your home. It has got a real amazing and wide
range of products from household needs, electronics to entertainment.
Entertainment: Internet provides facility to access wide range of Audio/ Video songs,
plays films. Many of which can be downloaded. One such popular website is YouTube.
Sof t ware Downloads: You can f r eely down load innumerable, softwares like
utilities, games, music, videos, movies, etc from the Internet.
Limitations of Internet
Theft of Personal information: Electronic messagessent over the Internet can be easily
snooped and tracked, revealing who is talking to whom and what they are talking about.
If you use the Internet, your personal information such as your name, address, credit card,
bank details and other information can be accessed by unauthorized persons. If you use
a credit card or internet banking for online shopping, then your details can also be ‘stolen’.
Negative effects on family communication: It is generally observed that due to more time
spent on Internet, there is a decrease in communication and feeling of togetherness among the
family members.
Internet addiction: There is some controversy over whether it is possible to actually be
addicted to the Internet or not. Some researchers, claim that it is simply people trying to
escape their problems in an online world.
Children using the Internet has become a big concern. Most parents do not realize the
dangers involved when their children log onto the Internet. When children talk to others
online, they do not realize they could actually be talking to a harmful person. Moreover,
pornography is also a very serious issue concerning the Internet, especially when it comes to
young children. There are thousands of pornographic sites on the Internet that can be easily
found and can be a detriment to letting children use the Internet.
Virus threat: Today, not only are humans getting viruses, but computers are also.Computers
are mainly getting these viruses from the Internet. Virus is is a program which disrupts the
26
normal functioning of your computer systems. Computers attached to internet are more
prone to virus attacks and they can end up into crashing your whole hard disk.
Spamming: It is often viewed as the act of sending unsolicited email. This multiple or
vast emailing is often compared to mass junk mailings. It needlessly obstruct the entire
system. Most spam is commercial advertising, often for dubious products, get-rich-quick
schemes, or quasi-legal services. Spam costs the sender very little to send — most of the
costs are paid for by the recipient or the carriers rather than by the sender
SERVICES OF INTERNET - E-mail, FTP, Telnet
Email, discussion groups, long-distance computing, and file transfers are some of the important
services provided by the Internet. Email is the fastest means of communication. With email one
can also send software and certain forms of compressed digital image as an attachment. News
groups or discussion groups facilitate Internet user to join for various kinds of debate, discussion
and news sharing. Long-distance computing was an original inspiration for development of
ARPANET and does still provide a very useful service on Internet. Programmers can maintain
accounts on distant, powerful computers and execute programs. File transfer service allows
Internet users to access remote machines and retrieve programs, data or text.
E-Mail (Electronic Mail)
E-mail or Electronic mail is a paperless method of sending messages, notes or letters from
one person to another or even many people at the same time via Internet. E-mail is very fast
compared to the normal post. E-mail messages usually take only few seconds to arrive at
their destination. One can send messages anytime of the day or night, and, it will get
delivered immediately. You need not to wait for the post office to open and you don’t have
to get worried about holidays. It works 24 hours a day and seven days a week. What’s more,
the copy of the message you have sent will be available whenever you want to look at it even in
the middle of the night. You have the privilege of sending something extra such as a file,
graphics, images etc. along with your e-mail. The biggest advantage of using e- mail is that
it is cheap, especially when sending messages to other states or countries and at the same time
it can be delivered to a number of people around the world.
It allows you to compose note, get the address of the recipient and send it. Once the mail is
received and read, it can be forwarded or replied. One can even store it for later use, or delete.
In e-mail even the sender can request for delivery receipt and read receipt from the recipient.
27
Features of E-mail:
⚫ One-to-one or one-to-many communications
⚫ Instant communications
⚫ Physical presence of recipient is not required
⚫ Most inexpensive mail services, 24-hours a day and seven days a week
⚫ Encourages informal communications
Components of an E-mail Address
As in the case of normal mail system, e-mail is also based upon the concept of a recipient
address. The email address provides all of the information required to get a message to the
recipient from any where in the world. Consider the e-mail ID.
john@hotmail.com
In the above example john is the username of the person who will be sending/ receiving the
email. Hotmail is the mail server where the username john has been registered and com is the
type of organization on the internet which is hosting the mail server.
FTP (File Transfer Protocol)
File Transfer Protocol, is an Internet utility software used to uploaded and download files.
It gives access to directories or folders on remote computers and allows software, data and
text files to be transferred between different kinds of computers. FTP works on the basis of
same principle as that of Client/ Server. FTP “Client” is a program running on your
computer that enables you to communicate with remote computers. The FTP client takes
FTP command and sends these as requests for information from the remote computer
known as FTP servers. To access remote FTP server it is required, but not necessary to have
an account in the FTP server. When the FTP client gets connected, FTP server asks for the
identification in terms of User Login name and password of the FTP client. If one does not
have an account in the remote FTP server, still he can connect to the server using
anonymous login.
28
Using anonymous login anyone can login in to a FTP server and can access public archives;
anywhere in the world, without having an account. One can easily Login to the FTP site with
the username anonymous and e-mail address as password.
Objectives of FTP:
⚫ Provide flexibility and promote sharing of computer programs, files and data
⚫ Transfer data reliably and more efficiently over network
⚫ Encourage implicit or indirect use of remote computers using Internet
⚫ Shield a user from variations in storage systems among hosts.
The basic steps in an FTP session
Start up your FTP client, by typing ftp on your system’s command line/ ’C>’ prompt (or,
if you are in a Windows, double-click on the FTP icon).
Give the FTP client an address to connect. This is the FTP server address to which the
FTP client will get connected
Identify yourself to the FTP remote site by giving the Login Name
Give the remote site a password
Remote site will verify the Login Name/ Password to allow the FTP client to access its
files
Look directory for files in FTP server
Change Directories if required
Set the transfer mode (optional);
Get the file(s) you want, and
Quit.
29
INTERFACE
SERVER
FTP Replies
SERVER
SYSTEM
Connection
Figure 3
Telnet (RemoteComputing)
Telnet or remote computing is telecommunication utility software, which uses
available telecommunication facility and allows you to become a user on a remote
computer. Once you gain access to remote computer, you can use it for the intended
purpose. The TELNET works in a very step by step procedure. The commands typed on the
client computer are sent to the local Internet Service Provider (ISP), and then from the ISP
to the remote computer that you have gained access. Most of the ISP provides facility to
TELENET into your own account from another city and check your e-mail while you are
travelling or away on business.
The following steps are required for a TELNET session
⚫ Start up the TELNET program
⚫ Give the TELNET program an address to connect (some really nifty TELNET
packages allow you to combine steps 1 and 2 into one simple step)
⚫ Make a note of what the “escape character” is
⚫ Log in to the remote computer,
⚫ Set the “terminal emulation”
FILE
SYSTEM
30
⚫ Play around on the remote computer, and
⚫ Quit.
TYPES OF INTERNET CONNECTIONS
There are five types of internet connections which are as follows:
(i) Dial up Connection
(ii) Leased Connection
(iii) DSL connection
(iv) Cable Modem Connection
(v) VSAT
Dial up connection
Dial-up refers to an Internet connection that is established using a modem. The modem
connects the computer to standard phone lines, which serve as the data transfer medium.
When a user initiates a dial-up connection, the modem dials a phone number of an Internet
Service Provider (ISP) that is designated to receive dial-up calls. The ISP then establishesthe
connection, which usually takesabout ten seconds and is accompanied by several beepings
and a buzzing sound. After the dial-up connection has been established, it is active until
the user disconnects from the ISP. Typically, this is done by selecting the “Disconnect”
option using the ISP’s software or a modem utility program. However, if a dial-up
connection is interrupted by an incoming phone call or someone picking up a phone in
the house, the service may also be disconnected.
Advantages
Low Price
Secure connection – your IP address continually changes
Offered in rural areas – you need a phone line
31
Disadvantages
Slow speed.
Phone line is required.
Busy signals for friends and family members.
Leased Connection
Leased connection is a permanent telephone connection between two points set up by a
telecommunications common carrier. Typically, leased lines are used by businesses to
connect geographically distant offices. Unlike normal dial- up connections, a leased line is
always active. The fee for the connection is a fixed monthly rate. The primary factors
affecting the monthly fee are distance between end points and the speed of the circuit.
Because the connection doesn’t carry anybody else’s communications, the carrier can assure
a given level of quality.
For example, a T-1 channel is a type of leased line that provides a maximum transmission
speed of 1.544 Mbps. You candivide the connection into different lines for data
and voice communication or use the channel for one high speed data circuit. Dividing
the connection is called multiplexing.
Increasingly, leased lines are being used by companies, and even individuals, for Internet
access because they afford faster data transfer rates and are cost-effective if the Internet is
used heavily.
Advantages
• Secure and private: dedicated exclusively to the customer
• Speed: symmetrical and direct
• Reliable: minimum down time
• Wide choice of speeds: bandwidth on demand, easily upgradeable
• Leased lines are suitable for in-house office web hosting
Disadvantages
• Leased lines can be expensive to install and rent.
32
• Not suitable for single or home workers
• Lead times can be as long as 65 working days
• Distance dependent to nearest POP
• Leased lines have traditionally been the more expensive access option. A Service
Level Agreement (SLA) confirms an ISP’s contractual requirement in ensuring the
service is maintained. This is often lacking in cheaper alternatives.
DSL connection
Digital Subscriber Line (DSL) is a family of technologies that provides digital data
transmission over the wires of a local telephone network. DSL originally stood for digital
subscriber loop. In telecommunications marketing, the term DSL is widely understood to
mean Asymmetric Digital Subscriber Line (ADSL), the most commonly installed DSL
technology. DSL service is delivered simultaneously with wired telephone service on the
same telephone line. This is possible because DSL uses higher frequency bands for data
separated by filtering. On the customer premises, a DSL filter on each outlet removes the
high frequency interference, to enable simultaneous use of the telephone and data.
The data bit rate of consumer DSL services typically ranges from 256 kbit/ s to 40 Mbit/ s in
the direction to the customer (downstream), depending on DSL technology, line conditions,
and service-level implementation. In ADSL, the data throughput in the upstream direction,
(the direction to the service provider) is lower, hence the designation of asymmetric
service. In Symmetric Digital Subscriber Line (SDSL) services, the downstream and
upstream data rates are equal.
Advantages:
Security: Unlike cable modems, each subscriber can be configured so that it will not be
on the same network. In some cable modem networks, other computers on the cable
modem network are left visibly vulnerable and are easily susceptible to break in as well
as data destruction.
Integration: DSL will easily interface with ATM and WAN technology.
High bandwidth
33
Cheap line charges from the phone company.
Good for “bursty” traffic patterns
Disadvantages
No current standardization: A person moving from one area to another might find that
their DSL modem is just another paperweight. Customers may have to buy new
equipment to simply change ISPs.
Expensive: Most customers are not willing to spend more than $20 to $25 per month
for Internet access. Current installation costs, including the modem, can be as high as
$750. Prices should come down within 1-3 years. As with all computer technology, being
first usually means an emptier wallet.
Distance Dependence: The farther you live from the DSLAM (DSL Access
Multiplexer), the lower the data rate. The longest run lengths are 18,000 feet, or a little
over 3 miles.
Cable Modem Connection
A cable modem is a type of Network Bridge and modem that provides bi-directional data
communication via radio frequency channels on a HFC and RFoG infrastructure. Cable
modems are primarily used to deliver broadband Internet access in the form of cable
Internet, taking advantage of the high bandwidth of a HFC and RFoG network. They are
commonly deployed in Australia, Europe, Asia and Americas.
Figure 4
34
Figure shows the most common network connection topologies when using cable modems.
The cable TV company runs a coaxial cable into the building to deliver their Internet service.
Although fed from the same coax that provides cable TV service, most companies place a
splitter outside of the building and runs two cables in, rather than using a splitter at the set-top
box. The coax terminates at the cable modem.
The cable modem itself attaches to the SOHO computing equipment via its 10BASE-T port.
In most circumstances, the cable modem attaches directly to a user’s computer. If a LAN is
present on the premises (something many cable companies frown upon), some sort of router
can be connected to the cable modem.
Advantages
Always Connected: A cable modem connection is always connected to the Internet. This
is advantageous because you do not have to wait for your computer to “log on” to the
Internet; however, this also has the disadvantage of making your computer more
vulnerable to hackers. Broadband: Cable modems transmit and receive data as digital
packets, meaning they provide high-speed Internet access. This makes cable modem
connections much faster than traditional dial-up connections.
Bandwidth: Cable modems have the potential to receive data from their cable provider
at speeds greater than 30 megabits per second; unfortunately, this speed is rarely ever
realized. Cable lines are shared by all of the cable modem users in a given area; thus, the
connection speed varies depending upon the number of other people using the Internet
and the amount of data they are receiving or transmitting.
File Transfer Capabilities: Downloads may be faster, but uploads are typically slower.
Since the same lines are used to transmit data to and from the modem, priority is often
given to data traveling in one direction.
Signal Integrity: Cable Internet can be transmitted long distances with little signal
degradation. This means the quality of the Internet signal is not significantly decreased
by the distance of the modem from the cable provider.
Routing: Cable routers allow multiple computers to be hooked up to one cable modem,
allowing several devices to be directly connected through a single modem. Wireless
routers can also be attached to your cable modem.
35
Rely on Existing Connections: Cable modems connect directly to preinstalled cable
lines. This is advantageous because you do not need to have other services, such as
telephone or Internet, in order to receive Internet through your cable modem. The
disadvantage is that you cannot have cable internet in areas where there are no cable lines.
Disadvantages
Cable internet technology excels at maintaining signal strength over distance. Once it
is delivered to a region, however, such as a neighborhood, it is split among that regions
subscribers. While increased capacity has diminished the effect somewhat, it is
still possible that users willsee significantly lowerspeeds at peak times when more people
are using the shared connection.
Bandwidth equals money, so cable’s advantage in throughput comes with a price.
Even in plans of similar speeds compared with DSL, customers spend more per Mb with
cable than they do with DSL.
It’s hard to imagine, but there are still pockets of the United States without adequate cable
television service. There are far fewer such pockets without residential land-line service
meaning cable internet is on balance less accessible in remote areas.
VSAT
Short for very small aperture terminal, an earthbound station used in satellite
communications of data, voice and video signals, excluding broadcast television. A
VSAT consists of two parts, a transceiver that is placed outdoors in direct line of sight to
the satellite and a device that is placed indoors to interface the transceiver with the end
user’s communications device, such as a PC. The transceiver receives or sends a signal to a
satellite transponder in the sky. The satellite sends and receives signals from a ground
station computer that acts as a hub for the system. Each end user is interconnected with
the hub station via the satellite, forming a star topology. The hub controls the entire
operation of the network. For one end user to communicate with another, each
transmission has to first go to the hub station that then retransmits it via the satellite to
the other end user’s VSAT.
Advantages
Satellite communication systems have some advantages that can be exploited for the provision
36
of connectivity. These are:
• Costs Insensitive to Distance
• Single Platform service delivery (one-stop-shop)
• Flexibility
• Upgradeable
• Low incremental costs per unit
Disadvantages
However like all systems there are disadvantages also. Some of these are
• High start-up costs (hubs and basic elements must be in place before the
services can be provided)
• Higher than normal risk profiles
• Severe regulatory restrictions imposed by countries that prevent VSAT
networks and solutions from reaching critical mass and therefore profitability
• Some service quality limitations such the high signal delays (latency)
• Natural availability limitsthat cannot be mitigated against
• Lack of skills required in the developing world to design, install and maintain
satellite communication systems adequately
DOWNLOADING FILES
Downloading is the process of copying a file (such as a game or utility) from one
computer to another across the internet. When you download a game from our web site,
it means you are copying it from the author or publisher’s web server to your own
computer. This allows you to install and use the program on your own machine.
Here’s how to download a file using Internet Explorer and Windows XP. (This example
shows a download of the file “dweepsetup.exe” from Dexterity Games.) If you’re using a
different browser such as Netscape Navigator or a different version of Windows, your
screen may look a little different, but the same basic steps should work.
37
Click on the download link for the program you want to download. Many sites offer
multiple download links to the same program, and you only need to choose one of these
links.
You may be asked if you want to save the file or run it from its current location. If you
are asked this question, select “ Save.” I f not, don’t worr y — som e br owsers will
automatically choose “Save” for you.
You will then be asked to select the folder where you want to save the program or file,
using a standard “Save As” dialog box. Pay attention to which folder you select before
clicking the “Save” button. It may help you to create a folder like “C:\ Download” for all
of your downloads, but you can use any folder you’d like.
The download will now begin. Your web browser will keep you updated on the progress of
the download by showing a progress bar that fills up as you download. You will also be
reminded where you’re saving the file. The file will be saved as “C:\ Download\
dweepsetup.exe” in the picture below.
Note: You may also see a check box labeled “Close this dialog box when download
completes.” If you see this check box, it helps to uncheck this box. You don’t have to, but
if you do, it will be easier to find the file after you download it.
Depending on which file you’re downloading and how fast your connection is, it may
take anywhere from a few seconds to a few minutes to download. When your download
is finished, if you left the “Close this dialog box when download completes” option
unchecked, you’ll see a dialog box as shown in fig. :
Figure 5 a
38
Figure 5b Figure 5c
Now click the “Open” button to run the file you just downloaded. If you don’t see
the “Download complete” dialog box, open the folder where you saved the file and
double-click on the icon for the file there.
What happens next will depend on the type of file you downloaded. The files you’ll
download most often will end in one oftwo extensions. (An extension isthe lastfew letters
of the filename, after the period.) They are:
.EXE files: The file you downloaded is a program. Follow the on-screen instructions
from there to install the program to your computer and to learn how to run the program
after it’s installed.
.ZIP files: ZIP is a common file format used to compress and combine files to make them
download more quickly. Some versions of Windows (XP and sometimes ME) can read
ZIP fi les without extra software. Otherwise, you will need an unzipping program to
read these ZIP files. Common unzipping programs are WinZip, PKZIP, and Bit Zipper,
but there are also many others. Many unzipping programs are shareware, which means
you will need to purchase them if you use them beyond their specified trial period.
World Wide Web
What is the Internet? What is the World Wide Web? How are they related?
The Internet is an international network (a collection of connected, in this case, computers) –
networked for the purpose of communication of information. The Internet offers many
software services for this purpose, including:
39
• World Wide Web
• E-mail
• Instant messaging, chat
• Telnet (a service that lets a user login to a remote computer that the user has login privileges
for)
• FTP (File Transfer Protocol) – a service that lets one use the Internet to copy files from one
computer to another
The Web was originally designed for the purpose of displaying “public domain” data to anyone
who could view it. Although this is probably the most popular use of the Web today, other
uses of the Web include:
• Research, using tools such as “search engines” to find desired information.
• A variety of databases are available on the Web (this is another “research” tool). One
example of such a database: a library’s holdings.
• Shopping – most sizable commercial organizations have Web sites with forms you can fill
out to specify goods or services you wish to purchase. Typically, you must include your
credit card information in this form. Typically, your credit card information is safe – the
system is typically automated so no human can see (and steal) your credit card number.
• We can generalize the above: Web forms can be filled out and submitted to apply for
admission to a university, to give a donation to a charity, to apply for a job, to become a
member of an organization, do banking chores, pay bills, etc.
• Listen to music or radio-like broadcasts, view videos or tv-like broadcasts.
• Some use the Web to access their e-mail or bulletin board services such as Blackboard.
• Most “browsers” today are somewhat like operating systems, in that they can enable a
variety of application programs. For example, a Word, Excel, PowerPoint document can
be placed on the Web and viewed in its “native” application.
Some terminology you should know:
• Browser: A program used to view Web documents. Popular browsers include Microsoft
Internet Explorer (IE), Netscape, Opera; an old text-only browser called Lynx is still around
on some systems; etc. The browsers of Internet Service Providers (ISPs) like AOL,
Adelphia, Juno, etc., are generally one of the above, with the ISP’s logo displayed. Most
browsers work alike, today. There may be minor (for example, what IE calls “Favorites,”
Netscape calls “Bookmarks”).
40
• A Web document is called a “page.” A collection of related pages is a “site.” A Web
site typically has a “home page” designed to be the first, introductory, page a user of the
site views.
• A Web page typically has an “address” or URL (Universal Reference Locator). You can
view a desired page by using any of several methods to inform your browser of the URL
whose page you wish to view. The home page of a site typically has a URL of the form
http://www.DomainName.suffix
where the “DomainName” typically tells you something about the identity of the “host” or
“owner” of the site, and the “suffix” typically tells either the type of organization of the
owner or its country. Some common suffixes include:
✓ edu – An educational institution, usually a college or university.
✓ com – A commercial site – a company
✓ gov – a government site
✓ org – an organization that’s non-profit
✓ net – an alternative to “com” for network service providers
Also, the Internet originally was almost entirely centered in the US. As it spread to other
countries, it became common for sites outside the US to use a suffix that’s a 2-letter country
abbreviation: “ca” (without quotation marks) for Canada; “it” for Italy; “mx” for Mexico;
etc.
A page that isn’t a home page will typically have an address that starts with its site’s home
page address, and has appended further text to describe the page. For example, the Niagara
University home page is at http://www.niagara.edu/ and the Niagara University Academics
page is at http://www.niagara.edu/academic.htm.
Navigating:
• One way to reach a desired page is to enter its URL in the “Address” textbox.
• You can click on a link (usually underlined text, or a graphic may also serve as a link; notice
that the mouse cursor changes its symbol, typically to a hand, when hovering over a Web
link) to get to the page addressed by the link.
• The Back button may be used to retrace your steps, revisiting pages recently visited.
• You can click the Forward button to retrace your steps through pages recently Backed out
of.
41
• Notice the drop-down button at the right side of the Address textbox. This reveals a menu
of URLs recently visited by users of the browser on the current computer. You may click
one of these URLs to revisit its page.
• Favorites (what Netscape calls “Bookmarks”) are URLs saved for the purpose of making
revisits easy. If you click a Favorite, you can easily revisit the corresponding page.
How do we find information on the Web? Caution: Don’t believe everything you see on the
Web. Many Web sites have content made up of hate literature, political propaganda, unfounded
opinions, and other content of dubious reliability. Therefore, you should try to use good
judgment about the sites you use for research.
Strategies for finding information on the Web include:
• Often, you can make an intelligent guess at the URL of a desired site. For example, you
might guess the UB Web site is http://www.ub.edu (turned out to be the University of
Barcelona) or http://www.buffalo.edu (was correct); similarly, if you’re interested in the
IRS Web site, you might try http://www.irs.gov – and it works. Similarly, you might try,
for Enron, we might try http://www.enron.com – and this redirected us to the page
http://www.enron.com/corp/.
• “Search engines” are Web services provided on a number of Web sites, allowing you to
enter a keyword or phrase describing the topic you want information for. You may then
click a button to activate the search. A list of links typically appears, and you may explore
these links to find (you hope) the information you want. Note: if you use a phrase of
multiple words, and don’t place that phrase in quotation marks, you may get links by virtue
of matching all the words separately – e.g., “Diane” and “Pilarski” separately appeared in
a document that matched the phrase “Diane Pilarski” without quotation marks; but the same
link did not appear when we searched for “Diane Pilarski” with quotation marks. Also,
you may find if the phrase you enter is someone’s name, that many people have the same
name.
• Another strategy: Some Web sites (including some that offer search engines) have “Web
directories” or “indices” – classifications of Web pages. A good example: The Yahoo!
site at http://www.yahoo.com has such a Web directory. You can work your way through
the directory, often, to find desired information.
42
WEB BROWSER
A web browser (commonly referred to as a browser) is a software application for accessing
information on the World Wide Web. Each individual web page, image, and video is identified
by a distinct Uniform Resource Locator (URL), enabling browsers to retrieve these resources
from a web server and display them on a user's device.
A web browser is not the same thing as a search engine, though the two are often confused. For
a user, a search engine is just a website, such as google.com, that stores searchable data about
other websites. But to connect to a website's server and display its web pages, a user must have
a web browser installed on their device.
As of March 2019, more than 4.3 billion people use a browser, which is about 55% of the
world’s population.
The most popular browsers are Chrome, Firefox, Safari, Internet Explorer, and Edge.
History
The first web browser, called WorldWideWeb, was created in 1990 by Sir Tim Berners-Lee.
He then recruited Nicola Pellow to write the Line Mode Browser, which displayed web pages
on dumb terminals; it was released in 1991.
Nicola Pellow and Tim Berners-Lee in their office at CERN.
Marc Andreessen, lead developer of Mosaic and Navigator
1993 was a landmark year with the release of Mosaic, credited as "the world's first popular
browser". Its innovative graphical interface made the World Wide Web system easy to use and
thus more accessible to the average person. This, in turn, sparked the Internet boom of the
1990s when the Web grew at a very rapid rate. Marc Andreessen, the leader of the Mosaic
team, soon started his own company, Netscape, which released the Mosaic-influenced
Netscape Navigator in 1994. Navigator quickly became the most popular browser.
Microsoft debuted Internet Explorer in 1995, leading to a browser war with Netscape.
Microsoft was able to gain a dominant position for two reasons: it bundled Internet Explorer
43
with its popular Microsoft Windows operating system and did so as freeware with no
restrictions on usage. Eventually the market share of Internet Explorer peaked at over 95% in
2002.
WorldWideWeb was the first web browser.
In 1998, desperate to remain competitive, Netscape launched what would become the Mozilla
Foundation to create a new browser using the open source software model. This work evolved
into Firefox, first released by Mozilla in 2004. Firefox reached a 28% market share in 2011.
Apple released its Safari browser in 2003. It remains the dominant browser on Apple platforms,
though it never became a factor elsewhere.
The last major entrant to the browser market was Google. Its Chrome browser, which debuted
in 2008, has been a huge success. It steadily took market share from Internet Explorer and
became the most popular browser in 2012. Chrome has remained dominant ever since.
In terms of technology, browsers have greatly expanded their HTML, CSS, JavaScript, and
multimedia capabilities since the 1990s. One reason has been to enable more sophisticated
websites, such as web applications. Another factor is the significant increase of broadband
connectivity, which enables people to access data-intensive web content, such as YouTube
streaming, that was not possible during the era of dial-up modems.
Function
The purpose of a web browser is to fetch information resources from the Web and display them
on a user's device.
This process begins when the user inputs a URL, such as https://en.wikipedia.org/, into the
browser. Virtually all URLs on the Web start with either http: or https: which means the
browser will retrieve them with the Hypertext Transfer Protocol. In the case of https:, the
communication between the browser and the web server is encrypted for the purposes of
security and privacy. Another URL prefix is file: which is used to display local files already
stored on the user's device.
Once a web page has been retrieved, the browser's rendering engine displays it on the user's
44
device. This includes image and video formats supported by the browser.
Web pages usually contain hyperlinks to other pages and resources. Each link contains a URL,
and when it is clicked, the browser navigates to the new resource. Thus the process of bringing
content to the user begins again.
Settings
Web browsers can typically be configured with a built-in menu. Depending on the browser, the
menu may be named Settings, Options, or Preferences.
The menu has different types of settings. For example, users can change their home page and
default search engine. They also can change default web page colors and fonts. Various
network connectivity and privacy settings are also usually available.
Privacy
During the course of browsing, cookies received from various websites are stored by the
browser. Some of them contain login credentials or site preferences. However, others are used
for tracking user behavior over long periods of time, so browsers typically provide settings for
removing cookies when exiting the browser.Finer-grained management of cookies requires a
browser extension.
Features
The most popular browsers have a number of features in common. They allow users to set
bookmarks and browse in a private mode. They also can be customized with extensions, and
some of them provide a sync service.
Most browsers have these user interface features:
Allow the user to open multiple pages at the same time, either in different browser windows or
in different tabs of the same window.
Back and forward buttons to go back to the previous page visited or forward to the next one.
A refresh or reload button to reload the current page.
A stop button to cancel loading the page. (In some browsers, the stop button is merged with
the reload button.)
A home button to return to the user's home page.
45
An address bar to input the URL of a page and display it.
A search bar to input terms into a search engine. (In some browsers, the search bar is merged
with the address bar.)
There are also niche browsers with distinct features. One example is text-only browsers that
can benefit people with slow Internet connections or those with visual impairments.
Security
Web browsers are popular targets for hackers, who exploit security holes to steal information,
destroy files, and other malicious activity. Browser vendors regularly patch these security
holes, so users are strongly encouraged to keep their browser software updated. Other
protection measures are antivirus software and avoiding known-malicious websites.
EMBnet
The European Molecular Biology network (EMBnet) is an international scientific network and
interest group that aims to enhance bioinformatics services by bringing together bioinformatics
expertises and capacities. On 2011 EMBnet has 37 nodes spread over 32 countries. The nodes
include bioinformatics related university departments, research institutes and national service
providers.
Operations
The main task of most EMBnet nodes is to provide their national scientific community with
access to bioinformatics databanks, specialised software and sufficient computing resources
and expertise. EMBnet is also working in the fields of bioinformatics training and software
development. Examples of software created by EMBnet members are: EMBOSS, wEMBOSS,
UTOPIA.
EMBnet represents a wide user group and works closely together with the database producers
such as EMBL's European Bioinformatics Institute (EBI), the Swiss Institute of
Bioinformatics (Swiss-Prot), the Munich Information Center for Protein Sequences (MIPS), in
order to provide a uniform coverage of services throughout Europe. EMBnet is registered in
the Netherlands as a public foundation (Stichting).
Since its creation in 1988, EMBnet has evolved from an informal network of individuals in
46
charge of maintaining biological databases into the only worldwide organization bringing
bioinformatics professionals to work together to serve the expanding fields
of genetics and molecular biology. Although composed predominantly of academic nodes,
EMBnet gains an important added dimension from its industrial members. The success of
EMBnet is attracting increasing numbers of organizations outside Europe to join.
EMBnet has a tried-and-tested infrastructure to organise training courses, give technical help
and help its members effectively interact and respond to the rapidly changing needs of
biological research in a way no single institute is able to do.
In 2005 the organization created additional types of node to allow more than one member per
country. The new category denomination is "associated node".
Coordination and organization
EMBnet is governed by the Annual General Meetings (AGM), and is coordinated by an
Executive Board (EB) that oversees the activities of three project committees:
Education and Training committee (E&T). Educational support includes a series of courses
organised in the member countries and languages, the committee works as well on the
continued development of on-line accessible education materials.
Publicity and Public Relations committee (P&PR). This committee is responsible for
promoting any type of EMBnet activities, for the advertisement of products and services
provided by the EMBnet community, as well as for proposing and developing new strategies
aiming to enhance EMBnet’s visibility, and to take care of public relationships with EMBnet
communities and related networks/societies.
Technical Manager committee (TM). The TM PC provides assistance and practical help to the
participating nodes and their users.
THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION (NCBI)
The National Center for Biotechnology Information (NCBI) is part of the United States
National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The
NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored
by Senator Claude Pepper.
47
The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an
important resource for bioinformatics tools and services. Major databases include GenBank for
DNA sequences and PubMed, a bibliographic database for the biomedical literature. Other
databases include the NCBI Epigenomics database. All these databases are available online
through the Entrez search engine. NCBI was directed by David Lipman, one of the original
authors of the BLAST sequence alignment program and a widely respected figure in
bioinformatics. He also led an intramural research program, including groups led by Stephen
Altschul (another BLAST co-author), David Landsman, Eugene Koonin, John Wilbur, Teresa
Przytycka, and Zhiyong Lu. David Lipman stood down from his post in May 2017.
GenBank
NCBI has had responsibility for making available the GenBank DNA sequence database since
1992.GenBank coordinates with individual laboratories and other sequence databases such as
those of the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of
Japan (DDBJ).
Since 1992, NCBI has grown to provide other databases in addition to GenBank. NCBI
provides Gene, Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D
protein structures), dbSNP (a database of single-nucleotide polymorphisms), the Reference
Sequence Collection, a map of the human genome, and a taxonomy browser, and coordinates
with the National Cancer Institute to provide the Cancer Genome Anatomy Project. The NCBI
assigns a unique identifier (taxonomy ID number) to each species of organism.
The NCBI has software tools that are available by WWW browsing or by FTP. For example,
BLAST is a sequence similarity searching program. BLAST can do sequence comparisons
against the GenBank DNA database in less than 15 seconds.
NCBI Bookshelf
The "NCBI Bookshelf is a collection of freely accessible, downloadable, on-line versions of
selected biomedical books. The Bookshelf covers a wide range of topics including molecular
biology, biochemistry, cell biology, genetics, microbiology, disease states from a molecular
and cellular point of view, research methods, and virology. Some of the books are online
versions of previously published books, while others, such as Coffee Break, are written and
edited by NCBI staff. The Bookshelf is a complement to the Entrez PubMed repository of peer-
48
reviewed publication abstracts in that Bookshelf contents provide established perspectives on
evolving areas of study and a context in which many disparate individual pieces of reported
research can be organized.
Basic Local Alignment Search Tool (BLAST)
BLAST is an algorithm used for calculating sequence similarity between biological sequences
such as nucleotide sequences of DNA and amino acid sequences of proteins. BLAST is a
powerful tool for finding sequences similar to the query sequence within the same organism or
in different organisms. It searches the query sequence on NCBI databases and servers and post
the results back to the person's browser in chosen format. Input sequences to the BLAST are
mostly in FASTA or Genbank format while output could be delivered in variety of formats
such as HTML, XML formatting and plain text. HTML is the default output format for NCBI's
web-page. Results for NCBI-BLAST are presented in graphical format with all the hits found,
a table with sequence identifiers for the hits having scoring related data, along with the
alignments for the sequence of interest and the hits received with analogous BLAST scores for
these
Entrez
The Entrez Global Query Cross-Database Search System is used at NCBI for all the major
databases such as Nucleotide and Protein Sequences, Protein Structures, PubMed, Taxonomy,
Complete Genomes, OMIM, and several others. Entrez is both indexing and retrieval system
having data from various sources for biomedical research. NCBI distributed the first version of
Entrez in 1991, composed of nucleotide sequences from PDB and GenBank, protein sequences
from SWISS-PROT, translated GenBank, PIR, PRF , PDB and associated abstracts and
citations from PubMed. Entrez is specially designed to integrate the data from several different
sources, databases and formats into a uniform information model and retrieval system which
can efficiently retrieve that relevant references, sequences and structures.
Gene
Gene has been implemented at NCBI to characterize and organize the information about genes.
It serves as a major node in the nexus of genomic map, expression, sequence, protein function,
structure and homology data. A unique GeneID is assigned to each gene record that can be
followed through revision cycles. Gene records for known or predicted genes are established
here and are demarcated by map positions or nucleotide sequence. Gene has several advantages
49
over its predecessor, LocusLink, including, better integration with other databases in NCBI,
broader taxonomic scope, and enhanced options for query and retrieval provided by Entrez
system.
Protein
Protein database maintains the text record for individual protein sequences, derived from many
different resources such as NCBI Reference Sequence (RefSeq) project, GenbBank, PDB and
UniProtKB/SWISS-Prot. Protein records are present in different formats including FASTA and
XML and are linked to other NCBI resources. Protein provides the relevant data to the users
such as genes, DNA/RNA sequences, biological pathways, expression and variation data and
literature. It also provides the pre-determined sets of similar and identical proteins for each
sequence as computed by the BLAST. The Structure database of NCBI contains 3D coordinate
sets for experimentally-determined structures in PDB that are imported by NCBI. The
Conserved Domain database (CDD) of protein contains sequence profiles that characterize
highly conserved domains within protein sequences. It also has records from external resources
like SMART and Pfam. There is another database in protein known as Protein Clusters database
which contains sets of proteins sequences that are clustered according to the maximum
alignments between the individual sequences as calculated by BLAST.
Pubchem database
PubChem database of NCBI is a public resource for molecules and their activities against
biological assays. PubChem is searchable and accessible by Entrez information retrieval
system.
FILE TRANSFER PROTOCOL
The File Transfer Protocol (FTP) is a standard network protocol used for the transfer of
computer files between a client and server on a computer network.
FTP is built on a client-server model architecture using separate control and data connections
between the client and the server. FTP users may authenticate themselves with a clear-text
sign-in protocol, normally in the form of a username and password, but can connect
anonymously if the server is configured to allow it. For secure transmission that protects the
50
username and password, and encrypts the content, FTP is often secured with SSL/TLS (FTPS)
or replaced with SSH File Transfer Protocol (SFTP).
The first FTP client applications were command-line programs developed before operating
systems had graphical user interfaces, and are still shipped with most Windows, Unix, and
Linux operating systems. Many FTP clients and automation utilities have since been developed
for desktops, servers, mobile devices, and hardware, and FTP has been incorporated into
productivity applications, such as HTML editors.
History of FTP servers
The original specification for the File Transfer Protocol was written by Abhay Bhushan and
published as RFC 114 on 16 April 1971. Until 1980, FTP ran on NCP, the predecessor of
TCP/IP. The protocol was later replaced by a TCP/IP version, RFC 765 (June 1980) and RFC
959 (October 1985), the current specification. Several proposed standards amend RFC 959, for
example RFC 1579 (February 1994) enables Firewall-Friendly FTP (passive mode), RFC 2228
(June 1997) proposes security extensions, RFC 2428 (September 1998) adds support for IPv6
and defines a new type of passive mode.
Protocol overview
Communication and data transfer
Illustration of starting a passive connection using port 21
FTP may run in active or passive mode, which determines how the data connection is
established. In both cases, the client creates a TCP control connection from a random, usually
an unprivileged, port N to the FTP server command port 21.
In active mode, the client starts listening for incoming data connections from the server on port
M. It sends the FTP command PORT M to inform the server on which port it is listening. The
server then initiates a data channel to the client from its port 20, the FTP server data port.
In situations where the client is behind a firewall and unable to accept incoming TCP
connections, passive mode may be used. In this mode, the client uses the control connection to
send a PASV command to the server and then receives a server IP address and server port
number from the server, which the client then uses to open a data connection from an arbitrary
client port to the server IP address and server port number received
Both modes were updated in September 1998 to support IPv6. Further changes were introduced
51
to the passive mode at that time, updating it to extended passive mode.
The server responds over the control connection with three-digit status codes in ASCII with an
optional text message. For example, "200" (or "200 OK") means that the last command was
successful. The numbers represent the code for the response and the optional text represents a
human-readable explanation or request (e.g. ) An ongoing
transfer of file data over the data connection can be aborted using an interrupt message sent
over the control connection.
While transferring data over the network, four data representations can be used:
ASCII mode: Used for text. Data is converted, if needed, from the sending host's character
representation to "8-bit ASCII" before transmission, and (again, if necessary) to the receiving
host's character representation. As a consequence, this mode is inappropriate for files that
contain data other than plain text.
Image mode (commonly called Binary mode): The sending machine sends each file byte by
byte, and the recipient stores the bytestream as it receives it. (Image mode support has been
recommended for all implementations of FTP).
EBCDIC mode: Used for plain text between hosts using the EBCDIC character set.
Local mode: Allows two computers with identical setups to send data in a proprietary format
without the need to convert it to ASCII.
For text files, different format control and record structure options are provided. These features
were designed to facilitate files containing Telnet or ASA.
Data transfer can be done in any of three modes:
Stream mode: Data is sent as a continuous stream, relieving FTP from doing any processing.
Rather, all processing is left up to TCP. No End-of-file indicator is needed, unless the data is
divided into records.
Block mode: FTP breaks the data into several blocks (block header, byte count, and data field)
and then passes it on to TCP.
Compressed mode: Data is compressed using a simple algorithm (usually run-length encoding).
Some FTP software also implements a DEFLATE-based compressed mode, sometimes called
"Mode Z" after the command that enables it. This mode was described in an Internet Draft, but
not standardized
52
Login
FTP login uses normal username and password scheme for granting access. The username is
sent to the server using the USER command, and the password is sent using the PASS
command. This sequence is unencrypted "on the wire", so may be vulnerable to a network
sniffing attack. If the information provided by the client is accepted by the server, the server
will send a greeting to the client and the session will commence. If the server supports it, users
may log in without providing login credentials, but the same server may authorize only limited
access for such sessions.
Anonymous FTP
A host that provides an FTP service may provide anonymous FTP access. Users typically log
into the service with an 'anonymous' (lower-case and case-sensitive in some FTP servers)
account when prompted for user name. Although users are commonly asked to send their email
address instead of a password, no verification is actually performed on the supplied data. Many
FTP hosts whose purpose is to provide software updates will allow anonymous logins.
NAT and firewall traversal
FTP normally transfers data by having the server connect back to the client, after the PORT
command is sent by the client. This is problematic for both NATs and firewalls, which do not
allow connections from the Internet towards internal hosts. For NATs, an additional
complication is that the representation of the IP addresses and port number in the PORT
command refer to the internal host's IP address and port, rather than the public IP address and
port of the NAT.
There are two approaches to solve this problem. One is that the FTP client and FTP server use
the PASV command, which causes the data connection to be established from the FTP client
to the server. This is widely used by modern FTP clients. Another approach is for the NAT to
alter the values of the PORT command, using an application-level gateway for this purpose.
Differences from HTTP
HTTP essentially fixes the bugs in FTP that made it inconvenient to use for many small
ephemeral transfers as are typical in web pages.
53
FTP has a stateful control connection which maintains a current working directory and other
flags, and each transfer requires a secondary connection through which the data are transferred.
In "passive" mode this secondary connection is from client to server, whereas in the default
"active" mode this connection is from server to client. This apparent role reversal when in
active mode, and random port numbers for all transfers, is why firewalls and NAT gateways
have such a hard time with FTP. HTTP is stateless and multiplexes control and data over a
single connection from client to server on well-known port numbers, which trivially passes
through NAT gateways and is simple for firewalls to manage.
Setting up an FTP control connection is quite slow due to the round-trip delays of sending all
of the required commands and awaiting responses, so it is customary to bring up a control
connection and hold it open for multiple file transfers rather than drop and re-establish the
session afresh each time. In contrast, HTTP originally dropped the connection after each
transfer because doing so was so cheap. While HTTP has subsequently gained the ability to
reuse the TCP connection for multiple transfers, the conceptual model is still of independent
requests rather than a session.
When FTP is transferring over the data connection, the control connection is idle. If the transfer
takes too long, the firewall or NAT may decide that the control connection is dead and stop
tracking it, effectively breaking the connection and confusing the download. The single HTTP
connection is only idle between requests and it is normal and expected for such connections to
be dropped after a time-out.
Web browser support
Most common web browsers can retrieve files hosted on FTP servers, although they may not
support protocol extensions such as FTPS.When an FTP—rather than an HTTP—URL is
supplied, the accessible contents on the remote server are presented in a manner that is similar
to that used for other web content. A full-featured FTP client can be run within Firefox in the
form of an extension called FireFTP.
Syntax
FTP URL syntax is described in RFC 1738, taking the form:
ftp://[user[:password]@]host[:port]/url-path (the bracketed parts are optional).
54
For example, the URL ftp://public.ftp-servers.example.com/mydirectory/myfile.txt represents
the file myfile.txt from the directory mydirectory on the server public.ftp-servers.example.com
as an FTP resource. The URL ftp://user001:secretpassword@private.ftpservers.example.com/mydirectory/myfile.txt adds a specification of the username and
password that must be used to access this resource.
More details on specifying a username and password may be found in the browsers'
documentation (e.g., Firefox and Internet Explorer). By default, most web browsers use passive
(PASV) mode, which more easily traverses end-user firewalls.
Some variation has existed in how different browsers treat path resolution in cases where there
is a non-root home directory for a user.
Security
FTP was not designed to be a secure protocol, and has many security weaknesses. In May
1999, the authors of RFC 2577 listed a vulnerability to the following problems:
Brute force attack
FTP bounce attack
Packet capture
Port stealing (guessing the next open port and usurping a legitimate connection)
Spoofing attack
Username enumeration
DoS or DDoS
FTP does not encrypt its traffic; all transmissions are in clear text, and usernames, passwords,
commands and data can be read by anyone able to perform packet capture (sniffing) on the
network. This problem is common to many of the Internet Protocol specifications (such as
SMTP, Telnet, POP and IMAP) that were designed prior to the creation of encryption
mechanisms such as TLS or SSL.
Common solutions to this problem include:
Using the secure versions of the insecure protocols, e.g., FTPS instead of FTP and TelnetS
instead of Telnet.
Using a different, more secure protocol that can handle the job, e.g. SSH File Transfer Protocol
or Secure Copy Protocol.
55
Using a secure tunnel such as Secure Shell (SSH) or virtual private network (VPN).
FTP over SSH
FTP over SSH is the practice of tunneling a normal FTP session over a Secure Shell connection.
Because FTP uses multiple TCP connections (unusual for a TCP/IP protocol that is still in use),
it is particularly difficult to tunnel over SSH. With many SSH clients, attempting to set up a
tunnel for the control channel (the initial client-to-server connection on port 21) will protect
only that channel; when data is transferred, the FTP software at either end sets up new TCP
connections (data channels) and thus have no confidentiality or integrity protection.
Otherwise, it is necessary for the SSH client software to have specific knowledge of the FTP
protocol, to monitor and rewrite FTP control channel messages and autonomously open new
packet forwardings for FTP data channels. Software packages that support this mode include:
Tectia ConnectSecure (Win/Linux/Unix) of SSH Communications Security's software suite
Derivatives
FTPS
Explicit FTPS is an extension to the FTP standard that allows clients to request FTP sessions
to be encrypted. This is done by sending the "AUTH TLS" command. The server has the option
of allowing or denying connections that do not request TLS. This protocol extension is defined
in RFC 4217. Implicit FTPS is an outdated standard for FTP that required the use of a SSL or
TLS connection. It was specified to use different ports than plain FTP.
SSH File Transfer Protocol
The SSH file transfer protocol (chronologically the second of the two protocols abbreviated
SFTP) transfers files and has a similar command set for users, but uses the Secure Shell
protocol (SSH) to transfer files. Unlike FTP, it encrypts both commands and data, preventing
passwords and sensitive information from being transmitted openly over the network. It cannot
interoperate with FTP software.
Trivial File Transfer Protocol
Trivial File Transfer Protocol (TFTP) is a simple, lock-step FTP that allows a client to get a
file from or put a file onto a remote host. One of its primary uses is in the early stages of booting
from a local area network, because TFTP is very simple to implement. TFTP lacks security and
most of the advanced features offered by more robust file transfer protocols such as File
56
Transfer Protocol. TFTP was first standardized in 1981 and the current specification for the
protocol can be found in RFC 1350.
Simple File Transfer Protocol
Simple File Transfer Protocol (the first protocol abbreviated SFTP), as defined by RFC 913,
was proposed as an (unsecured) file transfer protocol with a level of complexity intermediate
between TFTP and FTP. It was never widely accepted on the Internet, and is now assigned
Historic status by the IETF. It runs through port 115, and often receives the initialism of SFTP.
It has a command set of 11 commands and support three types of data transmission: ASCII,
binary and continuous. For systems with a word size that is a multiple of 8 bits, the
implementation of binary and continuous is the same. The protocol also supports login with
user ID and password, hierarchical folders and file management (including rename, delete,
upload, download, download with overwrite, and download with append).
Comments
Post a Comment